RAGAS eval baseline — 2026-05-19¶

Bootstrap of backend/tests/eval/golden_dataset.json from live AWS Bedrock Knowledge Base (OpenSearch-backed index, RAG_ENGINE=langgraph). Golden ground_truth and contexts reflect AWS-generation at bootstrap time.

Dataset¶

Item	Value
Seed questions	12 (`seed_questions.json`)
Promoted rows	10
Dropped	SpeedGrader audio feedback; Gradebook excuse (model out-of-scope despite retrieval)

Bootstrap command¶

./venv/bin/python scripts/bootstrap_golden_dataset.py
./venv/bin/python scripts/promote_golden_draft.py

RAGAS score comparison (AWS Bedrock — original baseline)¶

Historical runs on the same 10-row golden set with AWS LLM and Phase 5 LangGraph retrieval. These values are retained as the primary regression reference.

Metric	Threshold	Baseline (no Phase 5)	Phase 5 (initial)	Phase 5 tuned	Gate
faithfulness	≥ 0.85	0.662	0.712	0.754	fail
answer_relevancy	≥ 0.80	0.675	0.801	0.789	fail
context_recall	≥ 0.75	0.600	0.750	0.800	pass
context_precision	≥ 0.70	0.600	0.505	0.504	fail

Phase 5 tuned (AWS) (~3m 12s): keyword rerank, MULTI_QUERY_COUNT=2, RRF, RERANK_TOP_N=3, RERANK_PREFILTER_MAX=10, RERANK_MIN_KEYWORD_OVERLAP=0 (historical ./scripts/run_eval_phase5.sh profile).

AWS tuning notes (retained): FlashRank + single extra query regressed relevancy/recall vs keyword + two queries. RERANK_TOP_N=2 raised precision to ~0.62 in spot checks but hurt recall; the table above uses RERANK_TOP_N=3 as the balanced AWS profile.

RAGAS runs — Azure OpenAI LLM + judge (same golden set)¶

Follow-up sweeps with LLM_PROVIDER=azure, RETRIEVER_PROVIDER=aws, same golden_dataset.json (AWS-bootstrap references). Judge: Azure GPT-4o via LLM_PROVIDER / RAGAS_LLM_PROVIDER. Scores vary run-to-run (~±0.05–0.15); point estimates from local runs 2026-05-19.

Summary vs default gates¶

Profile	faithfulness	answer_relevancy	context_recall	context_precision	Notes
Azure + chain (best single run)	0.838	0.980	0.833	0.500	Pre–conftest-fix; `RETRIEVER_NUMBER_OF_RESULTS=4`
Azure + chain (`n=2`)	0.642	0.886	0.600	0.508
Azure + chain (`n=3`)	0.639	0.886	0.600	0.558
Azure + chain (`n=4`, repeat)	0.617	0.885	0.600	0.458	Judge variance
LangGraph P5 (`TOP_N=3`, `MQ=2`)	0.768	failed†	0.600	0.558	†per-metric test
LangGraph precision-balanced (`TOP_N=2`, `MQ=3`)	0.651	0.885	0.600	0.542	Current `run_eval_phase5.sh`
LangGraph `TOP_N=2`, `MQ=2`, overlap≥1	0.643	0.891	0.600	0.492
LangGraph no multi-query (`TOP_N=2`)	0.638	0.886	0.600	0.583	Best Azure precision
LangGraph FlashRank `TOP_N=2`	0.684	0.886	0.550	0.492
LangGraph tight prefilter (overlap≥2)	0.626	0.867	0.600	0.398

Gates: faithfulness ≥ 0.85, answer_relevancy ≥ 0.80, context_recall ≥ 0.75, context_precision ≥ 0.70. No Azure profile passed all four.

Recommended retrieval profile (precision-balanced)¶

Setting	Value
`RAG_ENGINE`	`langgraph`
`RERANK_ENABLED`	`true`
`RERANK_BACKEND`	`keyword`
`RERANK_TOP_N`	`2`
`RERANK_CANDIDATE_K`	`15`
`RERANK_PREFILTER_MAX`	`10`
`RERANK_MIN_KEYWORD_OVERLAP`	`1`
`MULTI_QUERY_ENABLED`	`true`
`MULTI_QUERY_COUNT`	`3`

Findings — why these scores¶

1. Golden reference vs live stack¶

ground_truth was bootstrapped with AWS Bedrock answers. RAGAS faithfulness and context_recall compare live output to that reference. Azure generation + Azure judge shifts wording and chunk emphasis vs AWS references — scores move even when answers are acceptable. Mitigation: re-bootstrap with LLM_PROVIDER=azure.

2. `context_precision` is the bottleneck¶

RAGAS scores each retrieved chunk against the reference. Multi-query + RRF improve recall by merging more candidates; rerank trims to top-N but irrelevant fused chunks still depress precision (~0.50–0.58 across AWS/Azure) while AWS Phase 5 tuned reaches context_recall 0.80.

RERANK_TOP_N=2: modest precision gain, often lower recall/faithfulness.
Disable multi-query: precision up to ~0.583, recall ~0.60.
Stricter keyword prefilter: precision drops (~0.40) — over-filters hard queries.

3. Chain vs LangGraph in eval¶

Previously backend/tests/conftest.py forced RAG_ENGINE=chain, so tox -e eval skipped Phase 5 nodes despite run_eval_phase5.sh. The best Azure relevancy/recall run used chain + KB only — not comparable to AWS Phase 5 tuned. Fix (2026-05-19): when RAGAS_EVAL=1, conftest preserves RAG_ENGINE from env.

4. Judge variance¶

answer_relevancy can fail per-metric while test_full_suite_report shows high means (~0.88–0.98). Repeated identical configs varied (e.g. faithfulness 0.838 vs 0.617). Use same session when A/B tuning.

5. Levers to move all metrics above gates¶

Lever	Effect
Re-bootstrap golden with eval LLM	Align faithfulness / recall
Ingestion / chunking / KB hygiene	Precision + recall upstream
FlashRank A/B	Precision vs latency
Metadata filters	Precision on subset
Gate policy	Track without blocking CI — EVALUATION.md

Only context_recall reliably passes under AWS Phase 5 tuned (0.800). Sub-threshold scores are documented baselines, not demo blockers.

Commands¶

tox -e eval
./scripts/run_eval_phase5.sh
tox -e eval -- python -m pytest \
  backend/tests/eval/test_rag_quality.py::TestRAGQuality::test_full_suite_report \
  -v -s -m slow --log-cli-level=WARNING
RAGAS_QUALITY_GATE=1 ./scripts/run_eval_phase5.sh

Eval hygiene: RAGAS_DO_NOT_TRACK=true in tox; quieter vendor loggers in backend/tests/eval/conftest.py.

EVALUATION.md
roadmap/PRODUCT_ROADMAP.md
scripts/bootstrap_golden_dataset.py, scripts/promote_golden_draft.py