Skip to content

RAGAS eval baseline — 2026-05-19

Bootstrap of backend/tests/eval/golden_dataset.json from live AWS Bedrock Knowledge Base (OpenSearch-backed index, RAG_ENGINE=langgraph). Golden ground_truth and contexts reflect AWS-generation at bootstrap time.

Dataset

Item Value
Seed questions 12 (seed_questions.json)
Promoted rows 10
Dropped SpeedGrader audio feedback; Gradebook excuse (model out-of-scope despite retrieval)

Bootstrap command

./venv/bin/python scripts/bootstrap_golden_dataset.py
./venv/bin/python scripts/promote_golden_draft.py

RAGAS score comparison (AWS Bedrock — original baseline)

Historical runs on the same 10-row golden set with AWS LLM and Phase 5 LangGraph retrieval. These values are retained as the primary regression reference.

Metric Threshold Baseline (no Phase 5) Phase 5 (initial) Phase 5 tuned Gate
faithfulness ≥ 0.85 0.662 0.712 0.754 fail
answer_relevancy ≥ 0.80 0.675 0.801 0.789 fail
context_recall ≥ 0.75 0.600 0.750 0.800 pass
context_precision ≥ 0.70 0.600 0.505 0.504 fail

Phase 5 tuned (AWS) (~3m 12s): keyword rerank, MULTI_QUERY_COUNT=2, RRF, RERANK_TOP_N=3, RERANK_PREFILTER_MAX=10, RERANK_MIN_KEYWORD_OVERLAP=0 (historical ./scripts/run_eval_phase5.sh profile).

AWS tuning notes (retained): FlashRank + single extra query regressed relevancy/recall vs keyword + two queries. RERANK_TOP_N=2 raised precision to ~0.62 in spot checks but hurt recall; the table above uses RERANK_TOP_N=3 as the balanced AWS profile.


RAGAS runs — Azure OpenAI LLM + judge (same golden set)

Follow-up sweeps with LLM_PROVIDER=azure, RETRIEVER_PROVIDER=aws, same golden_dataset.json (AWS-bootstrap references). Judge: Azure GPT-4o via LLM_PROVIDER / RAGAS_LLM_PROVIDER. Scores vary run-to-run (~±0.05–0.15); point estimates from local runs 2026-05-19.

Summary vs default gates

Profile faithfulness answer_relevancy context_recall context_precision Notes
Azure + chain (best single run) 0.838 0.980 0.833 0.500 Pre–conftest-fix; RETRIEVER_NUMBER_OF_RESULTS=4
Azure + chain (n=2) 0.642 0.886 0.600 0.508
Azure + chain (n=3) 0.639 0.886 0.600 0.558
Azure + chain (n=4, repeat) 0.617 0.885 0.600 0.458 Judge variance
LangGraph P5 (TOP_N=3, MQ=2) 0.768 failed† 0.600 0.558 †per-metric test
LangGraph precision-balanced (TOP_N=2, MQ=3) 0.651 0.885 0.600 0.542 Current run_eval_phase5.sh
LangGraph TOP_N=2, MQ=2, overlap≥1 0.643 0.891 0.600 0.492
LangGraph no multi-query (TOP_N=2) 0.638 0.886 0.600 0.583 Best Azure precision
LangGraph FlashRank TOP_N=2 0.684 0.886 0.550 0.492
LangGraph tight prefilter (overlap≥2) 0.626 0.867 0.600 0.398

Gates: faithfulness ≥ 0.85, answer_relevancy ≥ 0.80, context_recall ≥ 0.75, context_precision ≥ 0.70. No Azure profile passed all four.

Setting Value
RAG_ENGINE langgraph
RERANK_ENABLED true
RERANK_BACKEND keyword
RERANK_TOP_N 2
RERANK_CANDIDATE_K 15
RERANK_PREFILTER_MAX 10
RERANK_MIN_KEYWORD_OVERLAP 1
MULTI_QUERY_ENABLED true
MULTI_QUERY_COUNT 3

Findings — why these scores

1. Golden reference vs live stack

ground_truth was bootstrapped with AWS Bedrock answers. RAGAS faithfulness and context_recall compare live output to that reference. Azure generation + Azure judge shifts wording and chunk emphasis vs AWS references — scores move even when answers are acceptable. Mitigation: re-bootstrap with LLM_PROVIDER=azure.

2. context_precision is the bottleneck

RAGAS scores each retrieved chunk against the reference. Multi-query + RRF improve recall by merging more candidates; rerank trims to top-N but irrelevant fused chunks still depress precision (~0.50–0.58 across AWS/Azure) while AWS Phase 5 tuned reaches context_recall 0.80.

  • RERANK_TOP_N=2: modest precision gain, often lower recall/faithfulness.
  • Disable multi-query: precision up to ~0.583, recall ~0.60.
  • Stricter keyword prefilter: precision drops (~0.40) — over-filters hard queries.

3. Chain vs LangGraph in eval

Previously backend/tests/conftest.py forced RAG_ENGINE=chain, so tox -e eval skipped Phase 5 nodes despite run_eval_phase5.sh. The best Azure relevancy/recall run used chain + KB only — not comparable to AWS Phase 5 tuned. Fix (2026-05-19): when RAGAS_EVAL=1, conftest preserves RAG_ENGINE from env.

4. Judge variance

answer_relevancy can fail per-metric while test_full_suite_report shows high means (~0.88–0.98). Repeated identical configs varied (e.g. faithfulness 0.838 vs 0.617). Use same session when A/B tuning.

5. Levers to move all metrics above gates

Lever Effect
Re-bootstrap golden with eval LLM Align faithfulness / recall
Ingestion / chunking / KB hygiene Precision + recall upstream
FlashRank A/B Precision vs latency
Metadata filters Precision on subset
Gate policy Track without blocking CI — EVALUATION.md

Only context_recall reliably passes under AWS Phase 5 tuned (0.800). Sub-threshold scores are documented baselines, not demo blockers.


Commands

tox -e eval
./scripts/run_eval_phase5.sh
tox -e eval -- python -m pytest \
  backend/tests/eval/test_rag_quality.py::TestRAGQuality::test_full_suite_report \
  -v -s -m slow --log-cli-level=WARNING
RAGAS_QUALITY_GATE=1 ./scripts/run_eval_phase5.sh

Eval hygiene: RAGAS_DO_NOT_TRACK=true in tox; quieter vendor loggers in backend/tests/eval/conftest.py.