Evaluation strategy (RAGAS + LangSmith)¶
How to judge additions to this project — when changing RAG behavior, LangGraph nodes, or retrieval settings.
The baseline below is an engineering signal, not a marketing claim. context_recall passes the gate under the tuned AWS profile; faithfulness, answer_relevancy, and context_precision are documented baselines with active levers (ingestion/chunking, rerank tuning, golden-set re-bootstrap). Gates are release controls, not blockers for local demo or PR CI.
Two tools, two jobs¶
| RAGAS | LangSmith | |
|---|---|---|
| What | Open-source RAG metric library | Platform for traces, datasets, experiments |
| Question answered | "Did quality improve or regress?" | "Why was this request slow/wrong?" |
| In this repo | backend/tests/eval/test_rag_quality.py, golden_dataset.json |
simple_tracer.py, LANGCHAIN_TRACING_V2 |
| CI gate | Yes — RAGAS_QUALITY_GATE=1 on release/nightly |
Optional — traces for debug |
| Same thing? | No — complementary |
LangSmith can run evaluators (including RAG-like judges), but do not treat LangSmith scores as RAGAS without calibrating on the same dataset.
RAGAS metrics (defaults)¶
| Metric | Default min | Interpretation |
|---|---|---|
| faithfulness | 0.85 | Answer grounded in retrieved context |
| answer_relevancy | 0.80 | Answer addresses the question |
| context_recall | 0.75 | Retrieval covers ground truth |
| context_precision | 0.70 | Retrieved chunks are on-topic |
RAGAS baseline (2026-05-19)¶
The checked-in golden set was bootstrapped from one campus knowledge base deployment (Canvas LMS teaching-and-learning and ServiceNow IT articles). Questions and ground_truth rows are corpus-specific—re-run scripts/bootstrap_golden_dataset.py after ingesting your own KB.
Full score tables, Azure sweep results, tuning profiles, and findings: eval_baseline_2026-05-19.md.
| Stack | Best-known highlight | Precision |
|---|---|---|
| AWS Phase 5 tuned (LangGraph) | context_recall 0.80 (passes gate) | ~0.50 |
| Azure LLM + judge, chain path | faithfulness ~0.84, relevancy ~0.98, recall ~0.83 | ~0.50 |
| Azure LangGraph precision-balanced | relevancy ~0.88; recall often ~0.60 | ~0.54–0.58 |
context_precision (~0.70 gate) is the bottleneck for all profiles; multi-query fusion adds recall but also noisy chunks. Align bootstrap provider with eval judge when comparing faithfulness/recall. Eval now respects RAG_ENGINE=langgraph when RAGAS_EVAL=1 (see baseline doc). Sub-threshold scores are documented baselines, not demo blockers.
CI gate policy¶
| Event | RAGAS_QUALITY_GATE |
Notes |
|---|---|---|
PR / main CI (tox -e lint,backend,frontend-vue) |
0 (default) | Eval not required; keeps PRs fast without AWS |
| Local / release milestone | 1 | tox -e eval or ./scripts/run_eval_phase5.sh; needs judge + Bedrock |
CD release workflow |
1 when secrets configured | See CI.md |
Env overrides: RAGAS_FAITHFULNESS_MIN, RAGAS_ANSWER_RELEVANCY_MIN, etc.
Run locally¶
Runtime: The first metric test builds the dataset with one live RAG call per golden question (10 questions ≈ several minutes on AWS). RAGAS scoring then calls the judge LLM per row. Use pytest -v -s (enabled in tox -e eval) to see progress logs.
tox -e eval
# Phase 5 tuned profile:
./scripts/run_eval_phase5.sh
# or: PYTHONPATH=. pytest backend/tests/eval/ -v -m slow
Requires judge LLM: OPENAI_API_KEY, AZURE_OPENAI_API_KEY, or RAGAS_LLM_PROVIDER.
When to run what¶
| Event | RAGAS | LangSmith |
|---|---|---|
| Every PR | Optional — unit tests with mocked RAG | Dev only |
| Pre-release / milestone | Full golden set; compare to baseline | Trace screenshots in README |
| LangGraph parity (Phase 4 (LangGraph)) | chain vs RAG_ENGINE=langgraph, ±0.02 |
Per-node spans |
| Retrieval change (Phase 5 (retrieval)) | Primary metric + faithfulness guardrail | Compare runs |
| Nightly staging | Full gate with secrets | SLO debugging |
Scorecard (per feature)¶
| Field | Value |
|---|---|
| Change | e.g. multi-query retrieval node |
| Roadmap phase | Product phase (see PRODUCT_ROADMAP.md) |
| Primary metric | context_recall |
| Guardrails | faithfulness ≥ 0.85; p95 latency < X ms |
| Baseline | commit / eval_baseline_2026-05-19.md |
| Result | +0.04 recall; faithfulness 0.86 |
| Ship | yes / flag-only / no |
Ship rules¶
- LangGraph parity: ship if RAGAS within ε of chain and tests green.
- Retrieval / prompts: ship if primary metric improves and guardrails hold.
- Agentic (Phase 6 (agentic)): flag-only until staging week is stable.
LangGraph-specific evaluation¶
- Baseline RAGAS on
RAG_ENGINE=chain(or current default). - Run same golden set on
RAG_ENGINE=langgraph. - Compare all four metrics; investigate faithfulness regressions first.
- Use LangSmith to see whether regressions are condense vs retrieve vs generate.
Bootstrapping the golden set (live AWS KB)¶
Rebuild golden_dataset.json from live RAG so ground_truth and contexts match what Bedrock KB retrieval returns.
Prerequisites: RAG_FORCE_MOCK=false, LLM_PROVIDER=aws (or azure), valid AWS credentials, RAG_ENGINE as used in eval (e.g. langgraph).
./venv/bin/python scripts/bootstrap_golden_dataset.py
./venv/bin/python scripts/bootstrap_golden_dataset.py --only 3
./venv/bin/python scripts/bootstrap_golden_dataset.py --dry-run
Output: backend/tests/eval/golden_dataset.draft.json (includes _bootstrap review metadata).
Review checklist before promoting:
- Drop rows where the model returns an out-of-scope refusal despite non-empty contexts.
- Tighten
ground_truthto 2–4 factual sentences. - Keep
contextsas verbatim retrieved chunk text (top 3). - Restore hand-authored
ground_truthwhen live answers are refusals but KB content is on-topic.
Seeds: backend/tests/eval/seed_questions.json. Draft file is gitignored.
LangSmith run naming and traces¶
Chat RAG calls are traced as chat-session-<session_id> with tags:
session_id:<id>request_id:<id>whenX-Request-IDis setresearch_mode:kborresearch_mode:web
Filter the LangSmith UI by run name or tags. With RAG_ENGINE=langgraph, child spans include condense, multi_query, retrieve (or web_search), rerank (KB path), generate, and format.
Screenshots: docs/assets/observability/ (README uses langsmith-trace-kb-waterfall.png).
Capture a trace for docs¶
- In
.env:LANGCHAIN_TRACING_V2=true,LANGCHAIN_API_KEY,LANGCHAIN_PROJECT=chatbot-poc,RAG_ENGINE=langgraph. - Restart API; send one chat message from http://127.0.0.1:5173.
- In LangSmith, open the project and find run name
chat-session-<id>. - Expand condense → multi_query → retrieve (or web_search) → generate → format.
- Screenshot waterfall or tree; save under
docs/assets/observability/(e.g.langsmith-trace-kb-waterfall.png). - Commit the PNG; README links
docs/assets/observability/langsmith-trace-kb-waterfall.png.
Growing the golden set¶
- Start with domain Q&A in
golden_dataset.json. - Add failures from manual testing and (anonymized) bad traces.
- Tag items:
"tags": ["ambiguous", "multi-hop"]for targeted runs.
Related¶
- roadmap/PRODUCT_ROADMAP.md
- roadmap/LANGGRAPH.md
- assets/README.md — demo script