Production hardening backlog¶
Last updated: 2026-05-19
Optional work to move from source-reviewable reference architecture to campus-scale production. Shipped product phases: roadmap/PRODUCT_ROADMAP.md. Campus-scale detail: roadmap/archive/PHASED_IMPROVEMENT_ROADMAP.md.
Priority: P1 (before multi-tenant production) · P2 (scale / cost) · P3 (maturity / polish)
Backlog¶
| Item | Priority | Risk if deferred | Mitigation | Status |
|---|---|---|---|---|
| Managed Redis HA + distributed rate limits | P1 | Single-node rate limits; no shared session/cache at scale | Deploy ElastiCache / Azure Cache; wire REDIS_URL; extend rate-limit middleware |
Planned — PHASED_IMPROVEMENT_ROADMAP |
Secrets management (SSM / Key Vault vs .env on EB) |
P1 | Credential leakage, rotation pain | Move secrets to parameter store; document rotation runbook | Partial — .env + EB today |
| Tenant isolation guarantees | P1 | Cross-tenant data bleed in shared DB | Enforce tenant_id on all queries; audit routes; optional schema-per-tenant |
Logical isolation today — TENANT_CONFIG.md |
| Threat model + abuse controls | P1 | Auth brute-force, chat spam, cost blowout | Document threat model in SECURITY.md; per-user quotas beyond IP rate limit | Partial — rate limits, HTTP-only cookies |
| PII policy + log redaction audit | P1 | Compliance exposure in logs/traces | Formal PII classification; verify redaction on chat/JWT fields; LangSmith data retention policy | Partial — redaction shipped in logging pass |
| Observability dashboards + alerts | P2 | Slow incident response | Grafana from Prometheus; alert on p95 latency, 5xx, pool exhaustion | Metrics endpoint shipped; dashboards optional |
| Per-tenant token / cost budgets | P2 | Runaway LLM spend | Budget counters in Redis; route to cheaper model on threshold | Planned — campus track Phase 4 |
| Exact + semantic response cache | P2 | Repeated queries hit Bedrock every time | Redis cache keyed by (tenant_id, normalized_question) with TTL |
Planned — campus track Phase 1 |
| IaC (Terraform / CDK) | P2 | Drift between EB .ebextensions and prod |
Codify VPC, EB, OpenSearch, Bedrock KB wiring | Partial — EB config in repo |
| Async ingestion / KB sync cadence | P2 | Stale corpus; recall drops | Scheduled sync jobs; invalidation hooks for cache | AWS-managed via Bedrock KB connectors |
| Queueing for long RAG paths | P2 | Timeouts under load | Background worker for eval/heavy retrieve; SSE keep-alive | Not started |
| Postgres DR / backup posture | P2 | Data loss on failure | Automated backups, restore drill, RPO/RTO doc | Operator-dependent |
| LangGraph-native SSE (Phase 6a) | P3 | Higher TTFT on graph path | astream_events from graph; same SSE contract as chain |
Optional — LANGGRAPH.md |
| Expand RAGAS golden set (10 → 30–50) | P3 | Thin regression signal | Bootstrap + manual curation; tag by topic/difficulty | 10 rows today |
| RAG service lifecycle (singleton vs per-request) | P3 | Latency / connection churn | Document tradeoff; optional app-state singleton with config refresh | Per-request construction today |
| Chain vs LangGraph consolidation | P3 | Dual maintenance | Keep both until 6a; deprecate chain when graph streams | Explicit ADR — ADR-002 |
Suggested hardening sequence¶
flowchart TD
secrets[Secrets_SSM]
redis[Redis_HA_RateLimit]
tenant[Tenant_Isolation_Audit]
threat[Threat_Model_Abuse]
obs[Dashboards_Alerts]
cost[Tenant_Budgets_Cache]
secrets --> redis --> tenant --> threat --> obs --> cost
- Secrets + Redis — foundation for distributed limits and cache.
- Tenant audit + threat model — governance before wider rollout.
- Observability + cost — operability at scale.
Portfolio polish follow-ups (not code)¶
Tracked here so hiring readers see intentional scope boundaries:
| Follow-up | Why deferred | Owner action |
|---|---|---|
| 90-second demo GIF/video | Needs running app + recording | Record login → chat → sources → web toggle → LangSmith |
| Polished UI screenshot set | Refresh with consistent sample questions and clean session history | Clean session history; deliberate sample questions |
| Golden set 10 → 30–50 | Requires live AWS + judge LLM time | ./scripts/bootstrap_golden_dataset.py + review |
Related¶
- OPERATIONS.md — runbooks, metrics, migrations
- SECURITY.md — dependency audit, production notes
- PERFORMANCE.md — history caps, latency metrics
- LOAD_TESTING.md — k6 profiles
- PORTFOLIO_CASE_STUDY.md — what ships today vs backlog