Skip to content

Scribe IQ — product case study

One-line summary

Scribe IQ is a governed clinical documentation AI prototype showing how an offline synthetic clinical corpus becomes a clinical AI product surface: corpus construction, Postgres/pgvector serving, provider-agnostic LLM workflows, clinical UI, and Responsible AI auditability.


Why I built it

Higher-education systems and healthcare systems share more shape than most people notice: longitudinal records about people, governance constraints that are not optional, multiple stakeholders with different read/write boundaries, and a strong norm against fluent but ungrounded automation. I built Scribe IQ to make that bridge legible — to show, with running code, that the way I think about institutional data systems transfers directly to clinical documentation AI.

The headline problem the project solves is the one that defines clinical LLM work: language models are fluent liars. They produce confident, well-formatted text that may have no basis in the patient record. In a clinical setting that is not a usability bug — it is a safety failure. Scribe IQ is built around a specific answer: ground every response in the actual stored record, require citations, and log what the model saw and what it produced.


Bridge from higher-ed work to healthcare

Higher-ed (university / institutional systems) Healthcare (Scribe IQ)
Longitudinal student records across enrollment, advising, financial aid Longitudinal patient records across encounters, notes, conditions
Governance constraints (FERPA, audit, role boundaries) treated as first-class schema Governance constraints (PHI posture, ai_interactions audit, provider egress) treated as first-class schema
Multiple stakeholder views (student, advisor, registrar, analyst) with different read/write semantics Multiple stakeholder views (clinician chart, admin Responsible AI Control Center, corpus pipeline operator)
Reporting/dashboards that must reconcile to source-of-truth tables, not to a derived cache Chat answers that must reconcile to source-of-truth notes via citation contract, not post-hoc paraphrase
Pipelines (ETL, refresh, dataset cards) kept offline and out of the request path Corpus pipeline (Synthea → note pool → adapt → JSONL) kept offline and out of the request path
Identity / SSO / RBAC as the eventual seam OptionalApiKeyMiddleware as the explicit eventual seam

The transfer is not metaphor. The same instincts — colocate the governance with the data, prefer one store with strong consistency over two stores with drift, write audit on the request path instead of through a queue — produce the same shape of system.


Product flows demonstrated

  • Patient list and chart read path with a Read / Sources / Codes & map tab structure that mirrors how clinicians actually scan a chart.
  • Pre-meeting summary generated by an LLM, cached with a fingerprint of the underlying note set so it never serves stale narrative when new notes arrive.
  • Care timeline anchored to the latest event with pagination over encounters.
  • Generate-note panel with explicit feature-flag gating and a structured-output contract.
  • Grounded RAG chat over note embeddings with a citation contract enforced by the system prompt ([note:uuid]), and explicit 503 surfacing when embeddings are absent.
  • Responsible AI Control Center (admin) surfacing ai_interactions rows for inspection.

Architecture demonstrated

  • Persistence: Postgres 16 with pgvector, colocating relational rows and vector embeddings. One transactional model, one failure mode.
  • Service: FastAPI with an asyncpg pool, structured logging, X-Request-ID propagation, and append-only ai_interactions writes for audited AI paths.
  • Frontend: Next.js App Router, capability flags read from GET /health, degraded states surfaced explicitly rather than hidden.
  • Corpus pipeline: A nine-step offline data_prep/ pipeline (Synthea + ACI-Bench + MTSamples + MedSynth → match → score → cohort → adapt → validate) producing a JSONL artifact + dataset card + audit report. Loaded into Postgres via scribe-load-corpus.
  • Provider abstraction: LLM and embeddings are configurable across Groq, OpenAI, Azure OpenAI, and Amazon Bedrock through a typed Settings layer.

Responsible AI demonstrated

  • ai_interactions is a first-class migration, not an observability sidecar. Audited AI paths write append-only rows with redacted previews, content hashes, status values used by the admin surface (success, degraded, failed, blocked), and a JSONB governance blob.
  • Citation contract in the prompt rather than a post-hoc verifier — the cheap, reliable choice for grounded RAG at this scale, with the audit row providing post-hoc visibility for completed and handled degraded interactions.
  • Degraded-state visibility: admin aggregation distinguishes successful, degraded, failed, and blocked rows where the route records an interaction. Provider exceptions that occur before an audit insert are surfaced to the caller rather than silently converted into successful-looking audit entries.
  • Synthetic data only, with the boundary stated plainly in PRIVACY_AND_PROVIDER_BOUNDARIES.md — no real PHI, no claim of HIPAA readiness.

Provider strategy

LLM and embedding providers are intentionally pluggable to make the healthcare-realistic posture legible: in real deployments, the institution picks the provider boundary, not the vendor of the demo.

  • Groq is the default demo LLM provider (fast, cheap, good enough for the narrative shape).
  • Azure OpenAI is the natural production posture for many healthcare systems that already have a Microsoft tenancy and a BAA-eligible Azure deployment.
  • Amazon Bedrock is the natural production posture for AWS-native systems and supports model selection per organization policy.
  • Embeddings are independently configurable across OpenAI, Azure OpenAI, and Bedrock. Switching embedding providers requires re-embedding because vector spaces are not interchangeable — the docs say this in plain terms.

See docs/guides/LLM_AND_EMBEDDING_PROVIDERS.md for the full provider configuration matrix.


Product thinking signals

  • Scope boundary stated once, plainly. Synthetic data only, demonstration system, single-tenant local deployment, English-only. These are choices, not gaps.
  • Deferred list is as deliberate as the built list. Audio transcription, agentic tool loops, enterprise SSO, LangGraph orchestration, and a hosted demo are all called out with rationale and a clear extension seam.
  • Audience-routed documentation. A PM, an architect, a hands-on engineer, an as-built reviewer, and a corpus pipeline reader each have a designated entry point.
  • Honest capability flags. Features that are unconfigured surface explicit 503s or placeholders; nothing is silently absent.

Engineering discipline signals

  • One canonical path through QUICKSTART; alternative configurations are documented but never compete for primacy.
  • Pre-commit hooks and secret-pattern checks versioned under .githooks/ and wired through scripts/install_dev_hooks.sh.
  • Documentation hygiene is itself a recorded artifact (docs/history/EVOLUTION.md), and superseded long prompts live under docs/archive/ with archive banners — the repository's documentation has a paper trail.
  • X-Request-ID propagated from frontend through FastAPI handlers to structured logs so user-visible actions are traceable end-to-end without logging PHI in bodies.
  • Decimal-numbered pipeline steps (05.5, 06.5) preserved as legible evidence that the pipeline evolved under review rather than being designed perfectly the first time.

What is intentionally not production-ready

These are explicit choices, not omissions. Each names the seam where the production change would land.

  • No real authentication / SSO. OptionalApiKeyMiddleware is the seam; SSO with org-scoped tokens and per-tenant DB isolation replace it.
  • No multi-tenant isolation beyond domain on rows. Row-level security or schema-per-tenant is the next step; the data model already anticipates it.
  • No PHI handling. Synthetic data only. PHI-readiness requires institutional approval, BAA, private networking, formal de-identification, and policy work the demo does not pretend to substitute for.
  • No hosted demo URL. Out of scope for this documentation pass; planned separately.
  • No agentic tool-loop chat. Single-shot grounded RAG is the right baseline to govern first; the audit schema already accommodates per-step records when tool loops earn their complexity.
  • No production observability. The audit table is a complete record of AI behavior, not a complete record of system health. OpenTelemetry traces, metrics, and structured log aggregation are the next layer.

How I would extend it

  • Identity and tenancy: replace OptionalApiKeyMiddleware with SSO and bind ai_interactions rows to authenticated principals; enforce per-tenant isolation on domain.
  • Embeddings strategy: benchmark domain-specific embeddings (BGE-M3, MedCPT) and move to hybrid retrieval (BM25 + dense) with reranking; the provider abstraction already exists.
  • LLM provider strategy: add a verified fallback (Anthropic or OpenAI) behind a feature flag; pin prompt versions per route and record hashes in ai_interactions, which the schema already supports.
  • Agentic chat behind a flag: search_notes, get_encounter, get_patient_summary as model-callable tools; benchmark latency cost honestly before defaulting it on.
  • Observability layer: OpenTelemetry traces on FastAPI handlers, metrics surface for latency and error rates by route, log aggregation. X-Request-ID propagation is already in place.
  • Hosted demo + walkthrough video: a short narrated tour showing the chart read path, the citation contract in action, and the Responsible AI Control Center.

What makes this more than a chatbot demo

  • Governance is a schema decision, not a logging decision. ai_interactions ships in Alembic migrations alongside patients and notes. The audit row writes on the request path — same operational window as the user-visible response — so the audit reflects what happened, not what an async sidecar later inferred.
  • The corpus is a deliberate artifact, not a synthetic prop. Nine pipeline steps with decimal-numbered insertions, a quality scorer, a cohort selector, an LLM-adapter pass, and a validation report with a dataset card. The data is part of the engineering surface.
  • Provider boundaries are stated, not assumed. The docs name what leaves the deployment, what the audit table redacts, and where enterprise providers help — and where they do not by themselves create PHI compliance.
  • The deferred list is as load-bearing as the built list. Knowing what not to build, and writing down why and where the seam lives, is the product-thinking signal that distinguishes a finished demonstration from a sprawling prototype.
  • Documentation is audience-routed. Reviewers do not have to read everything to evaluate the system; the entry table sends each audience to the right depth.

For role-fit interpretation across academic health, university IT, research, education innovation, and AI platform reviews, see TARGET_ROLE_ALIGNMENT.md.