Downstream & Portfolio¶
This lakehouse is a data platform: it exists to produce one thing the downstream AI projects
can trust — gold.encounter_summary, a denormalized, one-row-per-encounter clinical corpus
published under a versioned contract. One governed, versioned interface, many consumers — the
point where a data platform earns its keep.
Why this exists — the portfolio arc¶
scribe-iq came first. It proved the clinical-documentation product on a corpus assembled
heuristically — Synthea exported as CSV, with clinical notes stitched on from public datasets
(ACI-Bench, MTSamples, MedSynth) matched and adapted onto Synthea encounters.
scribe-iq-lakehouse is the principled rebuild — same domain, done rigorously: the data
foundation moved to Synthea Coherent (FHIR R4) and a true medallion lakehouse, with a semver'd,
test-gated Gold contract and two independent engine-native implementations
(ADR-022).
The loop being closed (roadmap): an Ollama generation pipeline will consume
gold.encounter_summary to derive synthetic notes and dialogues — the next-generation corpus for
scribe-iq, superseding the original heuristic assembly. The arc: prototype the product →
industrialize the data foundation → generate the corpus from the governed contract — building
both the platform and the AI product on top of it.
flowchart LR
GD["gold.encounter_summary<br/>governed contract v1.1.0"]
OLL["Ollama generation pipeline<br/>(roadmap — not built)"]
NOTES["synthetic unstructured<br/>notes + dialogues"]
SIQ["scribe-iq<br/>clinical RAG corpus"]
BERT["clinical-bert-pipeline"]
GD --> OLL --> NOTES --> SIQ
GD --> BERT
classDef road fill:#fff7ed,stroke:#f59e0b,stroke-dasharray:4 3;
class OLL,NOTES road
Status — the loop is roadmap
scribe-iq's current corpus is the heuristic assembly above, not this lakehouse. The
Ollama generation pipeline that turns Gold into scribe-iq's next corpus is planned — not yet
built. This page describes the intended direction.
The contract is the interface¶
Downstream projects pin against the contract's major version and treat the lakehouse as a black box behind it:
- clinical-bert-pipeline — the contract's NLP consumer:
soap_note_textplus the structured labels (active_conditions,active_medications, …) for entity enrichment. - Ollama generation pipeline (roadmap) — derives synthetic notes/dialogues from each encounter summary, to become…
- scribe-iq (roadmap loop) — whose next clinical RAG corpus is the Ollama-generated text grounded on Gold.
Because the handoff is a versioned, test-gated contract (Corpus Contract), the corpus can be rebuilt or re-platformed (LocalLite today, Fabric, later Databricks/AWS) without the consumers changing — as long as the contract holds. A contract test fails if the code, the generated JSON Schema, and the docs ever drift apart, so a breaking change can't land silently without a major-version bump.
Where this sits in the portfolio¶
| Project | Role | Relationship |
|---|---|---|
| scribe-iq-lakehouse (this) | Data platform / governed corpus | Produces gold.encounter_summary |
| scribe-iq | Clinical-documentation AI | Downstream — next corpus via the Ollama loop (roadmap) |
| clinical-bert-pipeline | Clinical NLP | Downstream consumer of the contract (notes + labels) |
A separate companion repo,
fabric-lakehouse-hls-readmission,
tells a different story — migrating a Databricks demo to Fabric, CSV-first — with no code
dependency in either direction.
Bidirectional linking
The ideal complement to this page is a one-line backlink on the scribe-iq side ("the clinical corpus is produced upstream by scribe-iq-lakehouse"). That lives in the scribe-iq repo and is tracked there as a follow-up.