scribe-iq-lakehouse¶
This project takes raw, messy, hospital-style patient data and turns it into one clean, reliable, well-documented dataset that AI systems can safely build on. It's engineered to run the same way on a laptop or in the cloud, using only synthetic (non-real) patient data.
For technical reviewers
A production-pattern healthcare data lakehouse: a Bronze → Silver → Gold medallion over multimodal Synthea Coherent FHIR (R4), turning raw clinical bundles into one governed, versioned Gold data contract.
Built twice, on purpose — Polars + delta-rs + DuckDB on a laptop and Spark + Delta + OneLake on Microsoft Fabric — orchestrated as a Dagster asset graph, with a streaming-ingest simulation of Fabric's Auto Loader. Two independent, engine-native implementations converge on the same contract by schema parity and a lockstep version, not shared code (ADR-022).
22 ADRs · 129 fixture-only tests · generated-first docs · three orchestration surfaces · full local run in ~2.5 min.
Built with¶
Grouped by concern, so the breadth is visible at a glance:
| Concern | Stack |
|---|---|
| Compute / transform | Polars (local, in-process) · Apache Spark (Fabric, distributed) |
| Storage / format | Delta Lake — delta-rs locally, OneLake on Fabric · Change Data Feed on every table |
| Orchestration | Dagster asset graph · dependency-light CLI · Fabric notebooks + Data Factory |
| Ingest | AWS Open Data S3 → append-only Bronze · streaming simulation (Auto Loader pattern) |
| Explore | DuckDB (local SQL UI) · Power BI Direct Lake (Fabric) |
| Data | Synthea Coherent · FHIR R4 · DICOM headers · genomic reports · synthetic, no PHI |
Why this exists¶
scribe-iq came first — it proved the clinical-documentation product on a corpus assembled
heuristically: Synthea exported as CSV, with clinical notes stitched on from public datasets
(ACI-Bench, MTSamples, MedSynth) matched and adapted onto Synthea encounters.
This repo is the principled rebuild. The headline engineering decision is
ADR-022: the obvious shared-code
abstraction was built first, hit the applyInPandas "bridge tax" fighting Spark's native
parsing, and was deliberately reversed into two independent, engine-native implementations
that meet only at the contract. On that foundation: a true medallion over Synthea Coherent
(FHIR R4) — typed Silver, a semver'd + test-gated Gold contract, multimodal FHIR/DICOM/genomic
handling, and limitations modeled as first-class data.
Closing the loop (roadmap). Next, an Ollama generation pipeline will consume
gold.encounter_summary to derive synthetic unstructured notes and dialogues — the
next-generation corpus for scribe-iq, superseding the original heuristic assembly. The arc:
prototype the product → industrialize the data foundation → generate the corpus from the
governed contract.
Start here¶
-
90-second tour
New here? Start here — what this is, how to read it, and what's real vs in-progress.
-
Why it's built this way
The dual-engine decision (ADR-022) and the interesting problems.
-
Laptop ↔ Fabric parity
Two engine-native tiers, one contract.
-
The data contract
The versioned, test-gated Gold handoff (
encounter_summary, v1.1.0).
What this shows¶
Engineering first; every claim pairs a competence with a checkable number from a real run (Benchmarks, Corpus Contract).
| Capability | Evidence (real run) |
|---|---|
| Multi-platform engine parity | The same medallion on LocalLite (Polars/delta-rs) and Fabric (Spark/OneLake) — independent implementations, one contract (ADR-022) |
| Multiple orchestration surfaces | Two local surfaces share one transform set — CLI · Dagster asset graph (cohort partitions, validate_table as asset checks); the Fabric tier reimplements as its own notebooks 00–10 (ADR-022) |
| Streaming-shaped ingest | Cohort-partition replay simulating Fabric Auto Loader → core/ingest/streaming_sim.py |
| Contract governance | gold.encounter_summary v1.1.0 — semver + test-gated; downstream consumers pin the major version |
| Healthcare data engineering at scale | 1,280 FHIR bundles (1,278 patients, 4.6 GiB) → 10 typed Silver Delta tables; 669,898 observations |
| Multimodal FHIR handling | Base64 SOAP decode (100% coverage), DICOM headers for 298 studies, genomic flags |
| Honest data modeling | as-of-date problem lists (empty lists 0.9%), genomic data_limitation column, PHI-safe logs |
| Quality discipline | 129 tests (no cloud/network), generated docs-as-test, pre-commit security scanning |
How it fits together¶
flowchart LR
S3["AWS Open Data S3<br/>Synthea Coherent · FHIR R4<br/>1,278 patients · ~4.6 GiB"]
subgraph LH["scribe-iq-lakehouse (this repo — the data platform)"]
direction TB
BR["Bronze<br/>raw, append-only"]
SV["Silver<br/>10 typed Delta tables · CDC · validated"]
GD["Gold<br/>gold.encounter_summary · 143,946 rows · 1 / encounter"]
BR --> SV --> GD
end
S3 --> BR
GD == "corpus contract v1.1.0<br/>versioned · test-gated" ==> C
subgraph C["Downstream AI consumers"]
direction TB
SIQ["scribe-iq<br/>clinical RAG / docs"]
BERT["clinical-bert-pipeline<br/>NLP"]
OLL["Ollama pipeline (roadmap)<br/>note + dialogue generation"]
end
classDef plat fill:#eef2ff,stroke:#6366f1;
classDef cons fill:#f0fdf4,stroke:#22c55e;
classDef road fill:#fff7ed,stroke:#f59e0b,stroke-dasharray:4 3;
class LH plat
class C cons
class OLL road
How it actually runs¶
The medallion, with the engine named at every hop:
- Bronze —
core/ingest/download.pypulls Synthea Coherent from AWS Open Data S3 into an append-only, cohort-partitioned landing zone (fhir/ · dicom/ · csv/+ manifests).streaming_sim.pyreplays cohorts to simulate Fabric's Auto Loader pattern. - Silver — a pure-Python
FHIRBundleParserturns bundles into engine-agnostic records; each tier then materializes 10 typed, CDC-enabled Delta tables its own way — LocalLite via Polars + delta-rs, Fabric via Sparkfrom_json(BUNDLE_SCHEMA)distributed parsing. Validation runs asvalidate_table, surfaced as Dagster asset checks. - Gold — a Polars join/aggregation denormalizes Silver into a single
gold.encounter_summary(one row per encounter) under the versioned contract.
The same LocalLite transforms run under two surfaces — the CLI and the Dagster asset graph; the Fabric tier is a third surface that runs its own engine-native transforms (ADR-022), not the same code.
flowchart TB
S3["AWS Open Data S3<br/>FHIR bundles"]
STREAM["streaming_sim.py<br/>cohort replay · Auto Loader pattern"]
S3 --> STREAM --> BRONZE
subgraph BRONZE["Bronze — raw, append-only"]
B1["fhir/ · dicom/ · csv/ + manifests"]
end
subgraph SILVER["Silver — 10 typed Delta tables · CDC · validated"]
direction LR
PARSE["pure-Python FHIRBundleParser<br/>(engine-agnostic dicts)"]
LOCALS["LocalLite: Polars + delta-rs"]
FABS["Fabric: Spark from_json(BUNDLE_SCHEMA)<br/>distributed"]
PARSE --> LOCALS
PARSE --> FABS
end
subgraph GOLD["Gold — one governed contract"]
G1["gold.encounter_summary<br/>Polars join/agg · 1 row / encounter<br/>contract v1.1.0"]
end
BRONZE --> SILVER --> GOLD
VAL["validate_table → Dagster asset checks"] -.-> SILVER
classDef gold fill:#fff7ed,stroke:#f59e0b;
class GOLD gold
Two tiers at a glance¶
| Concern | LocalLite tier (core/) |
Fabric tier (fabric/) |
|---|---|---|
| Compute | Polars (in-process) | Spark |
| Table format / storage | delta-rs (Delta Lake), data/ |
OneLake Delta |
| Parsing | dict-parse → pa.Table |
from_json(value, BUNDLE_SCHEMA) (distributed) |
| Orchestration | CLI · Dagster asset graph | notebooks 00–10 · Data Factory |
| Explore | DuckDB UI (read-only SQL) | Spark display() · Power BI Direct Lake |
| Cost (1.3k patients) | $0 | trial (F4) |
| Status | ✅ full 1,278-patient run | ✅ green on 100-patient sample; full re-run pending |
Both are independent, engine-native implementations that emit the same Gold contract —
compatibility by schema parity + lockstep CONTRACT_VERSION, not code sharing. Full detail and
the side-by-side code: Engine Parity.
Downstream — closing the loop¶
gold.encounter_summary is the single governed interface between this platform and the AI apps.
clinical-bert-pipeline consumes it (the SOAP text plus the structured labels) for NLP. An
Ollama generation pipeline (roadmap) will derive synthetic unstructured notes and dialogues
from Gold to become scribe-iq's next corpus — closing the loop from structured contract back to
clinical text. Change the corpus once, behind the contract, and every downstream model inherits it.
flowchart LR
GD["gold.encounter_summary<br/>governed contract v1.1.0"]
OLL["Ollama generation pipeline<br/>(roadmap — not built)"]
NOTES["synthetic unstructured<br/>notes + dialogues"]
SIQ["scribe-iq<br/>clinical RAG corpus"]
BERT["clinical-bert-pipeline"]
GD --> OLL --> NOTES --> SIQ
GD --> BERT
classDef road fill:#fff7ed,stroke:#f59e0b,stroke-dasharray:4 3;
class OLL,NOTES road
→ The full provenance and portfolio arc: Downstream & Portfolio.
Honest boundaries¶
Data limitations are modeled as first-class columns, not hidden. Synthetic data only (Synthea
Coherent — no PHI). Genomics is simulated inheritance, flagged in a mandatory data_limitation
column; active_medications is a forward status=active approximation, not a point-in-time
timeline; has_ecg is always false (no ECG in the Coherent FHIR). Each limitation is named with
its reason and the production-grade alternative → Responsible Data.
Run it locally¶
python -m venv .venv && source .venv/bin/activate
pip install -e ".[local,dev]" # Polars + delta-rs + DuckDB + dev tooling
pytest # 129 tests — no cloud / Fabric / network
python -m core.surfaces.cli.pipeline --with-gold # Bronze → Silver → Gold
Full procedures — ingest, rebuilds, DICOM, verification, troubleshooting — are in the
Runbook. To see one patient flow through the medallion, run
python -m core.scripts.demo_walkthrough.
Built on Synthea Coherent synthetic data — no real patient information. MIT licensed. A portfolio engineering artifact: see About.