Reviewer Guide¶

Review this as an engineering artifact, not a product. The goal of this page is to get you to the evidence quickly and to be honest, up front, about what is fully built versus in-progress.

Pick a depth¶

Not technical?90 seconds10 minutes30 minutes (hands-on)

New to data engineering? In one line: this turns raw hospital-style data into one clean, trusted dataset for AI — built to run on a laptop and in the cloud, on synthetic data only.

Read the Home hero + the What this shows evidence table.
Look at the one system diagram: raw FHIR → medallion → a versioned Gold contract → downstream AI.
Takeaway: this is a governed clinical data product, built twice (laptop + Fabric) against one contract, with limitations modeled as first-class.

Engineering Case Study — problem → decisions → result.
Corpus Contract — the versioned, test-gated Gold handoff.
One ADR that shows judgment: ADR-022 (independent per-platform impls) or ADR-014 (problem-list as-of-date).

git clone https://github.com/sandeep-jay/scribe-iq-lakehouse && cd scribe-iq-lakehouse
python -m venv .venv && source .venv/bin/activate
pip install -e ".[local,dev]"
pytest                                   # 129 tests — no cloud / Fabric / network
python -m core.scripts.demo_walkthrough  # one patient: Bronze → Parse → Silver → Gold

No credentials, no cloud. The tests run against a single synthetic fixture bundle. The walkthrough renders one patient end-to-end, ending in a gold.encounter_summary row with the SOAP note as readable clinical text. Full run: Runbook.

What's real vs in-progress¶

Stated plainly so the evidence isn't oversold.

Area	Status
LocalLite tier (`core/`) — Bronze → Silver → Gold	✅ Run end-to-end on all 1,278 patients (143,946 Gold rows, 0 validation failures)
Dagster asset graph (local orchestration)	✅ Built — cohort partitions, `validate_table` as asset checks, file sensor
Gold corpus contract (v1.1.0)	✅ Versioned + test-gated (schema/JSON-Schema/docs can't drift)
Fabric tier (`fabric/`) — Spark-native, notebooks 00–10	✅ Green end-to-end on F4 against a 100-patient sample; full 1,280-bundle re-run pending
Fabric Data Factory pipeline + Power BI Direct Lake	🚧 In progress (demo deliverables)
Ollama note/dialogue generation → `scribe-iq` corpus loop	🗺️ Roadmap (not built) — Gold is the input; the generation pipeline is the next-gen corpus path
ECG waveform processing · Databricks/AWS tiers	🗺️ Roadmap (scoped, not built)

The Fabric tier is validated end-to-end on real Fabric infrastructure (F4 capacity) at sample scale — the remaining work there is a full-scale re-run, not a design question. Pair it with the LocalLite full-run numbers in Benchmarks.

If you only read one thing per concern¶

Concern	Read
Data modeling / data-as-a-product	Corpus Contract
Engine portability / architecture judgment	ADR-022 + Engine Parity
Healthcare-data judgment & limitations	Responsible Data
Interesting problems solved	Design Notes
Operations & reproducibility	Runbook + Benchmarks
The downstream story	Downstream & Portfolio