Reviewer Guide¶
Review this as an engineering artifact, not a product. The goal of this page is to get you to the evidence quickly and to be honest, up front, about what is fully built versus in-progress.
Pick a depth¶
New to data engineering? In one line: this turns raw hospital-style data into one clean, trusted dataset for AI — built to run on a laptop and in the cloud, on synthetic data only.
- Read the Home hero + the What this shows evidence table.
- Look at the one system diagram: raw FHIR → medallion → a versioned Gold contract → downstream AI.
- Takeaway: this is a governed clinical data product, built twice (laptop + Fabric) against one contract, with limitations modeled as first-class.
- Engineering Case Study — problem → decisions → result.
- Corpus Contract — the versioned, test-gated Gold handoff.
- One ADR that shows judgment: ADR-022 (independent per-platform impls) or ADR-014 (problem-list as-of-date).
git clone https://github.com/sandeep-jay/scribe-iq-lakehouse && cd scribe-iq-lakehouse
python -m venv .venv && source .venv/bin/activate
pip install -e ".[local,dev]"
pytest # 129 tests — no cloud / Fabric / network
python -m core.scripts.demo_walkthrough # one patient: Bronze → Parse → Silver → Gold
No credentials, no cloud. The tests run against a single synthetic fixture bundle. The
walkthrough renders one patient end-to-end, ending in a gold.encounter_summary row with
the SOAP note as readable clinical text. Full run: Runbook.
What's real vs in-progress¶
Stated plainly so the evidence isn't oversold.
| Area | Status |
|---|---|
LocalLite tier (core/) — Bronze → Silver → Gold |
✅ Run end-to-end on all 1,278 patients (143,946 Gold rows, 0 validation failures) |
| Dagster asset graph (local orchestration) | ✅ Built — cohort partitions, validate_table as asset checks, file sensor |
| Gold corpus contract (v1.1.0) | ✅ Versioned + test-gated (schema/JSON-Schema/docs can't drift) |
Fabric tier (fabric/) — Spark-native, notebooks 00–10 |
✅ Green end-to-end on F4 against a 100-patient sample; full 1,280-bundle re-run pending |
| Fabric Data Factory pipeline + Power BI Direct Lake | 🚧 In progress (demo deliverables) |
Ollama note/dialogue generation → scribe-iq corpus loop |
🗺️ Roadmap (not built) — Gold is the input; the generation pipeline is the next-gen corpus path |
| ECG waveform processing · Databricks/AWS tiers | 🗺️ Roadmap (scoped, not built) |
The Fabric tier is validated end-to-end on real Fabric infrastructure (F4 capacity) at sample scale — the remaining work there is a full-scale re-run, not a design question. Pair it with the LocalLite full-run numbers in Benchmarks.
If you only read one thing per concern¶
| Concern | Read |
|---|---|
| Data modeling / data-as-a-product | Corpus Contract |
| Engine portability / architecture judgment | ADR-022 + Engine Parity |
| Healthcare-data judgment & limitations | Responsible Data |
| Interesting problems solved | Design Notes |
| Operations & reproducibility | Runbook + Benchmarks |
| The downstream story | Downstream & Portfolio |