Engineering Case Study¶
A data-platform case study: the problem, the constraints that shaped it, the decisions that mattered, and the result — each decision linked to its ADR.
Problem¶
Downstream clinical-AI projects (scribe-iq, clinical-bert-pipeline) need a governed, realistic clinical corpus to build against. Synthea Coherent is the richest public synthetic dataset, but it ships as raw, deeply nested, multimodal FHIR R4 — verbose JSON with Base64-embedded notes, DICOM headers, and genomic reports. It is not queryable, not typed, and has no contract a consumer could pin to.
Constraints¶
- Laptop-reproducible and enterprise-credible. A reviewer must be able to clone and run the whole thing with no cloud account; the same design must also stand up on Microsoft Fabric.
- Synthetic only, honest throughout. No PHI; and where the synthetic data or the FHIR model is thin, say so in the data, not in a footnote.
- A real handoff, not a dump. Downstream teams need a stable, versioned interface — not a schema that shifts every rebuild.
Decisions that mattered¶
| Decision | Why | ADR |
|---|---|---|
| Medallion (Bronze → Silver → Gold) | Separate raw landing, typed/validated entities, and the denormalized AI-ready corpus | 012 |
| Decode, don't parse, SOAP notes | Notes are Base64 inside FHIR Binary — a decode step, not a new format |
005 |
| DICOM headers without pixels | stop_before_pixels=True — fast, lightweight, right for encounter context |
006 · 013 |
| Limitation as a first-class column | Synthetic genomics flagged in data_limitation, never silently presented as real |
007 |
| As-of-date problem lists | active_conditions/active_medications scoped to each encounter date — temporally honest |
014 |
| Generated-first docs + a versioned contract | A contract test gates code/JSON-Schema/docs so the handoff can't drift | 011 · 012 |
| Independent per-platform implementations | Each tier engine-native; compatibility by schema parity, not a leaky shared abstraction | 022 |
| Medallion as a Dagster asset graph | Cohort partitions, validation as asset checks, a file sensor — orchestration as a first-class surface | 015 · 016 |
The architecture evolved: it started as one platform-abstraction layer with an Arrow interchange (ADR-002/004), and — once Spark-on-Fabric met a Python-bridge parsing path that fought the abstraction — was deliberately replaced by two independent engine-native tiers emitting one contract (ADR-022). Recognizing when an abstraction is costing more than it saves is the senior move here; the story is told in Design Notes.
Result¶
gold.encounter_summary— 143,946 rows (one per encounter), contract v1.1.0, semver + test-gated, with a lineage manifest.- Built end-to-end on the full 1,278-patient dataset on a laptop: 10 Silver tables in 2m19s with 0 validation failures, then Gold in ~6.5s.
- Proven on Microsoft Fabric — green end-to-end on F4 against a 100-patient sample (notebooks 00–10); full re-run pending.
- 129 tests, fixture-only (no cloud/network); generated docs-as-test; pre-commit security scanning (detect-secrets, gitleaks, bandit, semgrep).
See the numbers in Benchmarks, the handoff in the Corpus Contract, and the parity story in Engine Parity.