Engineering Case Study¶

A data-platform case study: the problem, the constraints that shaped it, the decisions that mattered, and the result — each decision linked to its ADR.

Problem¶

Downstream clinical-AI projects (scribe-iq, clinical-bert-pipeline) need a governed, realistic clinical corpus to build against. Synthea Coherent is the richest public synthetic dataset, but it ships as raw, deeply nested, multimodal FHIR R4 — verbose JSON with Base64-embedded notes, DICOM headers, and genomic reports. It is not queryable, not typed, and has no contract a consumer could pin to.

Constraints¶

Laptop-reproducible and enterprise-credible. A reviewer must be able to clone and run the whole thing with no cloud account; the same design must also stand up on Microsoft Fabric.
Synthetic only, honest throughout. No PHI; and where the synthetic data or the FHIR model is thin, say so in the data, not in a footnote.
A real handoff, not a dump. Downstream teams need a stable, versioned interface — not a schema that shifts every rebuild.

Decisions that mattered¶

Decision	Why	ADR
Medallion (Bronze → Silver → Gold)	Separate raw landing, typed/validated entities, and the denormalized AI-ready corpus	012
Decode, don't parse, SOAP notes	Notes are Base64 inside FHIR `Binary` — a decode step, not a new format	005
DICOM headers without pixels	`stop_before_pixels=True` — fast, lightweight, right for encounter context	006 · 013
Limitation as a first-class column	Synthetic genomics flagged in `data_limitation`, never silently presented as real	007
As-of-date problem lists	`active_conditions`/`active_medications` scoped to each encounter date — temporally honest	014
Generated-first docs + a versioned contract	A contract test gates code/JSON-Schema/docs so the handoff can't drift	011 · 012
Independent per-platform implementations	Each tier engine-native; compatibility by schema parity, not a leaky shared abstraction	022
Medallion as a Dagster asset graph	Cohort partitions, validation as asset checks, a file sensor — orchestration as a first-class surface	015 · 016

The architecture evolved: it started as one platform-abstraction layer with an Arrow interchange (ADR-002/004), and — once Spark-on-Fabric met a Python-bridge parsing path that fought the abstraction — was deliberately replaced by two independent engine-native tiers emitting one contract (ADR-022). Recognizing when an abstraction is costing more than it saves is the senior move here; the story is told in Design Notes.

Result¶

gold.encounter_summary — 143,946 rows (one per encounter), contract v1.1.0, semver + test-gated, with a lineage manifest.
Built end-to-end on the full 1,278-patient dataset on a laptop: 10 Silver tables in 2m19s with 0 validation failures, then Gold in ~6.5s.
Proven on Microsoft Fabric — green end-to-end on F4 against a 100-patient sample (notebooks 00–10); full re-run pending.
129 tests, fixture-only (no cloud/network); generated docs-as-test; pre-commit security scanning (detect-secrets, gitleaks, bandit, semgrep).

See the numbers in Benchmarks, the handoff in the Corpus Contract, and the parity story in Engine Parity.