Skip to content

Design Notes

The interesting problems — narrated, with the trade-offs. Each links the ADR that records the decision.

1. SOAP notes are Base64 inside FHIR Binary — decode, don't parse (ADR-005)

Coherent's clinical notes aren't a separate file format; they're Base64-encoded text inside Binary resources, linked to encounters via DocumentReference. The right extraction is a decode step, not a CCDA/section parser. One consequence worth stating honestly: Synthea SOAP notes contain S/A/P but no Objective section, so the Silver completeness check validates S/A/P only — the validation matches reality rather than an idealized template.

2. DICOM: extract headers, never pixels (ADR-006 · 013)

pydicom.dcmread(..., stop_before_pixels=True) everywhere. Imaging metadata (modality, body site, series, dimensions) enriches the encounter; pixel arrays are deliberately never loaded — fast, lightweight, and the correct scope when the goal is encounter context, not radiology inference. FHIR ImagingStudy is authoritative for the 3,752 studies; the 298 with a downloaded .dcm add real header dimensions. Coherent's descriptive DICOM tags are placeholder "UNKNOWN", normalized to null — so grounding uses modality + body_site_display, not a study description.

3. The as-of-date problem list (ADR-014)

The decision that most demonstrates clinical-data-temporality judgment. A naive encounter-grain join left the majority of encounters with an empty problem list. Recomputing active_conditions as the patient's problem list active as of each encounter date (onset ≤ date, not yet abated) dropped empty problem lists to 0.9% and lifted average conditions per encounter to 9.57. Medications get the same as-of-date treatment — but with an honest asterisk: FHIR has no medication stop date, so meds are a forward status=active approximation, not a reconstructable point-in-time timeline (Responsible Data).

4. Dict-based FHIR parsing over typed models (ADR-008)

Parsing uses plain dict access with .get(...) defaults rather than fhir.resources Pydantic models. Coherent bundles are large and partially-populated; defensive .get() access keeps the transforms pure, dependency-light, and resilient to missing/optional fields, while explicit Arrow schemas (not inference) pin the output types. Clinical codes (SNOMED/LOINC/ICD) stay strings, never cast to numeric.

5. Why the platform abstraction was replaced (ADR-022)

The most consequential architecture decision — and a reversal.

flowchart LR
    A["ADR-002 · platform abstraction<br/>one interface, env var selects engine"] --> C
    B["ADR-004 · Arrow interchange<br/>every transform returns pa.Table"] --> C
    C["ADR-020 · Fabric distributed parse<br/>applyInPandas bridge — friction"] --> D
    D["ADR-022 · independent per-platform impls<br/>each tier engine-native; one shared contract"]
    classDef old fill:#fff7ed,stroke:#fb923c;
    classDef new fill:#eef2ff,stroke:#6366f1;
    class A,B,C old
    class D new

The original design routed everything through one LakehousePlatform interface with an Arrow (pa.Table) interchange — elegant on the laptop. But making Spark-on-Fabric honor that interface meant an applyInPandas Python bridge that fought Spark's native, distributed from_json parsing (ADR-020). The abstraction was now costing more than it saved. ADR-022 replaced it with two independent, engine-native implementations: LocalLite returns pa.Table (Polars + delta-rs), Fabric returns Spark DataFrames (Spark + OneLake), and the tiers stay compatible by schema parity + lockstep CONTRACT_VERSION — not shared code. Each tier is now idiomatic for its engine; the Gold contract is the only thing they must agree on. See Engine Parity.

6. Orchestration as a first-class surface (ADR-015 · 016)

The same pure LocalLite transforms run under a dependency-light CLI and a Dagster software-defined asset graph: cohorts become partitions (per-cohort backfill), validate_table becomes an @asset_check (rule-by-rule pass/fail in the UI), and a sensor watches data/bronze/fhir/ as an Auto-Loader analogue. Each materialization carries rendered metadata (schema + sample rows + a SOAP card) via the shared core/preview.py renderers — the same data shape shows up in the Dagster UI, the CLI walkthrough, and the DuckDB SQL notebook.