Skip to content

Healthcare Data & Responsible Handling

The project's governing principle: state every limitation precisely, pair it with the reason and the production-grade alternative, and treat the limitation itself as the signal. Knowing where synthetic data, the FHIR model, or a trial capacity stops being trustworthy — and modeling that boundary as a first-class column, a contract note, or a roadmap item rather than hiding it — is the difference between a demo and a system.

Synthetic-only, no PHI

The lakehouse is built entirely on Synthea Coherent — the richest public synthetic longitudinal FHIR dataset (1,278 patients, MIT). There is no real patient information anywhere in the repo, the tests, or the published corpus. This is a feature, not a caveat: it is exactly why the whole pipeline is laptop-reproducible and safe to open-source, and why a reviewer can git clone, run all 129 tests, and rebuild the corpus with zero credentials. Tests run against a single synthetic fixture bundle (core/tests/fixtures/sample_bundle.json), never real data.

PHI-safe by construction (ADR-010)

Even on synthetic data, the logging discipline is production-grade. Logs never contain patient or encounter identifiers, or bundle filenames. Identifier-bearing values are redacted to a non-reversible ref:<hash> before they reach a log line, via core.redaction.redact(). The habit is the point — the same code on real PHI would not leak identifiers.

Limitations, named (from the Corpus Contract)

Medications are a forward status=active approximation — not a point-in-time timeline

active_medications carries forward any status=active med from its authoring date (good for chronic/ongoing meds). But FHIR carries no medication stop date, so a med active at a past encounter and later stopped cannot be reconstructed on those past encounters. Conditions, by contrast, are temporally precise (onset + abatement gated, ADR-014). A true med timeline needs the CSV STOP column — deliberately out of scope (ADR-013). This is the page's clearest demonstration of clinical-data-temporality judgment.

Genomics is synthetic — and the limitation is a first-class column (ADR-007)

Synthea models familial inheritance simulation, not clinically actionable variants — no real BRCA, CYP2D6, or HLA typing, no pathogenic variants. Rather than footnote this, every silver.genomic_report row carries a mandatory data_limitation column (100% populated): "Synthea simulated inheritance — not clinical variants." The constraint is modeled as data, not prose.

has_ecg is always false

Coherent has no ECG DiagnosticReports in the FHIR bundles (ECG is Binary waveform data, roadmap Phase 3). The has_ecg / ecg_* fields exist for forward compatibility and are honestly empty rather than silently dropped.

Imaging: metadata yes, study description no

FHIR imaging metadata (modality, body site, series count) is present for all 3,752 studies; the 298 with a downloaded DICOM file additionally carry real header dimensions via pydicom stop_before_pixels=Truepixel data is never loaded (ADR-006). But Coherent's descriptive DICOM tags are placeholder "UNKNOWN", normalized to null (ADR-013) — so there's no human-readable study description to ground on. Use modality + body_site_display.

Vitals coverage is partial (~14% of encounters)

Only encounters with linked vital-sign observations populate recent_vitals. Blood pressure is parsed from the Silver components_json (systolic LOINC 8480-6 / diastolic 8462-4).

Fabric-trial scope

The Fabric tier ran green end-to-end on F4 trial capacity against a 100-patient sample (notebooks 00–10); the full 1,280-bundle re-run is pending. Framed honestly: the architecture is validated end-to-end on real Fabric infrastructure (F4 capacity) at sample scale — the remaining work is a full-scale re-run, not a design question — and it sits alongside a LocalLite tier that is validated on the full dataset.

FHIR & multimodal ingestion, in brief

Data type FHIR resource(s) Silver handling
Demographics, encounters Patient, Encounter Structured → typed tables
Conditions, observations, meds, procedures Condition, Observation, MedicationRequest, Procedure Structured; codes kept as strings (never cast to int)
SOAP notes DocumentReference + Binary (Base64) Decode, not parse (ADR-005); S/A/P (no Objective in Coherent)
Imaging ImagingStudy + DICOM headers Metadata via pydicom stop_before_pixels (ADR-006/013)
Genomics DiagnosticReport + Binary Report metadata + data_limitation flag only (ADR-007)

Clinical codes (SNOMED, LOINC, ICD) are always stored as str, never cast to numeric — a small but load-bearing healthcare-data correctness rule.