Benchmarks & data baseline¶

Reference numbers for the lakehouse — captured so changes in scale, runtime, or table sizes are visible over time, and so reviewers can see proof-of-scale. Update when the pipeline or dataset materially changes.

Environment¶


Machine	Apple Silicon laptop
Python	3.11.9 (`.venv`)
Platform	`local_lite` — Polars 1.41 · deltalake (delta-rs) 1.6 · DuckDB 1.5 · pyarrow 24
Source	`s3://synthea-open-data/coherent/unzipped/fhir/` (AWS Open Data, no credentials)

Dataset (Bronze)¶

Metric	Value
FHIR files landed	1,280 (2 are reference files: `organizations.json`, `practitioners.json`)
Distinct patients	1,278
Total size	4.6 GiB (4,619,518,472 bytes)
Cohort partition	A = 427 · B = 427 · C = 426 (round-robin)

Ingest (S3 → Bronze)¶

Metric	Value
FHIR `aws s3 sync` + cohort partition	~18m36s (network-bound, parallel sync)
DICOM `aws s3 sync` (`--with-dicom`)	298 files / 9.3 GiB (network-bound; one-time)
CSV `aws s3 sync` (`--with-csv`)	16 files / 466 MB (reference only, not processed)

Pipeline (Bronze → Silver) — full dataset¶

Metric	Value
Wall clock	2m19s (128.0s user · 7.5s sys · 97% CPU) — includes DICOM header reads for 298 studies
Per cohort (parse + build + MERGE)	A ~50s · B ~46s · C ~54s
Validation + read-back	<1s total
Cohorts	3 (processed as sequential micro-batches)
Validations failed	0

Silver row counts¶

Table	Rows	Notes
observation	669,898	vitals + labs (BP components in `components_json`)
medication_request	209,401
encounter	143,946
soap_note	143,946	~1 note per encounter; S/A/P, no Objective (ADR-005)
procedure	56,092
condition	15,956
imaging_study	3,752	FHIR metadata; 298 enriched with DICOM headers (ADR-013)
genomic_report	419	`data_limitation` 100% populated; 0 pathogenic (synthetic)
patient	1,278
ecg_metadata	0	ECG is Binary waveform, not in FHIR — roadmap Phase 3
ingest_log	11	one validation row per Silver table per run

Gold (Silver → `gold.encounter_summary`) — full dataset¶

Polars in-process denormalization of all 10 Silver tables → one row per encounter (ADR-012). Reads Silver via the platform, builds with the pure Gold transform, writes one Delta table (overwrite + CDC) plus the corpus manifest.

Metric	Value
Wall clock	~6.5s (incl. patient-level as-of-date joins for conditions/meds, ADR-014)
Output rows	143,946 (one per encounter)
Output columns	22 (incl. nested struct vitals/imaging/versions + array conditions/meds/labs)
Nested types in Delta	round-trip verified; CDC enabled

Corpus coverage¶

Metric	Value	Note
Encounters	143,946
Distinct patients	1,278
With SOAP note	143,946 (100%)	primary generation anchor
With labs	26,059
With vitals	19,830	BP parsed from `components_json`
With imaging	3,752	298 carry DICOM `study_date` (ADR-013)
With genomics	419	synthetic (ADR-007)
With ECG	0	no ECG in Coherent FHIR
Avg conditions / encounter	9.57	as-of-date problem list, onset+abatement gated (ADR-014)
Avg medications / encounter	1.66	as-of-date, status=active (ADR-014)
Encounters w/ empty problem list	0.9%	down from the majority under the old encounter-grain join

Engine comparison (target matrix)¶

Independent engine-native tiers emitting the same Gold contract (ADR-022). LocalLite is measured on the full dataset; the Fabric tier ran green on F4 capacity against a 100-patient sample.

Capability	local_lite	Fabric	Databricks	AWS
Bronze→Silver (full)	✅ 2m19s	✅ F4 (100-sample)	roadmap	roadmap
Silver→Gold (full)	✅ ~6.5s	✅ F4 (100-sample)	roadmap	roadmap
CDC	✅	✅	roadmap	roadmap
Streaming	sim only	🔜 Auto Loader	roadmap	roadmap
Cost (1.3k pts)	$0	trial	—	—

Execution surfaces¶

Two LocalLite surfaces over the same pure transforms, plus the Fabric tier's own notebooks (ADR-022):

Surface	Where	Use it for
`core.surfaces.cli.pipeline` CLI	`core/surfaces/cli/pipeline.py`	Default, dependency-light, CI gate; full-rebuild + per-cohort flags
Dagster asset graph	`core/orchestration/dagster/` (ADR-015/016)	Per-cohort backfill via UI, `validate_table` as asset checks (rule-by-rule pass/fail in metadata), sensor on `data/bronze/fhir/`, run history. Each asset's `MaterializeResult` carries schema + sample rows + sample-bundle/SOAP-card markdown so clicking a node shows what materialized
Fabric notebooks	`fabric/notebooks/` 00–10	Fabric tier's Spark-native impl over OneLake (ADR-022); green on F4 (100-sample)

Dagster timings track the CLI numbers above (the work is in the transforms; orchestration overhead is ~ms per asset on the fixture). No standalone benchmark — the value is the graph, the checks, and partition-level backfill, not throughput.

Demo / read-only query surface¶

For exploration alongside the three execution surfaces (not a fourth tier — purely read-only over the existing Delta tables):

Tool	Where	Demo use
`core/scripts/demo_walkthrough.py`	rich CLI, one patient end-to-end	Bronze → Parse → Silver → Gold for one anchor patient, with full SOAP note rendered
`docs/demo/notebooks/demo_notebook.sql`	DuckDB UI, 20 SQL cells	Corpus headlines, top conditions, as-of-date condition growth, full SOAP notes, keyword cohort search
`docs/demo/PLAYBOOK.md`	recording guide	5-beat demo video script + take-by-take recording sequence

All three render via core/preview.py, so the Dagster asset metadata, the CLI walkthrough, and the SQL notebook present the same data shape.

Reproduce¶

pip install -e ".[local,dev]"                 # or: .venv
python -m core.ingest.download --bronze-root data/bronze   # FHIR, ~4.6 GiB, network-bound
python -m core.ingest.download --assets-only --with-dicom --with-csv  # +9.3 GiB DICOM, 466 MB CSV
python -m core.surfaces.cli.pipeline --bronze-root data/bronze --with-gold  # → silver/* (~2m19s) + gold/* (~6.5s)
python -m core.surfaces.cli.pipeline --gold-only                        # rebuild Gold from existing Silver

Full re-runs build from a clean slate (delta-rs MERGE is for incremental cohort landing, not whole-table re-update); remove data/silver + data/gold before a full --with-gold rebuild. Both are reproducible from Bronze.

Methodology & caveats¶

Single run, warm OS file cache; numbers are indicative, not a controlled benchmark.
Ingest time is network-bound and will vary; the pipeline time is the stable figure.
local_lite holds one cohort's records in memory at a time (~1/3 of the data); peak RSS stayed well under what a typical dev laptop offers. Full-dataset-in-memory was deliberately avoided.
Fabric/Spark full-run figures will be filled in after the full 1,280-bundle re-run; the 100-patient F4 run is green end-to-end (notebooks 00–10).