Reference numbers for the lakehouse — captured so changes in scale, runtime, or table
sizes are visible over time, and so reviewers can see proof-of-scale. Update when the
pipeline or dataset materially changes.
ECG is Binary waveform, not in FHIR — roadmap Phase 3
ingest_log
11
one validation row per Silver table per run
Gold (Silver → gold.encounter_summary) — full dataset¶
Polars in-process denormalization of all 10 Silver tables → one row per encounter
(ADR-012). Reads Silver via the platform, builds with the pure Gold transform, writes one
Delta table (overwrite + CDC) plus the corpus manifest.
Metric
Value
Wall clock
~6.5s (incl. patient-level as-of-date joins for conditions/meds, ADR-014)
Independent engine-native tiers emitting the same Gold contract (ADR-022). LocalLite is measured
on the full dataset; the Fabric tier ran green on F4 capacity against a 100-patient sample.
Two LocalLite surfaces over the same pure transforms, plus the Fabric tier's own notebooks (ADR-022):
Surface
Where
Use it for
core.surfaces.cli.pipeline CLI
core/surfaces/cli/pipeline.py
Default, dependency-light, CI gate; full-rebuild + per-cohort flags
Dagster asset graph
core/orchestration/dagster/ (ADR-015/016)
Per-cohort backfill via UI, validate_table as asset checks (rule-by-rule pass/fail in metadata), sensor on data/bronze/fhir/, run history. Each asset's MaterializeResult carries schema + sample rows + sample-bundle/SOAP-card markdown so clicking a node shows what materialized
Fabric notebooks
fabric/notebooks/ 00–10
Fabric tier's Spark-native impl over OneLake (ADR-022); green on F4 (100-sample)
Dagster timings track the CLI numbers above (the work is in the transforms; orchestration
overhead is ~ms per asset on the fixture). No standalone benchmark — the value is the
graph, the checks, and partition-level backfill, not throughput.
Full re-runs build from a clean slate (delta-rs MERGE is for incremental cohort landing,
not whole-table re-update); remove data/silver + data/gold before a full
--with-gold rebuild. Both are reproducible from Bronze.
Single run, warm OS file cache; numbers are indicative, not a controlled benchmark.
Ingest time is network-bound and will vary; the pipeline time is the stable figure.
local_lite holds one cohort's records in memory at a time (~1/3 of the data); peak
RSS stayed well under what a typical dev laptop offers. Full-dataset-in-memory was
deliberately avoided.
Fabric/Spark full-run figures will be filled in after the full 1,280-bundle re-run; the
100-patient F4 run is green end-to-end (notebooks 00–10).