Runbook — scribe-iq-lakehouse (local tier)¶
Operational procedures for ingesting, building, verifying, and troubleshooting the lakehouse
on the local_lite platform (Polars + delta-rs, zero cloud). The local stack ships with two
execution surfaces — the core.surfaces.cli.pipeline CLI (default, dependency-light, the CI path) and
a Dagster asset graph (core/orchestration/dagster/, ADR-015/016, optional [orchestration] extra);
both reuse the same pure transforms. Fabric deployment is covered separately in the
Fabric Deployment guide.
For why the system is shaped this way, see ARCHITECTURE.md and the
ADRs; for reference numbers, see BENCHMARKS.md.
All commands assume the repo root and an activated venv (
source .venv/bin/activate). If you haven't activated it, prefix with.venv/bin/(e.g..venv/bin/python -m core.surfaces.cli.pipeline).
1. Prerequisites¶
| Need | Why | Check |
|---|---|---|
| Python 3.11+ | runtime | python --version |
venv with [local,dev] extras |
polars, delta-rs, duckdb, pytest, jsonschema | pip install -e ".[local,dev]" |
Optional [orchestration] extra |
Dagster + dagster-webserver (for §6) | pip install -e ".[local,dev,orchestration]" |
| AWS CLI | S3 ingest only (not needed to run tests or rebuild from existing Bronze) | aws --version |
| Disk | FHIR 4.6 GB; +9.3 GB if pulling DICOM; Silver/Gold a few hundred MB | df -h . |
The S3 source is AWS Open Data — no credentials required
(s3://synthea-open-data/coherent/unzipped/). All ingest uses --no-sign-request.
Configuration¶
| Variable | Default | Purpose |
|---|---|---|
LAKEHOUSE_PLATFORM |
local_lite |
Selects the platform implementation (factory). |
LAKEHOUSE_LOCAL_ROOT |
data |
Local storage root for bronze/silver/gold. |
DAGSTER_HOME |
(unset) | Where dagster dev keeps run history / schedules / storage. Set to $PWD/dagster_home so it lives next to the repo and is gitignored. |
Storage layout (all gitignored under data/):
data/
├── bronze/
│ ├── fhir/cohort=A|B|C/*.json # 1,280 FHIR bundles
│ ├── dicom/*.dcm # 298 DICOM files (optional)
│ ├── csv/*.csv # 16 CSV exports (optional, reference only)
│ └── _metadata/{manifest,assets_manifest}.json
├── silver/<10 tables>/ + ingest_log/ # Delta, CDC enabled
└── gold/encounter_summary/ + _metadata/corpus_manifest.json
2. First full run (end to end)¶
# 1. Install
python -m venv .venv && source .venv/bin/activate
pip install -e ".[local,dev]"
# 2. Land FHIR into Bronze (~18 min, network-bound; partitions into cohorts A/B/C)
python -m core.ingest.download --bronze-root data/bronze
# 3. (Optional) Land DICOM + CSV assets (~10 GB; enables imaging header extraction)
python -m core.ingest.download --assets-only --with-dicom --with-csv
# 4. Build Bronze → Silver → Gold (~2m19s Silver + ~6.5 s Gold)
python -m core.surfaces.cli.pipeline --with-gold
# 5. Verify (see §5)
Step 4 prints a JSON summary of per-table row counts and the Gold corpus stats.
3. Ingest¶
FHIR bundles¶
python -m core.ingest.download --bronze-root data/bronze # full dataset (1,280 files)
python -m core.ingest.download --max-files 30 # dev subset (fast)
Bundles download flat, then partition round-robin into cohort=A|B|C (balanced, order-
independent samples — the local analogue of streaming micro-batches). A manifest is written to
data/bronze/_metadata/manifest.json. Re-runs are idempotent (aws s3 sync skips existing).
DICOM + CSV assets (optional)¶
- DICOM (~9.3 GB, 298 files) →
data/bronze/dicom/. Enablessilver.imaging_studyheader extraction (study_date, dimensions, slice thickness) for the 298 studies that have a file. See ADR-013. Writes_metadata/assets_manifest.json. - CSV (~466 MB, 16 files) →
data/bronze/csv/. Landed for reference; not processed by any transform (the lakehouse is FHIR-first).
If you skip DICOM, the pipeline still runs — imaging rows simply keep FHIR-only metadata
(DICOM header columns stay null, dicom_extracted = false).
4. Build the medallion¶
Full Bronze → Silver → Gold¶
Per cohort: parse bundles → build all 10 Silver tables → MERGE-upsert into Delta (DICOM headers
merged in if data/bronze/dicom/ is present). After all cohorts land, every table is read back,
validated (spec §5.6) and logged to silver.ingest_log. Then Gold is denormalized and written
with its corpus manifest.
Gold only (Silver already built)¶
Reads the existing Silver tables and rebuilds gold.encounter_summary + manifest (~6.5 s). Use
after changing only Gold logic.
Single cohort / incremental¶
python -m core.surfaces.cli.pipeline --cohort A # process one cohort (MERGE-upsert)
python -m core.surfaces.cli.pipeline --cohort A --cohort B # process several
MERGE upsert means re-landing a cohort updates its rows in place — the intended path for incremental ingest.
Full clean rebuild¶
Required for a whole-dataset re-run (see Troubleshooting). Silver and Gold are derived data and always reproducible from Bronze, so deleting them is safe.
Streaming simulation¶
core/ingest/streaming_sim.py replays cohort partitions one at a time with a filesystem
watchdog — the local stand-in for Fabric Auto Loader (spec §5.2). It is a library used by the
pipeline/tests, not a standalone CLI.
5. Verify a build¶
Quick row-count + coverage check (matches the pipeline's printed summary and BENCHMARKS.md):
import json
from deltalake import DeltaTable
for t in ["patient", "encounter", "condition", "observation", "imaging_study"]:
dt = DeltaTable(f"data/silver/{t}")
print(f"silver.{t:14s} rows={dt.to_pyarrow_dataset().count_rows():>7} v{dt.version()}")
g = DeltaTable("data/gold/encounter_summary")
print("gold.encounter_summary rows:", g.to_pyarrow_dataset().count_rows())
print(json.load(open("data/gold/_metadata/corpus_manifest.json"))["corpus_stats"])
Expected at full scale: encounter = 143,946; gold.encounter_summary = 143,946;
distinct_patients = 1,278; avg conditions/encounter ≈ 9.6, medications ≈ 1.7; imaging studies
with DICOM = 298. Validation failures show up as _failed > 0 in the run summary and as rows in
silver.ingest_log with passed = false.
Ad-hoc SQL over the Delta tables with DuckDB (no Spark):
import duckdb
duckdb.sql("""
SELECT encounter_type, count(*) n
FROM delta_scan('data/gold/encounter_summary')
GROUP BY 1 ORDER BY n DESC LIMIT 10
""").show()
Read a SOAP-note sample (the demo centerpiece):
duckdb.sql("SELECT soap_note_text FROM delta_scan('data/gold/encounter_summary') "
"WHERE soap_note_text IS NOT NULL LIMIT 1").show(max_width=120)
Or run the full one-patient walkthrough for a rendered Bronze → Silver → Gold demo (also what's reused by the Dagster asset metadata, see §6):
python -m core.scripts.demo_walkthrough # auto-picks a "good demo" patient
python -m core.scripts.demo_walkthrough --pause 1.5 # 1.5s between sections (screencast pacing)
python -m core.scripts.demo_walkthrough --patient-id <uuid> # reproducible
For interactive SQL exploration — corpus headlines, top conditions, full SOAP notes, keyword search, as-of-date condition evolution — open the DuckDB UI notebook:
brew install duckdb # needs DuckDB ≥1.2 (-ui flag)
duckdb docs/demo/notebooks/demo.duckdb -ui # opens http://localhost:4213
20 SQL cells over the Delta tables; see docs/demo/notebooks/README.md
for the per-cell guide and how to regenerate the .duckdb (gitignored) if missing.
For recording a demo video around it, see docs/demo/PLAYBOOK.md.
6. Run via Dagster (optional)¶
The medallion is also exposed as a Dagster software-defined asset graph in core/orchestration/dagster/
(ADR-015, ADR-016)
— a richer, observable execution surface alongside the CLI. The same pure transforms run
under both; only the orchestration layer differs.
Install + launch the UI¶
pip install -e ".[local,dev,orchestration]"
export DAGSTER_HOME="$PWD/dagster_home" && mkdir -p "$DAGSTER_HOME"
dagster dev # opens http://localhost:3000 (asset graph)
dagster dev reads [tool.dagster] module_name = "core.orchestration.dagster.definitions"
from pyproject.toml. The asset graph nodes are:
bronze_fhir [cohort-partitioned]
└─→ silver_tables (multi_asset, cohort-partitioned, 10 outputs)
├─→ patient · encounter · observation · condition · procedure
├─→ medication_request · soap_note · imaging_study · genomic_report · ecg_metadata
│ (each with a @asset_check wrapping validate_table)
└─→ gold_encounter_summary [unpartitioned aggregate]
(also writes gold/_metadata/corpus_manifest.json)
Per-cohort materialization & backfill¶
Each cohort under data/bronze/fhir/cohort=*/ is a Dagster partition that flows through
both bronze_fhir and the silver_tables multi-asset (Gold is unpartitioned). From
the UI:
- Assets → bronze_fhir → Materialize → pick a partition (e.g.
A) — runs Bronze and all 10 Silver tables for cohortAin one Dagster run (MERGE-upsert); row counts surface as asset metadata. - Backfill the missing partitions (
B,C, …) from the same page — or right-clicksilver_tables→ Materialize all to fan out across every registered partition. - After all cohorts are present, materialize
gold_encounter_summary(unpartitioned) to rebuild the corpus and the manifest.
This is the incremental path that actually works — full-table whole rebuilds still need the CLI clean-slate dance (see Troubleshooting), because delta-rs MERGE is incremental by design.
What clicking an asset shows¶
Each asset surfaces inline metadata so the graph isn't just lineage — it's the actual data:
bronze_fhir[<cohort>]— file count + total bytes + a Markdown breakdown of the first bundle (FHIR resource-type counts: Patient, Encounter, Observation, ...).- Silver outputs (per partition) — row count, primary key, the full Arrow schema as a table, and a sample of the first 5 rows. Click any one of the 10 Silver assets to see what that table actually looks like for that cohort.
gold_encounter_summary— corpus stats JSON + Silver source versions + the full schema + one sample encounter rendered as a Markdown card (patient/date/age + SOAP note text + active conditions / medications / vitals / imaging).
The same renderings are reused by python -m core.scripts.demo_walkthrough for a CLI audience
and by the DuckDB UI notebook for an SQL audience (see §5 above and
docs/demo/PLAYBOOK.md for the demo-video recording guide).
Asset checks — rule-by-rule detail¶
Every Silver table carries an @asset_check that re-reads via the platform and runs the
same validate_table() the CLI logs to silver.ingest_log. Click the check in the UI to
see a Markdown table of every rule that ran — name, pass/fail, and the actual numbers
(e.g. non_null:patient_id → "0 nulls / 1,278 rows", unique:encounter_id → "143,946/143,946
distinct"). Failures surface as red badges; pre-fix this was the imperative pipeline's only
visibility into validation.
Bronze cohort sensor (the demo)¶
bronze_cohort_sensor is the Dagster analogue of the Auto Loader streaming-sim
(spec §5.2). Every tick it
diffs cohort directories under data/bronze/fhir/ against the registered dynamic
partitions and, for each new cohort, fires a single RunRequest that materializes
bronze_fhir and all 10 Silver tables for that partition — Bronze → Silver in one
run. Gold is intentionally not in the sensor target (unpartitioned aggregate; rebuild
manually once the cohorts of interest are present).
- Default status: STOPPED. Otherwise it would replay every existing cohort on
dagster devstartup. Start it from Overview → Sensors → bronze_cohort_sensor. - Polling interval: 30 s.
- Drop a new cohort to demo it: create
data/bronze/fhir/cohort=D/with one or more FHIR bundles and wait up to 30 s — the run appears in Runs and Bronze + Silver partitions forDfill in. (To re-demo across A/B/C, you can clear the dynamic partition set first:dagster asset wipe --partitions A B Cor, simpler for a fresh repo, just point at a freshLAKEHOUSE_LOCAL_ROOTso the sensor sees A/B/C as new.) - The sensor adds new partition keys in the same evaluation as it issues runs
(
dynamic_partitions_requests+run_requestsin oneSensorResult) so the partitions exist before the runs start.
Persistence model (ADR-016)¶
Assets return MaterializeResult (metadata only) — the LakehousePlatform writes the
Delta bytes, not a Dagster IOManager. Single persistence authority; Dagster owns the DAG
and observability. The [orchestration] extra is optional by design — CI and the CLI
path don't need it; core/tests/test_dagster_defs.py uses pytest.importorskip so the suite
collects cleanly without it.
7. Regenerate generated docs¶
The data dictionary and corpus JSON Schema are generated from code and guarded by tests + pre-commit (ADR-011). Regenerate after changing a Silver schema, a validation rule, or the Gold schema/contract:
python core/scripts/gen_data_dictionary.py # docs/DATA_DICTIONARY.md
python core/scripts/gen_corpus_schema.py # schemas/gold_encounter_summary.json
# CI / pre-commit run the read-only --check variants and fail on drift.
Never hand-edit those two files; durable per-column prose goes in the generator's _COLUMN_NOTES.
Troubleshooting¶
| Symptom | Cause | Fix |
|---|---|---|
DeltaError: MERGE matched a target row with multiple source rows |
Full re-run on top of existing tables — MERGE upsert is for incremental landing, not whole-table re-update (every source row matches) | rm -rf data/silver data/gold then python -m core.surfaces.cli.pipeline --with-gold (clean slate; rebuilds from Bronze) |
RuntimeError: No cohorts found under .../fhir/ |
Bronze not populated | Run python -m core.ingest.download first |
RuntimeError: AWS CLI not found on PATH |
AWS CLI missing | Install AWS CLI (only ingest needs it; tests/rebuilds don't) |
Imaging DICOM columns all null / dicom_extracted = false |
DICOM prefix not pulled | python -m core.ingest.download --assets-only --with-dicom, then rebuild Silver |
Skipping unreadable DICOM header for one imaging study (warning) |
A malformed .dcm |
Non-fatal by design — FHIR metadata stands for that study (ADR-013) |
imaging.study_description is null even with DICOM |
Coherent ships placeholder UNKNOWN description tags, normalized to null (ADR-013) |
Expected — ground imaging on modality + body_site_display |
active_conditions huge (avg ~9.6) |
As-of-date problem list incl. Synthea SDOH/social "conditions" (ADR-014) | Expected — faithful to source |
DATA_DICTIONARY.md is out of date (test/commit fails) |
Silver schema/rule changed without regen | python core/scripts/gen_data_dictionary.py (and gen_corpus_schema.py) |
| Out of disk during DICOM pull | 9.3 GB of .dcm |
DICOM is optional and re-pullable from S3; safe to delete data/bronze/dicom |
.claude/settings.json keeps showing modified |
Harness appends auto-approved permissions to the tracked file | Move them to gitignored .claude/settings.local.json, git restore the tracked file |
ModuleNotFoundError: No module named 'dagster' when running dagster dev or tests/test_dagster_defs.py |
[orchestration] extra not installed (deliberately optional) |
pip install -e ".[local,dev,orchestration]"; the test file uses importorskip so the rest of the suite still runs |
dagster dev creates a stray .tmp_dagster_home_*/ next to the repo |
DAGSTER_HOME is unset |
export DAGSTER_HOME="$PWD/dagster_home" before dagster dev (gitignored) |
Dagster bronze_cohort_sensor not firing |
Sensor is default-STOPPED (avoids replaying existing cohorts on startup) | Overview → Sensors → toggle on |
Dagster Silver materialize errors with MERGE matched a target row with multiple source rows |
Re-materializing an already-landed cohort against a stale full-table state (same root cause as the CLI gotcha) | Materialize per-partition (cohorts are the working incremental unit); for a full reset, use the CLI clean-slate path |
Logging & PHI safety¶
Logs never contain patient/encounter identifiers, note text, or bundle filenames — identifier-
bearing values are redacted to a non-reversible ref:<hash> (ADR-010).
Set log level via standard logging config; the pipeline emits INFO progress + per-table metrics.