Changelog¶
All notable changes to scribe-iq-lakehouse. Format: Keep a Changelog
[Unreleased]¶
Documentation site (2026-05-31) — MkDocs Material + portfolio docs + consistency fixes¶
Added¶
- MkDocs Material documentation site (
mkdocs.yml) for GitHub Pages — Material theme (indigo, light/dark toggle), mermaid viapymdownx.superfences,pymdownx.snippets, curated nav. 8 new hand-authored pages: Home (docs/index.md), Reviewer Guide, Engineering Case Study, Design Notes, Healthcare & Responsible Data, Multi-Platform Engine Parity, Downstream & Portfolio, About. Six mermaid diagrams including both engine-native tiers (LocalLite D2, Fabric D3) and the ADR-022 parity/convergence (D4). .github/workflows/docs.yml— GitHub Pages deploy via the Pages-artifact mechanism (upload-pages-artifact→deploy-pages), gated ongen_data_dictionary.py --check+gen_corpus_schema.py --checkandmkdocs build --strict. One-time maintainer step: Settings → Pages → Source = GitHub Actions.[project.optional-dependencies] docs(mkdocs-material, pymdown-extensions) +requirements-docs.txt(CI-cache mirror);/site/gitignored.
Changed (doc-consistency pass — reviewer docs had drifted behind ADR-017/022)¶
- README "Architecture at a glance" rewritten to the ADR-022 reality (two independent
engine-native tiers — LocalLite→
pa.Table, Fabric→Spark DataFrame — compatible by schema parity + lockstepCONTRACT_VERSION). Fixed broken ADR-002/004 links (→ 022/017/021), unified Fabric status (green on F4, 100-sample), and corrected counts (129 tests; 22 ADRs). - ARCHITECTURE.md module map corrected to the real
core/tree (orchestration + scripts undercore/, pipeline undersurfaces/cli/); ADR-002/004 → ADR-022 framing. - CORPUS_CONTRACT / BENCHMARKS / PLAYBOOK:
local/→core/paths,scripts.demo_walkthrough→core.scripts.demo_walkthrough, Fabric engine-matrix status; source-file links repointed to GitHub blob URLs (resolve in both the repo and the site). core/scripts/gen_data_dictionary.pyheader templatelocal/→core/(regeneratedDATA_DICTIONARY.md). ADR index gains alocal/→core/rename banner; roadmap spec gains an "intended end-state, not as-built" banner.
Tests / quality¶
- Full suite green (128 passed, 1 skipped — the Fabric-workspace test); generated-doc
gates pass;
mkdocs build --strictclean (zero warnings); all six diagrams render. - Note: 20 pre-existing ruff findings remain in the Fabric tier (
fabric/*,core/orchestration/*) from the in-progress Session 5 branch — untouched by this work.
Session 5 (in progress) — Fabric Spark-native rewrite (ADR-022) + dedup fix + Power BI¶
Plan: docs/roadmap/fabric-execution-plan.md.
Milestone (2026-05-29 — first green cloud run)¶
- Notebooks 00–10 ran successfully end-to-end on Fabric F4 capacity
against
SAMPLE_SIZE=100Coherent bundles. All 10 Silver tables +gold.encounter_summary+ Bronze/Gold manifests materialized in thescribe_iq_synthea_coherentlakehouse. - Branch
feat/fabric-spark-nativepushed to both GitHub (canonical mirror) and Azure DevOps (Fabric Git Integration source) via multi-push origin. Singlegit pushfans out to both.
Added (2026-05-29)¶
fabric/notebooks/01_bronze_ingest.Notebook/— self-contained Bronze ingest. Pulls Synthea Coherent froms3://synthea-open-data/coherent/via anonymous boto3, round-robin partitions intocohort=A,B,CunderFiles/bronze/fhir/, writes anIngestManifest-shaped JSON underFiles/bronze/_metadata/.SAMPLE_SIZEknob for fast demo (100) vs full corpus (None).fabric/environments/public_libraries.yml— pip-block file Fabric's Environment "Import .yml" UI accepts; pinsboto3==1.35.36+botocore==1.35.36for reproducibility.FabricPlatform.files_path(subpath)— Files/-rooted URI helper for non-table artifacts (Bronze JSON, Gold manifest). One place owns the GUID-vs-name path detail.
Changed (2026-05-29 — operational fixes from cloud run)¶
FabricPlatform.ensure_envnow reads from Spark conf (trident.workspace.id,trident.lakehouse.id) instead ofmssparkutils.env.getWorkspaceId()— the latter is a Synapse API not present on Fabric. Returns workspace + lakehouse GUIDs (not name); display name is best-effort, informational only.- OneLake paths now use lakehouse GUID throughout (drop
.Lakehousesuffix). Required for tenants withFriendlyNameSupportDisabled(the trial tenant has this) —<name>.Lakehousepaths get HTTP 400. Notebooks 00, 01, 10 updated to useplatform.files_path()instead of inline path construction. 00_setupGate 1 readsspark.conf.get("trident.workspace.id")(drops the brokenmssparkutils.env.getWorkspaceIdcall).01_bronze_ingestvalidation cell usesspark.read.text(wholetext=True)to read the sample bundle —mssparkutils.fs.headsilently truncates at ~100 KB even when a larger maxBytes is passed, breakingjson.loads. Sample-histogram wrapped in try/except so a parse failure prints a one-liner instead of halting the cell (manifest write below it now always runs).fabric/environments/lakehouse_env.yml— documentation-style spec updated to match ADR-022; dropspyarrow/pydicom/python-dateutil(not used by the pure-Spark Fabric tier — Fabric runtime supplies pyarrow; pydicom is local-only; date parsing is Spark-native)..github/workflows/fabric-deploy.ymlrenamedfabric-deploy.yml.disabled. User removed thefabric-prodGitHub Environment; the workflow'senvironment: fabric-prodwould fail on trigger. Matches the existingaws-deploy.yml.disabled/databricks-deploy.yml.disabledconvention. Active deploy path is Azure DevOps Git Integration + manual UI wheel upload.
Tests (2026-05-29)¶
test_fabric_platform.pyupdated for GUID-based API:test_storage_path_builds_onelake_urirewritten for the GUID shape (no.Lakehousesuffix). Newtest_files_path_builds_onelake_uricovers the helper.FabricPlatform(lakehouse_id=...)constructor arg replaceslakehouse_name=...for path-shape tests.- Full suite: 128 passed + 1 skipped (workspace-only).
Added (2026-05-29 — ADR-022 architecture pivot)¶
- ADR-022 (Independent per-platform implementations) — supersedes ADR-002 (LakehousePlatform ABC as universal contract), ADR-004 (pa.Table as cross-platform interchange), and ADR-020 (applyInPandas bridge — same-day supersession). Each platform tier now owns its complete Silver + Gold + validation stack written engine-native; cross-platform compat is by schema parity + lockstep CONTRACT_VERSION bumps, not code sharing.
fabric/transforms/— Spark-native Silver layer (10 builders + union BUNDLE_SCHEMA + registry). Parses bundles viafrom_jsonand projects to Silver via Spark DataFrame ops; no Python bridge.fabric/gold/— Spark-nativebuild_encounter_summary+corpus_manifest. Output schema matchescore.gold.encounter_summaryfield-for-field. Includes a UUIDv5 expression synthesized in Spark (SHA1 + RFC 4122 bit twiddling) sosummary_idstays deterministic across rebuilds.fabric/validation/— single.agg()per Silver table computes every metric in one pass; ingest_log schema matches core's..claude/rules/fabric-transforms.md— Fabric-tier transform rules.
Changed (2026-05-29)¶
fabric/platform.pyslimmed: droppedwrite_silver(pa.Table)/read_silver() → pa.Table/write_gold(pa.Table)convenience wrappers, dropped legacy_write_delta(pa.Table), droppedLakehousePlatforminheritance. Spark DataFrames are the only interchange type. Addedread_bronze_bundles_spark()as the canonical Bronze entry point.core/platform/factory.pyPLATFORMS dict dropsfabric/databricks/aws/gcp— independent tiers don't dispatch through the local factory.- All Fabric notebooks (00 + 02–10) rewritten: instantiate
FabricPlatform()directly (no factory, no env var), import fromfabric.transforms/fabric.gold/fabric.validation, noapplyInPandas. Notebook 10 rewritten against the actual manifest keys (gold_table,silver_sources,row_count) and Gold schema names (soap_note_text). CLAUDE.md+.claude/rules/transforms.md+.claude/rules/notebooks.mdupdated for the independence model. ADR index README.md flags 002/004/020 as Superseded with links intodocs/_archive/adr/. ADR-017 amended in place.
Removed (2026-05-29)¶
fabric/spark_helpers.py(housed theapplyInPandasbridge factory + pa→Spark schema converter; both dead under pure-Spark).
Tests (2026-05-29)¶
fabric/tests/test_fabric_platform.py— dropped subclass + abstract-method contract tests; rewrote the workspace round-trip to use Spark DataFrames againstfabric.transforms.registry. Addedtest_name_attribute.core/tests/test_platform_factory.py— addedtest_fabric_not_in_factory; updated unbuilt-platform test to uselocal_sparkplaceholder.- Full suite: 128 passed + 1 skipped (workspace-only).
Session 5 — earlier phases (Fabric end-to-end + dedup fix + Power BI)¶
Plan: docs/roadmap/fabric-execution-plan.md. Phases 1–3 complete (pre-pivot).
Added¶
- ADR-019 (Silver MERGE idempotency) — pre-merge target-side dedup guard
in
LocalLitePlatform._write_delta. Fixes the "matched a target row with multiple source rows" failure that occurred re-merging into Silver tables written beforededup_by_key()was added to everybuild_silver_*. Helpers_duplicate_row_count+_dedup_targetare pyarrow-only; only triggers a rewrite when total ≠ distinct on the PK. Survivor semantics are "some-survivor-wins" (Delta doesn't preserve write order on read) — the following MERGE writes the source's canonical value on top. - Regression test
test_merge_dedupes_target_with_legacy_duplicatesincore/tests/test_local_lite.py— writes intentionally-duplicate target via rawwrite_deltalake, asserts subsequentwrite_silverMERGE succeeds with canonical source value winning. - Real
FabricPlatformimplementation infabric/platform.py(10 methods replacing the Session 4.5 NotImplementedError stubs): schema-enabled OneLake abfss URIs, pa.Table↔Spark round-trip via pandas,DeltaTable.merge()with matching ADR-019 dedup guard (Spark equivalent:dropDuplicates([pk])), CDC enabled on all writes, manifest viamssparkutils.fs.put. All Fabric-runtime imports (pyspark,notebookutils,delta.tables) are lazy inside method bodies — module imports cleanly outside Fabric so the offline contract tests run without Fabric. - Real
fabric/deploy/upload_wheel.py— MSAL Service Principal → Fabric REST v1 client. PUT/workspaces/{ws}/environments/{env}/staging/libraries, POST/publish, then poll until publish state isSuccess(600 s deadline). [fabric]install extra (msal>=1.28,requests>=2.31) inpyproject.toml.pytest.mark.fabricmarker for behaviour tests that require a real workspace (registered in[tool.pytest.ini_options]); 1 such test gated onFABRIC_TENANT_IDenv var.- 2 new offline contract tests in
fabric/tests/test_fabric_platform.py:test_storage_path_builds_onelake_uri+test_storage_path_rejects_bad_layer. 4 → 5 offline tests; obsoletetest_methods_raise_not_implementedremoved. docs/roadmap/fabric-execution-plan.md— 7-phase Session 5 execution plan, linked fromdocs/roadmap/MASTER_PLAN.mdandCLAUDE.mdkey files..env.exampleat repo root — canonical FABRIC_* env-var inventory with capture instructions and consumer list..gitignoreupdated with!.env.exampleexception so the template tracks while.envstays out.boto3>=1.34tofabric/environments/lakehouse_env.yml— anonymous-mode S3 client for the public Synthea Coherent bucket, supersedes the original S3-shortcut design (Fabric shortcuts require AWS credentials).fabric/notebooks/00_setup.ipynb— first Phase 4 notebook (4 verification gates: wheel imports, Spark + workspace ID, FabricPlatform URI, boto3 anonymous S3). Follows the 8-cell template with cells 4–8 adapted for setup verification (no Delta write, nosilver.ingest_logrow).
Changed¶
fabric/docs/DEPLOYMENT.md— rewritten as a step-by-step operator runbook based on the live Phase 3 walkthrough. 6 numbered setup steps with verify lines, Path A (manual UI) and Path B (REST automation) wheel-upload paths, comprehensive Gotchas section ("Manage access" not in Settings, External repositories ≠ Built-in libraries, "+ Add library" stays clickable, SP secret shown only once, lakehouse must be schema-enabled, environment publish takes 2–5 min)..github/workflows/fabric-deploy.yml— installs[fabric]extra +fabric-cicd, runs the realupload_wheel.py, and invokesfabric-cicd smoke-runagainst notebook 05 (replacing the placeholder echo from Session 4.5).HANDOFF.mdOpen Decisions row "Silver parse-output deduplication" flipped to DONE — ADR-019.
Provisioned (Session 5 Phase 3)¶
- Fabric workspace
scribe_iq_lakehouse_fabric(Central US) - Schema-enabled lakehouse
scribe_iq_lakehouse_fabric - Environment
scribe-iq-lakehouse-env(Runtime 1.3, Spark 3.5, Delta 3.2) with 4 PyPI deps + thescribe_iq_lakehouse-0.1.0core wheel published. - IDs live in local
.env(gitignored) / GitHubfabric-prodsecrets — never in committed files.
Tests¶
128 passed, 1 skipped (@pytest.mark.fabric without FABRIC_TENANT_ID). 122
core + 5 fabric offline + 1 fabric behaviour (skipped). All Phase 1–3 code
files ruff + black clean.
Session 4.5 — Multi-platform repo reorg (core/ + fabric/)¶
Added¶
- ADR-017 (multi-platform repo layout) and ADR-018 (CI/CD monorepo, core as wheel).
docs/roadmap/multi-platform-reorg.md— full planning doc behind the reorg.- Top-level
fabric/domain:platform.pystub,notebooks/,environments/lakehouse_env.yml,deploy/{upload_wheel.py,fabric_cicd_config.yml},tests/test_fabric_platform.py,docs/{DEPLOYMENT.md,SCREENSHOTS.md},scripts/capture_lineage.py. The stub raisesNotImplementedErroron every method so accidental Fabric dispatch fails loudly. .github/workflows/:core-pr-tests.yml,core-build.yml,fabric-deploy.yml(skeleton);databricks-deploy.yml.disabledandaws-deploy.yml.disabledas visible templates.- One-way dependency rule (
core/never imports from any platform tier) enforced by CI grep.
Changed¶
local/→core/(viagit mv, history preserved).core/now bundles the platform-agnostic kernel +core/platform/local_lite.py(LocalLite impl) +core/orchestration/dagster/+core/surfaces/cli/pipeline.py+core/tests/+core/scripts/+core/docs/.- Imports rewritten:
from local.X→from core.Xacross all Python, docstrings, and top-level docs. Factory strings forfabric/databricks/aws/gcpnow point outsidecore/(e.g.,"fabric.platform.FabricPlatform"). pyproject.toml: package discovery["core*", "fabric*"]; testpaths["core/tests", "fabric/tests"];[tool.dagster] module_name = "core.orchestration.dagster.definitions".CLAUDE.md,.claude/rules/transforms.md,.claude/rules/notebooks.md: paths and cross-domain-import rule updated.- README: new "Repository layout" section with two-domain tree + "See also" link to the
separate
fabric-lakehouse-hls-readmissionrepo. core/scripts/gen_*.py:_REPO_ROOTclimbs one extra level (parent.parent.parent) now that scripts live one directory deeper.
Tests¶
- 122 core tests still pass; 4 new
fabric/tests/test_fabric_platform.pycontract tests verify FabricPlatform subclassesLakehousePlatform, implements every abstract method, and that every method currently raisesNotImplementedError. Total: 126 passing.
Session 4 (cont.) — Demoability polish: data shape visible, not just lineage¶
Added¶
local/preview.py— new shared Markdown renderer module:schema_md,sample_md,bundle_resource_counts,bundle_summary_md,gold_encounter_card. Pure-Python, framework-agnostic; used by both the Dagster asset metadata and the CLI walkthrough.core/scripts/demo_walkthrough.py(~280 LOC) — one-patient end-to-end medallion tour using therichlibrary. Auto-picks an anchor patient with ≥3 conditions, ≥3 meds, ≥1 SOAP note (or--patient-id <uuid>), then renders Bronze (FHIR resource counts + sample Patient JSON) → Parse (records dict) → Silver (patient row + 3-5 encounters / observations / conditions / meds + reference schema) → Gold (full SOAP note card with active conditions/medications/vitals/imaging).--pause Nfor screencast pacing.docs/demo/notebooks/demo_notebook.sql— 20-cell DuckDB UI source over the Delta tables: corpus headlines · schema · top conditions/medications · demographics · encounter mix · longitudinal span · co-morbidity buckets · vitals percentiles · imaging modalities · one-patient timeline · as-of-date condition evolution · full SOAP note · keyword cohort search · coverage stats · lineage (silver_versions struct) · cross-layer condition join · final corpus shape.docs/demo/notebooks/README.md+docs/demo/notebooks/demo.duckdb(gitignored binary, pre-built locally) — how to open / regenerate.docs/demo/PLAYBOOK.md— portfolio-video recording playbook: 5-beat structure (hook / raw mess / medallion / transformation / payoff), preflight + window setup, 6-take shot list (incl. optional live sensor demo), edit guidance, publishing checklist, contingencies.pyproject.toml[dev]extra: addedrich>=13.0for the walkthrough.
Changed¶
core/validation/validate.py— added@dataclass CheckOutcome(name, passed, detail)andValidationResult.checks: list[CheckOutcome]+ anok()method. Every rule now records its outcome (passing + failing alike), with the actual numbers in the detail string (e.g.unique:patient_id→ "1,278/1,278 distinct").failed_checksis kept unchanged forsilver.ingest_logand CLI pipeline compatibility.core/orchestration/dagster/checks.py—AssetCheckResult.metadatanow includesrules_total,rules_passed,rules_failed, and arulesMarkdown table (MetadataValue.md) so clicking a check in the UI shows the full rule-by-rule breakdown, not just a green dot.core/orchestration/dagster/assets.py— eachMaterializeResult.metadatanow carries rendered data shape vialocal.preview:bronze_fhir→ first bundle's FHIR resource-type breakdown table- 10 Silver outputs → schema table + first-5-row Markdown table per partition
gold_encounter_summary→ full schema + a sample-encounter card with the SOAP note rendered as readable textcore/platform/local_lite.py—LocalLitePlatform.__init__now anchors relative storage roots to the repo (viaPath(__file__).resolve().parents[2]), notPath.cwd(). Fixes a bug where Dagster sensor-triggered runs (spawned from the daemon's CWD) could not finddata/bronze. Absolute env-var values pass through unchanged.core/orchestration/dagster/assets.py+core/orchestration/dagster/sensors.py— both now derivebronze_rootfromplatform.create().root(not the module-level relativeDEFAULT_BRONZE). Sensor now takes the platform resource for consistency.core/orchestration/dagster/sensors.py— target widened frombronze_fhironly tobronze_fhir+ all 10 Silver asset keys, so each cohort drop materializes Bronze and Silver end-to-end in one Dagster run. Gold stays manual (unpartitioned aggregate). 6th wiring test intests/test_dagster_defs.pypins the selection (SENSOR_TARGET_KEYS)..gitignore— added*.duckdbfamily (DuckDB UI notebooks are binary, machine-specific).README.md— Demo section, Operations row, layout entries pointing at the demo walkthrough and DuckDB notebook.docs/ARCHITECTURE.md— module map extended (local/preview.py, demo paragraph noting the three demo surfaces sharelocal/preview.pyfor one set of renderers).docs/RUNBOOK.md— §5 adds the DuckDB UI command; §6 explains "what clicking an asset shows" + rule-by-rule check detail; links to PLAYBOOK.docs/BENCHMARKS.md— Execution-surfaces row for Dagster extended (metadata richness); new "Demo / read-only query surface" subsection for walkthrough + notebook + playbook.
Contract impact¶
- None — orchestration layer + presentation layer only.
gold.encounter_summarystays v1.1.0;silver.*schemas unchanged.
Session 4 — Dagster local orchestration (ADR-015, ADR-016)¶
Added¶
core/orchestration/dagster/— new top-level package modelling the medallion as a software-defined Dagster asset graph. Third execution surface alongside thecore.surfaces.cli.pipelineCLI and the (upcoming) Fabric notebooks; reuses the pure transforms verbatim — zero duplicate logic.assets.py:bronze_fhir(cohort-partitioned inventory) →silver_tables@multi_asset(parse-once → 10 distinct Silver asset nodes, MERGE-upsert viaplatform.write_silver) →gold_encounter_summary(unpartitioned aggregate that also co-writes the corpus manifest viaplatform.write_gold_manifest).checks.py: factory-built@asset_checkper Silver table wrappingvalidate_table()— stays in lockstep withSILVER_TABLES.partitions.py: cohortDynamicPartitionsDefinition— per-cohort materialization + backfill replaces therm -rffull rebuild.resources.py:PlatformResource(ConfigurableResource)delegates toget_platform(LAKEHOUSE_PLATFORM)— Dagster runs honour the platform env var the way the CLI does. Single persistence authority (ADR-002/009/016) — no IOManager.sensors.py:bronze_cohort_sensor(default STOPPED) — Dagster analogue of the Auto Loader streaming-sim (spec §5.2); diffscohort_labels()against the dynamic partition set and emitsRunRequests +build_add_requestin one tick. Target isbronze_fhir+ the 10 Silver asset keys (sourced fromSILVER_TABLESviaSENSOR_TARGET_KEYS), so each new cohort materializes Bronze and all 10 Silver tables in a single Dagster run — demoable end-to-end cohort flow. Gold stays out of the sensor target (unpartitioned aggregate; rebuilt manually).definitions.py: thin top-levelDefinitions(...).tests/test_dagster_defs.py(6 tests): Definitions load + every asset/check/sensor present, Silver assets carry thebronze_fhirlineage edge, sensor target = Bronze + 10 Silver and excludes Gold (asserted both againstSENSOR_TARGET_KEYSand Dagster's resolved selection), Bronze+Silver materialize writes all 10 Silver Delta tables on the fixture, Bronze metadata records file count, Gold materialize writes the table + the corpus manifest. Guarded bypytest.importorskip("dagster")so[dev]-only installs collect cleanly. 122 tests.- ADR-015: adopt Dagster for local orchestration; sequence Dagster → Fabric so the asset graph documents the DAG the Fabric notebooks mirror. Local-only — Fabric still orchestrates via Data Factory.
- ADR-016: medallion = software-defined asset graph;
@multi_assetfor Silver (parse-once - 10-node render); platform-persisted (not IOManager) preserves single authority; Bronze + Silver cohort-partitioned, Gold unpartitioned.
[tool.dagster] module_name=orchestration.definitionsinpyproject.toml—dagster devloads the medallion graph; UI onhttp://localhost:3000.
Changed¶
pyproject.toml: new optional[orchestration]extra (dagster>=1.8,<2.0,dagster-webserver>=1.8,<2.0);orchestration*added tosetuptools.packages.find. Kept out of[dev]to keep CI minimal — install withpip install -e ".[local,dev,orchestration]"..claude/rules/transforms.md: banneddagster/orchestrationimports fromcore/transforms/(orchestration imports transforms, never the reverse — mirrors the Spark/notebook isolation rule)..gitignore:dagster_home/,.tmp_dagster_home*/,.dagster/sodagster devlocal state never lands in git.docs/roadmap/scribe-iq-lakehouse-spec.md§9: renumbered — Session 4 = Dagster, Session 5 = Fabric, Session 6 = CI/docs.docs/roadmap/MASTER_PLAN.md: post-weekend update note explaining the Dagster→Fabric sequence and the rationale (permanent artifact independent of the expiring Fabric trial).docs/adr/README.md: ADR 015/016 added to index.
Contract impact¶
- None — orchestration layer only;
gold.encounter_summarystays v1.1.0.
Session 3 — Documentation refresh (README + Runbook)¶
Added¶
docs/RUNBOOK.md: operational runbook — prerequisites/config, first full run, ingest (FHIR + DICOM/CSV), build procedures (full / gold-only / single-cohort / clean rebuild), build verification (delta-rs + DuckDB snippets), doc regeneration, and a troubleshooting table (the clean-slate MERGE gotcha, missing cohorts, DICOM placeholders, etc.).
Changed¶
README.md: rewrote the Session-1 stub into a full overview — accurate counts (1,278 patients, 116 tests), correct install (pip install -e ".[local,dev]"), an Operations command table, data-products/contract section, current layout, and a documentation map.docs/ARCHITECTURE.md,docs/BENCHMARKS.md: corrected stale figures (corpus contract v1.0.0 → v1.1.0; Gold build ~5s → ~6.5s) and linked the runbook.
Session 3 — Problem-list-as-of-date corpus enrichment (ADR-014, contract v1.1.0)¶
Changed¶
gold.encounter_summaryactive_conditions/active_medicationsnow reflect the patient's clinical state as of each encounter date, not just what was recorded at that encounter (ADR-014). Conditions:onset ≤ date AND (abatement null OR abatement > date)— chronic conditions carry forward, resolved ones drop off. Medications:status=activeand authored ≤ date. Samearray[string]schema, changed semantics → contract v1.1.0 (MINOR).silver.condition: addedabatement_date(fromCondition.abatementDateTime) — additive column;fhir_parser.extract_conditionnow emits it.core/gold/encounter_summary.py:_conditions/_medications→_active_conditions/_active_medicationspatient-level as-of-date joins (meds pre-aggregated to earliest start).- Regenerated
docs/DATA_DICTIONARY.md(condition column) +schemas/gold_encounter_summary.json(x-contract-version 1.1.0).
Impact (full run)¶
- avg conditions/encounter 0.08 → 9.57; avg medications/encounter 0.05 → 1.66; encounters with an empty problem list dropped to 0.9%. Problem lists are clinically coherent and temporally gated; no duplicates. DICOM enrichment intact (298 studies). Gold build ~6.5s.
Limitation¶
- FHIR has no medication stop date, so
active_medicationsis a forwardstatus=activeapproximation (a med stopped after a past encounter still won't appear on it). Conditions are temporally precise. Documented in CORPUS_CONTRACT (ADR-014).
Tests¶
- New as-of-date unit test (onset gate, abatement exclusion, med start gate) + fixture
carry-forward test; condition schema test covers
abatement_date. 116 tests pass.
Session 3 — DICOM ingest + imaging header extraction (ADR-013)¶
Added¶
core/ingest/dicom_index.py:DicomIndexmaps DICOMStudyInstanceUID→ local.dcmpath (the FHIR↔DICOM join key) and serves bytes;study_uid_from_filename()parses the Coherent file-name convention. File names embed patient names → never logged raw (ADR-010).core/ingest/download.py:download_assets()+ CLI--with-dicom/--with-csv/--assets-onlysync the DICOM (~9.3 GiB, 298 files) and CSV (~466 MB) prefixes into Bronze, writing_metadata/assets_manifest.json. CSV is landed for reference; not otherwise processed.tests/test_dicom_extraction.py: 11 tests (synthetic in-memory DICOM, no committed binary) — UID linkage, placeholder→null, DA-date formatting, FHIR-authoritative modality, DicomIndex, the parse_bundle resolver path, bad-bytes resilience. 114 tests total.- ADR-013: DICOM ingest, FHIR↔DICOM linkage by StudyInstanceUID, header extraction semantics.
Changed¶
fhir_parser.py:parse_bundle(bundle, dicom_resolver=...)injects DICOM bytes via a callback (parser stays pure — I/O lives in the ingest layer);imaging_study_uid()helper;_extract_dicom_headersnormalizes Coherent placeholder tokens (UNKNOWN…) → null and DICOMDAdates → ISO; FHIR stays authoritative formodality;dicom_binary_id= StudyInstanceUID (never the patient-named file); a malformed file is caught per-study (dicom_extracted=False).core/surfaces/cli/pipeline.py: builds aDicomIndexonce and threads the resolver through_parse_cohort.tests/fixtures/sample_bundle.json: ImagingStudy now carries a realurn:oid:identifier.
Full-run result¶
- 298 of 3,752
silver.imaging_studyrows enriched with DICOMrows/columns/slice_thickness_mm/study_date; Goldimagingstruct surfacesstudy_date+dicom_binary_idfor those encounters. Descriptive tags are placeholderUNKNOWN→ null (honest limitation, documented in CORPUS_CONTRACT). Clean full rebuild: Silver 2m19s + Gold ~5s.
Note¶
- delta-rs MERGE errors on a whole-table re-update (every source row matches); full re-runs
build from a clean slate (
rm -rf data/silver data/gold). MERGE upsert remains for incremental per-cohort landing. Recorded as a pipeline operational note.
Session 3 — Gold layer + corpus contract (ADR-012)¶
Added¶
core/gold/encounter_summary.py: pure transform denormalizing all 10 Silver tables →gold.encounter_summary(one row per encounter). Polars join/aggregation engine; output assembled against an explicitGOLD_SCHEMA(nested struct vitals/imaging/versions + array conditions/meds/labs). Deterministicsummary_id(UUIDv5 of encounter_id); BP parsed from Silvercomponents_json; anniversary-based age-at-encounter. Defines the corpus contract (CONTRACT_VERSION,REQUIRED_FIELDS,OPTIONAL_FIELDS).core/gold/corpus_manifest.py: lineage manifest — contract version, per-Silver row counts + Delta versions, platform, and corpus coverage stats.core/scripts/gen_corpus_schema.py+schemas/gold_encounter_summary.json: machine-readable JSON Schema (Draft 2020-12) generated fromGOLD_SCHEMA(--checkfor CI); never hand-edited.docs/CORPUS_CONTRACT.md: human contract — required/optional guarantees, real corpus coverage, honest limitations (encounter-grain sparsity, ECG=0, synthetic genomics), semver versioning policy.tests/test_gold_encounter_summary.py: 17 tests — schema/grain, age, vitals (BP from components), labs, null-safe optional context, idempotent summary_id, manifest stats, contract field-list coverage, JSON Schema currency, and per-row validation against the published JSON Schema (jsonschema). 103 tests total.- ADR-012: Gold engine (Polars pure transform), grain,
silver_versionslineage, contract integrity.
Changed¶
core/surfaces/cli/pipeline.py: addedbuild_gold()+ CLI flags--with-gold/--gold-only.- Platform interface:
table_version(layer, table)(delta-rsversion()onlocal_lite,Nonedefault on base) andwrite_gold_manifest();local_litealso gainedread_gold(). pyproject.toml:jsonschema>=4.0added to[dev]for corpus-contract validation.
Enforcement¶
.pre-commit-config.yaml: read-onlycorpus-schema-currenthook (gen_corpus_schema.py --check);/session-enddoc-sync now regenerates the corpus schema.
Full-run result¶
- 1,278 patients → 143,946
gold.encounter_summaryrows in ~5s on a single laptop, nested Delta types + CDC verified; manifest written togold/_metadata/corpus_manifest.json.
Documentation — generated-first (ADR-011)¶
Added¶
core/scripts/gen_data_dictionary.py: rendersdocs/DATA_DICTIONARY.mdfrom the registry schemas + validation rules (--checkmode for CI); never hand-edited.docs/DATA_DICTIONARY.md: generated — all 10 Silver tables +ingest_log.docs/ARCHITECTURE.md: as-built view (Mermaid diagram + done-vs-planned status table), distinct from the spec's intent.docs/BENCHMARKS.md: real Session 2 run metrics (1,280 bundles → Silver in 2m30s, per-table row counts) + engine comparison matrix.tests/test_docs_generated.py: doc-as-test — fails if DATA_DICTIONARY is stale (86 total).- ADR-011: Generated-first documentation.
Enforcement¶
/session-endcommand + CLAUDE.md protocol: added a "Sync the docs" step (regenerate DATA_DICTIONARY; update ARCHITECTURE/BENCHMARKS/CORPUS_CONTRACT by judgment; never bulldoze hand-written docs)..pre-commit-config.yaml: localdata-dictionary-currenthook runsgen_data_dictionary.py --check— read-only, fails the commit on drift, never writes.
Deferred¶
docs/CORPUS_CONTRACT.md+ its schema-conformance test → built with the Gold layer (a contract test is only meaningful oncegold.encounter_summaryexists).
Post-Session-2 hardening¶
Security¶
core/redaction.py:redact()→ non-reversibleref:<hash>for identifier-bearing values. Applied to "skipping unreadable bundle" warnings inpipeline.pyandlocal_lite.py, which previously logged Synthea filenames embedding patient name + UUID (ADR-010). 4 redaction tests added (83 total).fhir_parser.py: per-bundle DEBUG summary logs counts only (no identifiers) + explicit logging-policy note in the module docstring.
Changed¶
- Split Claude Code settings: tracked
.claude/settings.jsontrimmed to curated allow globs + deny + hooks (hook command now uses$CLAUDE_PROJECT_DIR, portable); personal/ auto-approved permissions moved to gitignored.claude/settings.local.json.
Session 2 — Local Bronze + Silver pipeline (full dataset)¶
Added¶
core/ingest/download.py: parallelaws s3 sync(no-sign-request) + round-robin cohort partitioning (A/B/C) + ingest manifestcore/platform/local_lite.py:LocalLitePlatform(Polars + delta-rs) — Delta write/read, CDC enabled on create, MERGE-upsert on primary keycore/transforms/schema_utils.py: field-type-driven Arrow coercion (UTC timestamps, date32, string codes) + dedupcore/transforms/silver_{patient,encounter,clinical,soap_notes,ecg,imaging,genomics}.pyandregistry.py— all 10 Silver tables with explicit Arrow schemas (ADR-004)core/validation/{schema_registry,validate}.py: per-table quality checks →silver.ingest_logcore/ingest/{bronze_landing,streaming_sim}.py: cohort inventory + Auto Loader replay simcore/surfaces/cli/pipeline.py: per-cohort micro-batch Bronze→Silver orchestration- 36 new tests (schema_utils, silver transforms, local_lite Delta round-trip, validation) — 79 total, all passing
- ADR-009: Local Silver materialization (delta-rs, type coercion, component JSON)
- venv + full
[local,dev]extras (polars, deltalake, duckdb, watchdog, pydicom)
Results¶
- Full run: 1,280 files (1,278 patients +
organizations.json+practitioners.json) → all 10 Silver Delta tables in 2m30s on a single laptop, all validations passed. Row counts: encounter 143,946 · observation 669,898 · medication_request 209,401 · procedure 56,092 · soap_note 143,946 · condition 15,956 · imaging_study 3,752 · genomic_report 419 · patient 1,278 · ecg_metadata 0 (ECG is Binary waveform, not FHIR). - CDC (
delta.enableChangeDataFeed) enabled on every Silver table.
Session 1 — Repo scaffold + FHIR parser¶
Added¶
- Repo scaffold per spec §4:
pyproject.toml,requirements.txt,local/package tree (platform,transforms,ingest,gold,validation),tests/, README stub core/platform/base.py:LakehousePlatformabstract interface (ADR-002)core/platform/factory.py:LAKEHOUSE_PLATFORMenv-var router (defaultlocal_lite)core/transforms/fhir_parser.py:FHIRBundleParser— extract_patient, encounter, condition, observation (scalar + component), medication_request, procedure, soap_note (Base64 decode + S/O/A/P section detection), ecg_metadata, imaging_study (FHIR + DICOM passes), genomic_report;strip_referencehandlesurn:uuid:/Type/idformstests/fixtures/sample_bundle.json: synthetic 17-resource bundle covering every typetests/test_fhir_parser.py,tests/test_silver_soap_notes.py,tests/test_platform_factory.py,tests/conftest.py— 43 tests, all passing- ADR-008: Dict-based FHIR parsing (not fhir.resources models)
Changed¶
- end-of-file-fixer normalized trailing newlines across .claude/ files
Notes¶
- Parser validated against a real Coherent bundle (in gitignored
data/); SOAP notes use Markdown clinical headers, not literal SOAP markers, and lack an Objective section —has_objectiveis honestlyFalsefor most Coherent notes (see ADR-008). - pre-commit auto-install conflicts with Claude Code global core.hooksPath;
use
pre-commit run --all-filesmanually or rely on CI. See HANDOFF.md caveat.
[0.1.0] — 2026-05-27¶
Added¶
- Claude Code configuration: CLAUDE.md, HANDOFF.md, session protocol
- Global ~/.claude/CLAUDE.md with developer context and security rules
- ~/claude-os/ skill library: 7 skills, templates, init.sh
- .claude/skills/: healthcare-data, delta-patterns, mlops
- .claude/commands/: session-end, new-transform, new-adr, session-start
- .claude/rules/: transforms.md, notebooks.md (path-scoped context)
- .claude/hooks/scan-secrets.sh: pre-write secret detection
- docs/adr/README.md: ADR index
- ADR-001: Fabric-first development approach
- ADR-002: Platform abstraction layer design
- ADR-003: Polars + DuckDB for local lite tier
- ADR-004: Arrow as transform interchange format
- ADR-005: FHIR Binary Base64 decode for SOAP notes
- ADR-006: DICOM stop_before_pixels metadata extraction
- ADR-007: Genomic data_limitation as first-class column
- .pre-commit-config.yaml: detect-secrets, gitleaks, bandit, semgrep
- .semgrep/healthcare.yml: OneLake path and PHI logging rules