ADR-016: Medallion as a Dagster software-defined asset graph¶
Date: 2026-05-27 Status: Accepted Extends: ADR-015 (the modelling decision under the adopted orchestrator); honours ADR-002 (platform isolation) and ADR-004 (Arrow interchange) Contract impact: none Deciders: Sandeep Jayaprakash
Context¶
ADR-015 adopts Dagster. Dagster offers two modelling styles — imperative ops/jobs or
declarative software-defined assets — and a default IO-manager persistence model that
conflicts with this repo's single persistence authority (the LakehousePlatform over delta-rs,
ADR-002/009). The model must (a) preserve the pure-transform + platform-abstraction invariants,
(b) not double-parse Bronze (the current pipeline parses each cohort once and builds all 10
Silver tables from that parse), and (c) yield the lineage graph that is the reason we chose
Dagster over Prefect.
Decision¶
Model the medallion as software-defined assets, partitioned by cohort, where assets call the platform for persistence and return only materialization metadata — no Dagster IO manager owns the Delta bytes:
- Bronze landed data as source/observable assets, cohort-partitioned.
- A single
@multi_assetparses each cohort once and emits all 10 Silver tables as distinct asset outputs (parse-once preserved; Dagster still renders 10 nodes in the graph). Each Silver table carries an@asset_checkwrappingvalidate_table(). gold_encounter_summarydepends on the 10 Silver assets;corpus_manifestdepends on Gold (both unpartitioned — Gold aggregates across all cohorts).- The platform is a Dagster resource reading
LAKEHOUSE_PLATFORM; assets callplatform.write_silver/write_goldexactly as the notebooks and CLI do. A cohort-watching@sensoris the Dagster analogue of the Auto Loader streaming-sim (spec §5.2).
Alternatives considered¶
| Option | Pros | Cons | Why rejected |
|---|---|---|---|
| Assets + platform-persisted (chosen) | Best lineage; preserves single persistence authority; parse-once | Assets return metadata not data; slightly non-idiomatic (no IO manager) | — |
| Ops / jobs (imperative) | Closest to current pipeline.py |
No asset lineage — forfeits the reason we picked Dagster | Defeats the purpose |
| Custom IOManager wrapping the platform | Most "Dagster-native" | Couples Dagster to the platform; two persistence paths to reason about | Violates single-authority simplicity (ADR-002/009) |
| Per-table Silver assets, each re-parsing the cohort | Cleanest graph | Re-parses bundles ~10× — wasteful | multi_asset gives the same graph, parse-once |
| Materialize the parsed-records dict as an intermediate asset | Clean lineage, parse-once | Large dict serialized through an IO manager every cohort | Unnecessary memory/IO vs multi_asset |
Consequences¶
Positive:
- Cohort-partitioned assets give incremental per-cohort materialization + backfill (the
working MERGE path) — ends the rm -rf full rebuild from ADR-015's context.
- Validation is visible as asset checks, not just rows in ingest_log.
- Transforms stay 100% Dagster-unaware (ADR-002 intact) — the asset functions are thin
adapters over _parse_cohort, SILVER_TABLES[*].build, and build_gold internals.
- The asset graph is a literal medallion diagram for reviewers.
Negative:
- The "assets return metadata, the platform persists" pattern is slightly non-idiomatic Dagster
and needs a one-line explanation for reviewers.
- multi_asset couples the 10 Silver outputs into one parse step — acceptable, and it matches
what pipeline.py already does.
Neutral:
- Gold stays unpartitioned and full-rebuild (~6.5 s).
- silver.ingest_log is still written (for Fabric parity) even though checks also surface in the UI.
Implementation notes¶
orchestration/partitions.py— cohort partitions (StaticorDynamicfromcohort_labels()).orchestration/assets.py— Silver@multi_asset(reuse_parse_cohort+SILVER_TABLES[*].buildplatform.write_silver);gold+corpus_manifestassets (reusebuild_goldinternals).orchestration/checks.py—@asset_checkper Silver table wrappingvalidate_table().orchestration/resources.py— platform resource fromget_platform();orchestration/sensors.py— cohort sensor whose target isbronze_fhirplus the 10 Silver asset keys (sourced fromSILVER_TABLESso the selection cannot drift): each new cohort triggers one Dagster run that materializes Bronze + Silver for that partition. Gold is intentionally not in the sensor target — it is unpartitioned and rebuilt manually (or by a downstream schedule) once the cohorts of interest are present.orchestration/definitions.py—Definitions(assets, asset_checks, resources, sensors, schedules).tests/test_dagster_defs.py—Definitionsloads;materialize()on the fixture writes all Silver + Gold; asset checks pass.