Skip to content

Multi-Platform Engine Parity (ADR-022)

The lakehouse is implemented twice — once for a laptop, once for Microsoft Fabric — as two independent, engine-native implementations that emit the same Gold contract. They share no transform code. Compatibility is guaranteed by schema parity + a lockstep CONTRACT_VERSION, enforced by tests (ADR-022).

This is the senior-architect decision of the project: an earlier single-abstraction design was deliberately replaced once it started fighting Spark's native parsing — see Design Notes §5.

The two tiers converge on one contract

flowchart TB
    subgraph CORE["core/ — LocalLite tier (laptop, $0)"]
        direction TB
        C1["Polars + delta-rs + DuckDB"]
        C2["own transforms (core/transforms/silver_*.py)"]
        C1 --> C2
    end
    subgraph FAB["fabric/ — Fabric tier (Spark / OneLake)"]
        direction TB
        F1["Spark + Delta + OneLake"]
        F2["own transforms (fabric/transforms/silver_*.py)"]
        F1 --> F2
    end
    CONTRACT{{"Gold contract<br/>schema parity + lockstep CONTRACT_VERSION<br/>(compatibility, NOT shared code)"}}
    C2 --> CONTRACT
    F2 --> CONTRACT
    NOTE["Rejected: one shared transform layer.<br/>Lowest-common-denominator + applyInPandas bridge tax → ADR-022"]
    NOTE -.-> CONTRACT
    classDef contract fill:#eef2ff,stroke:#6366f1,font-weight:bold;
    class CONTRACT contract

LocalLite tier (core/)

Pure-Python transforms (dict-parse → explicitly-typed pa.Table) over Polars + delta-rs, run by two local surfaces — a dependency-light CLI and a Dagster asset graph — plus a read-only DuckDB UI.

flowchart TD
    S3["AWS Open Data S3 — coherent/ (anonymous)"]
    subgraph CORE["LocalLite tier · core/ — Polars · delta-rs · DuckDB"]
        ING["core.ingest.download — aws s3 sync → cohort partition"]
        BR["Bronze: data/bronze/fhir/cohort=A|B|C/*.json"]
        TF["core.transforms.* (pure) — dict-parse → pa.Table"]
        SV["Silver: 10 Delta (delta-rs) · CDC · per-cohort MERGE · validate→ingest_log"]
        GD["Gold: encounter_summary (Polars denormalize) · as-of-date problem list · manifest"]
        ING --> BR --> TF --> SV --> GD
    end
    CLI["CLI: core.surfaces.cli.pipeline"] -. drives .-> TF
    DAG["Dagster asset graph · cohorts=partitions · validate=asset checks"] -. drives .-> TF
    DUCK["DuckDB UI (read-only, 20-cell SQL)"] -. reads .-> SV
    S3 --> ING
    GD ==>|contract v1.1.0| C["downstream consumers"]
    classDef l fill:#eef2ff,stroke:#6366f1;
    class CORE l

Fabric tier (fabric/)

Spark-native transforms (from_json(value, BUNDLE_SCHEMA) — distributed, no Python bridge) over OneLake Delta, run by notebooks 00–10 (chained into a Data Factory pipeline), with the core wheel installed via a Fabric Environment and Power BI Direct Lake reading Gold.

flowchart TD
    S3["AWS Open Data S3 (anonymous)"]
    subgraph FAB["Fabric tier · fabric/ — Spark · OneLake Delta"]
        NB1["01_bronze_ingest — anonymous S3 → OneLake"]
        BR["Bronze: Files/bronze/fhir/cohort=* (OneLake)"]
        PARSE["Spark from_json(value, BUNDLE_SCHEMA) — distributed parse"]
        SV["Silver: 10 Delta (OneLake) · CDC · MERGE · single-.agg validation"]
        GD["Gold: encounter_summary (Spark denormalize) + manifest"]
        NB1 --> BR --> PARSE --> SV --> GD
    end
    ENV["Fabric Environment — core wheel installed"] -. utils .-> FAB
    ORCH["notebooks 00–10 (.Notebook, ADR-021) · Data Factory = 1 Run"] -. orchestrate .-> FAB
    PBI["Power BI Direct Lake"] -. reads .-> GD
    S3 --> NB1
    GD ==>|contract v1.1.0 (same shape)| C["downstream consumers"]
    classDef f fill:#ecfeff,stroke:#06b6d4;
    class FAB f

Same step, two engines

The Silver-build step is the clearest illustration of the parity — same registry concept, same spec.build(...), engine-native return types:

from core.platform.factory import get_platform
from core.transforms.registry import SILVER_TABLES

platform = get_platform()                              # LAKEHOUSE_PLATFORM=local_lite (default)
records = parse_cohort_bundles(...)                    # pure FHIRBundleParser → dict records
spec = SILVER_TABLES["patient"]
table = spec.build(records["patient"], ingest_ts)      # → pyarrow.Table (explicit schema)
platform.write_silver("patient", table, mode="merge")  # delta-rs MERGE + CDC
from fabric.platform import FabricPlatform
from fabric.transforms.registry import REGISTRY

platform = FabricPlatform()                            # ADR-022: no factory, no env var
bundles_df = platform.read_bronze_bundles_spark()      # (path, value) text DataFrame
spec = REGISTRY["patient"]
silver_df = spec.build(bundles_df, ingest_ts)          # from_json(value, BUNDLE_SCHEMA) → Spark DF
platform.write_silver_spark("patient", silver_df, mode="merge")  # Delta MERGE + CDC

Side by side

Aspect LocalLite (core/) Fabric (fabric/)
Engine Polars (in-process) Apache Spark
Storage delta-rs Delta (data/) OneLake Delta
Parse dict .get()pa.Table from_json(value, BUNDLE_SCHEMA) (distributed)
Transform return type pyarrow.Table Spark DataFrame
Silver MERGE + CDC delta-rs, per-cohort micro-batch Spark/Delta + server-side dedup guard (ADR-019)
Validation validate_tableingest_log single .agg() per table
Orchestration CLI · Dagster asset graph notebooks 00–10 · Data Factory
Explore DuckDB UI (read-only SQL) Spark display() · Power BI Direct Lake
Bronze→Silver (full) ✅ 2m19s ✅ green on 100-sample; full re-run pending
Cost (1.3k patients) $0 trial (F4)

Numbers: Benchmarks. Operator runbook: Fabric Deployment.