ADR-011: Generated-first documentation¶
Date: 2026-05-27 Status: Accepted Deciders: Sandeep Jayaprakash
Context¶
The project needs documentation that evolves with the codebase rather than rotting. The spec (§4) lists a target doc set (REVIEWER_GUIDE, ARCHITECTURE, DATA_DICTIONARY, CORPUS_CONTRACT, STREAMING_DESIGN, PRODUCTION_NOTES, MkDocs). The risk with several of these — especially the data dictionary and the corpus contract — is that a hand-written copy of the schemas silently diverges from the code as tables change.
Decision¶
Adopt a generated-first documentation strategy with three tiers:
- Living (hand-maintained, change often): HANDOFF, CHANGELOG, ADRs (append-only), ARCHITECTURE (as-built — distinct from the frozen spec).
- Generated (derived from code, never hand-edited): DATA_DICTIONARY is rendered from
local/transforms/registry.pyschemas +local/validation/schema_registry.pyrules byscripts/gen_data_dictionary.py. A test (tests/test_docs_generated.py) asserts the committed file matches the generator output, so it cannot drift. API reference will use mkdocstrings over the existing docstrings (Session 5). - Synthesized (written once at a milestone): REVIEWER_GUIDE, full README, PRODUCTION_NOTES, STREAMING_DESIGN — Session 5.
Supporting practices: ADRs are immutable (supersede, never edit); diagrams are
diagrams-as-code (Mermaid, render on GitHub, diff in git); contracts are verified by tests
(the forthcoming CORPUS_CONTRACT gets a schema-conformance test with the Gold layer);
docs/BENCHMARKS.md records real run metrics as a regression baseline.
Alternatives considered¶
| Option | Pros | Cons | Why rejected |
|---|---|---|---|
| Hand-write all docs | Full control, no tooling | Drift is near-certain as schemas change | Defeats "evolves with the project" |
| Generated-first (chosen) | Code-mirroring docs can't drift; verified by tests | A little tooling up front | Best fit; matches the repo's production-pattern ethos |
| All docs in Session 5 | Less work mid-build | Dictionary/contract stale through Sessions 2–4; no living baseline | Loses the value of evolving docs |
Consequences¶
Positive: - DATA_DICTIONARY is always correct (test-enforced); changing a schema forces a regen. - ARCHITECTURE/BENCHMARKS give reviewers an honest, current view at any commit. - Demonstrates documentation engineering — a portfolio signal.
Negative: - Contributors must run the generator (or CI will flag a stale dictionary).
Neutral:
- CORPUS_CONTRACT + its conformance test are deferred to the Gold session, since a
contract test is only meaningful once gold.encounter_summary exists.
Implementation notes¶
scripts/gen_data_dictionary.py(render()pure +--checkfor CI) →docs/DATA_DICTIONARY.md.tests/test_docs_generated.py— fails if the dictionary is stale or misses a table.docs/ARCHITECTURE.md(as-built + Mermaid),docs/BENCHMARKS.md(real run metrics).- Decision recorded in HANDOFF and the project memory
doc-strategy-generated-first. - Related: ADR-009 (schemas/registry that feed the generator), ADR-004 (Arrow schemas).