Scribe-IQ — Data Generation Pipeline v2¶
Implementation Brief (Reviewed + Corrected)¶
On this page¶
Layout and pipeline
- Corrections applied from review
- Repo layout decision
- Working directory convention
- Full repository structure
Scripts (01–09)
- Script 01 — Generate Patients
- Script 02 — Build Note Pool
- Script 03 — Reserve ACI-Bench Encounters
- Script 04 — Match and Score
- Script 05 — Select Top 50
- Script 06 — Adapt Notes via Groq
- Script 07 — Assemble Corpus
- Script 08 — Generate Dataset Card (NEW)
- Script 09 — Validate Corpus
Reference and runbook
- Summary · Source datasets · Corpus statistics · Specialty distribution · Note source breakdown · Known limitations · Intended use · Reproduction
- Utils: note checks · Utils: mappings
- requirements.txt · README · Prerequisites · Setup · Run pipeline · Resumability · Estimated runtime
- Relation to lakehouse-era proposals
Appendix (merged corrections)
- Appendix: Implementation corrections (merged)
- Script 05.5 — Extract Longitudinal Context (CORRECTED)
- Script 06 — Adapt Notes (CORRECTED)
- Script 06.5 — Match ACI (SIMPLIFIED - uses existing reservations)
- Utils update · Corrected implementation sequence
Corrections Applied from Review¶
- General Medicine candidate list — no longer appends full pool, uses capped sample
- Working directory — all scripts run from
data_prep/, PYTHONPATH set explicitly load_jsonl_as_dfimport removed from Script 04- MedSynth ICD-10 — defensive string normalizer added
- Duplicate SNOMED key
44054006fixed in mappings.py - Validation metrics — showcase vs ACI dialogue counts labeled precisely
dataset_card.md— generator script added (Script 08)- Groq model — pinned with a note to check Groq docs at build time
- Repo layout — canonical path decision documented at top
- ACI-Bench showcase — reserved assignment step added before match/score
- LLM adaptation — condition/medication allowlist check added post-generation
- Synthea modules — validated against known v3 module names
Repo Layout Decision¶
scribe-iq/ ← repo root
docs/reference/ ← architecture proposals + long-form references
agbonnet_lakehouse_precursor_proposal_v2.md
...
data_prep/ ← CANONICAL: offline corpus builder for this repo
scripts/
utils/
requirements.txt
README.md
data/ ← shared data dir
raw/
staging/
clinical_corpus/
backend/
frontend/
Rule: data_prep/ is the only supported tree for running corpus-generation scripts. Historical lakehouse-style proposals live under docs/reference/ (see docs/history/EVOLUTION.md). An optional local corpus_pipelines/agbonnet_hf_clinical_notes/ directory may exist on some machines and is not part of the supported layout.
Related (same repo): adapt-notes + longitudinal contract — data_prep_adapt_notes_longitudinal_design.md; merged implementation corrections — Appendix in this file; archived precursor scripts — corpus_pipelines/agbonnet_hf_clinical_notes/README.md; superseded long prompts — docs/archive/.
Working Directory Convention¶
All scripts are run from the data_prep/ directory:
The utils/ package is a sibling of scripts/ inside data_prep/, so Python
resolves imports correctly when run from data_prep/.
Alternatively, add to shell profile:
Full Repository Structure¶
scribe-iq/
data_prep/
scripts/
01_generate_patients.sh
02_build_note_pool.py
03_reserve_aci_encounters.py ← NEW: reserve ACI-Bench before matching
04_match_and_score.py
05_select_patients.py
06_adapt_notes.py
07_assemble_corpus.py
08_generate_dataset_card.py ← NEW: generates dataset_card.md
09_validate_corpus.py
utils/
__init__.py
mappings.py
io_utils.py
synthea_utils.py
scoring.py
note_checks.py ← NEW: allowlist validation post-adaptation
requirements.txt
README.md
data/
raw/
synthea/
aci_bench/
staging/
note_pool.jsonl
aci_reservations.jsonl ← NEW
match_results.jsonl
selected_patients.jsonl
adapted_notes.jsonl
clinical_corpus/
patients.jsonl
encounters.jsonl
notes.jsonl
conditions.jsonl
medications.jsonl
observations.jsonl
dialogues.jsonl
source_provenance.jsonl
manifest.json
audit_report.md
dataset_card.md
docs/reference/
agbonnet_lakehouse_precursor_proposal_v2.md
Script 01 — Generate Patients¶
File: data_prep/scripts/01_generate_patients.sh
#!/bin/bash
# Generates 1000 synthetic patients via Synthea
# Requires: Java 11+
# Download JAR: https://github.com/synthetichealth/synthea/releases
# → synthea-with-dependencies.jar → place in scribe-iq/ root
#
# VALIDATED module names for Synthea v3 (check release notes if generation fails):
# heart_disease, hypertension, diabetes, stroke, epilepsy,
# colorectal_cancer, crohns_disease, osteoporosis, osteoarthritis,
# asthma, copd, kidney_disease, dermatitis, hypothyroidism,
# anxiety_and_panic, depression, lung_cancer, atrial_fibrillation,
# alzheimers_disease, rheumatoid_arthritis
#
# NOTE: module names are file names in synthea/src/main/resources/modules/
# If a module fails, check the exact filename in that directory.
set -e
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../../" && pwd)"
JAR="$REPO_ROOT/synthea-with-dependencies.jar"
OUTPUT_DIR="$REPO_ROOT/data/raw/synthea"
if [ ! -f "$JAR" ]; then
echo "ERROR: synthea-with-dependencies.jar not found at $JAR"
echo "Download from: https://github.com/synthetichealth/synthea/releases"
exit 1
fi
mkdir -p "$OUTPUT_DIR"
java -jar "$JAR" \
-p 1000 \
-s 42 \
--exporter.csv.export true \
--exporter.fhir.export false \
--exporter.baseDirectory "$OUTPUT_DIR" \
-m "heart_disease,hypertension,diabetes,stroke,epilepsy,\
colorectal_cancer,crohns_disease,osteoporosis,\
osteoarthritis,asthma,copd,kidney_disease,\
dermatitis,hypothyroidism,anxiety_and_panic,depression,\
lung_cancer,atrial_fibrillation,alzheimers_disease,\
rheumatoid_arthritis"
echo ""
echo "✓ Synthea generation complete"
echo " Output: $OUTPUT_DIR/csv/"
ls -lh "$OUTPUT_DIR/csv/"
Script 02 — Build Note Pool¶
File: data_prep/scripts/02_build_note_pool.py
"""
02_build_note_pool.py
Run from: scribe-iq/data_prep/
Builds unified indexed note pool from three sources:
MedSynth : HuggingFace Ahmad0067/MedSynth (ICD-10 coded SOAP notes)
MTSamples : HuggingFace harishnair04/mtsamples (human-written progress notes)
ACI-Bench : github.com/microsoft/clinical_visit_note_summarization_corpus
Output: data/staging/note_pool.jsonl
"""
from pathlib import Path
from datasets import load_dataset
import pandas as pd
from tqdm import tqdm
from utils.mappings import ICD10_TO_SPECIALTY, MTSAMPLES_TO_STANDARD_SPECIALTY
from utils.io_utils import write_jsonl, count_jsonl
REPO_ROOT = Path(__file__).resolve().parents[2]
OUTPUT = REPO_ROOT / "data/staging/note_pool.jsonl"
ACI_DIR = REPO_ROOT / "data/raw/aci_bench"
OUTPUT.parent.mkdir(parents=True, exist_ok=True)
OUTPUT.unlink(missing_ok=True)
def normalize_icd10(value) -> str:
"""Safely coerce ICD-10 field to string, handling None/list/float."""
if value is None:
return ""
if isinstance(value, list):
return str(value[0]) if value else ""
try:
return str(value).strip()
except Exception:
return ""
OUTPATIENT_KEYWORDS = [
"progress note", "office visit", "office note", "soap",
"follow-up", "followup", "follow up", "clinic note",
"consult", "history and physical", "h&p", "outpatient",
"checkup", "check-up", "annual", "wellness", "recheck",
]
EXCLUDED_SPECIALTIES = {
"Surgery", "Neurosurgery", "Radiology", "Lab Medicine - Pathology",
"Letters", "Discharge Summary", "Emergency Room Reports",
"Autopsy", "IME-QME-Work Comp etc.", "Cosmetic / Plastic Surgery",
}
def is_outpatient_note(row) -> bool:
if row.get("medical_specialty", "") in EXCLUDED_SPECIALTIES:
return False
text = f"{row.get('sample_name','')} {row.get('description','')}".lower()
return any(kw in text for kw in OUTPATIENT_KEYWORDS)
# ─── MedSynth ──────────────────────────────────────────────────────────────────
print("\n[1/3] Loading MedSynth from HuggingFace...")
medsynth = load_dataset("Ahmad0067/MedSynth", split="train")
for idx, row in enumerate(tqdm(medsynth, desc="MedSynth")):
icd10 = normalize_icd10(row.get("ICD10"))
specialty = ICD10_TO_SPECIALTY.get(icd10[0] if icd10 else "X", "General Medicine")
note_text = row.get("note") or ""
dialogue = row.get("dialogue") or ""
if len(note_text) < 50:
continue # skip empty/stub notes
write_jsonl(OUTPUT, {
"note_id": f"ms_{idx:06d}",
"source": "medsynth",
"specialty": specialty,
"icd10_code": icd10,
"icd10_desc": str(row.get("ICD10_desc") or ""),
"note_type": "soap",
"note_text": note_text,
"dialogue": dialogue if dialogue else None,
"chief_complaint": None,
"gender": None,
"word_count": len(note_text.split()),
"quality_tier": "primary",
})
print(f" MedSynth done: {count_jsonl(OUTPUT)} records")
# ─── MTSamples ─────────────────────────────────────────────────────────────────
print("\n[2/3] Loading MTSamples from HuggingFace...")
mtsamples = load_dataset("harishnair04/mtsamples", split="train")
mts_count = 0
for idx, row in enumerate(tqdm(mtsamples, desc="MTSamples")):
if not is_outpatient_note(row):
continue
transcription = row.get("transcription") or ""
if len(transcription) < 100:
continue
raw_specialty = (row.get("medical_specialty") or "General Medicine").strip()
specialty = MTSAMPLES_TO_STANDARD_SPECIALTY.get(raw_specialty, raw_specialty)
write_jsonl(OUTPUT, {
"note_id": f"mts_{idx:06d}",
"source": "mtsamples",
"specialty": specialty,
"icd10_code": None,
"icd10_desc": None,
"note_type": "progress_note",
"note_text": transcription,
"dialogue": None,
"chief_complaint": row.get("description") or "",
"gender": None,
"keywords": row.get("keywords") or "",
"word_count": len(transcription.split()),
"quality_tier": "primary",
})
mts_count += 1
print(f" MTSamples done: {mts_count} outpatient notes added")
# ─── ACI-Bench ─────────────────────────────────────────────────────────────────
print("\n[3/3] Loading ACI-Bench...")
if not ACI_DIR.exists():
print(" WARNING: ACI-Bench not found.")
print(f" Clone to: {ACI_DIR}")
print(" git clone https://github.com/microsoft/clinical_visit_note_summarization_corpus data/raw/aci_bench")
else:
aci_count = 0
# Walk all subsets: aci, virtassist, virtscribe
for meta_file in sorted(ACI_DIR.rglob("*metadata*.csv")):
try:
meta_df = pd.read_csv(meta_file)
except Exception as e:
print(f" Could not read {meta_file}: {e}")
continue
# Find matching src-tgt file in same directory
src_tgt_candidates = list(meta_file.parent.glob("*src-tgt*.csv"))
if not src_tgt_candidates:
continue
src_tgt_df = pd.read_csv(src_tgt_candidates[0])
# Merge on encounter_id or id (whichever is present in both)
merge_key = "encounter_id" if "encounter_id" in src_tgt_df.columns else "id"
if merge_key not in meta_df.columns:
continue
merged = meta_df.merge(src_tgt_df, on=merge_key, how="inner")
for _, row in merged.iterrows():
note_text = str(row.get("note") or row.get("tgt") or "")
dialogue = str(row.get("src") or "")
if len(note_text) < 50:
continue
write_jsonl(OUTPUT, {
"note_id": f"aci_{row[merge_key]}",
"source": "aci_bench",
"specialty": "General Medicine", # classified in Script 03
"icd10_code": None,
"icd10_desc": None,
"note_type": "full_encounter",
"note_text": note_text,
"dialogue": dialogue if len(dialogue) > 20 else None,
"chief_complaint": str(row.get("cc") or ""),
"gender": str(row.get("gender") or ""),
"patient_name": f"{row.get('patient_firstname','')} {row.get('patient_familyname','')}".strip(),
"doctor_name": str(row.get("doctor_name") or ""),
"word_count": len(note_text.split()),
"quality_tier": "primary",
})
aci_count += 1
print(f" ACI-Bench done: {aci_count} encounters added")
total = count_jsonl(OUTPUT)
print(f"\n✓ Note pool complete: {total} total notes → {OUTPUT}")
Script 03 — Reserve ACI-Bench Encounters¶
File: data_prep/scripts/03_reserve_aci_encounters.py
Why this exists: ACI-Bench is too small (~207 encounters) to leave its assignment to match luck. If ACI notes aren't reserved upfront, many showcase encounters get a MedSynth or MTSamples note instead. This script classifies ACI-Bench by specialty and pre-assigns one encounter per patient specialty slot.
"""
03_reserve_aci_encounters.py
Run from: scribe-iq/data_prep/
Classifies ACI-Bench encounters by specialty using chief complaint.
Reserves one ACI encounter per target specialty for showcase assignment.
Output: data/staging/aci_reservations.jsonl
"""
from pathlib import Path
from collections import defaultdict
from utils.io_utils import load_jsonl, write_jsonl
from utils.mappings import ICD10_TO_SPECIALTY, CC_TO_SPECIALTY
REPO_ROOT = Path(__file__).resolve().parents[2]
NOTE_POOL = REPO_ROOT / "data/staging/note_pool.jsonl"
OUTPUT = REPO_ROOT / "data/staging/aci_reservations.jsonl"
OUTPUT.unlink(missing_ok=True)
# Target: one reserved ACI encounter per specialty
# Extra slots available for specialties with multiple ACI encounters
RESERVE_PER_SPECIALTY = 5 # reserve up to 5 per specialty
aci_notes = [
n for n in load_jsonl(NOTE_POOL)
if n["source"] == "aci_bench" and n.get("dialogue")
]
print(f"ACI-Bench encounters with dialogue: {len(aci_notes)}")
# Classify each by specialty using chief complaint
def classify_chief_complaint(cc: str) -> str:
"""Map chief complaint text to specialty using keyword rules."""
if not cc:
return "General Medicine"
cc_lower = cc.lower()
for keywords, specialty in CC_TO_SPECIALTY.items():
if any(kw in cc_lower for kw in keywords):
return specialty
return "General Medicine"
specialty_buckets = defaultdict(list)
for note in aci_notes:
specialty = classify_chief_complaint(note.get("chief_complaint", ""))
note["specialty"] = specialty # update in-memory
specialty_buckets[specialty].append(note)
print("\nACI-Bench specialty distribution:")
for spec, notes in sorted(specialty_buckets.items(), key=lambda x: -len(x[1])):
print(f" {spec:<30} {len(notes)}")
# Reserve encounters — take up to RESERVE_PER_SPECIALTY per specialty
reserved = []
used_note_ids = set()
for specialty, notes in specialty_buckets.items():
count = 0
for note in notes:
if count >= RESERVE_PER_SPECIALTY:
break
if note["note_id"] not in used_note_ids:
write_jsonl(OUTPUT, {
"note_id": note["note_id"],
"specialty": specialty,
"chief_complaint": note.get("chief_complaint", ""),
"gender": note.get("gender", ""),
"note_text": note["note_text"],
"dialogue": note["dialogue"],
"reserved_for": specialty,
})
used_note_ids.add(note["note_id"])
count += 1
print(f"\n✓ Reserved {sum(1 for _ in load_jsonl(OUTPUT))} ACI encounters")
print(f" Output: {OUTPUT}")
Add to utils/mappings.py:
# Chief complaint keywords → specialty (for ACI-Bench classification)
# Keys are tuples of keywords, values are specialty strings
CC_TO_SPECIALTY = {
("chest", "cardiac", "heart", "palpitation", "angina", "hypertension"): "Cardiology",
("seizure", "neuro", "headache", "migraine", "stroke", "memory", "tremor"): "Neurology",
("knee", "hip", "back pain", "joint", "fracture", "orthop", "shoulder"): "Orthopedics",
("bowel", "abdominal", "crohn", "colitis", "gastro", "reflux", "diarrhea"): "Gastroenterology",
("breath", "asthma", "copd", "pulmon", "cough", "lung"): "Pulmonology",
("kidney", "renal", "dialysis", "ckd"): "Nephrology",
("skin", "rash", "dermat", "psoriasis", "eczema"): "Dermatology",
("diabetes", "thyroid", "endocrin", "glucose", "insulin"): "Endocrinology",
("depression", "anxiety", "psych", "bipolar", "mental"): "Psychiatry",
("urin", "prostate", "bladder", "urolag"): "Urology",
("eye", "vision", "ophthalm", "cataract", "glaucoma"): "Ophthalmology",
("ear", "nose", "throat", "ent", "sinus", "hearing"): "Otorhinolaryngology",
}
Script 04 — Match and Score¶
File: data_prep/scripts/04_match_and_score.py
"""
04_match_and_score.py
Run from: scribe-iq/data_prep/
Matches every Synthea patient/encounter to the best note in the pool.
ACI-Bench reserved encounters (from script 03) are excluded from the general
pool and only assigned in script 06 as showcase encounters.
Output: data/staging/match_results.jsonl
"""
from pathlib import Path
import pandas as pd
from tqdm import tqdm
from collections import defaultdict
from utils.mappings import SNOMED_TO_ICD10, ICD10_TO_SPECIALTY
from utils.io_utils import write_jsonl, load_jsonl
from utils.synthea_utils import load_synthea, get_active_conditions, get_active_medications
REPO_ROOT = Path(__file__).resolve().parents[2]
SYNTHEA_DIR = REPO_ROOT / "data/raw/synthea/csv"
NOTE_POOL = REPO_ROOT / "data/staging/note_pool.jsonl"
ACI_RES = REPO_ROOT / "data/staging/aci_reservations.jsonl"
OUTPUT = REPO_ROOT / "data/staging/match_results.jsonl"
OUTPUT.unlink(missing_ok=True)
# Load reserved ACI IDs so they're excluded from general matching
reserved_aci_ids = {r["note_id"] for r in load_jsonl(ACI_RES)}
print(f"Reserved ACI encounters excluded from pool: {len(reserved_aci_ids)}")
# Load note pool (excluding reserved ACI)
all_notes = [
n for n in load_jsonl(NOTE_POOL)
if n["note_id"] not in reserved_aci_ids
]
print(f"Notes available for matching: {len(all_notes)}")
# ─── Index notes by specialty ─────────────────────────────────────────────────
# FIX: General Medicine index is a CAPPED SAMPLE, not the full pool
# This prevents O(n*m) blowup when falling back to General Medicine
GM_SAMPLE_SIZE = 200 # max General Medicine fallback candidates per query
notes_by_specialty = defaultdict(list)
gm_notes = []
for note in all_notes:
spec = note["specialty"]
notes_by_specialty[spec].append(note)
if spec == "General Medicine":
gm_notes.append(note)
# Shuffle for variety before capping
import random
random.seed(42)
random.shuffle(gm_notes)
gm_fallback = gm_notes[:GM_SAMPLE_SIZE]
print(f"\nNote pool by specialty:")
for spec, notes in sorted(notes_by_specialty.items(), key=lambda x: -len(x[1])):
print(f" {spec:<30} {len(notes)}")
print(f" {'General Medicine fallback cap':<30} {GM_SAMPLE_SIZE}")
# ─── Scoring ──────────────────────────────────────────────────────────────────
def score_match(encounter: dict, note: dict) -> float:
score = 0.0
# 1. Specialty match (0.40)
if note["specialty"] == encounter["specialty"]:
score += 0.40
elif note["specialty"] == "General Medicine":
score += 0.08 # small partial credit only
# 2. ICD-10 match (0.35) — exact 3-char prefix or chapter match
note_icd = (note.get("icd10_code") or "")
enc_icd_list = encounter.get("icd10_codes", [])
if note_icd and enc_icd_list:
for enc_icd in enc_icd_list:
if not enc_icd:
continue
if note_icd[:3] == enc_icd[:3]: # e.g. I25 == I25
score += 0.35
break
elif note_icd[0] == enc_icd[0]: # e.g. I == I
score += 0.12
break
# 3. Condition keyword overlap (0.15)
conditions = encounter.get("conditions", [])
note_text = (note.get("note_text") or "").lower()
if conditions:
# Match on 5+ char words from condition names
matched = sum(
1 for cond in conditions
if any(
word in note_text
for word in cond.lower().split()
if len(word) >= 5
)
)
score += 0.15 * (matched / len(conditions))
# 4. Quality tier bonus (0.05)
tier_weights = {"primary": 0.05, "secondary": 0.02, "fallback": 0.0}
score += tier_weights.get(note.get("quality_tier", "fallback"), 0)
# 5. Has dialogue (0.05 bonus)
if note.get("dialogue"):
score += 0.05
return round(min(score, 1.0), 3)
# ─── Load Synthea ─────────────────────────────────────────────────────────────
print("\nLoading Synthea data...")
synthea = load_synthea(SYNTHEA_DIR)
patients_df = synthea["patients"]
encounters_df = synthea["encounters"]
conditions_df = synthea["conditions"]
meds_df = synthea["medications"]
obs_df = synthea["observations"]
# ─── Match all patients ───────────────────────────────────────────────────────
print(f"\nMatching {len(patients_df)} patients...")
for patient_id in tqdm(patients_df["Id"].unique()):
pt_row = patients_df[patients_df["Id"] == patient_id].iloc[0]
birth_date = pd.to_datetime(pt_row["BIRTHDATE"])
gender = pt_row["GENDER"]
pt_encounters = encounters_df[
encounters_df["PATIENT"] == patient_id
].sort_values("START")
if len(pt_encounters) < 3:
continue
for _, enc in pt_encounters.iterrows():
enc_date = pd.to_datetime(enc["START"])
age = int((enc_date - birth_date).days / 365.25)
active_conditions = get_active_conditions(conditions_df, patient_id, enc_date)
active_meds = get_active_medications(meds_df, patient_id, enc_date)
recent_obs = obs_df[
(obs_df["PATIENT"] == patient_id) &
(pd.to_datetime(obs_df["DATE"]) <= enc_date)
].tail(10)
icd10_codes = list(filter(None, [
SNOMED_TO_ICD10.get(str(code))
for code in active_conditions["CODE"].tolist()
]))
specialty = ICD10_TO_SPECIALTY.get(
icd10_codes[0][0] if icd10_codes else "X",
"General Medicine"
)
enc_record = {
"patient_id": patient_id,
"encounter_id": enc["Id"],
"encounter_date": enc_date.isoformat(),
"encounter_reason": enc.get("REASONDESCRIPTION", ""),
"specialty": specialty,
"age": age,
"gender": gender,
"icd10_codes": icd10_codes,
"conditions": active_conditions["DESCRIPTION"].tolist(),
"medications": active_meds["DESCRIPTION"].tolist(),
"recent_obs": recent_obs[["DESCRIPTION", "VALUE", "UNITS"]].to_dict("records"),
}
# Candidate pool: specialty-specific + capped GM fallback
candidates = notes_by_specialty.get(specialty, [])
if specialty != "General Medicine":
candidates = candidates + gm_fallback
best_note = None
best_score = 0.0
for note in candidates:
s = score_match(enc_record, note)
if s > best_score:
best_score = s
best_note = note
write_jsonl(OUTPUT, {
**enc_record,
"best_note_id": best_note["note_id"] if best_note else None,
"best_note_source": best_note["source"] if best_note else None,
"best_note_text": best_note["note_text"] if best_note else None,
"best_note_dialogue": best_note.get("dialogue") if best_note else None,
"match_score": best_score,
})
print(f"\n✓ Matching complete → {OUTPUT}")
Script 05 — Select Top 50¶
File: data_prep/scripts/05_select_patients.py
"""
05_select_patients.py
Run from: scribe-iq/data_prep/
Selects top 50 patients by match quality with specialty distribution constraint.
Output: data/staging/selected_patients.jsonl
"""
from pathlib import Path
from collections import defaultdict, Counter
from utils.io_utils import load_jsonl, write_jsonl
from utils.mappings import ICD10_TO_SPECIALTY
REPO_ROOT = Path(__file__).resolve().parents[2]
MATCH_RESULTS = REPO_ROOT / "data/staging/match_results.jsonl"
OUTPUT = REPO_ROOT / "data/staging/selected_patients.jsonl"
OUTPUT.unlink(missing_ok=True)
TARGET_TOTAL = 50
MAX_PER_SPECIALTY = 6
MIN_ENCOUNTERS = 3
MIN_AVG_MATCH_SCORE = 0.35
results = list(load_jsonl(MATCH_RESULTS))
patient_data = defaultdict(lambda: {
"encounters": [], "scores": [], "specialties": []
})
for r in results:
pid = r["patient_id"]
patient_data[pid]["encounters"].append(r)
patient_data[pid]["scores"].append(r["match_score"])
patient_data[pid]["specialties"].append(r["specialty"])
scored_patients = []
for pid, data in patient_data.items():
n_enc = len(data["encounters"])
avg_score = sum(data["scores"]) / len(data["scores"])
if n_enc < MIN_ENCOUNTERS:
continue
if avg_score < MIN_AVG_MATCH_SCORE:
continue
if not any(r.get("conditions") for r in data["encounters"]):
continue
primary_specialty = Counter(data["specialties"]).most_common(1)[0][0]
scored_patients.append({
"patient_id": pid,
"quality_score": round(avg_score, 3),
"encounter_count": n_enc,
"primary_specialty": primary_specialty,
})
scored_patients.sort(key=lambda x: x["quality_score"], reverse=True)
selected = []
specialty_counts = defaultdict(int)
for patient in scored_patients:
spec = patient["primary_specialty"]
if specialty_counts[spec] >= MAX_PER_SPECIALTY:
continue
selected.append(patient)
specialty_counts[spec] += 1
if len(selected) >= TARGET_TOTAL:
break
for p in selected:
write_jsonl(OUTPUT, p)
print(f"\n✓ Selected {len(selected)} patients")
print(f"\nSpecialty distribution:")
for spec, count in sorted(specialty_counts.items(), key=lambda x: -x[1]):
print(f" {spec:<30} {count}")
scores = [p["quality_score"] for p in selected]
print(f"\nQuality score range: {min(scores):.3f} – {max(scores):.3f}")
print(f"Output: {OUTPUT}")
Script 06 — Adapt Notes via Groq¶
File: data_prep/scripts/06_adapt_notes.py
"""
06_adapt_notes.py
Run from: scribe-iq/data_prep/
Adapts matched notes to be coherent with each patient's Synthea record.
Assigns reserved ACI-Bench encounter as showcase for each patient.
Model: llama-3.1-70b-versatile (verify current name at console.groq.com/docs)
If deprecated, update MODEL constant before running.
Free tier: ~14,400 requests/day — 125 notes fits comfortably.
Set env var: export GROQ_API_KEY=gsk_...
Get free key: https://console.groq.com
"""
import os, time
from pathlib import Path
from collections import defaultdict
from groq import Groq
from utils.io_utils import load_jsonl, write_jsonl
from utils.synthea_utils import load_synthea, compute_age
from utils.note_checks import check_note_coherence
REPO_ROOT = Path(__file__).resolve().parents[2]
SELECTED = REPO_ROOT / "data/staging/selected_patients.jsonl"
MATCHES = REPO_ROOT / "data/staging/match_results.jsonl"
ACI_RES = REPO_ROOT / "data/staging/aci_reservations.jsonl"
SYNTHEA_DIR = REPO_ROOT / "data/raw/synthea/csv"
OUTPUT = REPO_ROOT / "data/staging/adapted_notes.jsonl"
OUTPUT.unlink(missing_ok=True)
# ── Check Groq model name at https://console.groq.com/docs/models ──────────────
MODEL = "llama-3.1-70b-versatile"
# ──────────────────────────────────────────────────────────────────────────────
client = Groq(api_key=os.environ["GROQ_API_KEY"])
synthea = load_synthea(SYNTHEA_DIR)
# Index ACI reservations by specialty (pick best fit per patient)
aci_by_specialty = defaultdict(list)
for r in load_jsonl(ACI_RES):
aci_by_specialty[r["specialty"]].append(r)
aci_used = set() # track used ACI encounter IDs across patients
# Index match results
all_matches = list(load_jsonl(MATCHES))
by_patient = defaultdict(list)
for m in all_matches:
by_patient[m["patient_id"]].append(m)
selected = list(load_jsonl(SELECTED))
patients_df = synthea["patients"]
note_idx = 0
ADAPT_PROMPT = """\
You are a clinical documentation specialist.
Adapt the REFERENCE NOTE to match the PATIENT DATA below.
Rules:
1. Keep the EXACT same section format and structure as the reference note
2. Keep the clinical writing style, tone, and approximate length
3. Replace any conditions or medications that CONFLICT with the patient data
4. Do NOT add conditions or medications not present in the patient data
5. Do NOT invent lab values — only use values provided below
6. Do NOT include any patient name or identifying information
7. Output only the adapted note text — no preamble, no explanation
PATIENT DATA:
- Age: {age}
- Sex: {sex}
- Visit date: {visit_date}
- Visit reason: {visit_reason}
- Active conditions: {conditions}
- Current medications: {medications}
- Recent labs/vitals: {observations}
REFERENCE NOTE:
{reference_note}"""
def call_groq(prompt: str, retries: int = 3) -> str:
for attempt in range(retries):
try:
resp = client.chat.completions.create(
model=MODEL,
messages=[{"role": "user", "content": prompt}],
temperature=0.2,
max_tokens=900,
)
return resp.choices[0].message.content.strip()
except Exception as e:
if attempt < retries - 1:
print(f" Retry {attempt + 1}: {e}")
time.sleep(2 ** attempt)
else:
raise
def format_obs(obs_list: list) -> str:
if not obs_list:
return "None recorded"
return "\n".join(
f"- {o.get('DESCRIPTION','')}: {o.get('VALUE','')} {o.get('UNITS','')}".strip()
for o in obs_list[:8]
)
for patient in selected:
pid = patient["patient_id"]
pt = patients_df[patients_df["Id"] == pid].iloc[0]
sex = "Male" if pt["GENDER"] == "M" else "Female"
spec = patient["primary_specialty"]
pt_matches = sorted(by_patient[pid], key=lambda x: x["encounter_date"])
prior_visits = pt_matches[:-1]
current_visit = pt_matches[-1]
print(f"\n{pid} {spec} {len(pt_matches)} encounters")
# ── Prior visits: adapt reference note ────────────────────────────────────
for enc in prior_visits:
if not enc.get("best_note_text"):
continue
age = compute_age(pt["BIRTHDATE"], enc["encounter_date"])
prompt = ADAPT_PROMPT.format(
age=age, sex=sex,
visit_date=enc["encounter_date"][:10],
visit_reason=enc.get("encounter_reason", "Follow-up visit"),
conditions=", ".join(enc.get("conditions", [])) or "None documented",
medications=", ".join(enc.get("medications", [])) or "None documented",
observations=format_obs(enc.get("recent_obs", [])),
reference_note=enc["best_note_text"][:2500],
)
print(f" [{enc['encounter_date'][:10]}] adapting "
f"({enc['best_note_source']}, score={enc['match_score']:.2f})...",
end=" ", flush=True)
adapted = call_groq(prompt)
# Post-adaptation coherence check
issues = check_note_coherence(
adapted,
enc.get("conditions", []),
enc.get("medications", [])
)
if issues:
print(f"⚠ {len(issues)} coherence warning(s): {issues[:2]}")
else:
print("ok")
write_jsonl(OUTPUT, {
"adapted_note_id": f"note_{note_idx:06d}",
"encounter_id": enc["encounter_id"],
"patient_id": pid,
"encounter_date": enc["encounter_date"],
"note_text": adapted,
"reference_note_id": enc["best_note_id"],
"reference_source": enc["best_note_source"],
"match_score": enc["match_score"],
"coherence_issues": issues,
"is_showcase": False,
"has_dialogue": False,
})
note_idx += 1
# ── Current (showcase) visit: assign reserved ACI-Bench encounter ─────────
aci_candidates = [
r for r in aci_by_specialty.get(spec, [])
if r["note_id"] not in aci_used
]
# Also try General Medicine ACI as fallback
if not aci_candidates:
aci_candidates = [
r for r in aci_by_specialty.get("General Medicine", [])
if r["note_id"] not in aci_used
]
if aci_candidates:
# Pick best gender match if possible
gender_match = [c for c in aci_candidates
if c.get("gender", "").upper() == pt["GENDER"].upper()]
aci_enc = gender_match[0] if gender_match else aci_candidates[0]
aci_used.add(aci_enc["note_id"])
write_jsonl(OUTPUT, {
"adapted_note_id": f"note_{note_idx:06d}",
"encounter_id": current_visit["encounter_id"],
"patient_id": pid,
"encounter_date": current_visit["encounter_date"],
"note_text": aci_enc["note_text"],
"dialogue": aci_enc["dialogue"],
"reference_note_id": aci_enc["note_id"],
"reference_source": "aci_bench",
"match_score": current_visit["match_score"],
"coherence_issues": [],
"is_showcase": True,
"has_dialogue": True,
})
print(f" [{current_visit['encounter_date'][:10]}] showcase → ACI-Bench "
f"({aci_enc['specialty']}, {aci_enc.get('gender','')})")
else:
# No ACI available: adapt the matched note, use MedSynth dialogue if present
if current_visit.get("best_note_text"):
age = compute_age(pt["BIRTHDATE"], current_visit["encounter_date"])
prompt = ADAPT_PROMPT.format(
age=age, sex=sex,
visit_date=current_visit["encounter_date"][:10],
visit_reason=current_visit.get("encounter_reason", "Follow-up"),
conditions=", ".join(current_visit.get("conditions", [])) or "None",
medications=", ".join(current_visit.get("medications", [])) or "None",
observations=format_obs(current_visit.get("recent_obs", [])),
reference_note=current_visit["best_note_text"][:2500],
)
adapted = call_groq(prompt)
dialogue = current_visit.get("best_note_dialogue")
write_jsonl(OUTPUT, {
"adapted_note_id": f"note_{note_idx:06d}",
"encounter_id": current_visit["encounter_id"],
"patient_id": pid,
"encounter_date": current_visit["encounter_date"],
"note_text": adapted,
"dialogue": dialogue,
"reference_source": current_visit["best_note_source"],
"match_score": current_visit["match_score"],
"coherence_issues": [],
"is_showcase": True,
"has_dialogue": bool(dialogue),
})
print(f" [{current_visit['encounter_date'][:10]}] showcase → adapted "
f"(no ACI available, dialogue={'yes' if dialogue else 'no'})")
note_idx += 1
print(f"\n✓ Adaptation complete: {note_idx} notes → {OUTPUT}")
Script 07 — Assemble Corpus¶
(Unchanged from v1 except updated path resolution using Path(__file__).resolve().parents[2]
for all paths. Apply this pattern to every open() and pd.read_csv() call.)
Key change from v1 — path pattern:
# Replace all hardcoded paths with:
REPO_ROOT = Path(__file__).resolve().parents[2]
CORPUS_DIR = REPO_ROOT / "data/clinical_corpus"
SYNTHEA_DIR = REPO_ROOT / "data/raw/synthea/csv"
Script 08 — Generate Dataset Card (NEW)¶
File: data_prep/scripts/08_generate_dataset_card.py
"""
08_generate_dataset_card.py
Run from: scribe-iq/data_prep/
Generates dataset_card.md for the clinical corpus.
"""
from pathlib import Path
from collections import Counter
from utils.io_utils import load_jsonl
import json
from datetime import datetime
REPO_ROOT = Path(__file__).resolve().parents[2]
CORPUS_DIR = REPO_ROOT / "data/clinical_corpus"
patients = list(load_jsonl(CORPUS_DIR / "patients.jsonl"))
notes = list(load_jsonl(CORPUS_DIR / "notes.jsonl"))
dialogues = list(load_jsonl(CORPUS_DIR / "dialogues.jsonl"))
manifest = json.loads((CORPUS_DIR / "manifest.json").read_text())
specialty_dist = Counter(p["primary_specialty"] for p in patients)
source_dist = Counter(n["reference_source"] for n in notes)
showcase_count = sum(1 for n in notes if n.get("is_showcase"))
aci_dialogue = sum(1 for d in dialogues if True) # all dialogues
card = f"""# Scribe-IQ Clinical Corpus
**Version:** 1.0
**Generated:** {datetime.utcnow().strftime('%Y-%m-%d')}
**Pipeline:** scribe-iq/data_prep/
---
## Summary
Synthetic clinical corpus of {len(patients)} patients designed for the Scribe-IQ
clinical AI demo. Each patient has a longitudinal encounter history with attached
clinical notes and one showcase encounter with a full doctor-patient dialogue.
The corpus is not intended for clinical use. It contains no real patient data.
All structured data is from Synthea (synthetic). Notes are either human-written
(MTSamples), synthetically generated (MedSynth), or expert-created (ACI-Bench),
adapted to each patient's specific Synthea record via LLM.
---
## Source Datasets
| Dataset | License | Role |
|---|---|---|
| Synthea v3 (seed=42) | Apache 2.0 | Patient spine — demographics, conditions, medications, labs, encounters |
| MTSamples | CC0 Public Domain | Human-written outpatient progress notes |
| MedSynth (Ahmad0067/MedSynth) | HuggingFace | ICD-10 coded SOAP dialogue-note pairs |
| ACI-Bench (microsoft/clinical_visit_note_summarization_corpus) | CC BY 4.0 | Expert-created full encounter dialogues |
---
## Corpus Statistics
| Entity | Count |
|---|---|
| Patients | {len(patients)} |
| Encounters | {manifest['record_counts'].get('encounters', 'N/A')} |
| Notes | {len(notes)} |
| Dialogues | {aci_dialogue} |
| Conditions | {manifest['record_counts'].get('conditions', 'N/A')} |
| Medications | {manifest['record_counts'].get('medications', 'N/A')} |
## Specialty Distribution
| Specialty | Patients |
|---|---|
{chr(10).join(f'| {s} | {c} |' for s, c in sorted(specialty_dist.items()))}
## Note Source Breakdown
| Source | Notes |
|---|---|
{chr(10).join(f'| {s} | {c} |' for s, c in sorted(source_dist.items()))}
---
## Known Limitations
- MTSamples notes are 10-15 years old; some medication names and practices may be outdated
- MedSynth notes are LLM-generated; clinical accuracy is not guaranteed
- ACI-Bench specialty coverage is narrow (primarily primary care and common outpatient)
- LLM adaptation (Groq / llama-3.1-70b-versatile) may introduce minor hallucinations; notes were spot-checked but not clinically validated
- SNOMED→ICD-10 mapping covers top ~40 Synthea conditions; rare conditions fall back to specialty-only matching
- Corpus should not be used for clinical decision support, clinical training, or any real patient care
---
## Intended Use
- Demonstrating Scribe-IQ pre-meeting summary, note generation, and transcription features
- Portfolio demonstration of clinical data pipeline design
- Research and development of clinical NLP systems in a synthetic data environment
---
## Reproduction
```bash
cd scribe-iq/data_prep
bash scripts/01_generate_patients.sh
python scripts/02_build_note_pool.py
python scripts/03_reserve_aci_encounters.py
python scripts/04_match_and_score.py
python scripts/05_select_patients.py
python scripts/06_adapt_notes.py
python scripts/07_assemble_corpus.py
python scripts/08_generate_dataset_card.py
python scripts/09_validate_corpus.py
(CORPUS_DIR / "dataset_card.md").write_text(card) print(f"✓ dataset_card.md written to {CORPUS_DIR}")
---
## `utils/note_checks.py` (NEW)
Post-adaptation coherence check — flags obvious hallucinations.
```python
"""
utils/note_checks.py
Checks adapted notes for coherence with the patient's Synthea record.
Returns a list of issue strings (empty list = no issues found).
"""
def check_note_coherence(
note_text: str,
expected_conditions: list,
expected_medications: list,
) -> list:
"""
Checks for two types of issues:
1. Medication mentioned in note that is NOT in the patient's active med list
2. Note is suspiciously short (may indicate adaptation failure)
Returns list of warning strings. Empty list = clean.
"""
issues = []
note_lower = note_text.lower()
# Check note length
if len(note_text.split()) < 50:
issues.append(f"Note is very short ({len(note_text.split())} words) — possible adaptation failure")
# Build allowlist from expected medications
# Extract significant tokens (5+ chars) from medication names
allowed_med_tokens = set()
for med in expected_medications:
for token in med.lower().split():
if len(token) >= 5:
allowed_med_tokens.add(token)
# Spot-check for common drug name patterns NOT in allowlist
# This is a heuristic, not exhaustive
COMMON_DRUG_SUFFIXES = [
"olol", "pril", "sartan", "statin", "mycin",
"cillin", "azole", "prazole", "tidine", "dipine",
]
for suffix in COMMON_DRUG_SUFFIXES:
import re
pattern = rf"\b\w+{suffix}\b"
matches = re.findall(pattern, note_lower)
for match in matches:
if match not in allowed_med_tokens and len(match) >= 7:
issues.append(f"Possible unlisted medication in note: '{match}'")
break # one warning per suffix type is enough
return issues
Script 09 — Validate Corpus¶
File: data_prep/scripts/09_validate_corpus.py
Key fix from v1: precise showcase/dialogue count labeling.
"""
09_validate_corpus.py — corrected version
"""
from pathlib import Path
from collections import defaultdict, Counter
import json
from utils.io_utils import load_jsonl
REPO_ROOT = Path(__file__).resolve().parents[2]
CORPUS = REPO_ROOT / "data/clinical_corpus"
patients = list(load_jsonl(CORPUS / "patients.jsonl"))
encounters = list(load_jsonl(CORPUS / "encounters.jsonl"))
notes = list(load_jsonl(CORPUS / "notes.jsonl"))
dialogues = list(load_jsonl(CORPUS / "dialogues.jsonl"))
conditions = list(load_jsonl(CORPUS / "conditions.jsonl"))
medications= list(load_jsonl(CORPUS / "medications.jsonl"))
patient_ids = {p["patient_id"] for p in patients}
encounter_ids = {e["encounter_id"] for e in encounters}
issues = []
warnings = []
# Referential integrity
for e in encounters:
if e["patient_id"] not in patient_ids:
issues.append(f"Encounter {e['encounter_id']} has orphaned patient_id")
for n in notes:
if n["encounter_id"] not in encounter_ids:
issues.append(f"Note {n['note_id']} has orphaned encounter_id")
if not n.get("note_text") or len(n["note_text"].split()) < 30:
issues.append(f"Note {n['note_id']} is empty or too short")
for d in dialogues:
if d["encounter_id"] not in encounter_ids:
issues.append(f"Dialogue {d['dialogue_id']} has orphaned encounter_id")
# Showcase / dialogue counts — PRECISE labeling (v1 fix)
showcase_enc_ids = {e["encounter_id"] for e in encounters if e.get("is_showcase")}
showcase_note_ids = {n["encounter_id"] for n in notes if n.get("is_showcase")}
dialogue_enc_ids = {d["encounter_id"] for d in dialogues}
aci_dialogue_ids = {d["encounter_id"] for d in dialogues
if "aci" in d.get("source", "")}
showcase_with_dlg = showcase_enc_ids & dialogue_enc_ids
showcase_aci_dlg = showcase_enc_ids & aci_dialogue_ids
showcase_no_dlg = showcase_enc_ids - dialogue_enc_ids
for enc_id in showcase_no_dlg:
warnings.append(f"Showcase encounter {enc_id} has no dialogue attached")
# Stats
specialty_dist = Counter(p["primary_specialty"] for p in patients)
source_dist = Counter(n["reference_source"] for n in notes)
score_vals = [e["match_score"] for e in encounters]
avg_score = sum(score_vals) / len(score_vals)
print(f"{'='*55}")
print(f"CORPUS AUDIT")
print(f"{'='*55}")
print(f"\nSCALE")
print(f" Patients: {len(patients)}")
print(f" Encounters: {len(encounters)}")
print(f" Notes: {len(notes)}")
print(f" Dialogues (total): {len(dialogues)}")
print(f" Conditions: {len(conditions)}")
print(f" Medications: {len(medications)}")
print(f"\nSHOWCASE ENCOUNTERS")
print(f" Total showcase: {len(showcase_enc_ids)}")
print(f" Showcase with ANY dialogue: {len(showcase_with_dlg)}")
print(f" Showcase with ACI dialogue: {len(showcase_aci_dlg)}")
print(f" Showcase with NO dialogue: {len(showcase_no_dlg)}")
print(f"\nSPECIALTY DISTRIBUTION")
for spec, count in sorted(specialty_dist.items(), key=lambda x: -x[1]):
print(f" {spec:<30} {count}")
print(f"\nNOTE SOURCE BREAKDOWN")
for source, count in sorted(source_dist.items(), key=lambda x: -x[1]):
print(f" {source:<20} {count}")
print(f"\nMATCH QUALITY")
print(f" Average: {avg_score:.3f}"
f" Min: {min(score_vals):.3f}"
f" Max: {max(score_vals):.3f}")
print(f"\n{'ISSUES' if issues else 'ISSUES: none'}")
for issue in issues:
print(f" ✗ {issue}")
print(f"\n{'WARNINGS' if warnings else 'WARNINGS: none'}")
for w in warnings:
print(f" ⚠ {w}")
# Write audit report
report_lines = [
"# Scribe-IQ Corpus Audit Report\n",
f"## Scale\n| Patients | {len(patients)} |\n"
f"| Encounters | {len(encounters)} |\n"
f"| Notes | {len(notes)} |\n"
f"| Dialogues | {len(dialogues)} |\n",
f"## Showcase\n"
f"| Total showcase | {len(showcase_enc_ids)} |\n"
f"| With ACI dialogue | {len(showcase_aci_dlg)} |\n"
f"| With any dialogue | {len(showcase_with_dlg)} |\n"
f"| No dialogue | {len(showcase_no_dlg)} |\n",
"## Issues\n" + ("None\n" if not issues else
"\n".join(f"- {i}" for i in issues) + "\n"),
"## Warnings\n" + ("None\n" if not warnings else
"\n".join(f"- {w}" for w in warnings) + "\n"),
]
(CORPUS / "audit_report.md").write_text("\n".join(report_lines))
print(f"\n✓ audit_report.md written to {CORPUS}/")
utils/mappings.py (corrected)¶
Duplicate SNOMED key 44054006 removed. Both entries were E11.x — kept the more
specific one (E11.9). Added CC_TO_SPECIALTY for ACI-Bench classification.
# ── SNOMED CT → ICD-10-CM (top Synthea conditions) ────────────────────────────
# Source: NLM SNOMED CT to ICD-10-CM Map (simplified lookup for top 50 codes)
# Full map: https://www.nlm.nih.gov/research/umls/mapping_projects/snomedct_to_icd10cm.html
SNOMED_TO_ICD10 = {
# Endocrine
"44054006": "E11.9", # Type 2 diabetes mellitus (kept, removed duplicate)
"46635009": "E10.9", # Type 1 diabetes
"73595000": "E03.9", # Hypothyroidism
"55822004": "E78.5", # Hyperlipidemia
"162864005": "E66.9", # Obesity
"237599002": "E66.01", # Morbid obesity
# Cardiovascular
"38341003": "I10", # Essential hypertension
"59621000": "I10", # Hypertension NOS
"53741008": "I25.10", # CAD, unspecified
"49436004": "I48.91", # Atrial fibrillation
"230690007": "I63.9", # Cerebral infarction
# Respiratory
"195967001": "J45.909", # Asthma
"267102003": "J45.20", # Mild intermittent asthma
"703151001": "J45.51", # Moderate persistent asthma
"13645005": "J44.1", # COPD exacerbation
"233604007": "J15.9", # Pneumonia
"444814009": "J06.9", # Viral URI
"40122008": "J18.9", # Pneumonia NOS
"36971009": "J32.9", # Sinusitis
"43878008": "J02.9", # Strep pharyngitis
# GI
"34000006": "K50.90", # Crohn's disease
"15777000": "K21.0", # GERD with esophagitis
"74400008": "K37", # Appendicitis
# Musculoskeletal
"64859006": "M81.0", # Osteoporosis
"396275006": "M19.90", # Osteoarthritis
"69896004": "M06.9", # Rheumatoid arthritis
"370143000": "M54.5", # Low back pain
# Renal / Urology
"40055000": "N18.3", # CKD stage 3
"90688005": "N18.9", # CKD NOS
"9855000": "N40.1", # BPH
"57870002": "N39.0", # UTI
"11840006": "N20.0", # Kidney stone
"62106007": "N10", # Acute pyelonephritis
# Neurological
"84757009": "G40.909", # Epilepsy
"26929004": "G30.9", # Alzheimer's
# Psychiatric
"35489007": "F32.9", # Major depressive disorder
"197480006": "F41.1", # Generalized anxiety
# Dermatology
"415068001": "L40.0", # Psoriasis
# Haematology / Oncology
"271737000": "D64.9", # Anaemia
"93761005": "C18.9", # Colorectal cancer
"363346000": "C80.1", # Malignant neoplasm unspecified
# Other
"195662009": "J06.9", # Acute URI
"3928004": "H66.90", # Otitis media
"50043002": "J96.00", # Respiratory failure
"298705000": "M79.3", # Panniculitis
}
# ── ICD-10 first character → clinical specialty ────────────────────────────────
ICD10_TO_SPECIALTY = {
"A": "Infectious Disease",
"B": "Infectious Disease",
"C": "Oncology",
"D": "Hematology",
"E": "Endocrinology",
"F": "Psychiatry",
"G": "Neurology",
"H": "Ophthalmology",
"I": "Cardiology",
"J": "Pulmonology",
"K": "Gastroenterology",
"L": "Dermatology",
"M": "Orthopedics",
"N": "Nephrology",
"O": "Obstetrics",
"Q": "Pediatrics",
"R": "General Medicine",
"S": "Orthopedics",
"T": "Emergency Medicine",
"Z": "General Medicine",
}
# ── MTSamples specialty → standard specialty name ─────────────────────────────
MTSAMPLES_TO_STANDARD_SPECIALTY = {
"Cardiovascular / Pulmonary": "Cardiology",
"Neurology": "Neurology",
"Orthopedic": "Orthopedics",
"Gastroenterology": "Gastroenterology",
"Urology": "Urology",
"Nephrology": "Nephrology",
"Psychiatry / Psychology": "Psychiatry",
"Dermatology": "Dermatology",
"Endocrinology": "Endocrinology",
"Ophthalmology": "Ophthalmology",
"ENT - Otolaryngology": "Otorhinolaryngology",
"General Medicine": "General Medicine",
"Obstetrics / Gynecology": "Obstetrics",
"Pediatrics - Neonatal": "Pediatrics",
"Rheumatology": "Orthopedics",
"Hematology - Oncology": "Oncology",
"SOAP / Chart / Progress Notes": "General Medicine",
"Office Notes": "General Medicine",
"Consult - History and Phy.": "General Medicine",
"Pain Management": "Orthopedics",
"Sleep Medicine": "Pulmonology",
"Physical Medicine - Rehab": "Orthopedics",
"Podiatry": "Orthopedics",
"Allergy / Immunology": "Infectious Disease",
"Chiropractic": "Orthopedics",
}
# ── Chief complaint keywords → specialty (for ACI-Bench classification) ────────
# Keys are tuples of keywords, values are specialty strings
CC_TO_SPECIALTY = {
("chest", "cardiac", "heart", "palpitation", "angina", "coronary",
"hypertension", "blood pressure"): "Cardiology",
("seizure", "neuro", "headache", "migraine", "stroke", "memory",
"tremor", "epilepsy", "dizziness"): "Neurology",
("knee", "hip", "back pain", "joint", "fracture", "orthop",
"shoulder", "bone", "arthritis", "spine"): "Orthopedics",
("bowel", "abdominal", "crohn", "colitis", "gastro", "reflux",
"diarrhea", "stomach", "colon"): "Gastroenterology",
("breath", "asthma", "copd", "pulmon", "cough", "lung",
"respiratory", "wheez"): "Pulmonology",
("kidney", "renal", "dialysis", "ckd", "creatinine"): "Nephrology",
("skin", "rash", "dermat", "psoriasis", "eczema", "acne"): "Dermatology",
("diabetes", "thyroid", "endocrin", "glucose", "insulin",
"A1c", "weight"): "Endocrinology",
("depression", "anxiety", "psych", "bipolar", "mental",
"mood", "panic"): "Psychiatry",
("urin", "prostate", "bladder", "urolog", "incontinence"): "Urology",
("eye", "vision", "ophthalm", "cataract", "glaucoma"): "Ophthalmology",
("ear", "nose", "throat", "ent", "sinus", "hearing",
"tonsil"): "Otorhinolaryngology",
}
requirements.txt¶
README¶
# Scribe-IQ Data Pipeline
Generates a synthetic clinical corpus of 50 patients for the Scribe-IQ demo.
## Prerequisites
- Python 3.11+
- Java 11+ (for Synthea)
- Groq API key (free at https://console.groq.com)
## Setup
```bash
# From repo root
pip install -r data_prep/requirements.txt
# Download Synthea JAR to repo root
curl -L https://github.com/synthetichealth/synthea/releases/download/master-branch-latest/synthea-with-dependencies.jar \
-o synthea-with-dependencies.jar
# Clone ACI-Bench
git clone https://github.com/microsoft/clinical_visit_note_summarization_corpus \
data/raw/aci_bench
# Set Groq key
export GROQ_API_KEY=gsk_...
# Set Python path
export PYTHONPATH="$(pwd)/data_prep:$PYTHONPATH"
Run Pipeline¶
cd data_prep # ALL scripts run from here
bash scripts/01_generate_patients.sh
python scripts/02_build_note_pool.py
python scripts/03_reserve_aci_encounters.py
python scripts/04_match_and_score.py
python scripts/05_select_patients.py
python scripts/06_adapt_notes.py
python scripts/07_assemble_corpus.py
python scripts/08_generate_dataset_card.py
python scripts/09_validate_corpus.py
Resumability¶
Each script reads from the previous script's output. If a script fails, fix the issue and re-run only that script. Scripts 01-05 are idempotent (safe to re-run, they overwrite output).
Estimated Runtime¶
| Script | Runtime |
|---|---|
| 01 generate patients | ~5 min |
| 02 build note pool | ~20 min |
| 03 reserve ACI | seconds |
| 04 match and score | ~45 min (M1 Max) |
| 05 select patients | seconds |
| 06 adapt notes | ~25 min (Groq) |
| 07 assemble corpus | ~5 min |
| 08 dataset card | seconds |
| 09 validate | seconds |
| Total | ~2 hours |
Relation to lakehouse-era proposals¶
Historical proposals sometimes assumed a lakehouse/ script tree alongside demos; this repository executes corpus work only under data_prep/. Written lineage and diagrams live under docs/archive/agbonnet_lakehouse_precursor_proposal_v2.md and docs/history/EVOLUTION.md.
Do not recreate a competing lakehouse/scripts tree for new work.
Appendix: Implementation corrections (merged)¶
Merged (2026-05-06): Former standalone
SCRIBE_IQ_IMPLEMENTATION_CORRECTED.mdnarrative is reproduced below so this brief remains a single source of truth. The archive retains a short supersession stub only.
Changes From Previous Version¶
Bugs Fixed:¶
- ✅
encounter_reasonNaN handling (sanitize to string) - ✅ Gender normalization for ACI matching (M/F consistent)
- ✅ Deterministic patient iteration (file order, not set)
- ✅ Field naming: use
adapted_note_id(matches existing 07) - ✅ Showcase notes: single file pattern (adapted_notes.jsonl)
- ✅ Use existing ACI reservations (no duplicate logic)
- ✅ Groq client in main() (not at import)
- ✅ Model from env with safe default
- ✅ Use
load_synthea/compute_agefrom utils - ✅ Aligned corpus directory name
Script 05.5 — Extract Longitudinal Context (CORRECTED)¶
File: data_prep/scripts/05.5_extract_longitudinal_context.py
"""
05.5_extract_longitudinal_context.py
Extracts longitudinal context for golden patient encounters.
CORRECTED VERSION: Fixes from agent review applied.
"""
from pathlib import Path
import json
from collections import defaultdict
def sanitize_string(value, default="Not documented"):
"""Convert encounter_reason to valid string (handles NaN/float/None)."""
if value is None or (isinstance(value, float) and value != value): # NaN check
return default
return str(value).strip() if str(value).strip() else default
def main():
REPO_ROOT = Path(__file__).resolve().parents[2]
GOLDEN = REPO_ROOT / "data/staging/selected_patients_golden.jsonl"
MATCHES = REPO_ROOT / "data/staging/match_results.jsonl"
OUTPUT = REPO_ROOT / "data/staging/patient_longitudinal_context.jsonl"
K_PRIOR_VISITS = 3
print("="*60)
print("EXTRACTING LONGITUDINAL CONTEXT")
print("="*60)
# Load golden patient IDs IN FILE ORDER (deterministic)
golden_patients = [json.loads(line) for line in open(GOLDEN)]
golden_ids = [p["patient_id"] for p in golden_patients]
print(f"\nGolden patients: {len(golden_ids)} (file order preserved)")
# Load matches for golden patients only (filter while reading)
matches = []
with open(MATCHES) as f:
for line in f:
m = json.loads(line)
if m["patient_id"] in golden_ids:
matches.append(m)
print(f"Total encounters for golden patients: {len(matches)}")
# Group by patient
by_patient = defaultdict(list)
for m in matches:
by_patient[m["patient_id"]].append(m)
# Build longitudinal context
OUTPUT.unlink(missing_ok=True)
total_contexts = 0
# Process in file order for deterministic output
for pid in golden_ids:
encs = by_patient[pid]
# Sort chronologically
encs_sorted = sorted(encs, key=lambda x: x["encounter_date"])
for i, current_enc in enumerate(encs_sorted):
# Get K=3 prior visits (or fewer if at start)
prior_start = max(0, i - K_PRIOR_VISITS)
prior_encs = encs_sorted[prior_start:i]
# Build prior visit blocks
prior_blocks = []
for prior in prior_encs:
# Sanitize encounter_reason (can be NaN/float)
reason = sanitize_string(prior.get("encounter_reason"))
# Cap lists
conditions = prior.get("conditions", [])[:5]
medications = prior.get("medications", [])[:5]
obs = prior.get("recent_obs", [])[:3]
prior_blocks.append({
"date": prior["encounter_date"][:10],
"reason": reason,
"conditions": conditions,
"medications": medications,
"key_observations": [
f"{o.get('DESCRIPTION', 'Unknown')}: {o.get('VALUE', '')} {o.get('UNITS', '')}".strip()
for o in obs
],
})
# Write context record
with open(OUTPUT, "a") as f:
f.write(json.dumps({
"patient_id": pid,
"encounter_id": current_enc["encounter_id"],
"encounter_date": current_enc["encounter_date"],
"prior_visits": prior_blocks,
"num_prior_visits": len(prior_blocks),
}) + "\n")
total_contexts += 1
print(f"\n✓ Longitudinal context extracted")
print(f" Total context records: {total_contexts}")
print(f" Output: {OUTPUT}")
if __name__ == "__main__":
main()
Script 06 — Adapt Notes (CORRECTED)¶
File: data_prep/scripts/06_adapt_notes.py
"""
06_adapt_notes.py
Adapts notes with longitudinal context.
CORRECTED VERSION: Fixes all bugs from agent review.
Key fixes:
- Groq client in main(), not at import
- Model from env with safe default
- Use utils.synthea_utils (not raw pandas)
- encounter_reason sanitization
- Field naming: adapted_note_id (not note_id)
- Showcase notes written to same file
"""
import os
import json
import time
from pathlib import Path
from collections import defaultdict
from groq import Groq
from utils.synthea_utils import load_synthea, compute_age
def sanitize_string(value, default="Not documented"):
"""Convert to valid string (handles NaN/float/None)."""
if value is None or (isinstance(value, float) and value != value):
return default
return str(value).strip() if str(value).strip() else default
def format_prior_context(prior_blocks):
"""Format prior visit blocks for prompt."""
if not prior_blocks:
return "No prior visits in recent history."
lines = []
for i, visit in enumerate(prior_blocks, 1):
lines.append(f"\nVisit {i} ({visit['date']}):")
lines.append(f" Reason: {visit['reason']}")
if visit.get('conditions'):
lines.append(f" Conditions: {', '.join(visit['conditions'])}")
if visit.get('medications'):
lines.append(f" Medications: {', '.join(visit['medications'])}")
if visit.get('key_observations'):
obs_text = '; '.join(visit['key_observations'])
lines.append(f" Key findings: {obs_text}")
return '\n'.join(lines)
def format_observations(obs_list):
"""Format observation list for prompt."""
if not obs_list:
return " None recorded"
lines = []
for o in obs_list[:5]:
desc = o.get('DESCRIPTION', 'Unknown')
val = o.get('VALUE', '')
units = o.get('UNITS', '')
lines.append(f" - {desc}: {val} {units}".strip())
return '\n'.join(lines) if lines else " None recorded"
def main():
REPO_ROOT = Path(__file__).resolve().parents[2]
GOLDEN = REPO_ROOT / "data/staging/selected_patients_golden.jsonl"
MATCHES = REPO_ROOT / "data/staging/match_results.jsonl"
CONTEXT = REPO_ROOT / "data/staging/patient_longitudinal_context.jsonl"
ACI_RES = REPO_ROOT / "data/staging/aci_reservations.jsonl"
SYNTHEA_DIR = REPO_ROOT / "data/raw/synthea/csv"
OUTPUT = REPO_ROOT / "data/staging/adapted_notes.jsonl"
# Model from env with safe default
MODEL = os.environ.get("GROQ_MODEL", "llama-3.1-70b-versatile")
# Initialize Groq client here, not at import
if "GROQ_API_KEY" not in os.environ:
print("ERROR: GROQ_API_KEY not set")
exit(1)
client = Groq(api_key=os.environ["GROQ_API_KEY"])
print("="*60)
print("ADAPTING NOTES WITH LONGITUDINAL CONTEXT")
print("="*60)
print(f"Model: {MODEL}")
# Load longitudinal context
context_lookup = {}
with open(CONTEXT) as f:
for line in f:
ctx = json.loads(line)
context_lookup[(ctx["patient_id"], ctx["encounter_id"])] = ctx
print(f"\nLoaded context for {len(context_lookup)} encounters")
# Load golden patient IDs in file order
golden_patients = [json.loads(line) for line in open(GOLDEN)]
golden_ids = [p["patient_id"] for p in golden_patients]
# Load matches for golden patients only
matches = []
with open(MATCHES) as f:
for line in f:
m = json.loads(line)
if m["patient_id"] in golden_ids:
matches.append(m)
print(f"Encounters to process: {len(matches)}")
# Group by patient
by_patient = defaultdict(list)
for m in matches:
by_patient[m["patient_id"]].append(m)
# Load Synthea using utils
synthea = load_synthea(SYNTHEA_DIR)
patients_df = synthea["patients"]
# Load ACI reservations if they exist
aci_reserved = {}
if ACI_RES.exists():
with open(ACI_RES) as f:
for line in f:
res = json.loads(line)
aci_reserved[res["note_id"]] = res
print(f"ACI reservations loaded: {len(aci_reserved)}")
# Adaptation prompt
ADAPT_PROMPT = """\
You are a clinical documentation specialist.
PATIENT HISTORY (prior visits):
{prior_context}
TODAY'S VISIT:
- Date: {visit_date}
- Patient: {age}-year-old {sex}
- Visit reason: {visit_reason}
- Active conditions: {conditions}
- Current medications: {medications}
- Recent vitals/labs:
{observations}
REFERENCE NOTE (use this structure and clinical writing style):
{reference_note}
Instructions:
1. Write a clinical note for TODAY'S VISIT using the reference note's structure
2. Reference relevant information from prior visits naturally where appropriate
3. Only include conditions and medications listed above for today
4. Do not invent lab values or procedures not mentioned
5. Keep the same professional tone and section format as the reference
6. Do not include patient names or identifiers
Write the adapted note now:"""
def call_groq(prompt, retries=3):
"""Call Groq with retry logic."""
for attempt in range(retries):
try:
response = client.chat.completions.create(
model=MODEL,
messages=[{"role": "user", "content": prompt}],
temperature=0.2,
max_tokens=900,
)
return response.choices[0].message.content.strip()
except Exception as e:
if attempt < retries - 1:
print(f" Retry {attempt + 1}: {e}")
time.sleep(2 ** attempt)
else:
raise
# Process all golden patients
OUTPUT.unlink(missing_ok=True)
note_idx = 0
total_prior_notes = 0
total_showcase = 0
# Process in file order (deterministic)
for pid in golden_ids:
pt_row = patients_df[patients_df["Id"] == pid].iloc[0]
sex = "Male" if pt_row["GENDER"] == "M" else "Female"
# Get encounters sorted chronologically
encs = sorted(by_patient[pid], key=lambda x: x["encounter_date"])
# All but last = prior visits
# Last = showcase
prior_visits = encs[:-1]
showcase = encs[-1]
print(f"\n{pid} — {len(encs)} encounters ({len(prior_visits)} prior + 1 showcase)")
# ── Adapt prior visits ────────────────────────────────────────
for enc in prior_visits:
if not enc.get("best_note_text"):
print(f" [{enc['encounter_date'][:10]}] no reference note, skipping")
continue
# Get context
ctx = context_lookup.get((pid, enc["encounter_id"]))
if not ctx:
print(f" [{enc['encounter_date'][:10]}] no context, skipping")
continue
# Compute age
age = compute_age(pt_row["BIRTHDATE"], enc["encounter_date"])
# Sanitize encounter_reason
reason = sanitize_string(enc.get("encounter_reason"), "Follow-up visit")
# Build prompt
prompt = ADAPT_PROMPT.format(
prior_context=format_prior_context(ctx["prior_visits"]),
visit_date=enc["encounter_date"][:10],
age=age,
sex=sex,
visit_reason=reason,
conditions=", ".join(enc.get("conditions", [])) or "None documented",
medications=", ".join(enc.get("medications", [])) or "None documented",
observations=format_observations(enc.get("recent_obs", [])),
reference_note=enc["best_note_text"][:2500],
)
print(f" [{enc['encounter_date'][:10]}] adapting "
f"({enc['best_note_source']}, {ctx['num_prior_visits']} prior)...",
end=" ", flush=True)
try:
adapted_text = call_groq(prompt)
# Write with correct field names
with open(OUTPUT, "a") as f:
f.write(json.dumps({
"adapted_note_id": f"note_{note_idx:06d}", # NOT note_id
"encounter_id": enc["encounter_id"],
"patient_id": pid,
"encounter_date": enc["encounter_date"],
"note_text": adapted_text,
"reference_source": enc["best_note_source"],
"reference_note_id": enc.get("best_note_id"),
"match_score": enc["match_score"],
"num_prior_visits": ctx["num_prior_visits"],
"adaptation_model": MODEL,
"is_showcase": False,
"has_dialogue": False,
}) + "\n")
note_idx += 1
total_prior_notes += 1
print("✓")
except Exception as e:
print(f"✗ Error: {e}")
# Rate limit courtesy (optional, can remove if too slow)
# time.sleep(0.5)
# ── Showcase encounter ────────────────────────────────────────
# Check if ACI reserved for this encounter
showcase_note_id = showcase.get("best_note_id")
if showcase_note_id and showcase_note_id in aci_reserved:
# Use ACI-Bench directly (already reserved)
aci = aci_reserved[showcase_note_id]
with open(OUTPUT, "a") as f:
f.write(json.dumps({
"adapted_note_id": f"note_{note_idx:06d}",
"encounter_id": showcase["encounter_id"],
"patient_id": pid,
"encounter_date": showcase["encounter_date"],
"note_text": aci["note_text"],
"dialogue": aci.get("dialogue"),
"reference_source": "aci_bench",
"reference_note_id": aci["note_id"],
"match_score": showcase["match_score"],
"num_prior_visits": 0, # showcase doesn't use prior context
"adaptation_model": "aci_bench_direct",
"is_showcase": True,
"has_dialogue": True,
}) + "\n")
note_idx += 1
total_showcase += 1
print(f" [{showcase['encounter_date'][:10]}] showcase → ACI-Bench")
elif showcase.get("best_note_dialogue"):
# Fallback: MedSynth dialogue
with open(OUTPUT, "a") as f:
f.write(json.dumps({
"adapted_note_id": f"note_{note_idx:06d}",
"encounter_id": showcase["encounter_id"],
"patient_id": pid,
"encounter_date": showcase["encounter_date"],
"note_text": showcase.get("best_note_text", ""),
"dialogue": showcase["best_note_dialogue"],
"reference_source": "medsynth",
"reference_note_id": showcase.get("best_note_id"),
"match_score": showcase["match_score"],
"num_prior_visits": 0,
"adaptation_model": "medsynth_direct",
"is_showcase": True,
"has_dialogue": True,
}) + "\n")
note_idx += 1
total_showcase += 1
print(f" [{showcase['encounter_date'][:10]}] showcase → MedSynth dialogue")
else:
# No dialogue available, adapt note
ctx = context_lookup.get((pid, showcase["encounter_id"]))
if ctx and showcase.get("best_note_text"):
age = compute_age(pt_row["BIRTHDATE"], showcase["encounter_date"])
reason = sanitize_string(showcase.get("encounter_reason"), "Follow-up")
prompt = ADAPT_PROMPT.format(
prior_context=format_prior_context(ctx["prior_visits"]),
visit_date=showcase["encounter_date"][:10],
age=age,
sex=sex,
visit_reason=reason,
conditions=", ".join(showcase.get("conditions", [])) or "None",
medications=", ".join(showcase.get("medications", [])) or "None",
observations=format_observations(showcase.get("recent_obs", [])),
reference_note=showcase["best_note_text"][:2500],
)
adapted_text = call_groq(prompt)
with open(OUTPUT, "a") as f:
f.write(json.dumps({
"adapted_note_id": f"note_{note_idx:06d}",
"encounter_id": showcase["encounter_id"],
"patient_id": pid,
"encounter_date": showcase["encounter_date"],
"note_text": adapted_text,
"dialogue": None,
"reference_source": showcase["best_note_source"],
"reference_note_id": showcase.get("best_note_id"),
"match_score": showcase["match_score"],
"num_prior_visits": ctx["num_prior_visits"],
"adaptation_model": MODEL,
"is_showcase": True,
"has_dialogue": False,
}) + "\n")
note_idx += 1
total_showcase += 1
print(f" [{showcase['encounter_date'][:10]}] showcase → adapted (no dialogue)")
print(f"\n{'='*60}")
print(f"ADAPTATION COMPLETE")
print(f"{'='*60}")
print(f"Prior visit notes: {total_prior_notes}")
print(f"Showcase notes: {total_showcase}")
print(f"Total: {note_idx}")
print(f"Output: {OUTPUT}")
if __name__ == "__main__":
main()
Script 06.5 — Match ACI (SIMPLIFIED - uses existing reservations)¶
File: data_prep/scripts/06.5_verify_aci_coverage.py
NOTE: This is now a verification script only. The actual ACI matching should use your existing aci_reservations.jsonl from script 03. This avoids duplicate logic.
"""
06.5_verify_aci_coverage.py
Verifies ACI-Bench dialogue coverage for golden patients.
Uses EXISTING aci_reservations.jsonl (from script 03).
This is a verification-only script. If you don't have aci_reservations.jsonl,
run script 03 first to create it.
"""
import json
from pathlib import Path
from collections import Counter
def main():
REPO_ROOT = Path(__file__).resolve().parents[2]
GOLDEN = REPO_ROOT / "data/staging/selected_patients_golden.jsonl"
MATCHES = REPO_ROOT / "data/staging/match_results.jsonl"
ACI_RES = REPO_ROOT / "data/staging/aci_reservations.jsonl"
print("="*60)
print("VERIFYING ACI-BENCH COVERAGE")
print("="*60)
# Load golden IDs
golden_patients = [json.loads(line) for line in open(GOLDEN)]
golden_ids = set(p["patient_id"] for p in golden_patients)
print(f"\nGolden patients: {len(golden_ids)}")
# Load matches for golden patients
matches = []
with open(MATCHES) as f:
for line in f:
m = json.loads(line)
if m["patient_id"] in golden_ids:
matches.append(m)
# Group by patient, find last encounter
from collections import defaultdict
by_patient = defaultdict(list)
for m in matches:
by_patient[m["patient_id"]].append(m)
showcase_encounters = {}
for pid in golden_ids:
encs = sorted(by_patient[pid], key=lambda x: x["encounter_date"])
showcase = encs[-1]
showcase_encounters[pid] = showcase
# Load ACI reservations
if not ACI_RES.exists():
print(f"\n⚠️ WARNING: {ACI_RES} not found")
print("Run script 03_reserve_aci_encounters.py first to create reservations")
return
aci_reserved = {}
with open(ACI_RES) as f:
for line in f:
res = json.loads(line)
aci_reserved[res["note_id"]] = res
print(f"ACI reservations loaded: {len(aci_reserved)}")
# Check coverage
dialogue_sources = []
for pid, showcase in showcase_encounters.items():
note_id = showcase.get("best_note_id")
if note_id and note_id in aci_reserved:
dialogue_sources.append("aci_bench")
elif showcase.get("best_note_dialogue"):
dialogue_sources.append("medsynth")
else:
dialogue_sources.append("none")
source_counts = Counter(dialogue_sources)
print(f"\n{'='*60}")
print(f"DIALOGUE COVERAGE FOR {len(golden_ids)} SHOWCASES")
print(f"{'='*60}")
print(f"ACI-Bench: {source_counts['aci_bench']}")
print(f"MedSynth: {source_counts['medsynth']}")
print(f"None: {source_counts['none']}")
print(f"Total coverage: {(source_counts['aci_bench'] + source_counts['medsynth']) / len(golden_ids) * 100:.0f}%")
if __name__ == "__main__":
main()
Utils Update¶
File: data_prep/utils/synthea_utils.py
Add this if not present:
def compute_age(birth_date_str: str, as_of_date_str: str) -> int:
"""Compute patient age at a specific date."""
import pandas as pd
birth = pd.to_datetime(birth_date_str)
as_of = pd.to_datetime(as_of_date_str)
return int((as_of - birth).days / 365.25)
Corrected Implementation Sequence¶
Prerequisites¶
# Must have these files already:
# - data/staging/selected_patients_golden.jsonl
# - data/staging/match_results.jsonl
# - data/staging/aci_reservations.jsonl (from script 03)
# - data/raw/synthea/csv/
export GROQ_API_KEY=your_key
export GROQ_MODEL=llama-3.1-70b-versatile # optional, has safe default
Run Sequence¶
cd data_prep
# Step 1: Extract context
python scripts/05.5_extract_longitudinal_context.py
# Step 2: Adapt notes (includes showcases)
python scripts/06_adapt_notes.py
# Step 3: Verify dialogue coverage
python scripts/06.5_verify_aci_coverage.py
# Step 4: Assemble corpus (existing script, should work as-is)
python scripts/07_assemble_corpus.py
Key Differences From Previous Version¶
| Issue | Previous | Corrected |
|---|---|---|
| encounter_reason | Direct use (crashes on NaN) | sanitize_string() helper |
| Patient order | Set iteration (random) | File order (deterministic) |
| Field names | note_id |
adapted_note_id (matches 07) |
| Showcase pattern | Two files | Single adapted_notes.jsonl |
| ACI logic | New script 06.5 | Uses existing aci_reservations.jsonl |
| Groq client | Import-time | In main() |
| Model ID | Hardcoded | From env with default |
| Synthea loading | Raw pandas | utils.synthea_utils |
Validation¶
After running, verify field names match:
# Check adapted_note_id is used (not note_id)
head -1 data/staging/adapted_notes.jsonl | python -m json.tool | grep note_id
# Should show "adapted_note_id"
# Check showcases are in same file
jq 'select(.is_showcase == true)' data/staging/adapted_notes.jsonl | wc -l
# Should show 19
# Check deterministic output (run twice)
python scripts/05.5_extract_longitudinal_context.py
mv data/staging/patient_longitudinal_context.jsonl /tmp/run1.jsonl
python scripts/05.5_extract_longitudinal_context.py
diff /tmp/run1.jsonl data/staging/patient_longitudinal_context.jsonl
# Should be identical (no output from diff)
Summary of Corrections¶
✅ All bugs from agent review fixed
✅ Uses existing ACI reservations (no duplicate logic)
✅ Single file for all notes (prior + showcase)
✅ Deterministic output (file order)
✅ Robust string handling (NaN-safe)
✅ Proper utils integration
✅ Env-based configuration
This version is ready for the coding agent to implement.
Instructions for the Coding Agent¶
Paste this document as a project brief. Then:
- "Create the full directory structure"
- "Implement utils/mappings.py, utils/io_utils.py, utils/synthea_utils.py, utils/note_checks.py"
- "Implement scripts/01 through 09 one at a time, confirm each compiles before continuing"
- "Run script 01 and show me the CSV output file list"
- Continue sequentially — do not skip ahead until each script runs successfully
Each script is independent and debuggable in isolation. The pipeline is fully resumable — fix and rerun any single script without rerunning earlier steps.