Build Pipeline
PIPELINE.md — 622 words
PIPELINE: How Data Flows from Source to Site
No abstraction. No speculation. Just the actual scripts and their order.
End-to-End Pipeline
STAGE 1: INGEST
init_db.py → Creates schema (24 tables incl. image_readings)
catalog_images.py → Parses 674 image filenames → images table (with master_path + web_path)
build_signature_map.py → 1499 collation formula → 448 signature_map entries
STAGE 2: EXTRACT
pdf_to_markdown.py → 37 PDFs → 37 markdown files in md/
chunk_documents.py → 37 markdowns → ~200 chunks in chunks/
extract_references.py → Russell thesis → 282 dissertation_refs
STAGE 3: MATCH
match_refs_to_images.py → Joins refs + images via signature_map → matches
fix_bl_offset.py → Corrects BL folio numbers (offset=13)
build_bl_ground_truth.py → Verified folio mapping from image reading
rebuild_bl_matches.py → Rebuilds BL matches with correct data → 39 BL matches
STAGE 4: ENRICH
add_hands.py → 11 annotator hand profiles
add_bibliography.py → 109 bibliography + 60 scholars + timeline events
consolidate_annotations.py → Migrates dissertation_refs → annotations (282)
classify_annotations.py → Assigns 6 annotation types
seed_dictionary.py → 37 base terms
seed_dictionary_v2.py → 43 HP entity terms
seed_dictionary_v3.py → 14 additional terms
migrate_dictionary_v2.py → Extends schema with significance columns
build_reading_packets.py → 94 evidence packets from corpus
enrich_dictionary.py → Populates source docs, quotes, refs
generate_dictionary_significance.py → significance_to_hp/scholarship for 80 terms
generate_significance_v3.py → significance for 14 more terms
link_scholars.py → scholar_works junction, historical figure tagging
generate_scholar_overviews.py → 38 scholar overviews
generate_remaining_overviews.py → 21 more overviews
migrate_marginalia.py → alchemical_symbols + symbol_occurrences tables
extract_alchemical_data.py → 26 symbol occurrences from thesis evidence
add_alchemist_descriptions.py → 13 folio descriptions
seed_timeline_v2.py → 29 new timeline events
seed_copies.py → 6 hp_copies entries
seed_woodcuts.py → 18 woodcut entries
STAGE 4.5: IMAGE READING INFRASTRUCTURE
migrate_v3_image_reading.py → Schema v3: image_readings table + expanded CHECKs
image_utils.py → Shared path validation (master vs web enforcement)
backfill_previous_readings.py → 30 historical readings into image_readings
STAGE 4.6: VISUAL GROUND TRUTH (Phase 1)
read_images.py --phase 1 → 189 BL photos read via Claude Code vision
→ 189 image_readings rows (phase=1)
→ 189 raw JSON files in staging/image_readings/bl/phase1/
→ BL offset confirmed at 174/174 points
→ 60 woodcuts detected
STAGE 5: BUILD
build_site.py → Generates all 365 HTML pages + data.json
Script Dependencies
Scripts must run in stage order. Within each stage, order matters:
Stage 1: init_db → catalog_images, build_signature_map (parallel OK)
Stage 2: pdf_to_markdown → chunk_documents; extract_references (parallel OK)
Stage 3: match_refs_to_images → fix_bl_offset → build_bl_ground_truth → rebuild_bl_matches (sequential)
Stage 4: All enrichment scripts depend on Stage 1-3 tables existing. Most are idempotent.
Stage 5: build_site.py depends on all data being populated.
Rebuild from Scratch
python scripts/init_db.py
python scripts/build_signature_map.py
python scripts/catalog_images.py
python scripts/extract_references.py
python scripts/match_refs_to_images.py
python scripts/fix_bl_offset.py
python scripts/build_bl_ground_truth.py
python scripts/rebuild_bl_matches.py
python scripts/add_hands.py
python scripts/add_bibliography.py
python scripts/migrate_v2.py
python scripts/consolidate_annotations.py
python scripts/classify_annotations.py
python scripts/seed_dictionary.py
python scripts/seed_dictionary_v2.py
python scripts/seed_dictionary_v3.py
python scripts/migrate_dictionary_v2.py
python scripts/build_reading_packets.py
python scripts/enrich_dictionary.py
python scripts/generate_dictionary_significance.py
python scripts/generate_significance_v3.py
python scripts/link_scholars.py
python scripts/generate_scholar_overviews.py
python scripts/generate_remaining_overviews.py
python scripts/migrate_timeline.py
python scripts/seed_timeline_v2.py
python scripts/migrate_marginalia.py
python scripts/extract_alchemical_data.py
python scripts/add_alchemist_descriptions.py
python scripts/seed_copies.py
python scripts/seed_woodcuts.py
python scripts/migrate_v3_image_reading.py
python scripts/backfill_previous_readings.py
# Phase 1 readings: run read_images.py --ingest with pre-computed results
python scripts/build_site.py
What Each Stage Produces
| Stage | Input | Output | Rows Created |
|---|---|---|---|
| Ingest | Image directories, collation formula | images (674), signature_map (448) | 1122 |
| Extract | PDFs | md/ (37 files), chunks/ (~200), dissertation_refs (282) | 282 |
| Match | Refs + images + map | matches (431) | 431 |
| Enrich | All above + corpus | annotations, dictionary, scholars, bibliography, timeline, symbols, woodcuts | ~1000 |
| Image Infra | Schema v3 migration | image_readings table, expanded CHECKs | 30 (backfill) |
| Phase 1 | Master BL images (via Claude Code vision) | 189 image_readings rows, 189 staging JSONs | 189 |
| Build | All tables | 365 HTML pages + data.json | — |