Pipeline Scripts
This section exposes the deterministic pipeline behind the site. Rather than hiding the build logic, it presents these 31 Python scripts that initialize the database, parse filenames, extract references, match folios to images, seed and enrich metadata, validate assumptions, and generate the static pages. The aim is transparency and reproducibility.
The project's architectural stance is deliberately conservative: SQLite as source of truth, Python for transformation, JSON and static HTML for delivery, and as little framework machinery as possible. That simplicity is not an omission; it is part of the project's long-term durability model.
| Script | Name | Description | Lines |
|---|---|---|---|
| add_alchemist_descriptions.py | Add Alchemist Descriptions | Inserts 13 folio-specific scholarly descriptions for the two alchemist annotators from Russell Ch. 6-7. | 352 |
| add_bibliography.py | Add Bibliography | Populates bibliography (58 entries), scholars (29), timeline (39 events) from hardcoded research data. | 327 |
| add_hands.py | Add Annotator Hands | Creates 11 annotator hand profiles and attributes dissertation references to specific hands. | 443 |
| build_essay_data.py | Build Essay Data | Extracts structured evidence from DB and corpus for the Russell and Concordance essays. | 318 |
| build_reading_packets.py | Build Reading Packets | Assembles structured research packets from corpus search for dictionary enrichment. | 158 |
| build_scholar_profiles.py | Build Scholar Profiles (Legacy) | Original scholar page generator from summaries.json. Superseded by build_site.py. | 283 |
| build_signature_map.py | Build Signature Map | Generates the 448-entry signature-to-folio concordance from the Aldine collation formula (a-z, A-G). | 103 |
| build_site.py | Build Site | Unified site generator: exports data.json, builds all HTML pages (scholars, dictionary, marginalia, bibliography, docs, code, about). | 4048 |
| catalog_images.py | Catalog Images | Parses image filenames from BL and Siena collections into the images table with folio/side metadata. | 224 |
| chunk_documents.py | Chunk Documents | Splits markdown files into ~1500-word semantic chunks for RAG/retrieval systems. | 261 |
| corpus_search.py | Corpus Search | Keyword-based search across markdown chunks and documents with provenance tracking. | 220 |
| dictionary_audit.py | Dictionary Audit | Audits dictionary coverage: missing fields, duplicate slugs, orphaned links, weak terms. | 160 |
| enrich_dictionary.py | Enrich Dictionary | Populates dictionary fields from reading packets with source provenance and review status. | 170 |
| export_showcase_data.py | Export Showcase Data (Legacy) | Original data.json exporter for the gallery. Superseded by build_site.py. | 116 |
| extract_references.py | Extract References | Uses PyMuPDF + regex to extract 282 folio/signature references from Russell's PhD thesis PDF. | 176 |
| generate_dictionary_significance.py | Generate Significance | Generates significance_to_hp and significance_to_scholarship prose for all 80+ dictionary terms. | 447 |
| generate_scholar_overviews.py | Generate Scholar Overviews | Generates 2-3 paragraph overview prose for modern scholars and role descriptions for historical figures. | 347 |
| ingest_perplexity.py | Ingest Perplexity Research | Adds 9 bibliography entries and 3 timeline events from HPPERPLEXITY.txt web research. | 217 |
| init_db.py | Initialize Database | Creates SQLite schema (7 core tables) and catalogs PDFs/documents from the filesystem. | 221 |
| link_scholars.py | Link Scholars | Links scholars to bibliography, tags historical figures, matches summaries.json to bibliography entries. | 205 |
| match_refs_to_images.py | Match Refs to Images | SQL join pipeline matching dissertation references to manuscript images via the signature map. | 142 |
| migrate_dictionary_v2.py | Dictionary Schema V2 | Extends dictionary_terms with significance, source tracking, provenance, and confidence columns. | 55 |
| migrate_timeline.py | Timeline Migration | Adds category, medium, location, image_ref, confidence columns to timeline_events table. | 41 |
| migrate_v2.py | Schema Migration V2 | Adds annotations, annotators, doc_folio_refs, dictionary tables, review/provenance columns. Downgrades BL confidence. | 389 |
| pdf_to_markdown.py | PDF to Markdown | Extracts all PDFs to markdown with YAML frontmatter, page markers, and metadata lookup. | 373 |
| seed_copies.py | Seed Copies | Creates hp_copies table and seeds six annotated copies with full metadata from Russell 2014. | 146 |
| seed_dictionary.py | Seed Dictionary | Inserts 37 dictionary terms across 6 categories with 76 bidirectional cross-reference links. | 429 |
| seed_dictionary_v2.py | Seed Dictionary V2 | Seeds 43 HP entity terms: characters, places, architecture, gardens, processions, aesthetics, materials. | 543 |
| seed_dictionary_v3.py | Seed Dictionary V3 | Seeds 14 additional terms: narrative form, built form, aesthetics, alchemy, material culture. | 260 |
| seed_timeline_v2.py | Seed Timeline V2 | Seeds ~30 new timeline events: art, literary influence, scholarly milestones, garden design. | 249 |
| validate.py | Validate & QA | Checks data integrity (duplicate slugs, broken links, confidence distribution) and writes AUDIT_REPORT.md. | 264 |