Mistakes to Avoid

docs/archive/MISTAKESTOAVOID.md — 1,627 words

MISTAKESTOAVOID: Lessons from the HP Marginalia Project

Hard-won takeaways from building a digital humanities database and static site across multiple long Claude Code sessions. These are mistakes we actually made, not hypothetical risks.

1. Background Agents Cannot Be Trusted for Critical Work

What happened: Two background agents (MIT site analysis, scholarship search) ran for 4-6 minutes, produced 0-byte output files, and had to be redone manually. Earlier background agents (article summarization batches) worked fine but their output files were also 0 bytes — they just happened to deliver results via in-conversation notifications before the session moved on.

The rule: If you need the result to proceed, run the agent in the foreground. Background agents are fire-and-forget. If the session state shifts while they're running — you start a new tool call, the context compresses, the conversation moves on — their results may not land.

Corollary: Background agents in sandboxed environments may lose access to tools (WebSearch, WebFetch, Bash) that the main conversation has. The agent doesn't know this until it tries. Two agents spent their entire runtimes repeatedly failing tool calls before returning partial results from training knowledge alone.

2. Don't Hardcode LLM Output as Source Data Without Marking It

What happened: Article summaries, scholar metadata (birth/death years, nationalities), hand attributions, and dictionary definitions were all generated by Claude in conversation, then copy-pasted into Python scripts as hardcoded data structures (HANDS = [...], SCHOLARS = [...], TERMS = [...]). These entered the database as if they were verified facts.

The rule: Every piece of data that originated from an LLM should carry a source_method field. We added this in the V2 migration, but by then the data was already in the DB without provenance. Had to retroactively tag everything as LLM_ASSISTED.

Corollary: The fix isn't to avoid LLM-generated data — it's essential for bootstrapping a project this size. The fix is to tag it at the moment of creation, not after the fact. The schema should have had source_method and needs_review from day one.

3. "Unknown" Is Not an Author Name

What happened: When ingesting leads from HPPERPLEXITY.txt, four entries were inserted with authors like "Unknown (botanical study)" and "Unknown (musicological study)". These showed up on the bibliography page as authored by "Unknown" — which is both wrong and embarrassing. A web search immediately resolved three of the four (Rhizopoulou, Godwin, O'Neill).

The rule: Never insert a bibliography entry with "Unknown" as the author if you haven't tried to find the author. A 30-second web search is faster than cleaning up bad data later. If you genuinely can't identify the author, use a field like author_unverified = True rather than filling the author field with a placeholder.

Corollary: One agent confidently attributed the botanical article to "Giulia Caneva" based on training data. The actual author was Sophia Rhizopoulou. Caneva was a cited reference within the article. LLM confidence about bibliographic metadata is not reliable — always verify.

4. The BL Photo-Folio Assumption Was Wrong to Ship at MEDIUM Confidence

What happened: The BL C.60.o.12 photos are numbered sequentially (001-189). The matching script assumed photo number = folio number and assigned MEDIUM confidence. But the BL copy is the 1545 edition (not the 1499 the signature map is based on), and the photos are Russell's research images (selective, not systematic). 218 matches were shipped at MEDIUM confidence that should have been LOW from the start.

The rule: When a matching assumption hasn't been manually verified, it's LOW or PROVISIONAL, not MEDIUM. MEDIUM should mean "the logic is sound and the data sources are compatible." Here, the logic was sound (folio numbers do correspond) but the data sources were incompatible (different edition, selective photography).

Corollary: The Deckard boundary audit caught this, but only after the matches were already in the database and being displayed on the site. The audit should have been run before the first data export, not after.

5. Schema Design: Add Review Fields from Day One

What happened: The initial schema (V1) had needs_review on the matches table but not on bibliography, scholars, dictionary_terms, or timeline_events. The V2 migration added review fields to everything — but this required an ALTER TABLE for each column on each table, plus retroactive tagging of all existing rows.

The rule: Every table that stores content should have these columns from the start:


confidence TEXT DEFAULT 'PROVISIONAL',
needs_review BOOLEAN DEFAULT 1,
reviewed BOOLEAN DEFAULT 0,
source_method TEXT,
notes TEXT

It costs nothing to add them upfront and saves a migration later.

6. One Topic Cluster Per Entry Loses Information

What happened: The bibliography.topic_cluster field is a single TEXT value. Works like Stewering's (architecture AND text-image) or Priki's (reception AND text-image AND dream-religion) were forced into a single category. We added a document_topics junction table in V2, but most entries still have only one topic assigned.

The rule: Use a junction table for any classification that could be multi-valued. Don't use a comma-separated string in a single column — it's harder to query, harder to validate, and harder to display.

7. Name Matching Is Harder Than It Looks

What happened: The scholar_works junction table links scholars to bibliography entries by matching author names. But names don't match cleanly:

"James O'Neill" vs "James Calum O'Neill" (same person)
"L. E. Semler" vs "L.E. Semler" (spacing)
"James O'Neill and Maggie O'Neill" (co-authored entry treated as a single author string)
"Sophia Rhizopoulou et al." (et al. breaks last-name matching)

The initial linking script missed 32 of 52 scholars. A second pass with LIKE-based substring matching caught most of them.

The rule: Author names need normalization before matching. At minimum: strip periods from initials, normalize whitespace, split co-author strings on "and" / "&" / ";". Better: use a fuzzy matching library (rapidfuzz). Best: assign stable IDs at the point of data entry and never rely on name strings for joins.

8. Don't Let the Exciting Part Crowd Out the Boring Part

What happened: We spent significant effort on the alchemist hand analysis, dictionary term definitions, and MIT site reverse-engineering — the intellectually exciting parts of the project. Meanwhile, basic data hygiene (deduplicating bibliography entries, verifying URLs, checking scholar birth/death dates against VIAF) was deferred and is still undone.

The rule: Schedule the boring validation work first. The exciting analysis can wait. A bibliography with 4 "Unknown" authors and unverified dates undermines the credibility of the whole site, even if the alchemical hand analysis is brilliant.

9. Static Site Architecture Was the Right Call

What happened: Nothing went wrong here. The static site (plain HTML/CSS/JS, no framework, SQLite as source of truth, Python scripts as build pipeline) has been straightforward to build, debug, and extend. The MIT site from 1997 validated this approach — it's still live 29 years later because it has zero dependencies.

The rule: For a digital humanities project with a small team and no dedicated hosting budget, static files are the correct architecture. The build pipeline (SQLite → Python scripts → HTML/JSON) is simple enough that anyone who reads the README can rebuild the site. No Docker, no Node, no framework churn.

Corollary: The temptation to add a JavaScript framework (React, Vue, Svelte) for the gallery lightbox and filters was resisted. The gallery works with 175 lines of vanilla JS. It loads instantly and will never break because a dependency was deprecated.

10. Write the Audit Report Before You Think You Need It

What happened: The HPDECKARD.md boundary audit and AUDIT_REPORT.md validation report were written mid-project, not at the end. They caught real issues (BL confidence, Unknown authors, missing review fields) that would have been harder to fix later.

The rule: Run your validation checks after every major data ingestion or schema change, not just at the end. The validate.py script takes 2 seconds to run. There is no reason not to run it constantly.

11. Document Your Agent Strategy Before Using Agents

What happened: Agents were deployed ad-hoc — "I need to summarize 30 articles, let me split them into 4 batches." The batching strategy (which articles in which batch) was improvised. Some agents got 7 articles, others got 13. No thought was given to what would happen if one agent failed.

The rule: Before launching parallel agents, decide:

1. How will the work be split? (By independence, not by size)

2. What happens if one agent fails? (Can you retry just that batch?)

3. How will results be merged? (Specify output schema upfront)

4. Foreground or background? (Foreground unless you genuinely have other work to do)

12. The Routledge Volume Misattribution Cascade

What happened: HPPERPLEXITY.txt listed the Routledge 2023 volume as "Various (ed. unknown)". One agent guessed "Campanelli and Scafi" as editors. The ingest script inserted it with "Various (ed. unknown)". A web search confirmed it's actually James Calum O'Neill's sole-authored monograph The Allegory of Love in the Early Renaissance. Three different wrong attributions existed briefly before being corrected.

The rule: When multiple sources disagree about metadata, don't insert any of them until you've resolved the disagreement. A single web search to the publisher's page would have resolved it instantly. The cost of inserting wrong data and cleaning it up later is always higher than the cost of verifying first.

Summary: The Three Meta-Rules

1. Tag provenance at creation, not after the fact. Every piece of data should know where it came from and whether a human has verified it.

2. Verify before inserting, not after. A 30-second web search before ingestion beats a 30-minute cleanup after.

3. Run validation constantly. The audit script is your friend. Run it after every change. The 2-second cost is always worth it.