title: Session: Cookbook Ingestion System tags: [session-log, cookbook, knowledge-base, opencode] created: 2026-05-21 updated: 2026-05-21 status: active related:
Session: Cookbook Ingestion System
Date: 2026-05-21
Scope: .opencode/commands/, .opencode/skills/cookbook/, knowledge/cookbook/, AGENTS.md, .opencode/skills/knowledge/, .opencode/skills/notes/
Summary
Built a cookbook knowledge base system for ingesting research papers (arXiv, bioRxiv, any URL) into structured Obsidian-flavored Markdown knowledge files. Replaced an initial Python CLI approach with opencode commands + skills, eliminating all Python dependencies.
Key Learnings
New Patterns
- Opencode commands (
/ingest,/ingest-batch) can replace dedicated CLI tools when the core work is LLM-driven extraction and file I/O. The built-in web fetch, LLM, and file tools eliminate the need for httpx, pdfplumber, ollama client, etc. - ar5iv.labs.arxiv.org provides clean HTML for arXiv papers — no PDF extraction needed. bioRxiv full-text pages work with defuddle. This eliminates pdfplumber entirely.
- The
defuddleskill is the preferred way to fetch web content for processing — it strips clutter and returns clean markdown, saving tokens vs raw HTML. - Knowledge files use arXiv IDs and DOIs as canonical identifiers (not human-chosen slugs), stored in both frontmatter
papers:list and evidence table rows for full backtrackability. - No raw file storage needed — source URLs go in frontmatter only. The LLM processes the paper in-context and writes directly to knowledge files.
Decisions
- Scrapped the Python CLI (
tools/cookbook/) in favor of opencode commands + skill. Rationale: no Python dependencies needed when opencode has built-in web fetch, LLM, and file I/O. Batch processing works via subagents (each gets fresh context window). - One
/ingestcommand handles all URL types (arXiv, bioRxiv, medRxiv, arbitrary URLs). Domain detection and extraction lens switching happens inside the command logic, not as separate commands. - Two extraction lenses: ML/DL (techniques, hyperparams, evidence) and Biology (targets, datasets, methods, relevance to protein design). Auto-detect for ambiguous papers.
- Deep category structure under
knowledge/cookbook/knowledge/: ML categories (optimization, regularization, etc.) and Biology categories (target, dataset, method, pathway). /ingest-batchspawns sequential subagents, not parallel, to avoid knowledge base write conflicts.
Pitfalls
- Initial implementation used
tools/cookbook/as a standalone Python CLI with click, httpx, pdfplumber, and an ollama client. This was over-engineered — the LLM is the extraction engine, so Python plumbing just adds deps and maintenance burden without benefit. - ar5iv doesn’t cover all arXiv papers (some old or unusual formats lack HTML). The ingest command includes a fallback to abstract-only via the arXiv API for these cases.
Skill Updates Needed
mapskill — cookbook system (/ingest,/ingest-batch) is a new tool in the Lemna ecosystem. Map skill should list it under Tools Overview.knowledgeskill — already updated to includecookbook/in the folder layout and placement table.notesskill — already updated to includecookbook/knowledge/andcookbook/knowledge/{category}/in the folder structure table.
Files Modified
.opencode/skills/cookbook/SKILL.md— created (full skill spec).opencode/commands/ingest.md— created (single-paper ingestion command).opencode/commands/ingest-batch.md— created (batch ingestion command).opencode/skills/knowledge/SKILL.md— updated (added cookbook to folder layout and placement table).opencode/skills/notes/SKILL.md— updated (cookbook folder entries)AGENTS.md— updated (added cookbook skill to skills table)knowledge/cookbook/knowledge/_index.md— created (seed index with category sections)tools/cookbook/— deleted (Python CLI replaced by opencode commands)knowledge/cookbook/raw/— deleted (no raw file storage needed)