title: Session: Cookbook Ingestion System tags: [session-log, cookbook, knowledge-base, opencode] created: 2026-05-21 updated: 2026-05-21 status: active related:


Session: Cookbook Ingestion System

Date: 2026-05-21 Scope: .opencode/commands/, .opencode/skills/cookbook/, knowledge/cookbook/, AGENTS.md, .opencode/skills/knowledge/, .opencode/skills/notes/

Summary

Built a cookbook knowledge base system for ingesting research papers (arXiv, bioRxiv, any URL) into structured Obsidian-flavored Markdown knowledge files. Replaced an initial Python CLI approach with opencode commands + skills, eliminating all Python dependencies.

Key Learnings

New Patterns

  • Opencode commands (/ingest, /ingest-batch) can replace dedicated CLI tools when the core work is LLM-driven extraction and file I/O. The built-in web fetch, LLM, and file tools eliminate the need for httpx, pdfplumber, ollama client, etc.
  • ar5iv.labs.arxiv.org provides clean HTML for arXiv papers — no PDF extraction needed. bioRxiv full-text pages work with defuddle. This eliminates pdfplumber entirely.
  • The defuddle skill is the preferred way to fetch web content for processing — it strips clutter and returns clean markdown, saving tokens vs raw HTML.
  • Knowledge files use arXiv IDs and DOIs as canonical identifiers (not human-chosen slugs), stored in both frontmatter papers: list and evidence table rows for full backtrackability.
  • No raw file storage needed — source URLs go in frontmatter only. The LLM processes the paper in-context and writes directly to knowledge files.

Decisions

  • Scrapped the Python CLI (tools/cookbook/) in favor of opencode commands + skill. Rationale: no Python dependencies needed when opencode has built-in web fetch, LLM, and file I/O. Batch processing works via subagents (each gets fresh context window).
  • One /ingest command handles all URL types (arXiv, bioRxiv, medRxiv, arbitrary URLs). Domain detection and extraction lens switching happens inside the command logic, not as separate commands.
  • Two extraction lenses: ML/DL (techniques, hyperparams, evidence) and Biology (targets, datasets, methods, relevance to protein design). Auto-detect for ambiguous papers.
  • Deep category structure under knowledge/cookbook/knowledge/: ML categories (optimization, regularization, etc.) and Biology categories (target, dataset, method, pathway).
  • /ingest-batch spawns sequential subagents, not parallel, to avoid knowledge base write conflicts.

Pitfalls

  • Initial implementation used tools/cookbook/ as a standalone Python CLI with click, httpx, pdfplumber, and an ollama client. This was over-engineered — the LLM is the extraction engine, so Python plumbing just adds deps and maintenance burden without benefit.
  • ar5iv doesn’t cover all arXiv papers (some old or unusual formats lack HTML). The ingest command includes a fallback to abstract-only via the arXiv API for these cases.

Skill Updates Needed

  • map skill — cookbook system (/ingest, /ingest-batch) is a new tool in the Lemna ecosystem. Map skill should list it under Tools Overview.
  • knowledge skill — already updated to include cookbook/ in the folder layout and placement table.
  • notes skill — already updated to include cookbook/knowledge/ and cookbook/knowledge/{category}/ in the folder structure table.

Files Modified

  • .opencode/skills/cookbook/SKILL.md — created (full skill spec)
  • .opencode/commands/ingest.md — created (single-paper ingestion command)
  • .opencode/commands/ingest-batch.md — created (batch ingestion command)
  • .opencode/skills/knowledge/SKILL.md — updated (added cookbook to folder layout and placement table)
  • .opencode/skills/notes/SKILL.md — updated (cookbook folder entries)
  • AGENTS.md — updated (added cookbook skill to skills table)
  • knowledge/cookbook/knowledge/_index.md — created (seed index with category sections)
  • tools/cookbook/ — deleted (Python CLI replaced by opencode commands)
  • knowledge/cookbook/raw/ — deleted (no raw file storage needed)