Session: Database Migration Refactor

Date: 2026-05-07 Scope: pipeline/backend/refactor/ (migrate.py, verify.py, recover_pssm.py, dehydrate.py, schemas.py, MIGRATION-PLAN.md)

Summary

Migrated 97,387 production MongoDB documents to a refactored schema, applying transformations (field renames, ID rehashing, status normalization, carbonara-binders registration, pssm recovery, BinaryFile dehydration). Final result: 61,403 clean completed jobs with 100% schema validation and ID consistency.

Key Learnings

New Patterns

Two-pass migration for ID remapping: Pass 1 transforms + writes with new _id from hash_payload(input_data), builds old_id → new_id map. Pass 2 resolves pid array references. Required because _id changes invalidate parent references.
Bulk lookup over per-doc queries: Initial recover_pssm.py did per-doc find_one queries to prod DB (no indexes), yielding ~8 doc/s. Rewrote to bulk-load all prod pssm data into a Python dict first, then iterate refactor docs with in-memory lookup — achieved ~25 doc/s.
MongoDB batch pagination with _id-based cursor: Use _id: {$gt: last_id} with .sort("_id", 1).limit(N) instead of skip/limit or raw cursors, which timeout on large collections (1.4GB prod DB).
Batch async writes with asyncio.gather: Instead of individual update_one per doc, batch 50 coroutines and gather with return_exceptions=True for error counting without crashing.
R2 bucket names are per-endpoint: ContentStore.put() takes a bucket_name param (e.g. "carbonara", "boltz"), not a single "lemna" bucket.

Decisions

Dropped 4 AutoReconnect + 2 missing BinaryFile docs: Accepted as negligible data loss. 4 were transient connection errors during migration, 2 were incomplete boltzgen/openmm jobs with no content or storage_key.
Dropped all non-completed jobs: Running, queued, and failed jobs (30,273 docs) removed from refactor DB — only completed jobs are useful.
Carbonara-binders registered as its own job type: Uses JobCarbonara schema but keeps "carbonara-binders" as job_type in DB.
Pssm data recovery: Original pssm was a list[dict] (amino acid frequency matrix), not a BinaryFile. Converted to CSV, uploaded to R2 as BinaryFile with storage_key.
--skip-cas flag for migration: Keeps BinaryFile content inline during initial migration (avoids slow R2 uploads). Pssm recovery handled as separate step afterward.

Pitfalls

MongoDB cursor timeouts on large docs: Prod DB (1.4GB) has documents up to ~16MB. Raw async for doc in cursor will timeout. Must use _id-based batch pagination with limit(500).
MongoDB connection drops during long writes: AutoReconnect errors after ~50-100 docs. Must close and recreate AsyncIOMotorClient periodically (every 50 successful writes in the original script, or batch-gather approach).
Prod DB has no indexes beyond _id: Queries by project_name/job_name are full collection scans. Avoid per-doc lookups — always bulk-read and build in-memory maps.
nohup + tqdm output: tqdm progress bars in nohup redirect poorly; the process appeared to hang silently. Foreground execution was more reliable for long-running recovery tasks.
Docker volume persistence: data-refactor volume survives docker compose down but docker compose up -d may restart containers with fresh state if volume mount path shifted. Always verify DB contents after restart.
motor async cursor with _id filter: PydanticObjectId is needed for _id comparisons in motor queries, not plain strings.

Skill Updates Needed

essentials: Add note about MongoDB batch pagination pattern for large datasets — always use _id: {$gt: last_id} with sorted limit, never skip/limit or raw cursors on large collections.
patterns: Add the two-pass migration pattern (transform+write, then resolve references) as a template for future schema migrations.
map: Update pipeline/backend/refactor/ entry to reflect final state: 61,403 completed jobs, schemas include carbonara-binders, pssm recovery complete, MIGRATION-PLAN.md updated with actual results.

Files Modified

pipeline/backend/refactor/src/refactor/migrate.py — main migration script
pipeline/backend/refactor/src/refactor/verify.py — post-migration verification
pipeline/backend/refactor/src/refactor/recover_pssm.py — pssm BinaryFile recovery (bulk lookup rewrite)
pipeline/backend/refactor/src/refactor/dehydrate.py — CAS dehydration script
pipeline/backend/refactor/src/refactor/schemas.py — added carbonara-binders to registry, ignore_unknown_atoms rename
pipeline/backend/refactor/MIGRATION-PLAN.md — updated with actual results and execution notes
pipeline/backend/refactor/recover_pssm.sh — bash wrapper for pssm recovery

Lemna Knowledge Base

Explorer

Session: Database Migration Refactor

Summary

Key Learnings

New Patterns

Decisions

Pitfalls

Skill Updates Needed

Files Modified

Graph View

Recent Notes

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Table of Contents

Backlinks