Session: Database Migration Refactor
Date: 2026-05-07
Scope: pipeline/backend/refactor/ (migrate.py, verify.py, recover_pssm.py, dehydrate.py, schemas.py, MIGRATION-PLAN.md)
Summary
Migrated 97,387 production MongoDB documents to a refactored schema, applying transformations (field renames, ID rehashing, status normalization, carbonara-binders registration, pssm recovery, BinaryFile dehydration). Final result: 61,403 clean completed jobs with 100% schema validation and ID consistency.
Key Learnings
New Patterns
- Two-pass migration for ID remapping: Pass 1 transforms + writes with new
_idfromhash_payload(input_data), buildsold_id → new_idmap. Pass 2 resolvespidarray references. Required because_idchanges invalidate parent references. - Bulk lookup over per-doc queries: Initial recover_pssm.py did per-doc
find_onequeries to prod DB (no indexes), yielding ~8 doc/s. Rewrote to bulk-load all prod pssm data into a Python dict first, then iterate refactor docs with in-memory lookup — achieved ~25 doc/s. - MongoDB batch pagination with
_id-based cursor: Use_id: {$gt: last_id}with.sort("_id", 1).limit(N)instead of skip/limit or raw cursors, which timeout on large collections (1.4GB prod DB). - Batch async writes with
asyncio.gather: Instead of individualupdate_oneper doc, batch 50 coroutines andgatherwithreturn_exceptions=Truefor error counting without crashing. - R2 bucket names are per-endpoint:
ContentStore.put()takes abucket_nameparam (e.g."carbonara","boltz"), not a single"lemna"bucket.
Decisions
- Dropped 4 AutoReconnect + 2 missing BinaryFile docs: Accepted as negligible data loss. 4 were transient connection errors during migration, 2 were incomplete boltzgen/openmm jobs with no content or storage_key.
- Dropped all non-completed jobs: Running, queued, and failed jobs (30,273 docs) removed from refactor DB — only completed jobs are useful.
- Carbonara-binders registered as its own job type: Uses
JobCarbonaraschema but keeps"carbonara-binders"asjob_typein DB. - Pssm data recovery: Original pssm was a
list[dict](amino acid frequency matrix), not aBinaryFile. Converted to CSV, uploaded to R2 as BinaryFile withstorage_key. --skip-casflag for migration: Keeps BinaryFile content inline during initial migration (avoids slow R2 uploads). Pssm recovery handled as separate step afterward.
Pitfalls
- MongoDB cursor timeouts on large docs: Prod DB (1.4GB) has documents up to ~16MB. Raw
async for doc in cursorwill timeout. Must use_id-based batch pagination withlimit(500). - MongoDB connection drops during long writes: AutoReconnect errors after ~50-100 docs. Must close and recreate
AsyncIOMotorClientperiodically (every 50 successful writes in the original script, or batch-gather approach). - Prod DB has no indexes beyond
_id: Queries byproject_name/job_nameare full collection scans. Avoid per-doc lookups — always bulk-read and build in-memory maps. nohup+ tqdm output: tqdm progress bars in nohup redirect poorly; the process appeared to hang silently. Foreground execution was more reliable for long-running recovery tasks.- Docker volume persistence:
data-refactorvolume survivesdocker compose downbutdocker compose up -dmay restart containers with fresh state if volume mount path shifted. Always verify DB contents after restart. motorasync cursor with_idfilter:PydanticObjectIdis needed for_idcomparisons in motor queries, not plain strings.
Skill Updates Needed
- essentials: Add note about MongoDB batch pagination pattern for large datasets — always use
_id: {$gt: last_id}with sorted limit, never skip/limit or raw cursors on large collections. - patterns: Add the two-pass migration pattern (transform+write, then resolve references) as a template for future schema migrations.
- map: Update
pipeline/backend/refactor/entry to reflect final state: 61,403 completed jobs, schemas includecarbonara-binders, pssm recovery complete,MIGRATION-PLAN.mdupdated with actual results.
Files Modified
pipeline/backend/refactor/src/refactor/migrate.py— main migration scriptpipeline/backend/refactor/src/refactor/verify.py— post-migration verificationpipeline/backend/refactor/src/refactor/recover_pssm.py— pssm BinaryFile recovery (bulk lookup rewrite)pipeline/backend/refactor/src/refactor/dehydrate.py— CAS dehydration scriptpipeline/backend/refactor/src/refactor/schemas.py— addedcarbonara-bindersto registry,ignore_unknown_atomsrenamepipeline/backend/refactor/MIGRATION-PLAN.md— updated with actual results and execution notespipeline/backend/refactor/recover_pssm.sh— bash wrapper for pssm recovery