Session: Database Migration Refactor

Date: 2026-05-07 Scope: pipeline/backend/refactor/ (migrate.py, verify.py, recover_pssm.py, dehydrate.py, schemas.py, MIGRATION-PLAN.md)

Summary

Migrated 97,387 production MongoDB documents to a refactored schema, applying transformations (field renames, ID rehashing, status normalization, carbonara-binders registration, pssm recovery, BinaryFile dehydration). Final result: 61,403 clean completed jobs with 100% schema validation and ID consistency.

Key Learnings

New Patterns

  • Two-pass migration for ID remapping: Pass 1 transforms + writes with new _id from hash_payload(input_data), builds old_id → new_id map. Pass 2 resolves pid array references. Required because _id changes invalidate parent references.
  • Bulk lookup over per-doc queries: Initial recover_pssm.py did per-doc find_one queries to prod DB (no indexes), yielding ~8 doc/s. Rewrote to bulk-load all prod pssm data into a Python dict first, then iterate refactor docs with in-memory lookup — achieved ~25 doc/s.
  • MongoDB batch pagination with _id-based cursor: Use _id: {$gt: last_id} with .sort("_id", 1).limit(N) instead of skip/limit or raw cursors, which timeout on large collections (1.4GB prod DB).
  • Batch async writes with asyncio.gather: Instead of individual update_one per doc, batch 50 coroutines and gather with return_exceptions=True for error counting without crashing.
  • R2 bucket names are per-endpoint: ContentStore.put() takes a bucket_name param (e.g. "carbonara", "boltz"), not a single "lemna" bucket.

Decisions

  • Dropped 4 AutoReconnect + 2 missing BinaryFile docs: Accepted as negligible data loss. 4 were transient connection errors during migration, 2 were incomplete boltzgen/openmm jobs with no content or storage_key.
  • Dropped all non-completed jobs: Running, queued, and failed jobs (30,273 docs) removed from refactor DB — only completed jobs are useful.
  • Carbonara-binders registered as its own job type: Uses JobCarbonara schema but keeps "carbonara-binders" as job_type in DB.
  • Pssm data recovery: Original pssm was a list[dict] (amino acid frequency matrix), not a BinaryFile. Converted to CSV, uploaded to R2 as BinaryFile with storage_key.
  • --skip-cas flag for migration: Keeps BinaryFile content inline during initial migration (avoids slow R2 uploads). Pssm recovery handled as separate step afterward.

Pitfalls

  • MongoDB cursor timeouts on large docs: Prod DB (1.4GB) has documents up to ~16MB. Raw async for doc in cursor will timeout. Must use _id-based batch pagination with limit(500).
  • MongoDB connection drops during long writes: AutoReconnect errors after ~50-100 docs. Must close and recreate AsyncIOMotorClient periodically (every 50 successful writes in the original script, or batch-gather approach).
  • Prod DB has no indexes beyond _id: Queries by project_name/job_name are full collection scans. Avoid per-doc lookups — always bulk-read and build in-memory maps.
  • nohup + tqdm output: tqdm progress bars in nohup redirect poorly; the process appeared to hang silently. Foreground execution was more reliable for long-running recovery tasks.
  • Docker volume persistence: data-refactor volume survives docker compose down but docker compose up -d may restart containers with fresh state if volume mount path shifted. Always verify DB contents after restart.
  • motor async cursor with _id filter: PydanticObjectId is needed for _id comparisons in motor queries, not plain strings.

Skill Updates Needed

  • essentials: Add note about MongoDB batch pagination pattern for large datasets — always use _id: {$gt: last_id} with sorted limit, never skip/limit or raw cursors on large collections.
  • patterns: Add the two-pass migration pattern (transform+write, then resolve references) as a template for future schema migrations.
  • map: Update pipeline/backend/refactor/ entry to reflect final state: 61,403 completed jobs, schemas include carbonara-binders, pssm recovery complete, MIGRATION-PLAN.md updated with actual results.

Files Modified

  • pipeline/backend/refactor/src/refactor/migrate.py — main migration script
  • pipeline/backend/refactor/src/refactor/verify.py — post-migration verification
  • pipeline/backend/refactor/src/refactor/recover_pssm.py — pssm BinaryFile recovery (bulk lookup rewrite)
  • pipeline/backend/refactor/src/refactor/dehydrate.py — CAS dehydration script
  • pipeline/backend/refactor/src/refactor/schemas.py — added carbonara-binders to registry, ignore_unknown_atoms rename
  • pipeline/backend/refactor/MIGRATION-PLAN.md — updated with actual results and execution notes
  • pipeline/backend/refactor/recover_pssm.sh — bash wrapper for pssm recovery