title: Pomodoro v0.7.x Training Plan — Ablation from v0.6.3 Baseline tags: [journal, pomodoro, training, plan, ablation, v0.7] created: 2026-05-04 updated: 2026-05-04 status: active related:


Pomodoro v0.7.x Training Plan — Ablation from v0.6.3 Baseline

Baseline: v0.6.3 Config

S=64, L=8, r=0.0, max_size=128
Loss: (la + lb + lnb + lr + lc + lda + ldr + ldc) * lw

v0.6.3 is the best so far on lDDT (local structure) but stagnates after ~3 days. Goal: isolate what helps by changing one thing at a time.

New Versions

VersionChanged Param(s)ValueRationale
v0.7.0max_size64 (was 128)Smaller structures = faster iterations, more pretraining signal
v0.7.1lossRMSD only (la + lr + lc)Remove all distance losses (lda, ldr, ldc) and push-pull (lb, lnb) — simpler gradient landscape, test if distance terms cause stagnation
v0.7.2max_size + loss64 + RMSD onlyCombine both changes — smallest/fastest regime, best for rapid iteration during pretraining
v0.7.3soft_normalizedisabledTest if soft_normalize at end of GT operations constrains gradient flow and contributes to stagnation

Per-Version Config Changes

v0.7.0 — Smaller Structures

  • ConfigData.max_size: 128 → 64
  • ConfigRuntime.version: "0.7.0"
  • Everything else same as v0.6.3

v0.7.1 — Simplified Loss (RMSD Only)

  • ConfigRuntime.version: "0.7.1"
  • In main.py:209, change: loss = (la + lb + lnb + lr + lc + lda + ldr + ldc) * lwloss = (la + lr + lc) * lw
  • Still compute and log all loss components for monitoring, just don’t backprop through distance/push-pull terms
  • Everything else same as v0.6.3

v0.7.2 — Smaller Structures + Simplified Loss

  • ConfigData.max_size: 128 → 64
  • ConfigRuntime.version: "0.7.2"
  • Same loss simplification as v0.7.1

v0.7.3 — No soft_normalize

  • ConfigRuntime.version: "0.7.3"
  • Remove soft_normalize calls at the end of GT operations in model.py:
    • VectorTrack (L60): v = soft_normalize(v, dim=1)v (identity)
    • ScalarTrack (L140-141): Q = soft_normalize(Q, dim=2), K = soft_normalize(K, dim=2) → remove
    • VectorTrack (L214-215, L220): same pattern → remove
    • BootstrapVectorState (L340): pz = soft_normalize(pz, dim=1)pz
    • GeometryDecoderModel (L428-429): Q, K normalize → remove
    • GeometryDecoderModel (L488): u = soft_normalize(u, dim=1)u
  • Alternative: make soft_normalize a no-op via a config flag rather than deleted, so it’s easy to re-enable.
  • Everything else same as v0.6.3

Implementation Steps

  1. Create v0.7.0 worktree from v0.6.3 branch:

    cd models/pomodoro/pomodoro
    git branch v0.7.0 v0.6.3
    git worktree add ../v0.7.0 v0.7.0
    

    Edit config.py: max_size=64, version="0.7.0". Commit + push.

  2. Create v0.7.1 worktree from v0.6.3 branch:

    git branch v0.7.1 v0.6.3
    git worktree add ../v0.7.1 v0.7.1
    

    Edit config.py: version="0.7.1". Edit main.py:209: simplify loss to (la + lr + lc) * lw. Still log all components. Commit + push.

  3. Create v0.7.2 worktree from v0.7.1 branch (inherits loss simplification):

    git branch v0.7.2 v0.7.1
    git worktree add ../v0.7.2 v0.7.2
    

    Edit config.py: max_size=64, version="0.7.2". Commit + push.

  4. Create v0.7.3 worktree from v0.6.3 branch:

    git branch v0.7.3 v0.6.3
    git worktree add ../v0.7.3 v0.7.3
    

    Edit config.py: version="0.7.3". Edit model.py: remove or disable soft_normalize calls at end of GT operations. Commit + push.

Loss Simplification Detail

Current loss in main.py:209:

loss = (la + lb + lnb + lr + lc + lda + ldr + ldc) * lw

Simplified (v0.7.1, v0.7.2):

loss = (la + lr + lc) * lw

Removed:

  • lda — atom distance matrix loss (O(N²) memory)
  • ldr — residue distance matrix loss
  • ldc — chain distance matrix loss
  • lb — bonded push-pull loss
  • lnb — non-bonded push-pull loss

All components still computed and logged for observability — only the backward gradient path changes.

What We Learn

ComparisonTests
v0.7.0 vs v0.6.3Does smaller structure size break stagnation?
v0.7.1 vs v0.6.3Does removing distance/push-pull losses break stagnation?
v0.7.2 vs v0.7.0 & v0.7.1Are the two effects independent or synergistic?
v0.7.3 vs v0.6.3Does removing soft_normalize break stagnation? (tests if GT output normalization constrains gradient flow)