title: Pomodoro v0.6.x Training Regime Observations tags: [journal, pomodoro, training, observations, v0.6] created: 2026-05-04 updated: 2026-05-04 status: active related:
Pomodoro v0.6.x Training Regime Observations
Version Config Recap
| Version | Changed Param | Value | Baseline (v0.6.0) | Purpose |
|---|---|---|---|---|
| v0.6.0 | — (baseline) | — | — | Reference |
| v0.6.1 | max_size | 512 | 128 | Larger structures |
| v0.6.2 | L | 32 | 8 | Deeper model (4x layers) |
| v0.6.3 | S | 64 | 32 | Wider model (2x state size) |
All versions: r: float = 0.0 (changed from 1e-3 after initial setup).
Training Observations
v0.6.0 (baseline)
- Stagnates after ~3 days or very slow training beyond that point.
- Serves as the reference; other versions are compared against this.
v0.6.1 (large structures, max_size=512)
- Will take a lot more to train — larger structures mean more compute per sample.
- Decision: Stay in the low structure size regime for now to pretrain the model effectively.
- Consider reducing
max_sizefrom 128 down to 64 residues to speed up pretraining.
v0.6.2 (deep model, L=32)
- Very slow at training; not fully trained yet.
- May show better performance than v0.6.0 and v0.6.3 once fully trained — too early to tell.
- 4x layers is a significant capacity increase; needs more training time.
v0.6.3 (wide model, S=64)
- Stagnates after ~3 days, similar to v0.6.0.
- Better than v0.6.0 on lDDT, likely indicating more model power helps with correct local structure.
- Global structure quality still limited by the same stagnation behavior.
Conclusions & Next Steps
- Reduce structure size: Move from 128 → 64 residues to speed up pretraining iterations.
- Try simplified loss: Run a v0.6.3-like model (S=64) without the distance matrix constraint — fewer loss terms may ease optimization and break the stagnation pattern.
- v0.6.1 deprioritized: Large structures are premature; pretrain on smaller structures first.
- v0.6.2 needs more time: Deeper model is slower but may be worth the wait; monitor for improvement.
- Local vs global: v0.6.3’s lDDT advantage over v0.6.0 suggests capacity helps local accuracy, but global convergence remains a bottleneck across versions.