Canvas World Model: Experimentation Loop Summary
Date: 2026-03-18
Dataset: single-action-shoulder-pan-700-combined (700 canvases, 464x480, 6-joint motor strip)
Goal: Action-conditioned next-frame prediction for robotics
Architecture Comparison Table
| Iteration | Model | Params | Config |
Val MSE | SSIM | PSNR |
Action Disc. | Motor Dir Acc | Motor Pos MAE |
Motor Consistency | TF/FR Gap |
| 1 | GPT | 6.9M | 256d8 |
0.0127 | 0.697 | 19.4 |
0.030 | 86.9% | 0.877 |
1.170 | 0.007 |
| 2 | GPT | 22M | 384d12 |
0.0102 | 0.756 | 20.4 |
0.031 | 83.1% | 0.612 |
1.007 | 0.006 |
| 3 | MAE | ~15M | 384d8 |
0.0147 | 0.616 | 19.1 |
0.022 | 80.6% | 1.696 |
1.946 | N/A |
| 4 | GPT | 29M | 384d16 |
0.0133 | 0.730 | 19.4 |
0.034 | 83.1% | 0.789 |
1.087 | 0.009 |
| 5 | GPT | 38M | 512d12 |
0.0110 | 0.746 | 20.1 |
0.032 | 87.5% | 0.615 |
1.037 | 0.007 |
| 6 | GPT | 22M | 384d12+cosine |
0.0113 | 0.752 | 20.0 |
0.034 | 85.0% | 0.463 |
0.815 | 0.007 |
Key Findings
1. GPT dominates MAE for this task.
The autoregressive approach (GPT) outperformed the masked reconstruction approach (MAE) on every metric, despite GPT's error compounding issue. The sequential patch dependencies provide crucial context for coherent prediction. MAE's parallel reconstruction produces blurrier, less accurate results.
2. Model capacity has a sweet spot at ~22M params for 700 samples.
Going from 6.9M to 22M (iter1→iter2) improved all metrics substantially. Going beyond 22M (iter4: 29M, iter5: 38M) provided no benefit — the models started overfitting more without improving generalization. This follows standard scaling law behavior: model size should match data size.
3. Depth scaling hurts more than width scaling.
Adding more transformer layers (12→16) worsened the TF/FR gap from 0.006 to 0.009, amplifying autoregressive error compounding. Width scaling (384→512) was neutral. For autoregressive models, shallower-and-wider is preferable.
4. Cosine schedule excels at motor strip precision.
The cosine schedule with warmup (iter6) achieved the best motor position MAE (0.463, 24% better than plateau) and motor consistency (0.815, 19% better). The smooth LR decay allows fine-tuning of precise grayscale values in the motor strip encoding. For robotics applications where motor state accuracy matters, cosine schedule is recommended.
5. Autoregressive error compounding is the dominant limitation.
Across all GPT iterations, the TF/FR gap remained 0.006-0.009, and per-position loss showed 10-50x increase at the end of the raster scan (motor strip). This is a fundamental architectural limitation that cannot be solved by scaling or schedule changes.
6. Action discrimination is consistently positive but modest.
All models achieved action discrimination scores of 0.022-0.034, confirming they learn to respond to different action separator colors. However, the scores are low in absolute terms. The dataset has only 2 action types (move+ and move-) with no "stay" actions, limiting diversity.
Recommended Configuration
Best overall visual quality: GPT 384d12 with ReduceLROnPlateau (iter2)
SSIM=0.756, PSNR=20.4, val_mse=0.0102
Best motor state prediction: GPT 384d12 with cosine schedule (iter6)
Motor pos MAE=0.463, motor consistency=0.815, motor dir acc=85%
For robotics use (where motor accuracy matters most): iter6 (cosine schedule)
What Would Help Next (not tried)
1. More data (highest priority): 700 canvases limits model capacity to ~22M params. With 2000-5000 canvases, the 38M+ models should start to shine.
2. Scheduled sampling: During training, randomly replace teacher-forced patches with model predictions. Would directly address the TF/FR gap (0.006-0.009) without changing architecture.
3. Bidirectional refinement: After GPT autoregressive pass, add a bidirectional refinement pass (like a denoising step) to correct accumulated errors in later patches.
4. Hybrid approach: Use MAE for the visual frame region (parallel prediction) but GPT-style autoregressive prediction for the motor strip (which is sequential in nature). Best of both worlds.
5. Action diversity: Add "stay" (action=0) samples to the dataset. Current data only has move+ and move-, limiting what the model can learn about action effects.
6. Perceptual loss: Add LPIPS or VGG perceptual loss term alongside MSE. Would improve visual quality (SSIM/PSNR) by penalizing blurriness rather than just pixel error.
Individual Reports