Diffusion Model: Experimentation Loop Summary
Date: 2026-03-19
Dataset: single-action-shoulder-pan-700-combined (700 canvases)
Diffusion Architecture Comparison
| Iteration | Config | Prediction | Weight Decay |
Val MSE | SSIM | PSNR |
Action Disc. | Motor Dir Acc | Motor Pos MAE |
| 1 | 256d8 | epsilon | 0.05 |
0.345 | 0.008 | 4.6 |
0.491 | 50.0% | 8.51 |
| 2 | 384d12 | epsilon | 0.0 |
0.344 | 0.008 | 4.6 |
0.490 | 50.0% | 8.49 |
| 3 | 384d12 | sample | 0.0 |
0.011 | 0.730 | 19.7 |
0.039 | 84.4% | 0.954 |
| 4 | 512d12 | sample | 0.01 |
0.009 | 0.775 | 21.0 |
0.038 | 82.5% | 0.957 |
vs. Best GPT Models
| Model | Val MSE | SSIM | PSNR |
Action Disc. | Motor Dir Acc | Motor Pos MAE |
Dynamic MSE | No Error Compound? |
| GPT iter2 (384d12) |
0.0102 | 0.756 | 20.4 |
0.031 | 83.1% | 0.612 |
0.021 | No (TF/FR gap=0.006) |
| GPT iter6 (384d12+cosine) |
0.0113 | 0.752 | 20.0 |
0.034 | 85.0% | 0.463 |
0.024 | No |
| Diffusion iter4 (512d12+sample) |
0.009 | 0.775 | 21.0 |
0.038 | 82.5% | 0.957 |
0.018 | Yes |
Key Findings
1. Epsilon prediction is BROKEN for conditional canvas diffusion.
Iterations 1-2 both produced pure noise despite training losses dropping to 0.506-0.675.
The issue: epsilon prediction asks the model to predict noise, but the canvas has an asymmetry —
context patches are always clean while only the last frame is noisy. The DDIM inference chain
compounds errors from high-noise timesteps. Switching to sample prediction (predict clean image
directly) fixed everything immediately.
2. Sample prediction + iterative x_0 refinement is the correct inference approach.
Standard DDIM doesn't work well with sample prediction because the noise re-derivation step
introduces errors. Iterative x_0 refinement (predict x_0, re-noise to t-1 level, repeat) is
more stable and produces dramatically better results.
3. Weight decay must be tuned carefully: 0.0 for convergence, 0.01 for regularization.
The baseline (wd=0.05) prevented learning entirely. Zero weight decay (wd=0.0) allowed training
but caused severe overfitting (train/val gap -0.013). Weight decay 0.01 hit the sweet spot:
good convergence + controlled overfitting (gap -0.005).
4. The diffusion model beats GPT on visual quality but loses on motor precision.
The wider diffusion model (512d12) achieves the best SSIM (0.775), PSNR (21.0), and dynamic
pixel MSE (0.018) across all experiments. However, motor position MAE (0.957) is significantly
worse than GPT (0.612). The stochastic generation process adds inherent noise to the precise
grayscale motor patches.
5. Diffusion models eliminate autoregressive error compounding.
Unlike GPT which shows 10-50x higher error at the end of the raster scan (motor strip patches),
the diffusion model produces uniform spatial quality. No TF/FR gap. This is a fundamental
architectural advantage for tasks requiring uniform spatial accuracy.
Recommended Configuration
Best overall quality: Diffusion 512d12, sample prediction, wd=0.01, cosine LR
SSIM=0.775, PSNR=21.0, val_mse=0.009
Best motor precision: GPT 384d12 with cosine schedule (from GPT experiments)
Motor pos MAE=0.463, motor consistency=0.815
For robotics (motor accuracy critical): GPT with cosine schedule
For visual prediction (image quality critical): Diffusion with sample prediction
What Would Help Next
1. Deterministic inference for motor strip: During diffusion inference, skip stochastic re-noising for motor strip patches — use deterministic x_0 prediction only. This should combine diffusion's visual quality with GPT-level motor accuracy.
2. EMA (Exponential Moving Average): Standard for diffusion models. The EMA model weights produce smoother, higher-quality predictions. Not yet implemented.
3. More data: The diffusion model with 512d12 (~50M params) benefits from more capacity than GPT, but 700 canvases still limits it. With 2000+ canvases, the quality gap over GPT would likely widen.
4. Hybrid model: Use diffusion for visual frame prediction, GPT-style direct prediction for motor strip. Best of both worlds.
Individual Reports
← Back to GPT Experiment Summary