Diffusion Model: Experimentation Loop Summary

Date: 2026-03-19

Dataset: single-action-shoulder-pan-700-combined (700 canvases)

Diffusion Architecture Comparison

IterationConfigPredictionWeight Decay Val MSESSIMPSNR Action Disc.Motor Dir AccMotor Pos MAE
1256d8epsilon0.05 0.3450.0084.6 0.49150.0%8.51
2384d12epsilon0.0 0.3440.0084.6 0.49050.0%8.49
3384d12sample0.0 0.0110.73019.7 0.03984.4%0.954
4512d12sample0.01 0.0090.77521.0 0.03882.5%0.957

vs. Best GPT Models

ModelVal MSESSIMPSNR Action Disc.Motor Dir AccMotor Pos MAE Dynamic MSENo Error Compound?
GPT iter2 (384d12) 0.01020.75620.4 0.03183.1%0.612 0.021No (TF/FR gap=0.006)
GPT iter6 (384d12+cosine) 0.01130.75220.0 0.03485.0%0.463 0.024No
Diffusion iter4 (512d12+sample) 0.0090.77521.0 0.03882.5%0.957 0.018Yes

Key Findings

1. Epsilon prediction is BROKEN for conditional canvas diffusion. Iterations 1-2 both produced pure noise despite training losses dropping to 0.506-0.675. The issue: epsilon prediction asks the model to predict noise, but the canvas has an asymmetry — context patches are always clean while only the last frame is noisy. The DDIM inference chain compounds errors from high-noise timesteps. Switching to sample prediction (predict clean image directly) fixed everything immediately.
2. Sample prediction + iterative x_0 refinement is the correct inference approach. Standard DDIM doesn't work well with sample prediction because the noise re-derivation step introduces errors. Iterative x_0 refinement (predict x_0, re-noise to t-1 level, repeat) is more stable and produces dramatically better results.
3. Weight decay must be tuned carefully: 0.0 for convergence, 0.01 for regularization. The baseline (wd=0.05) prevented learning entirely. Zero weight decay (wd=0.0) allowed training but caused severe overfitting (train/val gap -0.013). Weight decay 0.01 hit the sweet spot: good convergence + controlled overfitting (gap -0.005).
4. The diffusion model beats GPT on visual quality but loses on motor precision. The wider diffusion model (512d12) achieves the best SSIM (0.775), PSNR (21.0), and dynamic pixel MSE (0.018) across all experiments. However, motor position MAE (0.957) is significantly worse than GPT (0.612). The stochastic generation process adds inherent noise to the precise grayscale motor patches.
5. Diffusion models eliminate autoregressive error compounding. Unlike GPT which shows 10-50x higher error at the end of the raster scan (motor strip patches), the diffusion model produces uniform spatial quality. No TF/FR gap. This is a fundamental architectural advantage for tasks requiring uniform spatial accuracy.

Recommended Configuration

Best overall quality: Diffusion 512d12, sample prediction, wd=0.01, cosine LR
SSIM=0.775, PSNR=21.0, val_mse=0.009

Best motor precision: GPT 384d12 with cosine schedule (from GPT experiments)
Motor pos MAE=0.463, motor consistency=0.815

For robotics (motor accuracy critical): GPT with cosine schedule
For visual prediction (image quality critical): Diffusion with sample prediction

What Would Help Next

1. Deterministic inference for motor strip: During diffusion inference, skip stochastic re-noising for motor strip patches — use deterministic x_0 prediction only. This should combine diffusion's visual quality with GPT-level motor accuracy.

2. EMA (Exponential Moving Average): Standard for diffusion models. The EMA model weights produce smoother, higher-quality predictions. Not yet implemented.

3. More data: The diffusion model with 512d12 (~50M params) benefits from more capacity than GPT, but 700 canvases still limits it. With 2000+ canvases, the quality gap over GPT would likely widen.

4. Hybrid model: Use diffusion for visual frame prediction, GPT-style direct prediction for motor strip. Best of both worlds.

Individual Reports

← Back to GPT Experiment Summary