Diffusion Experimentation Loop Summary

Iteration	Config	Prediction	Weight Decay	Val MSE	SSIM	PSNR	Action Disc.	Motor Dir Acc	Motor Pos MAE
1	256d8	epsilon	0.05	0.345	0.008	4.6	0.491	50.0%	8.51
2	384d12	epsilon	0.0	0.344	0.008	4.6	0.490	50.0%	8.49
3	384d12	sample	0.0	0.011	0.730	19.7	0.039	84.4%	0.954
4	512d12	sample	0.01	0.009	0.775	21.0	0.038	82.5%	0.957

Model	Val MSE	SSIM	PSNR	Action Disc.	Motor Dir Acc	Motor Pos MAE	Dynamic MSE	No Error Compound?
GPT iter2 (384d12)	0.0102	0.756	20.4	0.031	83.1%	0.612	0.021	No (TF/FR gap=0.006)
GPT iter6 (384d12+cosine)	0.0113	0.752	20.0	0.034	85.0%	0.463	0.024	No
Diffusion iter4 (512d12+sample)	0.009	0.775	21.0	0.038	82.5%	0.957	0.018	Yes

1. Epsilon prediction is BROKEN for conditional canvas diffusion. Iterations 1-2 both produced pure noise despite training losses dropping to 0.506-0.675. The issue: epsilon prediction asks the model to predict noise, but the canvas has an asymmetry — context patches are always clean while only the last frame is noisy. The DDIM inference chain compounds errors from high-noise timesteps. Switching to sample prediction (predict clean image directly) fixed everything immediately.

2. Sample prediction + iterative x_0 refinement is the correct inference approach. Standard DDIM doesn't work well with sample prediction because the noise re-derivation step introduces errors. Iterative x_0 refinement (predict x_0, re-noise to t-1 level, repeat) is more stable and produces dramatically better results.

3. Weight decay must be tuned carefully: 0.0 for convergence, 0.01 for regularization. The baseline (wd=0.05) prevented learning entirely. Zero weight decay (wd=0.0) allowed training but caused severe overfitting (train/val gap -0.013). Weight decay 0.01 hit the sweet spot: good convergence + controlled overfitting (gap -0.005).

4. The diffusion model beats GPT on visual quality but loses on motor precision. The wider diffusion model (512d12) achieves the best SSIM (0.775), PSNR (21.0), and dynamic pixel MSE (0.018) across all experiments. However, motor position MAE (0.957) is significantly worse than GPT (0.612). The stochastic generation process adds inherent noise to the precise grayscale motor patches.

5. Diffusion models eliminate autoregressive error compounding. Unlike GPT which shows 10-50x higher error at the end of the raster scan (motor strip patches), the diffusion model produces uniform spatial quality. No TF/FR gap. This is a fundamental architectural advantage for tasks requiring uniform spatial accuracy.

Recommended Configuration

Best overall quality: Diffusion 512d12, sample prediction, wd=0.01, cosine LR
SSIM=0.775, PSNR=21.0, val_mse=0.009

Best motor precision: GPT 384d12 with cosine schedule (from GPT experiments)
Motor pos MAE=0.463, motor consistency=0.815

For robotics (motor accuracy critical): GPT with cosine schedule
For visual prediction (image quality critical): Diffusion with sample prediction

What Would Help Next

1. Deterministic inference for motor strip: During diffusion inference, skip stochastic re-noising for motor strip patches — use deterministic x_0 prediction only. This should combine diffusion's visual quality with GPT-level motor accuracy.

2. EMA (Exponential Moving Average): Standard for diffusion models. The EMA model weights produce smoother, higher-quality predictions. Not yet implemented.

3. More data: The diffusion model with 512d12 (~50M params) benefits from more capacity than GPT, but 700 canvases still limits it. With 2000+ canvases, the quality gap over GPT would likely widen.

4. Hybrid model: Use diffusion for visual frame prediction, GPT-style direct prediction for motor strip. Best of both worlds.

Diffusion Model: Experimentation Loop Summary

Diffusion Architecture Comparison

vs. Best GPT Models

Key Findings

Recommended Configuration

What Would Help Next

Individual Reports