Eval Report: diffusion - 20260319

DIFFUSION ITERATION 2: Larger model + training fixes ===================================================== Changes from Iteration 1 (total failure - produced noise): - embed_dim: 256 -> 384 (50% increase) - depth: 8 -> 12 (50% more layers) - num_heads: 8 -> 12 - weight_decay: 0.05 -> 0.0 (critical for diffusion models) - lr: 1e-4 -> 3e-4 (higher peak with warmup) - lr_schedule: plateau -> cosine with 10-epoch warmup - grad_clip: 1.0 (stability for larger model) - epochs: 200 -> 300 (more training time) - ~22M parameters Rationale: Iteration 1 completely failed (SSIM=0.008, basically noise). The denoising loss was ~0.675 uniformly across all timestep buckets, meaning the model didn't learn to denoise at ANY noise level. Root causes: 1. Model too small: 7M params is insufficient for denoising. The GPT model also needed 22M to work well. DiT models in the literature are typically 100M+, but 22M should be enough for this canvas size. 2. Weight decay 0.05: DiT/diffusion models typically use 0.0 weight decay. High weight decay fights the zero-initialization of adaLN modulation layers, preventing the model from learning. 3. No warmup: DiT training is sensitive to early learning rate. Warmup provides stability during initial weight updates. 4. No gradient clipping: Large gradients from random noise targets can destabilize early training.

Metrics

Recommendations

Counterfactual Action Grids

Each grid: Row 1 = GT, Row 2 = STAY (red), Row 3 = MOVE+ (green), Row 4 = MOVE- (blue)

Metric	Value
val_mse	0.343989
val_mse_visual	0.343989
ssim	0.008377
psnr	4.636465
val_mse_motor_strip	0.355602
val_mse_action_1	0.346389
val_mse_action_2	0.341200
val_mse_static	0.352569
val_mse_dynamic	0.333238
motor_position_mae_mean	8.487006
motor_velocity_mae_mean	1.673833
motor_direction_accuracy	0.500000
motor_consistency_error	8.318934
diffusion_loss_t_0_250	0.507211
diffusion_loss_t_250_500	0.505593
diffusion_loss_t_500_750	0.505646
diffusion_loss_t_750_1000	0.505727
action_discrimination_score	0.490408
motor_discrimination_score	0.488563
motor_position_mae_per_joint	[22.9280, 4.6155, 8.8229, 7.6031, 2.0917, 4.8608]
motor_position_mae_action_1	8.232281
motor_position_mae_action_2	8.783039

Sample 0

Error heatmap (jet colormap)

Joint	GT Pos	STAY	MOVE+	MOVE-
J0	-39.7233	-4.1823	0.2376	-4.9934
J1	-90.9068	-81.5035	-83.3020	-82.8269
J2	66.3224	77.7375	72.5603	74.0157
J3	39.2845	31.7693	31.5507	31.5420
J4	8.5083	5.5430	5.4178	5.7001
J5	10.3896	5.5114	5.1221	5.8552

Sample 1

Error heatmap (jet colormap)

Joint	GT Pos	STAY	MOVE+	MOVE-
J0	-59.4725	-2.1793	-11.6332	1.4510
J1	-89.7081	-83.7646	-82.3371	-82.4657
J2	72.1129	76.1374	73.9607	73.9740
J3	39.2845	32.7208	32.4717	31.0648
J4	8.5083	5.8499	6.1927	5.4717
J5	10.3127	5.6170	6.1536	5.6074

Sample 2

Error heatmap (jet colormap)

Joint	GT Pos	STAY	MOVE+	MOVE-
J0	33.1292	-10.4921	-9.1252	5.9270
J1	-89.7081	-82.5372	-82.3801	-83.6590
J2	70.5801	74.9962	74.4199	74.1325
J3	39.2845	33.1372	31.4575	31.2373
J4	8.3953	5.5251	6.2540	5.2965
J5	10.5433	5.0766	5.6930	5.2209

Sample 3

Error heatmap (jet colormap)

Joint	GT Pos	STAY	MOVE+	MOVE-
J0	43.2233	8.6940	2.1623	-3.6199
J1	-89.1403	-81.4538	-82.7848	-83.5504
J2	84.2048	74.8665	72.1271	73.6063
J3	40.3077	29.9256	30.5893	30.9743
J4	8.5083	5.5587	5.2230	5.8869
J5	10.5433	5.1369	5.9275	5.1094

Evaluation Report: diff_iter2_384d12_cosine

Analysis Notes

Metrics

Recommendations

Counterfactual Action Grids

Sample 0

Sample 1

Sample 2

Sample 3

Diffusion Loss by Timestep Bucket

Bucket	Loss
t_0_250	0.507211
t_250_500	0.505593
t_500_750	0.505646
t_750_1000	0.505727