Eval Report: diffusion - 20260319

DIFFUSION ITERATION 1: Baseline DiT on 700-canvas dataset ========================================================== Model: ConditionalDiffusionViT (DiT with adaLN-Zero) - embed_dim=256, depth=8, num_heads=8, patch_size=16 - prediction_type=epsilon, beta_schedule=cosine, 1000 timesteps - 200 epochs, batch_size=4, lr=1e-4, weight_decay=0.05 - Data normalized to [-1, 1] - DDIM inference with 50 steps Dataset: 700 canvases (single-action-shoulder-pan-700-combined) - 620 train / 80 val - Canvas: 464x480 (448 visual + 16 motor strip, 6 joints) Purpose: Establish diffusion baseline on same dataset as GPT experiments. GPT best results for comparison: - iter2 (384d12): SSIM=0.756, val_mse=0.0102, action_disc=0.031 - iter6 (384d12+cosine): motor_pos_mae=0.463, motor_dir_acc=85% Diffusion models have key advantages over GPT: - No autoregressive error compounding (all patches predicted in parallel) - Iterative refinement during inference (50 denoising steps) - Stochastic generation (can model multi-modal futures) Key challenges: - Diffusion models need more training epochs - Pixel-space diffusion can be inefficient (no latent encoding) - Quality depends heavily on noise schedule and inference steps

Metrics

Recommendations

Counterfactual Action Grids

Each grid: Row 1 = GT, Row 2 = STAY (red), Row 3 = MOVE+ (green), Row 4 = MOVE- (blue)

Metric	Value
val_mse	0.344694
val_mse_visual	0.344694
ssim	0.008250
psnr	4.627576
val_mse_motor_strip	0.356118
val_mse_action_1	0.347077
val_mse_action_2	0.341925
val_mse_static	0.353314
val_mse_dynamic	0.333897
motor_position_mae_mean	8.514683
motor_velocity_mae_mean	1.682004
motor_direction_accuracy	0.500000
motor_consistency_error	8.322731
diffusion_loss_t_0_250	0.680518
diffusion_loss_t_250_500	0.674586
diffusion_loss_t_500_750	0.674806
diffusion_loss_t_750_1000	0.674405
action_discrimination_score	0.491355
motor_discrimination_score	0.489860
motor_position_mae_per_joint	[23.1367, 4.5841, 8.8650, 7.5883, 2.0601, 4.8539]
motor_position_mae_action_1	8.239890
motor_position_mae_action_2	8.834036

Sample 0

Error heatmap (jet colormap)

Joint	GT Pos	STAY	MOVE+	MOVE-
J0	-39.7233	-6.6723	-3.6058	-4.9425
J1	-90.9068	-82.7890	-82.9502	-83.0991
J2	66.3224	75.8442	73.2527	73.4442
J3	39.2845	31.5148	31.4088	30.9081
J4	8.5083	5.8182	5.3726	5.6350
J5	10.3896	5.1894	5.3606	5.8075

Sample 1

Error heatmap (jet colormap)

Joint	GT Pos	STAY	MOVE+	MOVE-
J0	-59.4725	-3.4791	-8.9691	-1.6674
J1	-89.7081	-84.0023	-82.5116	-82.5540
J2	72.1129	75.9572	72.6826	73.2871
J3	39.2845	32.0486	31.0820	31.3631
J4	8.5083	5.8346	5.9727	5.6676
J5	10.3127	5.4062	6.1663	5.4465

Sample 2

Error heatmap (jet colormap)

Joint	GT Pos	STAY	MOVE+	MOVE-
J0	33.1292	-7.3979	-5.5323	7.6900
J1	-89.7081	-82.9439	-82.1501	-83.8303
J2	70.5801	75.9855	73.4693	73.9907
J3	39.2845	32.7813	31.6584	30.9280
J4	8.3953	5.3876	5.9138	5.6883
J5	10.5433	5.3513	5.7258	5.4069

Sample 3

Error heatmap (jet colormap)

Joint	GT Pos	STAY	MOVE+	MOVE-
J0	43.2233	4.5946	-0.0947	-5.4073
J1	-89.1403	-82.3772	-82.5958	-83.7297
J2	84.2048	74.7223	72.6063	74.0730
J3	40.3077	30.7951	30.8535	30.4521
J4	8.5083	5.3983	5.4453	5.9311
J5	10.5433	5.3200	5.7412	5.2440

Evaluation Report: diff_iter1_baseline_256d8

Analysis Notes

Metrics

Recommendations

Counterfactual Action Grids

Sample 0

Sample 1

Sample 2

Sample 3

Diffusion Loss by Timestep Bucket

Bucket	Loss
t_0_250	0.680518
t_250_500	0.674586
t_500_750	0.674806
t_750_1000	0.674405