Evaluation Report: diff_iter1_baseline_256d8

Run Name: diff_iter1_baseline_256d8

Model Type: diffusion

Checkpoint: local/checkpoints/diff_iter1_baseline/best.pth

Dataset: local/datasets/single-action-shoulder-pan-700-combined

Date: 2026-03-19 00:02:00

Val Samples: 80

Analysis Notes

DIFFUSION ITERATION 1: Baseline DiT on 700-canvas dataset ========================================================== Model: ConditionalDiffusionViT (DiT with adaLN-Zero) - embed_dim=256, depth=8, num_heads=8, patch_size=16 - prediction_type=epsilon, beta_schedule=cosine, 1000 timesteps - 200 epochs, batch_size=4, lr=1e-4, weight_decay=0.05 - Data normalized to [-1, 1] - DDIM inference with 50 steps Dataset: 700 canvases (single-action-shoulder-pan-700-combined) - 620 train / 80 val - Canvas: 464x480 (448 visual + 16 motor strip, 6 joints) Purpose: Establish diffusion baseline on same dataset as GPT experiments. GPT best results for comparison: - iter2 (384d12): SSIM=0.756, val_mse=0.0102, action_disc=0.031 - iter6 (384d12+cosine): motor_pos_mae=0.463, motor_dir_acc=85% Diffusion models have key advantages over GPT: - No autoregressive error compounding (all patches predicted in parallel) - Iterative refinement during inference (50 denoising steps) - Stochastic generation (can model multi-modal futures) Key challenges: - Diffusion models need more training epochs - Pixel-space diffusion can be inefficient (no latent encoding) - Quality depends heavily on noise schedule and inference steps

Metrics

MetricValue
val_mse0.344694
val_mse_visual0.344694
ssim0.008250
psnr4.627576
val_mse_motor_strip0.356118
val_mse_action_10.347077
val_mse_action_20.341925
val_mse_static0.353314
val_mse_dynamic0.333897
motor_position_mae_mean8.514683
motor_velocity_mae_mean1.682004
motor_direction_accuracy0.500000
motor_consistency_error8.322731
diffusion_loss_t_0_2500.680518
diffusion_loss_t_250_5000.674586
diffusion_loss_t_500_7500.674806
diffusion_loss_t_750_10000.674405
action_discrimination_score0.491355
motor_discrimination_score0.489860
motor_position_mae_per_joint[23.1367, 4.5841, 8.8650, 7.5883, 2.0601, 4.8539]
motor_position_mae_action_18.239890
motor_position_mae_action_28.834036

Recommendations

Motor direction accuracy is low. The model may not be learning action-to-motor mappings. Try larger separator width or explicit action token embeddings.
Motor position and velocity predictions are inconsistent. Consider adding a consistency loss or simplifying the velocity encoding.

Counterfactual Action Grids

Each grid: Row 1 = GT, Row 2 = STAY (red), Row 3 = MOVE+ (green), Row 4 = MOVE- (blue)

Sample 0

Counterfactual grid 0
Error heatmap 0

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J0-39.7233-6.6723-3.6058-4.9425
J1-90.9068-82.7890-82.9502-83.0991
J266.322475.844273.252773.4442
J339.284531.514831.408830.9081
J48.50835.81825.37265.6350
J510.38965.18945.36065.8075

Sample 1

Counterfactual grid 1
Error heatmap 1

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J0-59.4725-3.4791-8.9691-1.6674
J1-89.7081-84.0023-82.5116-82.5540
J272.112975.957272.682673.2871
J339.284532.048631.082031.3631
J48.50835.83465.97275.6676
J510.31275.40626.16635.4465

Sample 2

Counterfactual grid 2
Error heatmap 2

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J033.1292-7.3979-5.53237.6900
J1-89.7081-82.9439-82.1501-83.8303
J270.580175.985573.469373.9907
J339.284532.781331.658430.9280
J48.39535.38765.91385.6883
J510.54335.35135.72585.4069

Sample 3

Counterfactual grid 3
Error heatmap 3

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J043.22334.5946-0.0947-5.4073
J1-89.1403-82.3772-82.5958-83.7297
J284.204874.722372.606374.0730
J340.307730.795130.853530.4521
J48.50835.39835.44535.9311
J510.54335.32005.74125.2440

Diffusion Loss by Timestep Bucket

BucketLoss
t_0_2500.680518
t_250_5000.674586
t_500_7500.674806
t_750_10000.674405