Evaluation Report: diff_iter4_512d12_wd001

Run Name: diff_iter4_512d12_wd001

Model Type: diffusion

Checkpoint: local/checkpoints/diff_iter4_wider/best.pth

Dataset: local/datasets/single-action-shoulder-pan-700-combined

Date: 2026-03-19 05:25:29

Val Samples: 80

Analysis Notes

DIFFUSION ITERATION 4: Wider model (512d) + weight decay 0.01 for regularization ================================================================================= Changes from Iteration 3: - embed_dim: 384 -> 512 (wider model) - depth: 12 (same) - num_heads: 16 (matched to embed_dim) - weight_decay: 0.0 -> 0.01 (light regularization to reduce overfitting) - epochs: 500 -> 300 (iter3 best was at ~210, didn't improve after) - lr: 5e-4 -> 3e-4 (slightly lower for larger model) Rationale: Iteration 3 finally got diffusion working with sample prediction + fixed inference. The model achieves competitive results with GPT but motor accuracy is weaker (pos MAE 0.954 vs GPT's 0.612). The iter3 model overfitted heavily (train/val gap grew to -0.013). Adding light weight decay should help regularize. A wider model should provide more capacity for precise predictions. Key: iter3 showed diffusion models need NO weight decay to start training, but too little regularization causes overfitting after ~200 epochs. Weight decay 0.01 is a middle ground. Results so far: - diff_iter1: 256d8, epsilon, wd=0.05 -> NOISE (total failure) - diff_iter2: 384d12, epsilon, wd=0.0, cosine -> NOISE (DDIM broken) - diff_iter3: 384d12, sample, wd=0.0, cosine -> SSIM=0.730 (working!) - This run: 512d12, sample, wd=0.01, cosine -> targeting SSIM>0.75

Metrics

MetricValue
val_mse0.008753
val_mse_visual0.008753
ssim0.774989
psnr21.009076
val_mse_motor_strip0.011761
val_mse_action_10.007847
val_mse_action_20.009807
val_mse_static0.001445
val_mse_dynamic0.018052
motor_position_mae_mean0.956825
motor_velocity_mae_mean0.056059
motor_direction_accuracy0.825000
motor_consistency_error1.205545
diffusion_loss_t_0_2500.002058
diffusion_loss_t_250_5000.004465
diffusion_loss_t_500_7500.008339
diffusion_loss_t_750_10000.017111
action_discrimination_score0.037558
motor_discrimination_score0.049088
motor_position_mae_per_joint[2.4623, 0.4575, 0.9046, 1.0166, 0.8754, 0.0246]
motor_position_mae_action_10.865819
motor_position_mae_action_21.062589

Recommendations

Motor position and velocity predictions are inconsistent. Consider adding a consistency loss or simplifying the velocity encoding.

Counterfactual Action Grids

Each grid: Row 1 = GT, Row 2 = STAY (red), Row 3 = MOVE+ (green), Row 4 = MOVE- (blue)

Sample 0

Counterfactual grid 0
Error heatmap 0

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J0-39.7233-58.4362-46.4312-58.7470
J1-90.9068-90.5657-90.5461-90.6769
J266.322466.330665.956166.3921
J339.284538.599739.035538.8833
J48.50837.92248.67718.4876
J510.389610.500410.467210.3546

Sample 1

Counterfactual grid 1
Error heatmap 1

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J0-59.4725-56.0054-40.7560-58.8133
J1-89.7081-89.6776-89.4568-89.5167
J272.112972.749972.690272.8331
J339.284538.527439.340738.9401
J48.50837.68568.71108.3896
J510.312710.516710.544010.4008

Sample 2

Counterfactual grid 2
Error heatmap 2

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J033.129224.108133.666318.4541
J1-89.7081-90.2747-89.8306-90.5096
J270.580171.151671.290371.1998
J339.284538.666639.270139.4000
J48.39538.35758.60818.8027
J510.543310.607110.507310.5585

Sample 3

Counterfactual grid 3
Error heatmap 3

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J043.223329.024143.111825.5336
J1-89.1403-89.2733-89.3082-89.3114
J284.204883.318184.592285.4572
J340.307738.770739.520239.6215
J48.50837.43558.82708.7954
J510.543310.631010.579210.6138

Diffusion Loss by Timestep Bucket

BucketLoss
t_0_2500.002058
t_250_5000.004465
t_500_7500.008339
t_750_10000.017111