Evaluation Report: diff_iter2_384d12_cosine

Run Name: diff_iter2_384d12_cosine

Model Type: diffusion

Checkpoint: local/checkpoints/diff_iter2_larger/best.pth

Dataset: local/datasets/single-action-shoulder-pan-700-combined

Date: 2026-03-19 02:33:11

Val Samples: 80

Analysis Notes

DIFFUSION ITERATION 2: Larger model + training fixes ===================================================== Changes from Iteration 1 (total failure - produced noise): - embed_dim: 256 -> 384 (50% increase) - depth: 8 -> 12 (50% more layers) - num_heads: 8 -> 12 - weight_decay: 0.05 -> 0.0 (critical for diffusion models) - lr: 1e-4 -> 3e-4 (higher peak with warmup) - lr_schedule: plateau -> cosine with 10-epoch warmup - grad_clip: 1.0 (stability for larger model) - epochs: 200 -> 300 (more training time) - ~22M parameters Rationale: Iteration 1 completely failed (SSIM=0.008, basically noise). The denoising loss was ~0.675 uniformly across all timestep buckets, meaning the model didn't learn to denoise at ANY noise level. Root causes: 1. Model too small: 7M params is insufficient for denoising. The GPT model also needed 22M to work well. DiT models in the literature are typically 100M+, but 22M should be enough for this canvas size. 2. Weight decay 0.05: DiT/diffusion models typically use 0.0 weight decay. High weight decay fights the zero-initialization of adaLN modulation layers, preventing the model from learning. 3. No warmup: DiT training is sensitive to early learning rate. Warmup provides stability during initial weight updates. 4. No gradient clipping: Large gradients from random noise targets can destabilize early training.

Metrics

MetricValue
val_mse0.343989
val_mse_visual0.343989
ssim0.008377
psnr4.636465
val_mse_motor_strip0.355602
val_mse_action_10.346389
val_mse_action_20.341200
val_mse_static0.352569
val_mse_dynamic0.333238
motor_position_mae_mean8.487006
motor_velocity_mae_mean1.673833
motor_direction_accuracy0.500000
motor_consistency_error8.318934
diffusion_loss_t_0_2500.507211
diffusion_loss_t_250_5000.505593
diffusion_loss_t_500_7500.505646
diffusion_loss_t_750_10000.505727
action_discrimination_score0.490408
motor_discrimination_score0.488563
motor_position_mae_per_joint[22.9280, 4.6155, 8.8229, 7.6031, 2.0917, 4.8608]
motor_position_mae_action_18.232281
motor_position_mae_action_28.783039

Recommendations

Motor direction accuracy is low. The model may not be learning action-to-motor mappings. Try larger separator width or explicit action token embeddings.
Motor position and velocity predictions are inconsistent. Consider adding a consistency loss or simplifying the velocity encoding.

Counterfactual Action Grids

Each grid: Row 1 = GT, Row 2 = STAY (red), Row 3 = MOVE+ (green), Row 4 = MOVE- (blue)

Sample 0

Counterfactual grid 0
Error heatmap 0

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J0-39.7233-4.18230.2376-4.9934
J1-90.9068-81.5035-83.3020-82.8269
J266.322477.737572.560374.0157
J339.284531.769331.550731.5420
J48.50835.54305.41785.7001
J510.38965.51145.12215.8552

Sample 1

Counterfactual grid 1
Error heatmap 1

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J0-59.4725-2.1793-11.63321.4510
J1-89.7081-83.7646-82.3371-82.4657
J272.112976.137473.960773.9740
J339.284532.720832.471731.0648
J48.50835.84996.19275.4717
J510.31275.61706.15365.6074

Sample 2

Counterfactual grid 2
Error heatmap 2

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J033.1292-10.4921-9.12525.9270
J1-89.7081-82.5372-82.3801-83.6590
J270.580174.996274.419974.1325
J339.284533.137231.457531.2373
J48.39535.52516.25405.2965
J510.54335.07665.69305.2209

Sample 3

Counterfactual grid 3
Error heatmap 3

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J043.22338.69402.1623-3.6199
J1-89.1403-81.4538-82.7848-83.5504
J284.204874.866572.127173.6063
J340.307729.925630.589330.9743
J48.50835.55875.22305.8869
J510.54335.13695.92755.1094

Diffusion Loss by Timestep Bucket

BucketLoss
t_0_2500.507211
t_250_5000.505593
t_500_7500.505646
t_750_10000.505727