Evaluation Report: diff_iter3_sample_fixed_inference

Run Name: diff_iter3_sample_fixed_inference

Model Type: diffusion

Checkpoint: local/checkpoints/diff_iter3_sample/best.pth

Dataset: local/datasets/single-action-shoulder-pan-700-combined

Date: 2026-03-19 04:10:56

Val Samples: 80

Analysis Notes

DIFFUSION ITERATION 3: Sample prediction + higher LR + 500 epochs ================================================================== Changes from Iteration 2 (still producing noise): - prediction_type: epsilon -> sample (predict clean image directly) - epochs: 300 -> 500 (diffusion needs much more training) - lr: 3e-4 -> 5e-4 (higher peak since sample prediction is easier to optimize) - warmup_epochs: 10 -> 20 (longer warmup for stability at higher LR) Rationale: Iteration 2 reduced denoising loss from 0.675 to 0.506, but inference still produces noise (SSIM=0.008). The issue: epsilon prediction asks the model to predict the noise component, which is harder to learn than predicting the clean image directly. With "sample" prediction: - The model predicts x_0 (clean patches) instead of epsilon (noise) - More intuitive: model learns "what should this look like?" not "what noise was added?" - Converges faster in practice for conditional generation - The DDIM step still works: uses pred_x0 directly instead of deriving it from epsilon Also, this canvas has an asymmetry: context patches are ALWAYS clean while last-frame patches are noisy. Epsilon prediction struggles with this because the model must learn different behaviors for context vs. target regions. Sample prediction is more uniform: predict clean for everything, training only optimizes the last-frame predictions. The previous training ran for 300 epochs but loss was still decreasing. 500 epochs should allow better convergence.

Metrics

MetricValue
val_mse0.011394
val_mse_visual0.011394
ssim0.730140
psnr19.706405
val_mse_motor_strip0.015084
val_mse_action_10.010401
val_mse_action_20.012549
val_mse_static0.002058
val_mse_dynamic0.023309
motor_position_mae_mean0.953616
motor_velocity_mae_mean0.093113
motor_direction_accuracy0.843750
motor_consistency_error1.194727
diffusion_loss_t_0_2500.002217
diffusion_loss_t_250_5000.004562
diffusion_loss_t_500_7500.008389
diffusion_loss_t_750_10000.018411
action_discrimination_score0.038722
motor_discrimination_score0.055427
motor_position_mae_per_joint[2.9318, 0.6501, 0.8192, 0.2109, 1.0452, 0.0645]
motor_position_mae_action_10.954987
motor_position_mae_action_20.952021

Recommendations

Motor position and velocity predictions are inconsistent. Consider adding a consistency loss or simplifying the velocity encoding.

Counterfactual Action Grids

Each grid: Row 1 = GT, Row 2 = STAY (red), Row 3 = MOVE+ (green), Row 4 = MOVE- (blue)

Sample 0

Counterfactual grid 0
Error heatmap 0

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J0-39.7233-53.1471-44.0113-58.1136
J1-90.9068-90.5143-90.5498-90.7645
J266.322466.511465.448865.0265
J339.284539.667139.977039.7421
J48.50838.42939.10148.6378
J510.389610.605510.635310.5477

Sample 1

Counterfactual grid 1
Error heatmap 1

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J0-59.4725-49.4470-41.1102-58.3401
J1-89.7081-89.5667-89.5738-90.6919
J272.112972.401471.269771.2285
J339.284539.651439.973039.8268
J48.50838.65318.98748.8648
J510.312710.580710.664310.5929

Sample 2

Counterfactual grid 2
Error heatmap 2

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J033.129225.191634.050111.7716
J1-89.7081-90.2027-90.0745-90.0278
J270.580170.574370.145270.0515
J339.284539.604439.874940.0282
J48.39539.05818.87249.0839
J510.543310.659710.639310.6636

Sample 3

Counterfactual grid 3
Error heatmap 3

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J043.223334.723851.999426.8380
J1-89.1403-89.3265-89.4593-89.1761
J284.204883.212684.971384.2591
J340.307739.895840.196540.0169
J48.50838.99239.15269.0080
J510.543310.672710.671710.6596

Diffusion Loss by Timestep Bucket

BucketLoss
t_0_2500.002217
t_250_5000.004562
t_500_7500.008389
t_750_10000.018411