Evaluation Report: iter3_mae_384d8

Run Name: iter3_mae_384d8

Model Type: mae

Checkpoint: local/checkpoints/mae_iter3/best.pth

Dataset: local/datasets/single-action-shoulder-pan-700-combined

Date: 2026-03-17 23:19:15

Val Samples: 80

Analysis Notes

ITERATION 3: Switch to MAE model (no autoregressive compounding) ================================================================ Changes from Iteration 2: - Model type: GPT -> MAE (Masked Autoencoder) - Architecture: encoder-decoder with spatial masking - embed_dim=384, depth=8 (encoder), decoder_embed_dim=192, decoder_depth=4 - Mask strategy: mask last frame region, reconstruct from context - NO autoregressive generation — all patches predicted in parallel Rationale: The GPT model's biggest limitation is autoregressive error compounding. The per-position loss shows 10-50x higher error at the end of the raster scan (motor strip patches). The TF/FR gap remained at 0.006 even with 3x more capacity. MAE predicts all masked patches simultaneously, eliminating: 1. Error compounding (each patch is independent of other predictions) 2. TF/FR gap (no autoregressive generation) 3. Position-dependent quality degradation Expected improvements: - Motor strip accuracy (no longer at bottom of raster scan) - More uniform spatial error distribution - Potentially better SSIM (no blur accumulation) Potential downsides: - MAE doesn't capture sequential dependencies between patches - May produce less coherent local structure - Different failure modes (may produce "average" predictions) Comparison targets from iter2 (GPT 384d12): val_mse=0.0102, SSIM=0.756, PSNR=20.4, action_disc=0.031 motor_dir_acc=83.1%, motor_pos_mae=0.612, motor_consistency=1.01

Metrics

MetricValue
val_mse0.014740
val_mse_visual0.014740
ssim0.616064
psnr19.056644
val_mse_motor_strip0.012590
val_mse_action_10.013135
val_mse_action_20.016605
val_mse_static0.004921
val_mse_dynamic0.027464
motor_position_mae_mean1.695826
motor_velocity_mae_mean0.068468
motor_direction_accuracy0.806250
motor_consistency_error1.946142
action_discrimination_score0.021702
motor_discrimination_score0.047564
motor_position_mae_per_joint[4.2821, 1.0353, 1.6430, 2.2400, 0.9011, 0.0733]
motor_position_mae_action_11.649179
motor_position_mae_action_21.750037

Recommendations

Motor position and velocity predictions are inconsistent. Consider adding a consistency loss or simplifying the velocity encoding.

Counterfactual Action Grids

Each grid: Row 1 = GT, Row 2 = STAY (red), Row 3 = MOVE+ (green), Row 4 = MOVE- (blue)

Sample 0

Counterfactual grid 0
Error heatmap 0

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J0-39.7233-45.6685-35.6004-56.2754
J1-90.9068-89.3808-90.1750-89.8725
J266.322463.154870.256169.9488
J339.284538.990838.959638.4939
J48.50838.57068.67508.5793
J510.389610.548210.54529.9446

Sample 1

Counterfactual grid 1
Error heatmap 1

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J0-59.4725-25.0172-31.5712-53.5583
J1-89.7081-84.6556-89.4076-89.4094
J272.112972.494673.066071.8873
J339.284539.467238.758238.3339
J48.50838.82708.76598.6613
J510.31279.988510.53209.8787

Sample 2

Counterfactual grid 2
Error heatmap 2

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J033.129227.987931.114514.8720
J1-89.7081-89.7625-89.3492-89.4558
J270.580167.513170.967670.6356
J339.284538.568938.577538.8647
J48.39538.64328.65988.9428
J510.543310.275010.294810.3580

Sample 3

Counterfactual grid 3
Error heatmap 3

Error heatmap (jet colormap)

JointGT PosSTAYMOVE+MOVE-
J043.223337.679041.623717.5690
J1-89.1403-88.0972-89.0935-89.6248
J284.204879.170384.041984.6894
J340.307739.512639.685539.8198
J48.50838.75278.81589.1169
J510.543310.429710.439110.3780