Date: 2026-03-20
Dataset: hold-1500-combined (1500 canvases from 15 datasets, 464×480, 6-joint motor strip)
Goal: Compare fine-tuning from best existing checkpoints vs training from scratch on a new combined dataset with hold actions, and compare GPT vs Diffusion inference speed
Iterations: 2 (iter0: baseline configs, iter1: adjusted hyperparameters)
All experiments used the same architecture as the best existing checkpoints. GPT: 22M params (384d12), Diffusion: ~80M params (512d12h16).
| Metric | GPT Fine-tune | GPT Scratch | Diff Fine-tune | Diff Scratch |
|---|---|---|---|---|
| Total training time | 77.4 min | 76.9 min | 250.5 min | 250.6 min |
| Best epoch | 66 | 81 | 147 | 195 |
| Time to plateau | 31.4 min | 44.2 min | 129.0 min | 166.9 min |
| Best val loss | 0.002065 | 0.002119 | 0.002874 | 0.003492 |
| Val MSE (visual) | 0.005735 | 0.004797 | 0.003354 | 0.004503 |
| Val MSE (motor strip) | 0.006187 | 0.005548 | 0.010641 | 0.013146 |
| SSIM | 0.864 | 0.869 | 0.883 | 0.867 |
| PSNR (dB) | 24.55 | 25.39 | 26.77 | 25.22 |
| Motor direction accuracy | 91.7% | 88.3% | 86.8% | 89.8% |
| Action discrimination | 0.041 | 0.043 | 0.046 | 0.047 |
| Inference mean (ms) | 5,132 | 7,630 | 459 | 459 |
| Overfitting ratio | 1.038 | 1.029 | 1.756 | 1.783 |
Based on iter0 analysis: increased weight_decay 0.05→0.1 for diffusion (overfitting), reduced epochs 300→200. Lowered LR 0.0002→0.0001 for GPT FT. GPT scratch not re-run.
| Metric | GPT FT (iter1) | Diff FT (iter1) | Diff Scratch (iter1) |
|---|---|---|---|
| Config changes | LR: 0.0002→0.0001 | WD: 0.05→0.1, Epochs: 300→200 | WD: 0.05→0.1, Epochs: 300→200 |
| Training time | 113.0 min | 188.8 min | 188.7 min |
| Time to plateau | 61.0 min | 148.5 min | 148.4 min |
| Best val loss | 0.002183 | 0.002927 | 0.003589 |
| Val MSE (visual) | 0.005928 | 0.002975 | 0.005138 |
| SSIM | 0.858 | 0.891 | 0.857 |
| PSNR (dB) | 24.19 | 27.24 | 24.51 |
| Motor direction accuracy | 88.8% | 87.3% | 90.7% |
| Overfitting ratio | 1.041 | 1.473 | 1.406 |
| vs iter0 (SSIM) | -0.006 | +0.008 | -0.010 |
Hold action (Stay) shows lowest MSE across all models since it requires predicting minimal change.
| Action | GPT FT (iter0) | GPT Scratch (iter0) | Diff FT (iter0) | Diff FT (iter1) |
|---|---|---|---|---|
| Move+ (action 1) | 0.006749 | 0.005948 | 0.004184 | 0.003841 |
| Move- (action 2) | 0.008766 | 0.007166 | 0.005009 | 0.004310 |
| Stay/Hold (action 3) | 0.001104 | 0.000806 | 0.000539 | 0.000500 |
Includes fine-tune vs scratch plateau comparison. Time to plateau = wall-clock time to reach the best validation loss epoch.
| Model | Source | Iter | Epochs | Total Time | Time to Plateau | Best Epoch | Per Epoch |
|---|---|---|---|---|---|---|---|
| GPT | Fine-tune | 0 | 150 | 77.4 min | 31.4 min | 66 | 30.9s |
| GPT | Scratch | 0 | 150 | 76.9 min | 44.2 min | 81 | 30.8s |
| GPT plateau speedup (FT vs Scratch) | 1.41x faster (31.4 min vs 44.2 min, saving 12.8 min) | ||||||
| Diffusion | Fine-tune | 0 | 300 | 250.5 min | 129.0 min | 147 | 50.1s |
| Diffusion | Scratch | 0 | 300 | 250.6 min | 166.9 min | 195 | 50.1s |
| Diffusion plateau speedup (FT vs Scratch) | 1.29x faster (129.0 min vs 166.9 min, saving 37.9 min) | ||||||
| GPT | Fine-tune | 1 | 150 | 113.0 min | 61.0 min | 81 | 45.2s |
| Diffusion | Fine-tune | 1 | 200 | 188.8 min | 148.5 min | 147 | 56.6s |
| Diffusion | Scratch | 1 | 200 | 188.7 min | 148.4 min | 147 | 56.6s |
| Diffusion iter1 plateau speedup (FT vs Scratch) | 1.00x (148.5 min vs 148.4 min — same epoch, WD increase equalized convergence) | ||||||
| Parameter | Iter 0 | Iter 1 |
|---|---|---|
| Embed dim | 384 | 384 |
| Depth / Heads | 12 / 12 | 12 / 12 |
| Learning rate | 0.0002 | 0.0001 |
| LR schedule | cosine | cosine |
| Warmup epochs | 5 | 5 |
| Weight decay | 0.05 | 0.05 |
| Batch size | 4 | 4 |
| Epochs | 150 | 150 |
| FT source | gpt_iter6_cosine/best.pth | |
| Parameter | Iter 0 | Iter 1 |
|---|---|---|
| Embed dim | 512 | 512 |
| Depth / Heads | 12 / 16 | 12 / 16 |
| Learning rate | 0.0003 | 0.0003 |
| LR schedule | cosine | cosine |
| Warmup epochs | 15 | 15 |
| Weight decay | 0.05 | 0.1 |
| Batch size | 4 | 4 |
| Epochs | 300 | 200 |
| Grad clip / Pred type | 1.0 / sample | 1.0 / sample |
| DDIM steps | 50 | |
| FT source | diff_iter4_wider/best.pth | |
| Experiment | Epochs | Best Epoch | Best Val Loss | Final Val Loss | Overfitting Ratio | Train-Val Gap | Status |
|---|---|---|---|---|---|---|---|
| GPT FT (iter0) | 150 | 66 | 0.002065 | 0.002142 | 1.038 | -0.001923 | Healthy plateau |
| GPT Scratch (iter0) | 150 | 81 | 0.002119 | 0.002180 | 1.029 | -0.001890 | Healthy plateau |
| Diff FT (iter0) | 300 | 147 | 0.002874 | 0.005047 | 1.756 | -0.004088 | Overfitting |
| Diff Scratch (iter0) | 300 | 195 | 0.003492 | 0.006225 | 1.783 | -0.005145 | Overfitting |
| GPT FT (iter1) | 150 | 81 | 0.002183 | 0.002272 | 1.041 | -0.002001 | Healthy plateau |
| Diff FT (iter1) | 200 | 147 | 0.002927 | 0.004310 | 1.473 | -0.002915 | Reduced overfitting |
| Diff Scratch (iter1) | 200 | 147 | 0.003589 | 0.005047 | 1.406 | -0.003166 | Reduced overfitting |
Measured on CUDA with batch_size=1, 5 warmup + 30 timed iterations.
| Metric | GPT Fine-tune | GPT Scratch | Diffusion Fine-tune | Diffusion Scratch |
|---|---|---|---|---|
| Mean (ms) | 5,132 | 7,630 | 459 | 459 |
| Median (ms) | 5,166 | 8,080 | 476 | 476 |
| P95 (ms) | 5,308 | 8,393 | 507 | 504 |
| Speedup vs GPT FT | 1.0x | 0.7x | 11.2x | 11.2x |
Each grid shows the model's predictions under different action conditions for the same input context. Rows: ground truth, then predictions for each action (Move+, Move-, Stay). Differences between rows indicate action sensitivity.
Sample 1
Sample 2
Sample 1
Sample 2
Sample 1
Sample 2