Hold Dataset Experiment: Fine-tune vs Scratch Comparison

Date: 2026-03-20

Dataset: hold-1500-combined (1500 canvases from 15 datasets, 464×480, 6-joint motor strip)

Goal: Compare fine-tuning from best existing checkpoints vs training from scratch on a new combined dataset with hold actions, and compare GPT vs Diffusion inference speed

Iterations: 2 (iter0: baseline configs, iter1: adjusted hyperparameters)

Best Results At a Glance

0.891
Best SSIM
Diff Fine-tune (iter1)
27.2 dB
Best PSNR
Diff Fine-tune (iter1)
91.7%
Best Motor Dir Acc
GPT Fine-tune (iter0)
459 ms
Fastest Inference
Diffusion (11x faster than GPT)

Main Results: Iteration 0 (Baseline)

All experiments used the same architecture as the best existing checkpoints. GPT: 22M params (384d12), Diffusion: ~80M params (512d12h16).

MetricGPT Fine-tuneGPT ScratchDiff Fine-tuneDiff Scratch
Total training time77.4 min76.9 min250.5 min250.6 min
Best epoch6681147195
Time to plateau31.4 min44.2 min129.0 min166.9 min
Best val loss0.0020650.0021190.0028740.003492
Val MSE (visual)0.0057350.0047970.0033540.004503
Val MSE (motor strip)0.0061870.0055480.0106410.013146
SSIM0.8640.8690.8830.867
PSNR (dB)24.5525.3926.7725.22
Motor direction accuracy91.7%88.3%86.8%89.8%
Action discrimination0.0410.0430.0460.047
Inference mean (ms)5,1327,630459459
Overfitting ratio1.0381.0291.7561.783

Iteration 1: Adjusted Hyperparameters

Based on iter0 analysis: increased weight_decay 0.05→0.1 for diffusion (overfitting), reduced epochs 300→200. Lowered LR 0.0002→0.0001 for GPT FT. GPT scratch not re-run.

MetricGPT FT (iter1)Diff FT (iter1)Diff Scratch (iter1)
Config changesLR: 0.0002→0.0001WD: 0.05→0.1, Epochs: 300→200WD: 0.05→0.1, Epochs: 300→200
Training time113.0 min188.8 min188.7 min
Time to plateau61.0 min148.5 min148.4 min
Best val loss0.0021830.0029270.003589
Val MSE (visual)0.0059280.0029750.005138
SSIM0.8580.8910.857
PSNR (dB)24.1927.2424.51
Motor direction accuracy88.8%87.3%90.7%
Overfitting ratio1.0411.4731.406
vs iter0 (SSIM)-0.006+0.008-0.010

Per-Action MSE Breakdown

Hold action (Stay) shows lowest MSE across all models since it requires predicting minimal change.

ActionGPT FT (iter0)GPT Scratch (iter0)Diff FT (iter0)Diff FT (iter1)
Move+ (action 1)0.0067490.0059480.0041840.003841
Move- (action 2)0.0087660.0071660.0050090.004310
Stay/Hold (action 3)0.0011040.0008060.0005390.000500

Training Time Summary

Includes fine-tune vs scratch plateau comparison. Time to plateau = wall-clock time to reach the best validation loss epoch.

ModelSourceIterEpochsTotal TimeTime to PlateauBest EpochPer Epoch
GPTFine-tune015077.4 min31.4 min6630.9s
GPTScratch015076.9 min44.2 min8130.8s
GPT plateau speedup (FT vs Scratch)1.41x faster (31.4 min vs 44.2 min, saving 12.8 min)
DiffusionFine-tune0300250.5 min129.0 min14750.1s
DiffusionScratch0300250.6 min166.9 min19550.1s
Diffusion plateau speedup (FT vs Scratch)1.29x faster (129.0 min vs 166.9 min, saving 37.9 min)
GPTFine-tune1150113.0 min61.0 min8145.2s
DiffusionFine-tune1200188.8 min148.5 min14756.6s
DiffusionScratch1200188.7 min148.4 min14756.6s
Diffusion iter1 plateau speedup (FT vs Scratch)1.00x (148.5 min vs 148.4 min — same epoch, WD increase equalized convergence)
Fine-tuning reaches plateau faster for both architectures (iter0). GPT fine-tune plateaus 1.41x faster than scratch (31 min vs 44 min). Diffusion fine-tune plateaus 1.29x faster (129 min vs 167 min). Additionally, fine-tuning reaches a better minimum — especially for diffusion where the gap is 17.7% in val loss. In iter1, the increased weight decay equalized diffusion convergence speed but the fine-tuned model still achieved a better final loss.

Training Configurations

GPT (Autoregressive ViT) — 22.2M params

ParameterIter 0Iter 1
Embed dim384384
Depth / Heads12 / 1212 / 12
Learning rate0.00020.0001
LR schedulecosinecosine
Warmup epochs55
Weight decay0.050.05
Batch size44
Epochs150150
FT sourcegpt_iter6_cosine/best.pth

Diffusion (Conditional DiT) — ~80M params

ParameterIter 0Iter 1
Embed dim512512
Depth / Heads12 / 1612 / 16
Learning rate0.00030.0003
LR schedulecosinecosine
Warmup epochs1515
Weight decay0.050.1
Batch size44
Epochs300200
Grad clip / Pred type1.0 / sample1.0 / sample
DDIM steps50
FT sourcediff_iter4_wider/best.pth

Convergence Analysis

ExperimentEpochsBest EpochBest Val LossFinal Val LossOverfitting RatioTrain-Val GapStatus
GPT FT (iter0)150660.0020650.0021421.038-0.001923Healthy plateau
GPT Scratch (iter0)150810.0021190.0021801.029-0.001890Healthy plateau
Diff FT (iter0)3001470.0028740.0050471.756-0.004088Overfitting
Diff Scratch (iter0)3001950.0034920.0062251.783-0.005145Overfitting
GPT FT (iter1)150810.0021830.0022721.041-0.002001Healthy plateau
Diff FT (iter1)2001470.0029270.0043101.473-0.002915Reduced overfitting
Diff Scratch (iter1)2001470.0035890.0050471.406-0.003166Reduced overfitting

Inference Speed Comparison

Measured on CUDA with batch_size=1, 5 warmup + 30 timed iterations.

MetricGPT Fine-tuneGPT ScratchDiffusion Fine-tuneDiffusion Scratch
Mean (ms)5,1327,630459459
Median (ms)5,1668,080476476
P95 (ms)5,3088,393507504
Speedup vs GPT FT1.0x0.7x11.2x11.2x
GPT inference is autoregressive — generates 406 patches sequentially. Diffusion uses 50 DDIM steps, each a single forward pass. Despite being ~4x larger, diffusion is 11x faster.

Counterfactual Inference Samples

Each grid shows the model's predictions under different action conditions for the same input context. Rows: ground truth, then predictions for each action (Move+, Move-, Stay). Differences between rows indicate action sensitivity.

GPT Fine-tune — GPT Fine-tune (iter0)

Sample 0

Sample 1

Sample 1

Sample 2

GPT Scratch — GPT Scratch (iter0)

Sample 0

Sample 1

Sample 1

Sample 2

Diff Fine-tune — Diffusion Fine-tune (iter0)

Sample 0

Sample 1

Sample 1

Sample 2

Diff Scratch — Diffusion Scratch (iter0)

Sample 0

Sample 1

Sample 1

Sample 2

Diff Fine-tune — Diffusion Fine-tune (iter1, best SSIM)

Sample 0

Sample 1

Sample 1

Sample 2

Key Findings

1. Diffusion fine-tune produces the best visual quality. SSIM 0.891 (iter1), PSNR 27.2 dB. Fine-tuning from pretrained weights transfers well to the new hold-action dataset.
2. Diffusion is 11x faster at inference than GPT. 459ms vs 5,132ms per sample. Essential for real-time robotics.
3. Fine-tuning benefits diffusion more than GPT. Diffusion FT: +17.7% better val loss than scratch. GPT FT: only +2.6% better.
4. Fine-tuning converges faster. GPT FT 1.41x faster to plateau. Diffusion FT 1.29x faster. Both also reach better minima.
5. GPT fine-tune excels at motor direction prediction. 91.7% accuracy, best across all experiments.
6. Diffusion models overfit significantly. Ratios 1.76-1.78 (iter0). Weight decay increase (iter1) reduced to 1.41-1.47 but did not eliminate it.
7. Lowering GPT fine-tune LR hurt performance. Iter1 (lr=0.0001) worse than iter0 (lr=0.0002).

Recommendations

Best visual quality: Diffusion Fine-tune (iter1 config, wd=0.1, 200 epochs). SSIM 0.891, PSNR 27.2 dB.
Best motor accuracy: GPT Fine-tune (iter0 config, lr=0.0002, 150 epochs). 91.7% direction accuracy.
Fastest inference: Diffusion (~459ms vs ~5.1s for GPT).
Training from scratch: Always fine-tune diffusion if pretrained weights exist (+17.7%). GPT fine-tuning is optional (+2.6%).

Generated 2026-03-20 | canvas-world-model hold dataset experiment | 1500 canvases (15 datasets) | 7 training runs across 2 iterations