March 6-7 Automated Experiment Summary

Date: March 6-7, 2026

Context: First overnight session using Claude Code as an automated research agent. Starting from the best manual decoder-only checkpoint, Claude explored architecture variants, hyperparameters, and loss functions across 14 experiment runs.

Overview

Two architecture families were explored:

The decoder-only architecture consistently outperformed encoder-decoder, with the best run (run7) achieving a hybrid loss of 0.009178.

Encoder-Decoder Experiments (March 6)

These runs explored the encoder-decoder (masked autoencoder) architecture with depth=5, embed_dim=256. Most runs failed to converge well, with 3 out of 6 runs stuck at loss ~0.089.

Run Key Changes Best Loss Report
encdec_debug Baseline encoder-decoder, sep_width=16, no weight decay 0.089601 report
encdec_run1 Same as debug 0.089601 report
encdec_run1_2 Re-run with fixes 0.016448 report
encdec_run2 (Mar 6) Further iteration 0.089601 report
encdec_run2 (Mar 7) sep_width=32, weight_decay=0.01 0.018353 report
encdec_run3 focal_beta=10, focal_loss_alpha=0.3 0.075089 report
Takeaway: The encoder-decoder architecture was unstable — half the runs failed to converge (stuck at ~0.089 loss). The best successful run (0.016) was still 1.7x worse than the decoder-only baseline. Claude pivoted to decoder-only after these results.

Decoder-Only Experiments (March 7)

Starting from the successful decoder-only architecture (embed=256, depth=12), Claude systematically explored perceptual loss weighting, model scaling, and VGG loss.

Run Key Changes vs Baseline Best Loss Report
run4 Baseline (no perceptual loss) 0.009492 report
run5 Continued training from run4 0.009360 report
run5_2 Further continuation 0.009239 report
run6 perceptual_loss_weight=0.1 (high) 0.012570 report
run7 perceptual_loss_weight=0.01 0.009178 report
run8 perceptual_loss_weight=0.02 0.009435 report
run9 embed=512, depth=8 (wider, shallower) 0.010240 report
run10 depth=16 (deeper) 0.011437 report
Takeaways:

Results Summary

Rank Run Architecture Best Loss Key Config
1 run7 Decoder-Only 0.009178 embed=256, depth=12, perceptual=0.01
2 run5_2 Decoder-Only 0.009239 embed=256, depth=12, perceptual=0 (continued)
3 run5 Decoder-Only 0.009360 embed=256, depth=12, perceptual=0 (continued)
4 run8 Decoder-Only 0.009435 embed=256, depth=12, perceptual=0.02
5 run4 Decoder-Only 0.009492 embed=256, depth=12, perceptual=0 (baseline)
6 run9 Decoder-Only 0.010240 embed=512, depth=8, perceptual=0.01
7 run10 Decoder-Only 0.011437 embed=256, depth=16, perceptual=0.01
8 run6 Decoder-Only 0.012570 embed=256, depth=12, perceptual=0.1
9 encdec_run1_2 Encoder-Decoder 0.016448 embed=256, depth=5
10 encdec_run2 (Mar 7) Encoder-Decoder 0.018353 embed=256, depth=5, sep_width=32

Conclusion

In a single overnight session, Claude Code explored 14 experiment variations across two architecture families — work that would have taken over a week manually. Key findings:

  1. Decoder-only > Encoder-decoder for this task and data scale
  2. Small perceptual loss (0.01) helps, but too much (0.1) hurts
  3. Model capacity is not the bottleneck — both wider and deeper models performed worse, pointing to data as the limiting factor
  4. Progressive fine-tuning works — continued training from checkpoints yields incremental gains

These results motivated collecting more data and creating the canvas-world-model repo for further experiments.