March 6-7 Automated Experiment Summary

Context: First overnight session using Claude Code as an automated research agent. Starting from the best manual decoder-only checkpoint, Claude explored architecture variants, hyperparameters, and loss functions across 14 experiment runs.

Overview

Two architecture families were explored:

Encoder-Decoder (Mar 6): 6 runs exploring the masked autoencoder approach with variable mask ratios
Decoder-Only (Mar 7): 8 runs systematically varying perceptual loss weight, model depth, and embedding dimension

The decoder-only architecture consistently outperformed encoder-decoder, with the best run (run7) achieving a hybrid loss of 0.009178.

Encoder-Decoder Experiments (March 6)

These runs explored the encoder-decoder (masked autoencoder) architecture with depth=5, embed_dim=256. Most runs failed to converge well, with 3 out of 6 runs stuck at loss ~0.089.

Run	Key Changes	Best Loss	Report
encdec_debug	Baseline encoder-decoder, sep_width=16, no weight decay	0.089601	report
encdec_run1	Same as debug	0.089601	report
encdec_run1_2	Re-run with fixes	0.016448	report
encdec_run2 (Mar 6)	Further iteration	0.089601	report
encdec_run2 (Mar 7)	sep_width=32, weight_decay=0.01	0.018353	report
encdec_run3	focal_beta=10, focal_loss_alpha=0.3	0.075089	report

Takeaway: The encoder-decoder architecture was unstable — half the runs failed to converge (stuck at ~0.089 loss). The best successful run (0.016) was still 1.7x worse than the decoder-only baseline. Claude pivoted to decoder-only after these results.

Decoder-Only Experiments (March 7)

Starting from the successful decoder-only architecture (embed=256, depth=12), Claude systematically explored perceptual loss weighting, model scaling, and VGG loss.

Run	Key Changes vs Baseline	Best Loss	Report
run4	Baseline (no perceptual loss)	0.009492	report
run5	Continued training from run4	0.009360	report
run5_2	Further continuation	0.009239	report
run6	perceptual_loss_weight=0.1 (high)	0.012570	report
run7	perceptual_loss_weight=0.01	0.009178	report
run8	perceptual_loss_weight=0.02	0.009435	report
run9	embed=512, depth=8 (wider, shallower)	0.010240	report
run10	depth=16 (deeper)	0.011437	report

Takeaways:

Perceptual loss sweet spot: Weight of 0.01 gave the best result (run7). Too high (0.1) hurt performance significantly. Too low or zero was slightly worse.
Architecture scaling didn't help: Both wider (512 embed, depth 8) and deeper (depth 16) models performed worse than the baseline (256 embed, depth 12), suggesting the dataset size was the bottleneck, not model capacity.
Progressive training works: Runs 4→5→5_2 show incremental improvement from continued training on the same checkpoint.

Results Summary

Rank	Run	Architecture	Best Loss	Key Config
1	run7	Decoder-Only	0.009178	embed=256, depth=12, perceptual=0.01
2	run5_2	Decoder-Only	0.009239	embed=256, depth=12, perceptual=0 (continued)
3	run5	Decoder-Only	0.009360	embed=256, depth=12, perceptual=0 (continued)
4	run8	Decoder-Only	0.009435	embed=256, depth=12, perceptual=0.02
5	run4	Decoder-Only	0.009492	embed=256, depth=12, perceptual=0 (baseline)
6	run9	Decoder-Only	0.010240	embed=512, depth=8, perceptual=0.01
7	run10	Decoder-Only	0.011437	embed=256, depth=16, perceptual=0.01
8	run6	Decoder-Only	0.012570	embed=256, depth=12, perceptual=0.1
9	encdec_run1_2	Encoder-Decoder	0.016448	embed=256, depth=5
10	encdec_run2 (Mar 7)	Encoder-Decoder	0.018353	embed=256, depth=5, sep_width=32

Conclusion

In a single overnight session, Claude Code explored 14 experiment variations across two architecture families — work that would have taken over a week manually. Key findings:

Decoder-only > Encoder-decoder for this task and data scale
Small perceptual loss (0.01) helps, but too much (0.1) hurts
Model capacity is not the bottleneck — both wider and deeper models performed worse, pointing to data as the limiting factor
Progressive fine-tuning works — continued training from checkpoints yields incremental gains

These results motivated collecting more data and creating the canvas-world-model repo for further experiments.