The Spectral Lifecycle of Transformer Training
New arXiv paper tracks full SVD decompositions of every weight matrix during transformer pretraining across 30M–285M parameter models, sampling every 25 steps.
Finds three unexpected phenomena: compression waves traveling layer-to-layer (late layers eventually over-compress), persistent power-law gradients forming inverted-U patterns, and asymmetry between Q/K and V matrix spectral behavior.
Systematic empirical foundation for understanding what happens inside transformers during training, not just at initialization or convergence.