Nous Research — Token Superposition Training speeds LLM pre-training 2.5x
Nous Research published Token Superposition Training (TST), a two-phase method that cuts wall-clock training time by up to 2.5x without changing model architecture, tokenizer, or inference behavior.
• Phase 1 averages contiguous token embeddings into bags; Phase 2 reverts to standard next-token prediction
• Validated across 270M, 600M, 3B dense, and 10B-A1B MoE scales
• Speedup holds at matched FLOPs — same compute budget, faster wall-clock time