Scaling Laws
Empirical rules showing that AI model performance follows predictable power-law curves as parameters, data, and compute grow
What Are Scaling Laws?
Scaling laws are empirical observations that AI model performance improves along predictable power-law curves as parameter count, training data, and compute increase together. The foundational results came from OpenAI's Kaplan et al. (2020) and DeepMind's Chinchilla paper (2022).
Simply put: train a bigger model on more data for longer, and loss drops at a predictable rate. This regularity is the quantitative case behind multi-billion-dollar investments in frontier models.
How Do They Work?
Three variables need to move together to hit an efficient frontier:
- Parameters (N): model size. e.g., GPT-3 at 175B, GPT-4 estimated ~1.8T
- Training Tokens (D): data volume. Chinchilla proposes an optimal N:D ≈ 1:20 ratio
- Compute (C): total FLOPs, approximated as C ≈ 6 · N · D
Before Chinchilla, the industry tended to overshoot on parameters alone. The insight that smaller models + more data can outperform under the same compute budget elevated the role of data scaling in training recipes.
Why Do They Matter?
Scaling laws give AI labs a forecasting tool for investment and product roadmaps. Extrapolating loss curves answers "given this compute budget, what quality should I expect?" However, recent analyses point to limits of naive scaling — data exhaustion, spiraling compute cost, reasoning plateaus — shifting attention to test-time compute, agent architectures, and other scaling dimensions. Even so, scaling laws remain central to the economics and engineering decisions of the LLM ecosystem.