Skip to main content
ml-foundations

Scaling Laws

Empirical rules showing that AI model performance follows predictable power-law curves as parameters, data, and compute grow

#Scaling Laws#LLM#Chinchilla#Model Size#Training Compute

What Are Scaling Laws?

Scaling laws are empirical observations that AI model performance improves along predictable power-law curves as parameter count, training data, and compute increase together. The foundational results came from OpenAI's Kaplan et al. (2020) and DeepMind's Chinchilla paper (2022).

Simply put: train a bigger model on more data for longer, and loss drops at a predictable rate. This regularity is the quantitative case behind multi-billion-dollar investments in frontier models.

How Do They Work?

Three variables need to move together to hit an efficient frontier:

  • Parameters (N): model size. e.g., GPT-3 at 175B, GPT-4 estimated ~1.8T
  • Training Tokens (D): data volume. Chinchilla proposes an optimal N:D ≈ 1:20 ratio
  • Compute (C): total FLOPs, approximated as C ≈ 6 · N · D

Before Chinchilla, the industry tended to overshoot on parameters alone. The insight that smaller models + more data can outperform under the same compute budget elevated the role of data scaling in training recipes.

Why Do They Matter?

Scaling laws give AI labs a forecasting tool for investment and product roadmaps. Extrapolating loss curves answers "given this compute budget, what quality should I expect?" However, recent analyses point to limits of naive scaling — data exhaustion, spiraling compute cost, reasoning plateaus — shifting attention to test-time compute, agent architectures, and other scaling dimensions. Even so, scaling laws remain central to the economics and engineering decisions of the LLM ecosystem.

Related terms