Back to Blog
Featured image for Scaling Laws for LLMs: Predicting Performance and Optimizing Training
5/7/2026
Yujian
7 min read

Scaling Laws for LLMs: Predicting Performance and Optimizing Training

LLMScaling LawsMachine LearningAI ResearchModel Optimization

Scaling Laws for LLMs: Predicting Performance and Optimizing Training

In the world of Artificial Intelligence, we often hear the mantra "Scale is All You Need." From the jump between GPT-2 and GPT-3 to the massive clusters powering Llama 3 and Gemini, the trajectory has been clear: more compute, more parameters, and more data lead to more intelligent systems. But behind the hype lies a rigorous mathematical framework known as Scaling Laws.

For researchers and engineers, scaling laws are the Rosetta Stone of model development. They allow us to predict how well a model will perform before we even spend a single dollar on GPU compute. In this guide, we will dive deep into the math, the history, and the practical implications of scaling laws for Large Language Models (LLMs).


The Power Law: Why Size Matters

At the heart of LLM performance is a mathematical relationship known as a Power Law. Early research by OpenAI in 2020 (Kaplan et al.) demonstrated that the loss ($L$) of a language model—which measures its unpredictability—decreases predictably as you increase three primary variables:

  1. $N$ (Number of Parameters): The "capacity" of the model.
  2. $D$ (Dataset Size): The number of tokens the model sees during training.
  3. $C$ (Compute): The total floating-point operations (FLOPs) used for training.

The core finding was that if you hold two factors constant and increase the third, the performance improves according to a power law. This means that if you plot loss against scale on a log-log graph, you get a straight line.

The Formula

The general form of the scaling law for loss $L$ as a function of compute $C$ looks like this:

$L(C) = (C/C_0)^{-\alpha}$

Where:

  • $C$ is the training compute.
  • $C_0$ and $\alpha$ are constants derived from empirical data.

This predictability is what gave OpenAI the confidence to build GPT-3. They didn't just guess that a 175-billion parameter model would work; they extrapolated the line from smaller experiments.


The Chinchilla Pivot: Quality Over Quantity?

In 2022, DeepMind released a seminal paper titled "Training Compute-Optimal Large Language Models," introducing what we now call the Chinchilla Scaling Laws.

Before Chinchilla, the prevailing wisdom (based on the Kaplan paper) suggested that if you increased your compute budget by 10x, you should put most of that into making the model larger (increasing $N$) and only a small portion into more data (increasing $D$). This led to massive models that were, in retrospect, "starved" for data.

The Chinchilla Findings

DeepMind found that for every doubling of the compute budget, the model size ($N$) and the amount of data ($D$) should be increased equally.

  • The Golden Ratio: For an LLM to be "compute-optimal," you should have roughly 20 tokens of training data for every 1 parameter.
  • The Result: DeepMind trained Chinchilla, a 70B parameter model, on 1.4 trillion tokens. Despite being much smaller than GPT-3 (175B), it outperformed it across almost all benchmarks because it was trained on significantly more data.

Why This Matters for Your Budget

If you have a fixed budget for training, the Chinchilla laws tell you exactly where to stop. There is no point in building a 100B parameter model if you only have 500B tokens of data; you would be better off training a 25B parameter model on those same tokens for much longer.


Calculating the Compute Frontier

To apply scaling laws, we need to understand the relationship between parameters and FLOPs. A common rule of thumb for transformer-based models is:

$C \approx 6ND$

Where:

  • $C$ = Total Compute (in FLOPs)
  • $N$ = Number of Parameters
  • $D$ = Total Tokens in the dataset

Example Scenario:

You have access to 100 NVIDIA H100 GPUs for 30 days. How big should your model be?

  1. Calculate Total Compute ($C$): An H100 provides roughly 700 TFLOPS (at FP16/BF16 with sparsity). $C = 100 \text{ GPUs} \times 700 \times 10^{12} \text{ FLOP/s} \times (30 \times 24 \times 3600) \text{ seconds}$ $C \approx 1.8 \times 10^{23} \text{ FLOPs}$

  2. Apply Chinchilla Optimality ($D = 20N$): $C = 6 \times N \times (20N) = 120N^2$ $1.8 \times 10^{23} = 120N^2$ $N^2 = 1.5 \times 10^{21}$ $N \approx 38 \text{ Billion Parameters}$

  3. Optimal Dataset ($D$): $D = 20 \times 38B = 760 \text{ Billion Tokens}$

By following this math, you ensure that you aren't wasting compute on a model that is too wide (sparse) or too deep for the data available.


The "Over-Training" Era: Llama 3 and Beyond

While Chinchilla defined "compute-optimal training," the industry has recently shifted toward "inference-optimal" scaling.

Models like Meta's Llama 3 were trained far beyond the Chinchilla limit. The Llama 3 8B model was trained on a staggering 15 trillion tokens—roughly 1,875 tokens per parameter!

Why over-train?

If you are a company like Meta or Google, the cost of training the model is high, but it is a one-time cost. However, the cost of serving the model (inference) happens every time a user asks a question.

By training a smaller model on much more data than the scaling laws suggest:

  • The model achieves better performance than a "compute-optimal" model of the same size.
  • The model remains small enough to run on consumer hardware or single GPUs.
  • The total cost of ownership (TCO) over the lifetime of the model is significantly lower.

The takeaway: If your model will be used by millions of people, ignore Chinchilla and over-train on as much high-quality data as possible.


When Do We Hit the Wall? (The Limits of Scaling)

Scaling laws imply that we can just keep adding GPUs and data forever. However, we are approaching three significant bottlenecks:

  1. The Data Wall: We are running out of high-quality human-generated text on the internet. Estimates suggest we may exhaust "high-quality" public data by 2026. This is driving research into synthetic data and multi-modal data (video/audio).
  2. The Power Wall: Training the next generation of models may require dedicated nuclear power plants. Scaling $10x$ requires $10x$ the energy, which becomes a logistical and environmental nightmare.
  3. Diminishing Returns: While the loss continues to go down, the functional improvement in reasoning might not scale linearly. It takes orders of magnitude more compute to go from "passing the Bar Exam" to "inventing a new theory of physics."

Practical Recommendations for AI Practitioners

If you are building or fine-tuning LLMs today, here is how to use scaling laws to your advantage:

  • Prioritize Data Quality: A smaller dataset of "clean" data (textbooks, high-quality code) often yields better scaling coefficients than a massive scrape of the messy web.
  • Use Proxy Models: Don't start your 70B training run immediately. Train 100M, 500M, and 1B versions on the same data distribution. Plot their loss and use the power law to predict your 70B performance.
  • Benchmark Inference Needs: If your application requires low latency, choose a smaller model architecture and invest your budget into tokens/data rather than parameter count.
  • Consider Architecture Efficiency: Scaling laws are not set in stone. Innovations like Mixture of Experts (MoE) allow models to have high parameter counts (knowledge capacity) with lower per-token compute costs, effectively "bending" the scaling curve.

Conclusion

Scaling laws have transformed LLM development from an expensive guessing game into a predictable engineering discipline. By understanding the trade-offs between compute, parameters, and data, we can build models that are not just larger, but smarter and more efficient.

Whether we are nearing the end of the scaling era or just at the beginning of a new chapter involving synthetic data and agentic reasoning, one thing is certain: the math of the power law will continue to be our guide in the quest for AGI.


Key Takeaways Table

| Concept | Metric | Takeaway | | :--- | :--- | :--- | | Kaplan Scaling | $L \propto C^{-0.05}$ | Predictable loss reduction via scale. | | Chinchilla Limit | 20 tokens per param | Optimal training for fixed compute. | | Inference Scaling | >1000 tokens per param | Best for high-traffic applications. | | Compute Rule | $6ND$ | Basic heuristic for training cost. |

Y

Yujian

Author