Seminal Plots of AI Progress
plots that tell you the most about AI progress
Introduction
I often get asked about my intuitions about AI progress and where things are headed. This post aims to capture that intuition in just five plots, which I believe are important for making accurate predictions about future AI systems.
My personal bias will heavily lean towards high-scale and empirical research, since I come from a background of high-performance computing.
Plot #1: A new kind of “free lunch”
This first plot shows CPU scaling trends over four decades from 1970 to 2010.
You should note that there are two y-axes: CPU clock speed and number of transistors.
In the early days of computing, CPU clock speeds were a sort of “free lunch” that programmers were able to exploit. Instead of making their programs run faster, they could just wait until the next generation of hardware came out.
In 1965, the co-founder of Intel, predicted that the number of transistors per CPU would double each year, and a decade later he adjusted his prediction to a doubling ever two years. We call this Moore’s Law, and the trend still holds today.
However, this “free lunch” of CPU clock speeds eventually ended. But if we’re still seeing a doubling in transistor count every two years, why aren’t we seeing equally faster CPUs? The key here is parallelism.
Here is the first key paradigm shift: instead of just one processor, cutting edge hardware is transitioning to many. And for GPUs, the tradeoff is more stark: they have orders of magnitude more sequential processors, but orders of magnitude slower ones.
At the software development layer, this is a fundamental shift that enables all sorts of new types of applications, including machine learning.
Today, we’re in a different sort of “free lunch” regime, but instead of getting faster sequential operations, we get more concurrent operations. Today, the largest supercomputers in the world get most of their processing power from GPUs, and modern phones and personal computers all come equipped with parallel processing units.
Plot #2: Floating point arithmetic is all you need
This next plot shows super-computer scaling trends over time. The y-axis is the total number of gigaFLOP/s (billion floating-point operations per second, like addition and multiplication).
Instead of tracking the performance of a single chip, this plot is tracking the performance of entire datacenters filled with them, and it doesn’t plateau like the previous plot.
This trend is staggering and unintuitive. Today, a Macbook M3 can process approximately 3.5 teraFLOPs per second. That’s more than all the fastest computers in the world in 1995 combined. In just the past 25 years, supercomputers have gotten over one million times faster.
Plot #3: Scaling neural networks: model size and data
In this plot, you can see this in action. The y-axis has the “Test Loss” or a measure of how well the AI has modeled the data it is given, where lower is better.
Artificial neural networks are useful for learning patterns from data without explicitly baking them in.
It turns out that if you collect as much as text as you can from the internet and train a neural network on it, you can get predictably better aggregate performance across a wide variety of tasks like summarization, common sense reasoning, mathematics, biology, chemistry, etc. 1
What the scaling laws paper found was that there are two primary levers for improving performance: the size of the model and the size of the dataset.
Plot #4: Algorithmic improvements
Plot #4: Scaling thinking time
Other Important Epistemological Intuitions
There are other important intuitions for arriving at conclusions and anticipating progress in AI. I list them here:
Footnotes
It’s still difficult to predict when specific capabilities will arise. This principle only applies to averages.↩︎