Summary: We have collected a dataset and analysed key trends in the training compute of machine learning models since 1950. We identify three major eras of training compute - the pre-Deep Learning Era, the Deep Learning Era, and the Large-Scale Era. Furthermore, we find that the training compute has grown by a factor of 10 billion since 2010, with a doubling rate of around 5-6 months. See our recent paper, Compute Trends Across Three Eras of Machine Learning, for more details.
Introduction
It is well known that progress in machine learning (ML) is driven by three primary factors - algorithms, data, and compute. This makes intuitive sense - the development of algorithms like backpropagation transformed the way that machine learning models were trained, leading to significantly improved efficiency compared to previous optimisation techniques (Goodfellow et al., 2016; Rumelhart et al., 1986). Data has been becoming increasingly available, particularly with the advent of “big data” in recent years. At the same time, progress in computing hardware has been rapid, with increasingly powerful and specialised AI hardware (Heim, 2021; Khan and Mann, 2020).
What is less obvious is the relative importance of these factors, and what this implies for the future of AI. Kaplan et al. (2020) studied these developments through the lens of scaling laws, identifying three key variables:
- Number of parameters of a machine learning model
- Training dataset size
- Compute required for the final training run of a machine learning model (henceforth referred to as training compute)
Trying to understand the relative importance of these is challenging because our theoretical understanding of them is insufficient - instead, we need to take large quantities of data and analyse the resulting trends. Previously, we looked at trends in parameter counts of ML models - in this paper, we try to understand how training compute has evolved over time.
Amodei and Hernandez (2018) laid the groundwork for this, finding a 300,000× increase in training compute from 2012 to 2018, doubling every 3.4 months. However, this investigation had only around 15 datapoints, and does not include some of the most impressive recent ML models like GPT-3 (Brown, 2020).
Motivated by these problems, we have curated the largest ever dataset containing the training compute of machine learning models, with over 120 datapoints. Using this data, we have drawn several novel insights into the significance of compute as an input to ML models.
These findings have implications for the future of AI development, and how governments should orient themselves to compute governance and a future with powerful AI systems.
Methodology
Following the approach taken by OpenAI (Amodei and Hernandez, 2018), we use two main approaches to determining the training compute1 of ML systems:
- Counting the number of operations: The training compute can be determined from the number of arithmetic operations that is performed by the machine. By looking at the model architecture and closely monitoring the training process, we can directly calculate the total number of multiplications and additions, yielding the training compute. As ML models become significantly more complex (as continues to be the case), this approach becomes significantly more tedious and tricky. Doing this also requires knowledge of key details of the training process, which may not always be accessible.
- GPU-time: A second approach, which is independent of the model architecture, uses the information about the total training time and hardware to estimate the training compute. This method typically requires making several assumptions about the training process, which leads to a greater uncertainty in the final value.
Most of the time, we were able to use either of the above approaches to estimate the training compute for a particular ML model. In practice this process involved many difficulties, since authors would often not publish key information about the hardware used or training time.
Of course, it would be infeasible for us to gather this data for all ML systems since 1950. Instead, we focus on milestone systems, based on the following criteria:
- Clear importance: These are systems that have major historical influence, significantly improve on the state-of-the-art, or have over 1000 citations
- Relevance: We only include papers which include experimental results and a key machine learning component, and have a goal of pushing the existing state-of-the-art
- Uniqueness: If another paper describing the same system is more influential, then the paper is excluded from our dataset
This selection process lets us focus on the most important systems, helping us understand the key drivers of the state-of-the-art.
Results
Using these techniques, we yielded a dataset with training compute for over 120 milestone ML systems, the largest such dataset yet. We have chosen to make this and our interactive data visualisation publicly available, in order to facilitate further research along the same lines.
Training Compute of Notable Machine Learning Systems Over Time
Training compute (FLOP)
Large Scale
Other
Publication date
PresetCompute trendsThree eras of computeBiological modelsRaw compute dataPre and post deep learning - computePre and post large scale models - computeLarge scale models - computeOutliers - computeRecord setting models - computeDomain trends - compute
X axisPublication dateNumber of trainable parametersTraining compute (FLOP)Inference compute (FLOP)Training dataset size (datapoints)Training compute per parameter (FLOP)Datapoints per parameter
Y axisNumber of trainable parametersTraining compute (FLOP)Inference compute (FLOP)Training dataset size (datapoints)Training compute per parameter (FLOP)Datapoints per parameter
Split trendlines in Large Scale Era
Interface options
System namesShow outstanding systemsShow allShow record settersHide
Aspect ratioFit to screen16:91:1
Automatic zoom
Plot regressions
Label eras
Show legend
Show slope labels
Slope unitNx/yearOOMs/yearDoubling time
Filter options
Citation threshold
Others threshold
Isolate domain(Show all)3D modelingAudioBiologyDrivingEarth scienceGamesImage generationLanguageMathematicsMedicineMultimodalOtherRecommendationRoboticsSearchSpeechVideoVision
Record settersIgnoreLabelIsolate
Large scaleIgnoreLabelIsolate
Define large scale systems as systems
standard deviations above the mean in any year window inside the Large Scale Era (hidden by default)
OutliersIgnoreLabelRemove
Define outliers as systems
standard deviations below the mean in any year window
Bootstrapping options
Bootstrap sample size
Adjust for estimate uncertaintyAdjustDon't adjust
Confidence interval width (%)
Cite
Dataset
Documentation
Downloads
NOTE: This visualization is dynamically updated as we collect further information on notable ML systems. As a result, the showcased trends differ from the time of our original publication.
When analysing the gathered data, we draw two main conclusions.
- Trends in training compute are slower than previously reported
- We identify three eras of training compute usage across machine learning
Compute trends are slower than previously reported
In the previous investigation by Amodei and Hernandez (2018), the authors found that the training compute used was growing extremely rapidly - doubling every 3.4 months. With approximately 10 times more data than the original study, we find a doubling time closer to 6 months. This is still extraordinarily fast! Since 2010, the amount of training compute for machine learning models has grown by a factor of 10 billion, significantly exceeding a naive extrapolation of Moore’s Law.
This suggests that many previous analyses based on OpenAI’s paper were biased towards rapid developments, approximately by a factor of two.
Three eras of machine learning
One of the more speculative contributions of our paper is that we argue for the presence of three eras of machine learning. This is in contrast to prior work, which identifies two trends separated by the start of the Deep Learning revolution (Amodei and Hernandez, 2018). Instead, we split the history of ML compute into three eras:
- The Pre-Deep Learning Era: Prior to Deep Learning, training compute approximately follows Moore’s Law, with a doubling time of approximately every 20 months.
- The Deep Learning Era: This starts somewhere between 2010 and 2012, and displays a doubling time of approximately 6 months.
- The Large-Scale Era: Arguably, a separate trend of of models breaks off the main trend between 2015 and 2016. These systems are characteristic in that they are run by large corporations, and use training compute 2-3 orders of magnitude larger than systems that follow the Deep Learning Era trend in the same year. Interestingly, the growth of compute in these Large-Scale models seems slower, with a doubling time of about 10 months.
A key benefit of this framing is that it helps make sense of developments over the last two decades of ML research. Deep Learning marked a major paradigm shift in ML, with an increased focus on training larger models, using larger datasets, and using more compute. The bifurcation of the Deep Learning trend coincides with the shift in focus towards major projects at large corporations, such as DeepMind and OpenAI.
However, there is a fair bit of ambiguity with this framing. For instance, how do we know exactly which models can be considered large-scale? How can we be sure that this “large-scale” trend isn’t just due to noise? To test these questions, we used different statistical thresholds for what counts as “large-scale”, and the resulting trend does not change very much, thus the findings are at least somewhat robust to different selection criteria. Of course, the exact threshold that we use is still debatable, and it is hard to be certain about the observed trends without more data.
Implications and further work
We expect that future work will build upon this research project. Using the aforementioned compute estimation techniques, more training compute data can be gathered, offering the potential for more conclusive analyses. We can also make the data gathering process easier, such as by:
- Developing tools for automatically measuring training compute usage (as well as inference compute)
- Publishing key details about the training process, such as the GPU model used
Taking these steps helps key actors obtain valuable information in the future.
Naturally, we will also be looking at trends in dataset sizes, and comparing the relative importance of data and compute for increased performance. We can also look how factors like funding and talent influence the primary inputs of a ML system, like data and compute.
Answering questions like these is crucial for understanding how the future of AI will look like. At Epoch AI, we’re particularly concerned about ensuring that AI is developed in a beneficial way, with appropriate governance intervention to ensure safety. Better understanding the progress of compute capabilities can help us better navigate a future with powerful AI systems.