Deep learning's recent history has been one of achievement: from triumphing
over humans in the game of Go to world-leading performance in image
recognition, voice recognition, translation, and other tasks. But this progress
has come with a voracious appetite for computing power. This article reports on
the computational demands of Deep Learning applications in five prominent
application areas and shows that progress in all five is strongly reliant on
increases in computing power. Extrapolating forward this reliance reveals that
progress along current lines is rapidly becoming economically, technically, and
environmentally unsustainable. Thus, continued progress in these applications
will require dramatically more computationally-efficient methods, which will
either have to come from changes to deep learning or from moving to other
machine learning methods.

What explains the dramatic progress from 20th-century to 21st-century AI...

Code Repositories

2020.1_G2_TCLDL

Our project aims to develop a web application for the "The Computational Limits of Deep Learning" where will be possible for people/community to have access to the data and the paper's analysis, and also allowing them to continuously contribute with it.

research papers found in the arXiv pre-print repository, as well as other benchmark sources, to understand how deep learning performance depends on computational power in the domains of image classification, object detection, question answering, named entity recognition, and machine translation. We show that computational requirements have escalated rapidly in each of these domains and that these increases in computing power have been central to performance improvements. If progress continues along current lines, these computational requirements will rapidly become technically and economically prohibitive. Thus, our analysis suggests that deep learning progress will be constrained by its computational requirements and that the machine learning community will be pushed to either dramatically increase the efficiency of deep learning or to move to more computationally-efficient machine learning techniques.

To understand why deep learning is so computationally expensive, we analyze its statistical and computational scaling in theory. We show deep learning is not computationally expensive by accident, but by design. The same flexibility that makes it excellent at modeling diverse phenomena and outperforming expert models also makes it dramatically more computationally expensive. Despite this, we find that the actual computational burden of deep learning models is scaling more rapidly than (known) lower bounds from theory, suggesting that substantial improvements might be possible.

It would not be a historical anomaly for deep learning to become computationally constrained. Even at the creation of the first neural networks by Frank Rosenblatt, performance was limited by the available computation. In the past decade, these computational constraints have relaxed due to speed-ups from moving to specialized hardware and a willingness to invest additional resources to get better performance. But, as we show, the computational needs of deep learning scale so rapidly, that they will quickly become burdensome again.

2 Deep Learning’s Computational Requirements in Theory

The relationship between performance, model complexity, and computational requirements in deep learning is still not well understood theoretically.
Nevertheless, there are important reasons to believe that deep learning is intrinsically more reliant on computing power than other techniques, in particular due to the role of overparameterization and how this scales as additional training data are used to improve performance (including, for example, classification error rate, root mean squared regression error, etc.).

It has been proven that there are significant benefits to having a neural network contain more parameters than there are data points available to train it, that is, by overparameterizing it [soltanolkotabi2019theoretical]. Classically this would lead to overfitting, but stochastic gradient-based optimization methods provide a regularizing effect due to early stopping [pillaud2018statistical, Belkin15849]^{1}^{1}1This is often called implicit regularization, since there is no explicit regularization term in the model., moving the neural networks into an interpolation regime, where the training data is fit almost exactly while still maintaining reasonable predictions on intermediate points [belkin2018overfitting, belkin2019does]. An example of large-scale overparameterization is the current state-of-the-art image recognition system, NoisyStudent, which has 480

1.2M data points [xie2019self, russakovsky2015imagenet].

The challenge of overparameterization is that the number of deep learning parameters must grow as the number of data points grows. Since the cost of training a deep learning model scales with the product of the number of parameters with the number of data points, this implies that computational requirements grow as at least the square of the number of data points in the overparameterized setting. This quadratic scaling, however, is an underestimate of how fast deep learning networks must grow to improve performance

, since the amount of training data must scale much faster than linearly in order to get a linear improvement in performance. Statistical learning theory tells us that, in general, root mean squared estimation error can at most drop as

1/√n (where n is the number of data points), suggesting that at least a quadratic increase in data points would be needed to improve performance, here viewing (possibly continuous-valued) label prediction as a latent variable estimation problem. This back-of-the envelope calculation thus implies that the computation required to train an overparameterized model should grow at least as a fourth-order polynomial with respect to performance, i.e. Computation=O(Performance4), and may be worse.

The relationship between model parameters, data, and computational requirements in deep learning can be illustrated by analogy in the setting of linear regression, where the statistical learning theory is better developed (and, which is equivalent to a 1-layer neural network with linear activations). Consider the following generative

d-dimensional linear model: y(x)=θTx+z, where z is Gaussian noise. Given n independent (x,y) samples, the least squares estimate of θ is ^θLS=(XTX)−1XTY, yielding a predictor ^y(x0)=^θTLSx0 on unseen x0.^{2}^{2}2X∈Rn×d is a matrix concatenating the samples from x, and Y is a n

to scale as O(√dn). Suppose that d (the number of covariates in x) is very large, but we expect that only a few of these covariates (whose identities we don’t know) are sufficient to achieve good prediction performance. A traditional approach to estimating ^θ would be to use a small model, i.e. choosing only some small number of covariates, s, in x, chosen based on expert guidance about what matters. When such a model correctly identifies all the relevant covariates (the “oracle” model), a traditional least-squares estimate of the s

θ.^{4}^{4}4Gauss-Markov Theorem. When such a model is only partially correct and omits some of the relevant covariates from its model, it will quickly learn the correct parts as n increases but will then have its performance plateau. An alternative is to attempt to learn the full d-dimensional model by including all covariates as regressors. Unfortunately, this flexible model is often too data inefficient to be practical.

Regularization can help. In regression, one of the simplest forms of regularization is the Lasso [tibshirani1996regression], which penalizes the number of non-zero coefficients in the model, making it sparser. Lasso regularization improves the root mean squared error scaling to O(√slogdn) where s is the number of nonzero coefficients in the true model[meinshausen2009lasso].
Hence if s is a constant and d is large, the data requirements of Lasso is within a logarithmic factor of the oracle model, and exponentially better than the flexible least squares approach.
This improvement allows the regularized model to be much more flexible (by using larger d), but this comes with the full computational costs associated with estimating a large number (d) of parameters.
Note that while here d is the dimensionality of the data (which can be quite large, e.g. the number of pixels in an image), one can also view deep learning as mapping data to a very large number of nonlinear features. If these features are viewed as d, it is perhaps easier to see why one would want to increase d

dramatically to achieve flexibility (as it would now correspond to the number of neurons in the network).

To see these trade-offs quantitatively, consider a generative model that has 10 non-zero parameters out of a possible 1000, and consider 4 models for trying to discover those parameters:

Oracle model: has exactly the correct 10 parameters in the model

Expert model: has exactly 9 correct and 1 incorrect parameters in the model

Flexible model: has all 1000 potential parameters in the model and uses the least-squares estimate

Regularized model: like the flexible model, it has all 1000 potential parameters but now in a regularized (Lasso) model

We measure the performance as −log10(MSE), where MSE is the normalized mean squared error between the prediction computed using the estimated parameters and the prediction computed using the true 1000-dimensional parameter vector. The prediction MSE is averaged over query vectors sampled from an isotropic Gaussian distribution.

As figure 1(a) shows, the neural-network analog (the flexible, regularized model) is much more efficient with data than an unregularized flexible model, but considerably less so than the oracle model or (initially) the expert model. Nevertheless, as the amount of data grows, the regularized flexible model outperforms expert models that don’t capture all contributing factors. This graph generalizes an insight attributed to Andrew Ng: that traditional machine learning techniques do better when the amount of data is small, but that flexible deep learning models do better with more data [kruup_2018]^{5}^{5}5In fact sufficiently large neural networks are universal function approximators [hornik1989multilayer], implying maximum flexibility.. We argue that this is a more-general phenomenon of flexible models having greater potential, but also having vastly greater data and computational needs.^{6}^{6}6Another demonstration of this comes from the fact that certain types of deep neural networks can provably be replaced by Gaussian process models that are also flexible and have the advantage of being less black-box, but scale their computational needs even more poorly that neural networks [novak2018]. In our illustration in figure 1, for example, 1,500 observations are needed for the flexible model to reach the same performance as the oracle reaches with 15. Regularization helps with this, dropping the data need to 175. But, while regularization helps substantially with the pace at which data can be learned from, it helps much less with the computational costs, as figure 1(b) shows.

Hence, by analogy, we can see that deep learning performs well because it uses overparameterization to create a highly flexible model and uses (implicit) regularization to make the sample complexity tractable. At the same time, however, deep learning requires vastly more computation than more efficient models. Thus, the great flexibility of deep learning inherently implies a dependence on large amounts of data and computation.

3 Deep Learning’s Computational Requirements in Practice

3.1 Past

Even in their early days, it was clear that computational requirements limited what neural networks could achieve. In 1960, when Frank Rosenblatt wrote about a 3

-layer neural network, there were hopes that it had “gone a long way toward demonstrating the feasibility of a perceptron as a pattern-recognizing device.” But, as Rosenblatt already recognized “as the number of connections in the network increases, however, the burden on a conventional digital computer soon becomes excessive.”

[Rosenblatt1960] Later that decade, in 1969, Minsky and Papert explained the limits of 3-layer networks, including the inability to learn the simple XOR function. At the same time, however, they noted a potential solution: “the experimenters discovered an interesting way to get around this difficulty by introducing longer chains of intermediate units” (that is, by building deeper neural networks).[minsky69perceptrons]

Despite this potential workaround, much of the academic work in this area was abandoned because there simply wasn’t enough computing power available at the time. As Léon Bottou later wrote “the most obvious application of the perceptron, computer vision, demands computing capabilities that far exceed what could be achieved with the technology of the

1960s”.[minsky69perceptrons]

In the decades that followed, improvements in computer hardware provided, by one measure, a ≈50,000× improvement in performance [HennessyPa19a] and neural networks grew their computational requirements proportionally, as shown in figure 9(a). Since the growth in computing power per dollar closely mimicked the growth in computing power per chip [thompson2018decline], this meant that the economic cost of running such models was largely stable over time. Despite this large increase, deep learning models in 2009 remained “too slow for large-scale applications, forcing researchers to focus on smaller-scale models or to use fewer training examples.”[Raina2009] The turning point seems to have been when deep learning was ported to GPUs, initially yielding a 5−15× speed-up [Raina2009] which by 2012 had grown to more than 35× [nvidia2017], and which led to the important victory of Alexnet at the 2012 Imagenet competition [krizhevsky2012imagenet].^{7}^{7}7

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) released a large visual database to evaluate algorithms for classifying and detecting objects and scenes every year since 2010

[imagenet_cvpr09, russakovsky2015imagenet]. But image recognition was just the first of these benchmarks to fall. Shortly thereafter, deep learning systems also won at object detection, named-entity recognition, machine translation, question answering, and speech recognition.

The introduction of GPU-based (and later ASIC-based) deep learning led to widespread adoption of these systems. But the amount of computing power used in cutting-edge systems grew even faster, at approximately 10× per year from 2012 to 2019 [openai2018]. This rate is far faster than the ≈35× total improvement from moving to GPUs, the meager improvements from the last vestiges of Moore’s Law [thompson2018decline], or the improvements in neural network training efficiency [openai2020]. Instead, much of the increase came from a much-less-economically-attractive source: running models for more time on more machines. For example, in 2012 AlexNet trained using 2 GPUs for 5-6 days[krizhevsky2012imagenet], in 2017 ResNeXt-101 [xie2017aggregated] was trained with 8 GPUs for over 10 days, and in 2019 NoisyStudent was trained with ≈1,000 TPUs (one TPU v3 Pod) for 6 days[xie2019self]. Another extreme example is the machine translation system, “Evolved Transformer”, which used more than 2 million GPU hours and cost millions of dollars to run[So2019evolved, wu2020lite]. Scaling deep learning computation by scaling up hardware hours or number of chips is problematic in the longer-term because it implies that costs scale at roughly the same rate as increases in computing power [openai2018], which (as we show) will quickly make it unsustainable.

3.2 Present

To examine deep learning’s dependence on computation, we examine 1,058 research papers, covering the domains of image classification (ImageNet benchmark), object detection (MS COCO), question answering (SQuAD 1.1), named-entity recognition (COLLN 2003), and machine translation (WMT 2014 En-to-Fr). We also investigated models for CIFAR-10, CIFAR-100, and ASR SWB Hub500 speech recognition, but too few of the papers in these areas reported computational details to analyze those trends.

We source deep learning papers from the arXiv repository as well as other benchmark sources (see Section 6 in the supplement for more information on the data gathering process). In many cases, papers did not report any details of their computational requirements. In other cases only limited computational information was provided. As such, we do two separate analyses of computational requirements, reflecting the two types of information that were available: (1) Computation per network pass (the number of floating point operations required for a single pass in the network, also measurable using multiply-adds or the number of parameters in the model), and (2) Hardware burden (the computational capability of the hardware used to train the model, calculated as #processors×ComputationRate×time).

We demonstrate our analysis approach first in the area with the most data and longest history: image classification. As opposed to the previous section where we considered mean squared error, here the relevant performance metric is classification error rate. While not always directly equivalent, we argue that applying similar forms of analysis to these different performance metrics is appropriate, as both involve averaging error loss over query data points.^{10}^{10}10Note for instance that classification error rate can be related to regression MSE under a 1-hot encoding of d classes into a d-dimensional binary vector.

Figure 9 (b) shows the fall in error rate in image recognition on the ImageNet dataset and its correlation with the computational requirements of those models. Each data point reflects a particular deep learning model from the literature. Because this is plotted on a log-log scale, a straight line indicates a polynomial growth in computing per unit of performance. In particular, a polynomial relationship between computation and performance of the form Computation=Performanceα yields a slope of −1α in our plots. Thus, our estimated slope coefficient of −0.11 (p-value <0.01) indicates that computation used for ImageNet scales as O(Performance9) (recall: theory shows that, at best, performance could scale as O(Performance4)

. Not only is computational power a highly statistically significant predictor of performance, but it also has substantial explanatory power, explaining 43% of the variance in ImageNet performance.

^{11}^{11}11In supplement section 7.1, we consider alternative forms for this regression. For example, we present a functional form where computation scales exponentially with performance. That form also results in a highly statistically significant reliance on computing power, but has less explanatory power. We also present an alternative prediction to the conditional mean where we instead estimate the best performance achievable for models with a given amount of computation. That analysis shows an even greater dependence of performance on computation, O(Performance11).

We do substantial robustness checking on this result in the Supplemental Materials. For example, we attempt to account for algorithmic progress by introducing a time trend. That addition does not weaken the observed dependency on computation power, but does explain an additional 12% of the variance in performance. This simple model of algorithm improvement implies that 3 years of algorithmic improvement is equivalent to an increase in computing power of 10×. But it is unclear how this manifests. In particular, it may apply to performance away from the cutting-edge (as [openai2020] showed) but less to improving performance. Another possibility is that getting algorithmic improvement may itself require complementary increases in computing power.

Unfortunately, despite the efforts of machine learning conferences to encourage more thorough reporting of experimental details (e.g. the reproducibility checklists of ICML [icml] and NeurIPS), few papers in the other benchmark areas provide sufficient information to analyze the computation needed per network pass. More widely reported, however, is the computational hardware burden of running models. This also estimates the computation needed, but is less precise since it depends on hardware implementation efficiency.

Figure 3 shows progress in the areas of image classification, object detection, question answering, named entity recognition, and machine translation. We find highly-statistically significant slopes and strong explanatory power (R2 between 29% and 68%) for all benchmarks except machine translation, English to German, where we have very little variation in the computing power used. Interpreting the coefficients for the five remaining benchmarks shows a slightly higher polynomial dependence for imagenet when calculated using this method (≈14), and a dependence of 7.7 for question answering. Object detection, named-entity recognition and machine translation show large increases in hardware burden with relatively small improvements in outcomes, implying dependencies of around 50. We test other functional forms in the supplementary materials and, again, find that overall the polynomial models best explain this data, but that models implying an exponential increase in computing power as the right functional form are also plausible.

Collectively, our results make it clear that, across many areas of deep learning, progress in training models has depended on large increases in the amount of computing power being used. A dependence on computing power for improved performance is not unique to deep learning, but has also been seen in other areas such as weather prediction and oil exploration [thompson2020exponential]. But in those areas, as might be a concern for deep learning, there has been enormous growth in the cost of systems, with many cutting-edge models now being run on some of the largest computer systems in the world.

3.3 Future

In this section, we extrapolate the estimates from each domain to understand the projected computational power needed to hit various benchmarks. To make these targets tangible, we present them not only in terms of the computational power required, but also in terms of the economic and environmental cost of training such models on current hardware (using the conversions from [strubell2019energy]). Because the polynomial and exponential functional forms have roughly equivalent statistical fits — but quite different extrapolations — we report both in figure 4.

We do not anticipate that the computational requirements implied by the targets in figure 4 will be hit. The hardware, environmental, and monetary costs would be prohibitive. And enormous effort is going into improving scaling performance, as we discuss in the next section. But, these projections do provide a scale for the efficiency improvements that would be needed to hit these performance targets. For example, even in the more-optimistic model, it is estimated to take an additional 105× more computing to get to an error rate of 5% for ImageNet. Hitting this in an economical way will require more efficient hardware, more efficient algorithms, or other improvements such that the net impact is this large a gain.

The rapid escalation in computing needs in figure 4 also makes a stronger statement: along current trends, it will not be possible for deep learning to hit these benchmarks. Instead, fundamental rearchitecting is needed to lower the computational intensity so that the scaling of these problems becomes less onerous. And there is promise that this could be achieved. Theory tells us that the lower bound for the computational intensity of regularized flexible models is O(Performance4), which is much better than current deep learning scaling. Encouragingly, there is historical precedent for algorithms improving rapidly [thompson2020algorithms].

4 Lessening the Computational Burden

The economic and environmental burden of hitting the performance benchmarks in Section 3.3 suggest that Deep Learning is facing an important challenge: either find a way to increase performance without increasing computing power, or have performance stagnate as computational requirements become a constraint. In this section, we briefly survey approaches that are being used to address this challenge.

Increasing computing power: Hardware accelerators. For much of the 2010s, moving to more-efficient hardware platforms (and more of them) was a key source of increased computing power [thompson2018decline]. For deep learning, these included mostly GPU and TPU implementations, although it has increasingly also included FPGA and other ASICs. Fundamentally, all of these approaches sacrifice generality of the computing platform for the efficiency of increased specialization. But such specialization faces diminishing returns [Leiserson20], and so other different hardware frameworks are being explored. These include analog hardware with in-memory computation [ambrogio2018equivalent, kim2019confined], neuromorphic computing [davies2019progress], optical computing [lin2018all], and quantum computing based approaches [welser2018future], as well as hybrid approaches [potok2018study]. Thus far, however, such attempts have yet to disrupt the GPU/TPU and FPGA/ASIC architectures. Of these, quantum computing is the approach with perhaps the most long-term upside, since it offers a potential for sustained exponential increases in computing power [gambetta2019cramming, cross2019validating].

Reducing computational complexity: Network Compression and Acceleration. This body of work primarily focuses on taking a trained neural network and sparsifying or otherwise compressing the connections in the network, so that it requires less computation to use it in prediction tasks [cheng2017survey]

. This is typically done by using optimization or heuristics such as “pruning” away weights

[dong2017more], quantizing the network [hubara2016binarized], or using low-rank compression [wen2017coordinating], yielding a network that retains the performance of the original network but requires fewer floating point operations to evaluate. Thus far these approaches have produced computational improvements that, while impressive, are not sufficiently large in comparison to the overall orders-of-magnitude increases of computation in the field (e.g. the recent work [chen2018big] reduces computation by a factor of 2, and [wu2020lite] reduces it by a factor of 8 on a specific NLP architecture, both without reducing performance significantly).^{12}^{12}12Some works, e.g. [han2015deep] focus more on the reduction in the memory footprint of the model. [han2015deep] achieved 50x compression. Furthermore, many of these works focus on improving the computational cost of evaluating the deployed network, which is useful, but does not mitigate the training cost, which can also be prohibitive.

Finding high-performing small deep learning architectures: Neural Architecture Search and Meta Learning. Recently, it has become popular to use optimization to find network architectures that are computationally efficient to train while retaining good performance on some class of learning problems, e.g. [pham2018efficient], [cai2019once] and [finn2017model], as well as exploiting the fact that many datasets are similar and therefore information from previously trained models can be used (meta learning [pham2018efficient]

[long2017deep]). While often quite successful, the current downside is that the overhead of doing meta learning or neural architecture search is itself computationally intense (since it requires training many models on a wide variety of datasets) [pham2018efficient], although the cost has been decreasing towards the cost of traditional training [cai2018proxylessnas, cai2019once].

An important limitation to meta learning is the scope of the data that the original model was trained on. For example, for ImageNet, [Barbu2019] showed that image recognition performance depends heavily on image biases (e.g. an object is often photographed at a particular angle with a particular pose), and that without these biases transfer learning performance drops 45%. Even with novel data sets purposely built to mimic their training data, [recht2019] finds that performance drops 11−14%.
Hence, while there seems to be a number of promising research directions for making deep learning computation grow at a more attractive rate, they have yet to achieve the orders-of-magnitude improvements needed to allow deep learning progress to continue scaling.

Another possible approach to evade the computational limits of deep learning would be to move to other, perhaps as yet undiscovered or underappreciated types of machine learning. As figure 1(b) showed, “expert” models can be much more computationally-efficient, but their performance plateaus if those experts cannot see all the contributing factors that a flexible model might explore. One example where such techniques are already outperforming deep learning models are those where engineering and physics knowledge can be more-directly applied: the recognition of known objects (e.g. vehicles) [He2019, Tzoumas2019]. The recent development of symbolic approaches to machine learning take this a step further, using symbolic methods to efficiently learn and apply “expert knowledge” in some sense, e.g. [udrescu2020ai] which learns physics laws from data, or approaches [mao2019neuro, asai2020learning, yi2018neural]

The explosion in computing power used for deep learning models has ended the “AI winter” and set new benchmarks for computer performance on a wide range of tasks. However, deep learning’s prodigious appetite for computing power imposes a limit on how far it can improve performance in its current form, particularly in an era when improvements in hardware performance are slowing. This article shows that the computational limits of deep learning will soon be constraining for a range of applications, making the achievement of important benchmark milestones impossible if current trajectories hold. Finally, we have discussed the likely impact of these computational limits: forcing Deep Learning towards less computationally-intensive methods of improvement, and pushing machine learning towards techniques that are more computationally-efficient than deep learning.

Acknowledgement

The authors would like to acknowledge funding from the MIT Initiative on the Digital Economy and the Korean Government. This research was partially supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT & Future Planning(2017R1C1B1010094).

References

Supplemental Materials

6 Methodology

6.1 Data collection

We collect data on the performance and computational requirements of various deep learning models from
arXiv (arXiv.org), which is an open-access archive where scholars upload preprints of their scientific papers (once they are approved by moderators). Preprints are categorized into the following fields: mathematics, physics, astrophysics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Moderators review submissions and can (re)categorize them or reject them (but this is not a full peer review), as discussed at https://arxiv.org/corr/subjectclasses. Preprints are accessible and freely distributed worldwide.

To collect preprints of interest from arXiv, we use the search terms of specific tasks and benchmarks (as discussed below). These allow us to gather pdf pre-prints from arXiv.org. From each paper, we attempt to gather the following pieces of information:

Application area (e.g. image recognition)

Benchmark (name, version, # of items in the training set)

Paper details (authors, title, year made public, model name)

Network characteristics (# parameters, # of training epochs)

We extract information from these pre-prints using a manual review. Our results were also cross-checked with information paperswithcode.com, where authors upload their papers and the code implementing their algorithms along with the achieved performance metrics.
For example, because preprints are named using a combination of publication year and month, submission number, and version update, we can automatically extract when the paper was made public. For example, the Gpipe paper [huang2019gpipe] has the name “1811.06965v5” indicating that it was made public on November, 2018.

Despite manual review, for many papers we are unable to reconstruct the amount of computation being used by the paper because so little data on this is reported. When the type of hardware and the usage data for hardware is provided by the paper, but the computational capacity of the hardware is not, we seek out that performance data from external sources, including the hardware designers (NVIDIA, Google) or publicly-available databases (e.g. Wikipedia). For example, a preprint might reference “NVIDIA GPU 1080 Ti,” which external sources reveal as performing 10609 GFLOPs. We take the processing power of FP32 single in GFLOPS as our standard.

Figure 6 summarizes the benchmark data that we gathered, as well as the number of papers where we were able to extract sufficient data to calculate the computation used.

6.2 Application Area: Images

We examine two applications of deep learning to images: image classification and object detection.

6.2.1 Image classification

Image classification is a computer vision task where the content of an image is identified using only the image itself. For example, an image classification algorithm performs a set of instructions to calculates the probability that an image contains a cat or not. There a number of image classification datasets, including Caltech 101

[fei2004caltech], Caltech 256 [griffin2007caltech], ImageNet [russakovsky2015imagenet], CIFAR-10/100 [krizhevsky2014cifar], MNIST [lecun1998mnist], SVHN [netzer2019street], STL-10 [coates2011analysis], Fashion-MNIST [xiao2017fashion], CINIC-10 [darlow2018cinic], Flowers-102, iNaturalist [van2018inaturalist], EMNIST-Letters [cohen2017emnist], Kuzushiji-MNIST [lamb2018deep], Stanford Cars [krause2013collecting], and MultiMNIST [eslami2016attend]. We focus on ImageNet and CIFAR-10/100 because their long history allows us to gather many data points (i.e. papers).

One simple performance measure for image classification is the share of total predictions for the test set that are correct, called the “average accuracy.” (Or, equivalently, the share that are incorrect). A common instantiation of this is the “top-k” error rate, which asks whether the correct label is missing from the top k predictions of the model. Thus, the top-1 error rate is the fraction of test images for which the correct label is not the top prediction of the model. Similarly, the top-5 error rate is the fraction of test images for which the correct label is not among the five predictions.

Benchmark: ImageNet

ImageNet refers to ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It is a successor of PASCAL Visual Object Classes Challenge (VOC) [everingham2011pascal]. ILSVRC provides a dataset publicly and runs an annual competition as PASCAL VOC. Whereas PASCAL VOC supplied about 20,000 images labelled as 1 of 20 classes by a small group of annotators, ILSVRC provides about 1.5M images labelled as 1 of 1000 classes by a large group of annotators.
The ILSVRC2010 dataset contains 1,261,406 training images. The minimum number of training images for a class is 668 and the maximum number is 3047. The dataset also contains 50 validation images and 150 test images for each class. The images are collected from Flickr and other search engines. Manual labelling is crowdsourced using Amazon Mechanical Turk.

In the ILSVRC2010 competition, instead of deep learning, the winner and the outstanding team used support vector machines (SVM) with different representation schemes. The NEC-UIUC team won the competition by using a novel algorithm that combines SIFT

[lowe2004distinctive], LBP [ahonen2006face], two non-linear coding representations [zhou2010image, wang2010locality], and stochastic SVM [lin2011large]. The winning top-5 error rate was 28.2%. The second best performance was done by Xerox Research Centre Europe (XRCE). XRCE combined an improved Fisher vector representation [perronnin2007fisher], PCA dimensionality reduction and data compression, and a linear SVM [perronnin2010improving], which resulted in top-5 error rate of 33.6%. The trend of developing advanced Fisher vector-based methods continued until 2014.

Deep learning systems begin winning ILSVRC in 2012, starting with the SuperVision team from University of Toronto which won with AlexNet, achieving a top-5 error rate of 16.4% [krizhevsky2012imagenet]. Since this victory, the majority of teams submitting to ILSVRCeach year have used deep learning algorithms.

Benchmark: CIFAR-10/100
CIFAR refers to Canadian Institute For Advanced Research (https://www.cifar.ca/). According to [krizhevsky2009learning], groups at MIT and NYU collected 80 million images from the web for building a dataset for unsupervised training of deep generative models. There are two versions of CIFAR dataset, CIFAR-10 and CIFAR-100, which are subsets of the 80 million images (https://www.cs.toronto.edu/~kriz/cifar.html). CIFAR-10 contains 6,000 low-resolution (32×32) color images each of 10 classes (airplane, car, bird, cat, deer, dog, frog, horse, ship, truck). CIFAR-100 dataset contains 600 low-resolution (32×32) color images each of 100 classes from 20 super-classes (aquatic mammals, fish, flowers, food containers, fruit and vegetables, household electrical devices, household furniture, insects, large carnivores, large man-made outdoor things, large natural outdoor scenes, large omnivores and herbivores, medium-sized mammals, non-insect invertebrates, people, reptiles, small mammals, trees, vehicle 1, vehicles 2). All the images were annotated by paid students. [krizhevsky2010convolutional]

trained a two-layer convolutional Deep Belief Network (DBN) on NVIDIA GTX 280 GPU using CIFAR-

10 dataset. It took 45 hours to pre-train and 36 hours to fine-tune. The best performance (accuracy rate) was 78.90%.

6.2.2 Object Detection

Object detection is the task of localization and classification of multiple objects in a single image. Localization means drawing a bounding box for each object. Classification means identifying the object in each bounding box. Localization becomes instance segmentation if, instead of a bounding box, an object is outlined. Whereas image classification identifies a single object in a single image, object detection identifies multiple objects in a single image using localization.

There are various performance measures for object detection, all based around the same concept. Intersection Over Union (IOU) measures the overlap between two bounding boxes: the ground truth bounding box and the predicted bounding box. This is calculated with a Jaccard Index, which calculates the similarity between two different sets,

A and B, as J(A,B)=|A∩B||A∪B|. Thus, IOU is the area of the intersection between the two bounding boxes divided by the area of the union of both two bounding boxes. It is 1 if the ground truth bounding box coincides with the predicted bounding box.

Box Average Precision (AP), which is also called mean average precision (mAP), sums IOUs between 0.5 and 0.95 and divides the sum by the number of the IOU values.

Benchmark: MS COCO
COCO refers to Microsoft Common Objects in COntext (MS COCO) released in 2014 [lin2014microsoft]. The COCO dataset contains 91 common object categories with 82 of them having more than 5000 labeled instances. In total the dataset has 2.5M manually labeled objects in 328,000 images. The objects are grouped into 11 super-categories and then classified into 91 common object categories by crowdsourced workers on Amazon’s Mechanical Turk platform. Like other benchmarks, the COCO dataset is publicly available so that new algorithms can be run on it (http://cocodataset.org/). [he2016deep] applied Faster R-CNN to COCO dataset on a 8-GPU computer for 80k iterations and achieved AP of 34.9.

6.3 Application area: Text

Deep Learning has been applied to various text-related tasks, including: named entity recognition, machine translation, question answering, text classification, text generation, text summarization, sentiment analysis, emotion recognition, part-of-speech tagging. In this section, we consider three: entity recognition, machine translation, and question answering.

6.3.1 Named Entity Recognition

Named entity recognition is the task of identifying and tagging entities in text with pre-defined classes (also called “types”). For example, Amazon Comprehend Medical extract relevant medical information such as medical condition, medication, dosage, strength, and frequency from unstructured text such as doctors’ notes [bhatia2019comprehend]. Popular benchmarks are CoNLL 2003, Ontonotes v5, ACE 2004/2005, GENIA, BC5CDR, SciERC. We focus on CoNLL2003.

Named Entity Recognition is measured using an F1 score, which is the harmonic mean of the precision and the recall on that task. The precision is the percentage of named entities discovered by an algorithm that are correct. The recall is the percentage of named entities present in the corpus that are discovered by the algorithm. Only an exact match is counted in both precision and recall. The F1 score goes to 1 only if the named entity recognition has perfect precision and recall, that is, it finds all instances of the classes and nothing else.

Benchmark: CoNLL2003
[sang2003introduction] shared the CoNLL2003 dataset for language-independent named entity recognition of the following classes: people, locations, organizations and names of miscellaneous entities. The dataset consists of a training file, a development file, a test file, and a large file with unlabeled data in each of English and German. The four files in English are taken from the Reuters Corpus (http://www.reuters.com/researchandstandards/). The English training file has 203,621 tokens from 14,987 sentences across 946 articles. The English test file has 7,140 location tokens, 3,438 miscellaneous entity tokens, 6,321 organization tokens, and 6,600 types of person tokens. The English development file has 51,362 tokens from 3,466 sentences from 216 articles. It has 51,362 tokens, including 1,837 locations, 922 miscellaneous entities, 1,341 organizations, and 1,842 types of people. The English test file has 46,435 tokens from 3,684 sentences in 231 articles. It has 46,435 tokens, including 1,668 locations, 702 miscellaneous entities, 1,661 organizations, and 1,617 types of people.

6.3.2 Machine Translation (MT)

MT tasks a machine with translating a sentence in one language to that in a different language. MT can be viewed as a form of natural language generation from a textual context. MT can be categorized into rule-based MT, statistical MT, example-based MT, hybrid MT, and neural MT (i.e., MT based on DL). MT is another task that has enjoyed a high degree of improvement due to the introduction of DL. Benchmarks are WMT 2014/2016/2018/2019 [bojar2014findings] and IWSLT 2014/2015 [cettolo2014report].

score is a metric for translation and computes the similarity between human translation and machine translation based on n-gram. An n-gram is a continuous sequence of n items from a given text. The score is based on precision, brevity penalty, and clipping. The modified n-gram precision means the degree of overlap in n-gram between reference sentence and translated sentence. Simply, precision is the number of candidate n-grams which occur in any reference over the total number of n-grams in the candidate. Sentence brevity penalty is a factor that rescales a high-scoring candidate translation by considering the extent to match the reference translations in length, in word choice, and in word order. It is computed by a decaying exponential in the test corpus’ effective reference length (

r) over the total length of the candidate translation corpus (c), rc. Hence the brevity penalty is 1 if c exceeds r and exp(1−rc) otherwise.

BLEU is a multiplication of an exponential brevity penalty factor and the geometric mean of the modified n-gram precisions after case folding, as the equation below.

BLEU=min{1,e(1−r/c)}×e(∑Nn=1wnlogpn).

Here, N is the maximum number that n can have. wn is a weight on n-gram. pn is the modified n-gram precision.

BLEU score ranges from 0 to 1. 1 is the best possible score but is only achieved when a translation is identical to a reference translation. Thus even a human translator may not reach 1. BLEU has two advantages. First, it can be applied to any language. Second, it is easy and quick to compute. BLEU is known to have high correlation with human judgments by computing the average of individual sentence judgment errors over a test corpus.

QA is a task of machine to generate a correct answer to a question from an unstructured collection of documents in a certain natural language. QA requires reading comprehension ability. Reading comprehension of a machine is to understand natural language and to comprehend knowledge about the world.

Benchmarks include but are not limited to Stanford Question Answering Dataset (SQuAD), WikiQA, CNN, Quora Question Pairs, Narrative QA, TrecQA, Children’s Book Test, TriviaQA, NewsQA, YahooCQA.

F1 score and Exact Match (EM) are popular performance measures. EM measures the percentage of predictions that match any one of the ground truth answers exactly. The human performance is known to be 82.304 for EM and 91.221 for F1 score.

Benchmark: SQuAD1.1
SQuAD consists of questions posted by crowd workers on a set of Wikipedia articles. And the answer to every question may be in a segment of text. SQuAD1.1 contains 107,785 question-answer pairs on 536 articles [rajpurkar2016squad]. [rajpurkar2016squad] collected question-answer pairs by crowdsourcing using curated passages in top 10,000 articles of English Wikipedia from Project Nayuki’s Wikipedia’s internal PageRanks. Plus, the authors collected additional answers to the questions that have crowdsourced answers already.

Speech recognition is the task of recognizing speech within audio and converting it into the corresponding text. The first part of recognizing speech within audio is performed by an acoustic model and the second part of converting recognized speech into the corresponding text is done by a language model [jelinek1976continuous]

. The traditional approach is to use Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) for acoustic modeling in speech recognition. Artificial neural networks (ANNs) were applied to speech recognition since the 1980s. ANNs empowered the traditional approach from the end of 20th century

[hochreiter1997long]

. Recently, DL models such as CNNs and RNNs improved the performance of acoustic models and language models, respectively. More recently, end-to-end automatic speech recognition based on CTC (Connectionist Temporal Classification)

[graves2006connectionist, graves2014towards] has been growing in popularity. Baidu’s DeepSpeech2 [amodei2016deep] and Google’s LAS (Listen, Attend and Spell) [chan2016listen] are examples. Speech recognition also requires large scale high quality datasets in order to improve performance.

One simple way to measure the performance of speech recognition is to compare the body of text read by a speaker with the transcription written by a listener. And, WER (Word Error Rate) is a popular metric of the performance of a speech recognition. It is difficult to measure the performance of speech recognition because the recognized word sequence can have a different length from the reference word sequence. WER is based on Levenshtein distance in word level. In addition, dynamic string alignment is utilized to cope with the problem of the difference in word sequence lengths. WER can be computed by dividing the sum of the number of substitutions, the number of deletions, the number of insertions by the number words in a reference sequence. Namely, the corresponding accuracy can be calculated by subtracting WER from 1. The exemplary benchmarks are Switchboard, LibriSpeech, TIMIT, and WSJ.

Benchmark: ASR SWB Hub500
Automatic Speech Recognition Switchboard Hub 500 (ASR SWB Hub500) dataset contains 240 hours of English telephone conversations collected by Linguistic Data Consortium (LDC) [ld20022000] (https://catalog.ldc.upenn.edu/LDC2002S09).

7 Model Analysis

7.1 Computations per network pass

When a deep learning paper from our source data does not report the number of gigaflops or multiply-adds, but does report network size, we perform a mapping between the two. This is done using a prediction line from regressing network size on gigaflops, as shown in figure 7, on models where both are known.

Imagenet

Errorrate

log10(Errorrate)

(1)

(2)

Constant

0.31∗∗∗

−0.51∗∗∗

(0.01)

(0.02)

log10(ComputationperNetworkUpdate)

−0.06∗∗∗

−0.11∗∗∗

(0.01)

(0.02)

Observations

51

51

R2

0.36

0.43

Adjusted R2

0.35

0.41

Residual Std. Error (df = 49)

0.07

0.10

F Statistic (df = 1; 49)

27.43∗∗∗

36.31∗∗∗

Note:

∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Table 1: Comparison of (1) exponential and (2) polynomial functional forms for Imagenet analysis

Figure 8 shows a comparison between the conditional-mean regression shown in the paper to a quantile regression (10%) which better approximates the best performance possible for any level of computational burden. Figure 9 presents various polynomial functional forms and the explanatory power (R2) that results from each.

Table 1 reports the Imagenet estimations for (1) an exponential functional form, and (2) the polynomial form. Table 2 adds a time trend to the analysis as a proxy for algorithmic progress. Adding that trend does not reduce the estimated dependence on computation.

Imagenet

log10(Errorrate)

log10(ErrorRate)

(1)

(2)

Constant

−0.51∗∗∗

−0.24∗∗∗

(0.02)

(0.08)

log10(ComputationperNetworkUpdate)

−0.11∗∗∗

−0.11∗∗∗

(0.02)

(0.02)

Year

−0.04∗∗∗

(0.01)

Observations

51

51

R2

0.43

0.55

Adjusted R2

0.41

0.53

Residual Std. Error

0.10 (df = 49)

0.09 (df = 48)

F Statistic

36.31∗∗∗ (df = 1; 49)

28.91∗∗∗ (df = 2; 48)

Note:

∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Table 2: Regressions Imagenet analysis augmented with year time trend to proxy for algorithmic improvement

7.2 Hardware Burden

The analyses in figure 3 are repeated in figure 10, but with an exponential dependence functional form. Tables 3 and 4 show the corresponding regressions.

1 / Error rate

Imagenet

MS COCO

SQuAD 1.1

CoNLL 2003

WMT 2014 (EN-FR)

log10(TOP1)

log10(BOXAP)

log10(F1score)

log10(F1score)

log10(BLEU)

(1)

(2)

(3)

(4)

(5)

Constant

0.10

−0.002

0.35

−0.96∗∗∗

0.03

(0.26)

(0.06)

(0.22)

(0.05)

(0.07)

log10(HardwareBurden)

−0.07∗∗

−0.02∗∗∗

−0.13∗∗∗

−0.02∗∗

−0.02∗∗∗

(0.02)

(0.01)

(0.02)

(0.01)

(0.01)

Observations

13

33

16

13

13

R2

0.45

0.29

0.68

0.42

0.52

Adjusted R2

0.40

0.27

0.66

0.37

0.47

Residual Std. Error

0.11 (df = 11)

0.02 (df = 31)

0.11 (df = 14)

0.04 (df = 11)

0.02 (df = 11)

F Statistic

9.06∗∗ (df = 1; 11)

12.66∗∗∗ (df = 1; 31)

30.00∗∗∗ (df = 1; 14)

7.90∗∗ (df = 1; 11)

11.74∗∗∗ (df = 1; 11)

Note:

∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Table 3: Hardware burden estimates for the polynomial model

Error rate

Imagenet

MS COCO

SQuAD 1.1

CoNLL 2003

WMT 2014 (EN-FR)

TOP1

BOXAP

F1score

F1score

BLEU

(1)

(2)

(3)

(4)

(5)

Constant

−4.15

0.84∗∗∗

−19.80∗∗∗

8.59∗∗∗

0.76∗∗

(2.91)

(0.25)

(4.15)

(1.59)

(0.28)

log10(HardwareBurden)

0.87∗∗∗

0.09∗∗∗

2.99∗∗∗

0.50∗∗

0.09∗∗∗

(0.27)

(0.02)

(0.45)

(0.18)

(0.03)

Observations

13

33

16

13

13

R2

0.48

0.27

0.76

0.41

0.49

Adjusted R2

0.44

0.25

0.74

0.36

0.44

Residual Std. Error

1.25 (df = 11)

0.08 (df = 31)

2.08 (df = 14)

1.21 (df = 11)

0.08 (df = 11)

F Statistic

10.33∗∗∗ (df = 1; 11)

11.63∗∗∗ (df = 1; 31)

43.66∗∗∗ (df = 1; 14)

7.72∗∗ (df = 1; 11)

10.41∗∗∗ (df = 1; 11)

Note:

∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Table 4: Hardware burden estimates for the exponential model