1. Introduction
Deep learning (DL) has emerged as a primary driver of recent artificial intelligence (AI) breakthroughs. As DLenabled products grow, it becomes more important to satisfy the future hardware requirements of DL model training. We aim to develop systems that meet these future DL requirements.
DL accuracy advances can come from two different drivers. First, DL researchers study model architecture changes to better fit data sets and improve accuracy. Model changes tend to be highly nontrivial—often requiring problem reframing—and can substantially change their computational structure. As a result, it is very difficult to predict the model structures that will be important for future DL applications.
However, DL advances can also come from growing data sets, model size, and computation—an approach that has received more attention in recent research. The DL community commonly accepts that model accuracy improves as training dataset size grows (e.g., (Banko and Brill, 2001; Amodei et al., 2016; Sun et al., 2017)). Further, Hestness et al. characterize accuracy and model size growth, showing they are particular powerlaw functions of dataset size (Hestness et al., 2017).
This paper leverages the prior work to project the data and model size scaling required to advance DL accuracy beyond humanlevel, to frontier targets defined by machine learning experts. We collect these accuracy targets for five DL domains—word and character language modeling, machine translation, speech recognition, and image classification.
These domains will require substantial increases in dataset and model size to achieve target accuracy. Datasets will need to grow in size – larger than the datasets used to train current stateoftheart (SOTA) models. Models must also grow in parameter count by –
larger. Based on these desired targets, simple estimates suggest that training time would take decades to centuries on current systems.
Not shying away from the challenge, this paper characterizes and projects the growth in computational requirements to train these target applications. Although some DL applications are computationally wellunderstood, our broader analysis reveals surprisingly predictable compute and memory scaling across a range of very different DL architectures, including deep convolutional networks (CNNs), recurrent sequencetosequence models, and recurrent encoderdecoder models with attention.
Our characterization reveals an important segmentation of DL training challenges. While prior works have focused heavily on CNNs, their compute requirements differ significantly from recurrent neural networks (RNNs), which are likely to demand far more compute and memory resources. Image processing applications with deep CNNs desire relatively small growth in dataset and model size, and they show more potential to leverage emerging compute accelerators with high computetomemory throughput ratios. Even small batch sizes can expose sufficient operational intensity for high compute throughput.
On the other hand, RNNs, especially in language domains, will require upwards of more training time to achieve target accuracy. They have moderate operational intensities, and very large memory footprints that exceed current accelerator memory capacity by –. These characteristics make it difficult to efficiently parallelize largescale training, as we demonstrate in a language modeling case study.
We recommend the hardware community place more focus on supporting RNNs. Systems for RNN training could be substantially different than emerging hardware. For example, a possible approach to better support largescale RNN training parallelism would be to significantly increase accelerator memory capacity. We could also better leverage growing accelerator compute throughput by building larger onchip caches to avoid excessive memory data streaming for large matrix multiply operations. These approaches run counter to emerging accelerator designs.
2. Deep Learning Applications
The deep learning research community has developed a large set of important DL applications. The initial release of MLPerf (MLPerf, 2018)
identifies seven domains critical for industry DL training: image classification and object detection, recommendations, reinforcement learning in games, language understanding sentiment analysis, speech recognition, and translation. This paper focus on some MLPerf applications and a similar breadth: image classification, language modeling, speech recognition, and translation. This section describes the general algorithmic structure of DL applications, and describes the particular applications for which we study scaling behaviors.
2.1. Compute Graphs of DL Applications
Deep learning applications are usually structured algorithmically as compute graphs. These compute graphs include nodes, or "ops", that perform a mathematical computation—e.g., matrixvector multiplication, convolution, or pointwise operations—on input data. Boxes in the network diagrams below represent ops or groups of ops. Data is passed between ops using "tensors" (like data arrays) that encode the data’s structure and dependencies between ops.
To project future hardware needs, we define three properties of the compute graphs that allow us to characterize their compute and memory requirements. In practice, when executing a compute graph on hardware, numerous hardware factors affect performance and are difficult or impossible to model (e.g., memory/cache hierarchy, addressing modes, kernel optimization). Rather than trying to model each of these factors for all kinds of hardware, we choose to define algorithmic compute requirements, which are independent from particular choices of hardware:
Algorithmic FLOPs are the number of FLOPs required to perform the mathematical calculation of a compute graph op (note: either floating point or integer arithmetic). For example, algorithmic FLOPs include the multiplies and accumulations in a matrix multiply op. Algorithmic FLOPs do not include other instructions executed by hardware to perform the computation, such as address, loop invariant, or branch target calculations. Hardware instructions that are not counted in algorithmic FLOPs are likely to account for at most constant overhead per algorithmic FLOP.
Unlike more general applications, DL compute graphs also perform backward propagation (“backprop”) of gradients from the model’s predictions. Ops in a DL compute graph are differentiable, so that the gradient of each input can be calculated when given gradients of the outputs. After backprop, accumulated gradients are used to update weights and improve the model’s predictions. A compute graph’s backprop has highly analogous ops to the forward graph traversal, but it splits gradients to flow to model weights and to activations. The backprop for matrix operations usually has twice the algorithmic FLOPs as the forward traversal.
Analogously, we define an op’s algorithmic bytes accessed as the total memory bytes that an op must read as inputs and write as outputs to perform the operation. Algorithmic op bytes do not include intermediate data or other memory that might be used to perform the operations, and ignores hardware effects such as caching.
We also define algorithmic memory footprint as the minimum number of memory bytes that must be allocated to execute a training step. More precisely, it is the minimum—over all correct topological compute graph traversals—of the maximum memory capacity required to accommodate all active tensors during any step of the traversal. Active tensors are those produced by an op in a previous traversal step, but not yet consumed by each of its downstream ops.
Finally, algorithmic IO counts the amount of data accessed for input to and output from a model. Training data is often stored on disks, read from the disk, and placed into the model’s input memory allocations. Algorithmic IO is proportional to the batch size, but stays fixed as model size and training step compute requirements grow. We do not investigate algorithmic IO further in this work, because we expect IO will grow very slowly relative to compute.
2.2. Image Classification
ResNets are recognized as highaccuracy convolutional networks (CNNs) for image classification and processing (He et al., 2016). The ResNet bottleneck architecture, diagrammed in Figure 1
, shows the generic structure of residual groups that allow the model grow in depth. Each residual group contains blocks of layers, as well as skip connections that permit activations to bypass the blocks. Blocks contain convolutions (with trainable weights designated in blue), batch normalization, and nonlinearities. The final layer is a fullyconnected (FC) projection that maps its input to the object classes. These networks tend to be compute intensive due to their depth (50+ convolutions with 64–2048 filters each). However, as we show later, the following recurrent networks can also require significant compute.
2.3. Language Modeling
Language models (LMs) predict the next word or character given a previous sequence of input text. Most computation in these RNNs occurs in recurrent layer matrix multiplies.
Word Language Models:
LSTMbased RNNs are the SOTA architecture for word LMs (Jozefowicz et al., 2016). Figure 2 shows the word LM LSTM generic architecture: an embedding layer followed by recurrent layers that feed a FC output layer. The embedding layer is a table lookup operation with no algorithmic FLOPs, but it accounts for a large portion of overall weight memory footprint. The LSTM layers are moderately computeintensive due to their many matrix multiplications in separate recurrent steps. Finally the FC output layer is computeintensive and responsible for a large portion of activation memory footprint.
Character Language Models:
Recurrenthighway networks (RHN) have been shown to provide low character perplexities for character LMs (Zilly et al., 2017). This character LM architecture, as depicted in Figure 3, is a sequence of embedding layer, followed by a deep RHN layer, followed by an output layer. Unlike word LMs, embedding layer and output layer account for a small portion of run time and memory footprint as the vocabulary size (number of characters) is significantly smaller. Each RHN layer contains a sequence of feedforward sublayers, and the last sublayer output feeds into the next timestep. These layers tend to be computeintensive and responsible for a large portion of activation memory footprint, especially given their many recurrent steps (100–300 per sample).
2.4. Neural Machine Translation (NMT)
The SOTA models for neural machine translation use encoderdecoder architectures with an attention mechanism to identify important recent time steps
(Luong et al., 2015). Figure 4 diagrams such an architecture, which uses a recurrent bidirectional LSTM in the encoder and a standard LSTM for the decoder cell. The encoder and decoder feed the attention context and selection layers that choose the best decoder outputs to predict translated words. Most compute and memory access comes from the recurrent cells in this model.2.5. Speech Recognition
We investigate the hybrid attention speech model diagrammed in Figure 5 (Battenberg et al., 2017). Like NMT, this model is an encoderdecoder model with attention, though the encoder is a multilayer bidirectional LSTM with intermediate pooling layers. Most computation occurs in these encoder layers. This model contains convolutions in its attention context layer, but they are very small relative to recurrent portions of network.
3. Application Accuracy Scaling
The DL community has progressively increased dataset and model sizes, and the future system demands of DL training will continue to grow. We would like to project the computational requirements of future DL applications based on the way we expect applications to grow. Recent prior work allows us to project applicationlevel characteristics using analytical models that show the relationships between DL dataset size, model size, and model accuracy. We collect desirable accuracy targets and use the analytical models to predict the dataset and model sizes required to achieve the target accuracy. Compared to current SOTA, DL domains would like – as much data and – larger models!
3.1. Motivation to Grow Data and Models
The DL community has continually grown datasets, with opensource sets larger than 10s of GBs, to increase modeling task difficulty and model accuracy. Industry is already using significantly larger datasets. Google Research recently showed the importance of training image classifiers with
more images than prior datasets (Sun et al., 2017). Baidu’s prior work uses speech recognition datasets of multiple terabytes (Hannun et al., 2014). Google has also stated they want to train language models on a trillion word dataset (Shazeer et al., 2017). Such datasets of interest to DL industry are upwards of 5TB, or about 50+ larger than current publicly available datasets.As datasets grow, DL models must also grow to fit the larger datasets, and industry is aiming for very large models. Google states they would like to train a trillion parameter model on a trillion word dataset (Shazeer et al., 2017). That same work proposes a method to compose numerous small models to reach the trillion parameter size. Our projections indicate models will easily reach into the 100s of billions of parameters. Such models would be – larger than DL models described in current research.
3.2. Accuracy Scaling with Training Data Growth
Recent work indicates why industry wants to increase dataset and model sizes. Hestness et al. show that on real datasets, DL model accuracy improves predictably with training dataset size (Hestness et al., 2017). They further show that the model size required to fit the data grows predictably with data size. Industry can use these empirical models to estimate the amount of training data and model sizes required to achieve particular accuracy.
Figure 6 shows a sketch of a model’s learning curve—the reduction in prediction error as datasets grow^{1}^{1}1Copied with author permission (Hestness et al., 2017). The curve begins in the small data region, where models can only perform as well as "best" guessing for the output data distribution. The powerlaw region is where each new training sample offers information to help models improve predictions on previously unseen samples. Error declines predictably. Finally, for real applications, curves are likely to end in an irreducible region where models cannot further improve due to the stochastic nature of the data.
In particular, we project learning curves starting from the powerlaw region, where most existing largescale data applications are currently. In this region, model generalization error scales roughly as a power law:
(1) 
Here, is the number of samples in the training dataset, and and are constants that depend on the structure of the modeling task, possibly including the data distribution and model architecture. represents aspects of the input data space and the DL model architecture. is the powerlaw exponent and indicates the difficulty for models to learn more information from each additional training example. closer to means models can learn quickly from smaller datasets. Table 1 lists estimates of and for the different modeling tasks as found in the prior work (Hestness et al., 2017).
Desired  Current Data Size  Learn Curve  Model Size  Projected Scale  

Domain (model)  Current SOTA  SOTA  Samples  GB  Data  Model  
Word LMs (LSTM)  nat/word  (Shannon, 1951)  word  
Character LMs (RHN)  bit/char  (Shannon, 1951)  char.  
NMT (enc/dec+attn)  WPER  WP  
Speech Recogn. (enc/dec+attn)  CER  (Xiong et al., 2017)  char.  
Image Classification (ResNet)  Top1  (Russakovsky et al., 2015)  image 
To extend the work of Hestness et al. to predict the required data and model size from these models, we need to define accuracy targets that would be desirable for DLenabled products. We collect feedback from DL experts and refer to prior studies that estimate the irreducible error to select desirable accuracy targets for each domain. For example, word and character LM desired SOTA are near estimated lower bounds on the entropy of English text (Shannon, 1951). The “Desired SOTA” column of Table 1 reflects these projections.
Finally, given these analytical learning curves and target error rates, we solve the analytical models for the required data size to realize the target. The “Projected Scale” columns in Table 1 show the relative data size projections. Desired SOTA values are to better than current SOTA values. However, the amount of data required to achieve these values range from more for speech recognition to more for character LMs. Language domains require the most data due to their poorer powerlaw exponents, .
3.3. Model Size Scaling with Training Data Growth
As datasets grow in size, models must also grow in size to represent the data. Hestness et al. also collect and characterize model sizes required to fit varying training set sizes. Model parameters (roughly capacity) are expected to grow sublinearly in the training set size with the following form:
(2) 
Here, is the the number of samples in the training set, and and are constants that depend on the problem structure, possibly including data distribution and model architecture. Models should grow parameter count more slowly than the training set (i.e., ), or we could just store the dataset rather than training a model. Recent prior work shows that deep neural network model capacity—the volume of concepts (data) it can learn—grows with , where is a measure of the model’s depth (Harvey et al., 2017). Loosening this bound slightly, model size should grow at least with a square root of the dataset size (i.e., ).
Table 1 shows empirically collected and for the DL domains (Hestness et al., 2017). Given the target data size determined in the last subsection, we project the model sizes required to fit the target dataset sizes. The model scale column shows the relative required increase in model size. For example, current SOTA word LMs use roughly parameters to fit roughly word datasets. Thus, to fit a larger dataset, a model would require ~ parameters (–GB, depending on weight precision).
4. Characterizing Compute Requirements
Now that we have an idea of desirable data and model sizes, we turn our attention to characterizing the computational requirements to train these very large models. This section characterizes DL application compute FLOP, memory access, and memory footprint growth. Although the structure of DL applications is intricate, their training requirements scale mostly predictably. Compute and memory usage grow asymptotically linearly with model size and batch size. We provide these accessible firstorder models of compute requirements not characterized in prior work.
4.1. Methodology
We estimate model training compute requirements by collecting statistics from training runs and assembling analytical models to project growth. We train with Tensorflow 1.5.0
(Abadi et al., 2015) running on NVIDIA GPUs and using a modified version of TFprof. TFprof annotates compute graph ops to calculate their algorithmic FLOPs and bytes, and collect run time as they execute. At the end of a training step (i.e., a compute graph traversal), TFprof returns this profile for all ops executed during the step, ensuring that we profile even fine details of an endtoend training step. We also query Tensorflow’s memory allocators for the maximum amount of training step memory allocated—the memory footprint.We collect profiles from 100500 randomlychosen training steps to account for pertrainingstep differences in FLOPs and memory accessed for different models. For instance, character LMs, NMT, and speech models unroll their recurrent layers for the timesteps required for the longest batch sample. This unrolling results in variable computation and memory access in separate training steps, so we average the profiled results over the training steps.
The most complicated variable to control for is training batch size—the number of data parallel samples to observe in a single training step. Batch size can be set arbitrarily, but particular batch sizes result in best model accuracy depending on data set size (Smith and Le, 2017). For tested domains in this study, SOTA models have been trained using data parallelism across GPUs to increase batch size beyond the maximum memory capacity of a single GPU. It is likely that future DL training will also be constrained by percomputeunit memory capacity, suggesting that ML researchers will choose percomputeunit batch sizes (henceforth, “subbatch size”) that can provide nearpeak utilization of compute unit resources. We profile with the smallest such subbatch size.
To grow models, we change hyperparameters that have the largest effect on the ability for the model to fit larger data sets as measured by generalization error. For ResNets, increasing depth and convolution channels, rather than filter sizes, improves accuracy the most, so we collect profiles for deeper and wider image classification networks. Most recurrent models have already grown to a depth such that increased depth results in no accuracy improvement. Instead, we increase the number of hidden weights per layer.
Finally, we aim to project forward the compute requirements for models as we scale up data set and model size. The analytical models of application characteristics below use firstorder approximations to provide the community with a concise set of formulas for projections. However, we also use highfidelity modeling to verify these results (Appendix A).
4.2. Estimating Training Step Algorithmic FLOPs
For DL models, the number of FLOPs per training step grows roughly linearly in the number of parameters of the model, suggesting that each model parameter is used roughly the same number of times in a single training step. We demonstrate this observation analytically for word LMs next.
Again, let be the number of model parameters for a LSTM word LM, and let , , and be the parameters in embedding, recurrent, and output layers, respectively. We approximate the total model parameters as:
Here, is the LM’s vocabulary size, is number of hidden weights per recurrent layer, and is the number of layers.
Next, we show the roughly linear relationship between parameters and FLOPs per step. Since backward propagation adds ~ the number of FLOPs, regardless of the model, we consider only the forward propagation. For this firstorder model, we assume that most compute FLOPs come from the subset of ops that perform vector or matrix operations. We estimate forward propagation algorithmic FLOPs:
Here, is the sequence length for the training step (we ignore subbatch size to normalize per training sample). These models indicate that , a constant. Thus, we expect that for word LMs and similarly structured recurrent models, compute FLOPs should grow roughly linearly in the increase in number of model parameters.
We confirm this linear relationship between model parameters and algorithmic FLOPs per training step empirically across our set of applications. Figure 7 plots the TFprofprofiled growth in algorithmic FLOPs (note: batched training roughly multiplies these values by the subbatch size). Each domain’s algorithmic FLOPs grow linearly with model size above – parameters—moderately large models. FLOPs per parameter ranges from 149 for NMT to 1111 for ResNets. For recurrent networks, as sequence length grows, the FLOPs/parameter also grows, approaching ResNet requirements. Character LMs and speech networks unroll layers for 150 and 300 timesteps, respectively.
Table 2 records the asymptotic hardware requirements for each DL domain as models grow. Given the clear linear relationships between FLOPs and parameter counts, we use the following linear trend to project the compute FLOPs per training sample ("") for models with parameters:
Here, is a constant that depends on the input data shape, recurrent sequence length, and model architecture.
Alg. compute  Alg. memory access  Alg. operational intensity  Minimal Mem. Foot  

Domain (model)  (FLOPs/Param)  (Bytes/Param)  (FLOPs/Byte)  (Bytes/Param) 
Word LMs (LSTM)  
Character LMs (RHN)  
NMT (enc/dec+attn)  
Speech Recogn. (enc/dec+attn)  
Image Classification (ResNet) 
4.3. Estimating Algorithmic Memory Bytes Accessed
Like algorithmic FLOP counts, algorithmic memory accesses also scale linearly with model parameters across the DL applications. However, since a significant portion of training step memory accesses are from reading or updating model weights—which do not depend on the subbatch size—memory access counts depend, to firstorder, on both model size and subbatch size. This section describes an analytical model and verifies that it fits empirical results.
A training step must access two types of tensors: the DL model and the activation tensors that flow through the model. Hardware loads from and stores to the model parameters roughly a constant number of times each for the forward and backward propagation, and to update the weights at the end of a training step. Similarly, activation memory, with dimensions proportional to the batch size and model dimensions, is accessed roughly a constant number of times. As above, denote as the model parameter count. Then total memory accesses for a training step ("") takes this firstorder form^{2}^{2}2Supplemental material shows detailed calculation for word LMs:
Here, and are constants that depend on input data shape, recurrent sequence length, and model architecture. The term approximates the model’s hidden layer weight or channel counts—one dimension of the compute graph’s large linear algebra ops. We find is a good approximation for all domains, with a small caveat: For models with many parameters to embed input data (e.g., the larger vocabularies of word LMs and NMT), overestimates hidden dimension until the hidden dimension is large relative to the embedding dimension. Figure 8 curves show nearly linear asymptotes.
4.4. Estimating Training Operational Intensity
Conveniently, although model training steps are composed of many ops, their algorithmic FLOPs and memory access characteristics are strikingly similar to those of a single large linear algebra operation. As a result, operational intensity—the ratio of FLOPs to memory bytes accessed—takes form familiar in linear algebra kernel optimization.
Algorithmic operational intensity for each DL model is listed in Table 2. A model’s ops that contribute the most to FLOPs and memory accesses are often matrix operations with dimensions related to the hidden dimension (~) and subbatch size. The operational intensity of a matrix multiplication with dimensions ()() is , the same form as the endtoend training step operational intensities listed in Table 2.
As a result of its form, operational intensity will approach some fixed upper bound unless both a model’s hidden dimension and the subbatch size grow. When either model size or subbatch size is fixed, it will asymptotically approach the ratio of the slopes of algorithmic FLOPs and bytes growth. Figure 9 shows the leveling of operational intensity for fixed subbatch size as model size grows for each application.
4.5. Estimating Training Step Memory Footprint
Memory footprint is the measure of the memory capacity required to execute an algorithm. Tensorflow’s memory allocator provides a footprint estimate, but we also estimate minimal memory footprint by tracking it through a topological traversal of the compute graph. For each op, DL frameworks can allocate memory for the op’s output tensors, and after executing the op, free the op’s input tensors if all the tensor’s consumer ops have executed.
Figure 10 plots the Tensorflow allocator memory footprints for each model and our topological estimates. These values agree up to the point that Tensorflow runs out of GPU memory capacity (80% of 12GB). At that point, the allocator starts swapping GPU memory to the CPU’s memory space, where it no longer counts the memory as part of the footprint. When Tensorflow does not swap memory, our models tend to slightly overestimate minimal memory footprint; Tensorflow optimizes to perform some ops on tensors inplace rather than allocating separate output tensors.
Minimal memory footprint grows asymptotically linearly with model size for larger models. This trend is expected given that the model’s parameters dominate memory and are persistent, while activation tensors can be freed and reused by the framework. We model minimal footprint linearly:
Here, is a constant dependent on the input data shape, recurrent sequence length, and model architecture. This firstorder approximation fits well for parameter counts above ~, but for our projections in the next section, we opt to use more accurate topological traversal estimates.
Language model footprint growth is similar across the domains; character LM footprint growth slows significantly for large models (not depicted in the figure). Speech and image domains show faster memory footprint growth with model size. However, as the next section shows, speech and image domains need much smaller networks to achieve accuracy targets, so their footprint requirements are modest.
5. Projecting the Accuracy Frontier
Here, we project the compute resources required to train models to target accuracy levels. We also project a hypothetical Roofline estimate of model training time and discuss implications of the resource requirements. Improving speech recognition and image classification should be feasible with existing parallelism strategies. Language domains, however, are likely to require more compute, suggesting the need for both improved algorithmic and parallelism strategies.
5.1. Projecting Target Compute Requirements
Using the analytical models from the last two sections, we project compute resource requirements to reach target accuracy levels. Table 3 lists the projected data and model size, our choice of subbatch sizes (Section 5.2.1), and projected training requirements.
Projected  Sub  Training Step  Accel. Time  

Model  batch  TFLOPs/  Mem Acc.  Min Mem  Step  Epoch  
Domain (model)  Data Size  Params.  Size  Step  TB/Step  Foot (GB)  (secs)  (days) 
Word LMs (LSTM)  word  
Character LMs (RHN)  char.  
NMT (enc/dec+attn)  WP  
Speech Recogn. (enc/dec+attn)  char.  
Image Classification (ResNet)  image 
We expect that image processing networks will require the least growth in algorithmic FLOPs and memory access per training step to achieve aggressive accuracy targets. Their required model growth is small relative to recurrent networks, and their convolutional layers offer high operational intensity to utilize compute resources with smaller subbatch sizes. The clearest contrast is with speech recognition, which would require similar model size as image classification, but its larger subbatch size means more FLOPs and memory access per training step. These results suggest it may be easier to parallelize very large image network training by sharding full batches across many accelerators.
The projected compute requirements also witness the challenges of scaling language domains specifically, and recurrent networks in general. To reach target accuracy on language and speech domains will require – more FLOPs and memory access per training step than image classification. In language domains, these increases are largely due to the model size growth required to fit larger data sets.
Finally, we note that all domains are likely to require significantly more memory capacity than available with current accelerators. Current GPUs and Google’s TPU v2 have 16 or 32GB of memory per accelerator chip (Dean and others, 2017). Running any of these models on such accelerators will require either modellevel parallelism to split portions of the models across multiple accelerator’s memories, or migrating model parts into and out of accelerator memory—an expensive operation.
5.2. Projecting Run Time on Hardware
Next, we estimate hypothetical bestcase run times for each of the target applications running on an accelerator. We configure a target accelerator, describe our process for choosing the training step subbatch size, and then estimate run time. The estimates use the Roofline model to predict the overall system throughput given the fullgraph algorithmic FLOPs and memory accesses (Williams et al., 2009).
Component  Configuration 

Compute Throughput, 32bit ()  TFLOP/s 
Onchip Cache  MB 
Memory Bandwidth ()  GB/s 
Memory Capacity (offchip)  GB 
Interdevice Bandwidth  GB/s 
Table 4 shows the configuration for a target accelerator similar to NVIDIA’s V100 version 2. We assume maximum achievable throughput of 80% of peak FLOPs and 70% of peak memory bandwidth, consistent with existing hardware. The accelerator’s compute intensity inflection point between memorybound and bandwidthbound (its Roofline “ridge point”) is FLOP/B, but given peak achievable throughput, rises to FLOP/B. We start by assuming that the accelerator has infinite memory capacity and is able to fit the memory footprint for a training step of any model.
5.2.1. Subbatch Size: Minimize PerSample Time
Choosing an appropriate subbatch size for model training is a difficult process that depends on many aspects of the DL application. Here, we focus on the hardware tradeoffs: we want to ensure good utilization of the accelerator while keeping a small memory footprint. We identify three subbatch size pointsofinterest and show that the smallest size that minimizes persample latency (i.e., maximizes throughput) provides the best tradeoffs.
Figure 11 shows the effect of subbatch size on the graphlevel operational intensity and the training step time persubbatchsample. We could choose subbatch size such that the graphlevel operational intensity nears saturation (green marker), giving the most opportunity to utilize the accelerator’s compute throughput. However, this point also requires a very large memory footprint, often – more than a small subbatch. Another option is subbatch size such that the graphlevel operational intensity matches the accelerator’s ridge point (blue marker). In practice, however, this point does not optimize the accelerator’s compute throughput—many ops are still memorybound. The training step timepersample curve (orange) shows 40% throughput loss.
Instead, we prefer subbatch size that minimizes the training step time normalized persample. The orange point in Figure 11 is a subbatch size that keeps memory footprint small while achieving 79% of the peak compute throughput. We use this approach to estimate best subbatch sizes for each domain in Table 3. For recurrent networks, subbatch size settles at about larger than the point where graphlevel operational intensity matches the accelerator’s ridge point.
5.2.2. Perepoch Run Time
Finally, we estimate bestcase run time using a Roofline model—performance is bounded either by the accelerator’s compute () or memory access () throughput:
We list training step time in Table 3, and project these out to the training time for one epoch. These estimates were also used for selecting subbatch sizes.
Although optimistic, these training time projections show that the target accuracies for image classification and speech recognition may not be far out of reach. A single epoch would take ~ months on a single accelerator. Reducing epoch time to less than a day would require parallelizing training over ~ accelerators—activity becoming more common in recent data parallelism work.
However, major challenges exist in language domains with epoch times of – years on a single accelerator. To achieve target accuracies for these domains will require significant innovation beyond existing parallelization strategies.
6. Case Study: Word LMs at the Frontier
Language and translation domains are among the most challenging problems we have tested. Our results show they may require models of ~– parameters and ~ more compute than other domains. This section works through a case study of word LMs to consider the challenges and potential approaches to scale to their frontier accuracy. A combination of algorithmic and parallelism optimizations is required to train a target word LM in 7 days per epoch.
6.1. Setting the Baseline Train Time
Num.  Batch  Accel. Mem.  L2 Cache  Train Time  Alg. FLOP  

Optimization Stage  Accel.  Size  Required (GB)  Capacity  days/epoch  Utilization 
Bestcase (Roofline) Baseline  113.8  —  2707  80%  
Cachehierarchyaware Baseline  113.8  6MB  
w/ Data Parallelism (Option 1)  125.7  6MB  
w/ Data Parallelism (Option 2)  125.7  6MB  
+ Layer Parallelism ()  { 60, 17, 17, 32}  6MB  
+ Shard the Embedding Layer  {32, 31, 31, 32}  6MB 
We begin by setting the baseline training time. Since the prior word LM would take ~ years per epoch to train, we start by choosing an algorithmic optimization used in recent word models: LSTM projection (Sak et al., 2014). The projected LSTM reduces the inner dimension of the last hidden layer before feeding it to the output layer. We also increase the vocabulary size to match prior work (Jozefowicz et al., 2016). These changes reduce the pertrainingstep FLOPs, memory accesses, reducing the roofline time by a factor of to ~ on the target accelerator. Table 5 records our process of parallelizing the word LM, starting with this bestcase days per epoch.
We also make our target application more realistic by adding simple modeling for memory accesses in matrix arithmetic. Unfortunately, algorithmic memory accesses underestimate total memory accesses for ops that perform large matrix multiplications; portions of the input tensors can be stored in onchip caches of the accelerator, but largetensor multiplies will require restreaming significant portions of the inputs from offchip memory multiple times. We capture the impact of cache hierarchy on performance by modeling these extra memory accesses, assuming a common, tiled matrix multiply implementation (Coleman and McKinley, 1995). This cachehierarchyaware model predicts perepoch time would take days, reducing to 46% algorithmic FLOP utilization. We validate that this model is still optimistic, but reduces maximum prediction error from 42% down to 15% on tested hardware.
6.2. StepbyStep Parallelism Strategy
There are three major challenges to scale this word LM’s training time. First, we will need to reduce training time by ( days), requiring parallelism across at least this many accelerators. Second, the required memory footprint is too large to fit in a single accelerator’s memory, so each dataparallel worker will need to parallelize across at least accelerators (GB per stepGB capacity per accelerator). Finally, we aim for effective use of resources, and describe a parallelism scheme that keeps accelerator algorithmic FLOPs utilization above .
6.2.1. Data Parallelism
We first scale out using data parallelism—the process of dividing batch elements across multiple workers, and then collecting their results to update model weights. The baseline subbatch size is 128, found using the method in Section 5.2.1
. We model a synchronous stochastic gradient descent (SGD) approach implemented using a ringallreduce
(Patarasuk and Yuan, 2009).SGD communication overheads eventually dominate pertrainingstep time. As listed in Table 4, we assume accelerators can communicate using high bandwidth interdevice links at GB/s, consistent with future intranode and Infiniband 400Gb internode interconnects. Figure 12 shows training time per epoch improves while utilization declines as we increase the number of dataparallel workers.
Modeling results show that word LMs would require at least accelerators to reduce epoch time to 6.2 days. Utilization declines slightly to 34% at accelerators due to communication overheads for reducing gradients. Recent prior work uses batch sizes up to samples to train image classification (You et al., 2018) and character LMs (Puri et al., 2018), so we believe batch size – may be feasible with future techniques. We also list a configuration for accelerator data parallelism as the basis for the next stage.
6.2.2. Model Parallelism
Although data parallelism gets close to our desired training time, it does not address the problem of peraccelerator required memory footprint. At the current optimization stage, each accelerator would require roughly GB of memory capacity, so we must divide the model to parallelize training steps across more accelerators.
We consider “layerwise parallelism”, an approach that places separate layers of the model across neighboring accelerators. Since the word LM has 4 layers, we allocate 4 accelerators per dataparallel worker. Starting from the data parallel worker option, we require total accelerators to add layer parallelism. This approach would reduce epoch time to just over 7 days, and cut the required memory footprint per accelerator by half.
Although layer parallelism reduces the footprint required per accelerator, the embedding layer of the word LM (GB) will not fit in a single accelerator’s memory. Prior techniques move the embedding layer to locations with more memory capacity, such as host memory, which will require moving embedded data to the accelerator’s memory. Instead, we propose to split the embedding layer into 3 pieces and locate two smaller parts in the memories of accelerators that perform recurrent layer computations. This split evens out the peraccelerator footprints with trivial run time overhead, and results in final algorithmic FLOP utilization of .
6.2.3. Discussion and Related Work
This word LM case study highlights the challenges that will exist for scaling RNN model training to frontierlevel accuracy. Major opportunities to optimize RNN training all result in the need for more cache or memory capacity.
Memory Capacity: Although emerging accelerators use highbandwidth memories (HBM), their capacities are currently just or GB. Existing CNN applications can utilize compute FLOPs of these accelerators with small memory footprints, so there is little pressure from these applications to increase memory capacity. On the other hand, each of the language domains show extreme peraccelerator trainingstep memory footprint that exceeds current memory capacities by –.
We started with algorithmic optimization to reduce the word LM’s memory (and compute) requirements. Accelerator memory capacities are orders of magnitude short of LM requirements, but some algorithmic optimizations may be promising to chip away at this gap. Model compression or distillation, and lowprecision or sparse computation may reduce model or activation tensor size (Cheng et al., 2018), and reduce memory requirements by –. Currently, many challenges exist to use these techniques during model training.
Parallelism Techniques: DL frameworks could also provide parallelization techniques to enable more effective use of cache and memory capacity. Researchers have shown how to train image and language applications quickly using dataparallel scaling to reduce peraccelerator activation memory requirements (Goyal et al., 2017; You et al., 2018; Puri et al., 2018). Prior work also explores other forms of data parallelism (Krizhevsky, 2014; Recht et al., 2011; Maleki et al., 2018), and there are opportunities to reduce data communication overheads of parallelism (Alistarh et al., 2017; Wen et al., 2017; Lin et al., 2017). Layer parallelism can also reduce the memory requirements for model weights (Shen et al., 2017b, a). Improved model parallelism techniques could recover some of the ~23% algorithmic FLOP utilization lost to layer parallelism in the case study. Frameworks should aim to automatically and dynamically subdivide the computation, automatically map appropriate compute graph portions to compute resources, and prefetch data between host and accelerator memories.
Operational Intensity: Our modeling also shows that RNN networks suffer from moderate operational intensity in large matrix multiplications. This medium operational intensity is caused by the need to stream inputs from memory multiple times during tiled multiplies, decreasing the algorithmic FLOP utilization by ~40% compared to an ideal system that would not need to restream inputs. In this setting, increasing onchip cache size may hinder compute throughput growth, but is likely to proportionally reduce input restreaming from memory. Better cache tiling, kernel optimization and fusion techniques might also help (Coates et al., 2013; Chetlur et al., 2014).
Hardware techniques to better support largescale RNN training—larger memories and onchip caches—run counter to emerging accelerator designs. Emerging designs aim to support very high computetomemory ratios by optimizing for compute throughput. This design philosophy is unlikely to trade die area for memory channels (capacity) or caches.
7. Other Related Work
DL Application Characterization: Many benchmark suites have been developed that aim to analyze DL applications by focusing on particular ops/kernels (13) or quantifying the performance of endtoend DL applications (11; MLPerf (2018)). OpenAI recently characterized the trend in DL FLOP growth over time using coarse approximations of FLOPs for a range of applications ranging from AlexNet to Alpha Go (2).
Performance Modeling Paleo is an analytical performance model which explores parallelism for CNN networks (Qi et al., 2017).
8. Conclusion
This paper leverages the prior work to project the dataset and model size growth required to advance DL accuracy beyond humanlevel, to “frontier” targets defined by machine learning experts. Datasets will need to grow by –, while models will need to grow by – to achieve target accuracies. We project the computational requirements to train these applications at scale. Our results reveal an important segmentation of DL training challenges for recurrent neural networks (RNNs) that contrasts with prior studies of deep convolutional networks. RNNs will have comparatively moderate operational intensities and very large memory footprint requirements. In contrast to emerging accelerators, largescale RNN training characteristics suggest designs with significantly larger memory capacity and onchip caches.
References
 TensorFlow: LargeScale Machine Learning on Heterogeneous Systems. External Links: Link Cited by: §4.1.
 [2] (2018) AI and Compute. Note: External Links: Link Cited by: §7.
 QSGD: CommunicationEfficient SGD via Gradient Quantization and Encoding. In Advances in Neural Information Processing Systems (NIPS), pp. 1709–1720. Cited by: §6.2.3.
 Deep Speech 2: EndtoEnd Speech Recognition in English and Mandarin. In The International Conference on Machine Learning (ICML), pp. 173–182. Cited by: §1.
 Scaling to Very Very Large Corpora for Natural Language Disambiguation. In Association of Computational Linguistics (ACL), Cited by: §1.
 Exploring Neural Transducers for Endtoend Speech Recognition. In IEEE Automatic Speech Recognition and Understanding Workshop, pp. 206–213. Cited by: §2.5.
 Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges. In IEEE Signal Processing Magazine, Vol. 35. Cited by: §6.2.3.
 cuDNN: Efficient Primitives for Deep Learning. arXiv preprint arXiv:1410.0759. Cited by: §6.2.3.
 Deep Learning with COTS HPC Systems. In International Conference on Machine Learning (ICML), pp. 1337–1345. Cited by: §6.2.3.
 Tile Size Selection Using Cache Organization and Data Layout. In ACM SIGPLAN Notices, Vol. 30, pp. 279–290. Cited by: §6.1.
 [11] (2018) DAWNBench. Note: External Links: Link Cited by: §7.
 Machine Learning for Systems and Systems for Machine Learning. Note: Presentation at ML Systems Workshop with Neural Information Processing Systems (NIPS) Conference External Links: Link Cited by: §5.1.
 [13] (2018) DeepBench. Note: External Links: Link Cited by: §7.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
. Facebook AI Research Publications. Cited by: §6.2.3.  Deep Speech: Scaling Up EndtoEnd Speech Recognition. arXiv preprint arXiv:1412.5567. Cited by: §3.1.
 Nearlytight VCdimension and Pseudodimension Bounds for Piecewise Linear Neural Networks. In The Conference on Learning Theory (COLT), Vol. 65, pp. 1064–1068. Cited by: §3.3.

Deep Residual Learning for Image Recognition.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 770–778. Cited by: §2.2.  Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409. Cited by: §1, §3.2, §3.2, §3.3, footnote 1.
 Exploring the Limits of Language Modeling. arXiv preprint arXiv:1602.02410v2. Cited by: §2.3, §6.1.

One Weird Trick for Parallelizing Convolutional Neural Networks
. arXiv preprint arXiv:1404.5997. Cited by: §6.2.3.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In The International Conference on Learning Representations (ICLR), Cited by: §6.2.3.

Effective approaches to attentionbased neural machine translation.
In
The Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pp. 1412–1421. Cited by: §2.4.  Semanticspreserving parallelization of stochastic gradient descent. In IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 224–233. Cited by: §6.2.3.
 MLPerf: A Broad ML Benchmark Suite for Measuring Performance of ML Software Frameworks, ML Hardware Accelerators, and ML Cloud Platforms.. External Links: Link Cited by: §2, §7.
 Bandwidth optimal allreduce algorithms for clusters of workstations. Journal on Parallel and Distributed Computing 69 (2), pp. 117–124. External Links: ISSN 07437315 Cited by: §6.2.1.
 Large Scale Language Modeling: Converging on 40GB of Text in Four Hours. arXiv preprint arXiv:1808.01371. Cited by: §6.2.1, §6.2.3.
 Paleo: A Performance Model for Deep Neural Networks. In The International Conference on Learning Representations (ICLR), Cited by: §7.
 Hogwild: A Lockfree Approach to Parallelizing Stochastic Gradient Descent. In Advances in Neural Information Processing Systems (NIPS), pp. 693–701. Cited by: §6.2.3.
 ImageNet Large Scale Visual Recognition Challenge. arXiv preprint arXiv:1409.0575. Cited by: Table 1.
 Long shortterm memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128. Cited by: §6.1.
 Prediction and Entropy of Printed English. Vol. 30, pp. 47–51. Cited by: §3.2, Table 1.
 Outrageously Large Neural Networks: The SparselyGated MixtureofExperts Layer. arXiv preprint arXiv:1701.06538v1. Cited by: §3.1, §3.1.
 Escher: A CNN Accelerator with Flexible Buffering to Minimize OffChip Transfer. In IEEE International Symposium on FieldProgrammable Custom Computing Machines (FCCM). IEEE Computer Society, Los Alamitos, CA, USA, Cited by: §6.2.3.
 Maximizing CNN Accelerator Efficiency Through Resource Partitioning. In The International Symposium on Computer Architecture (ISCA), pp. 535–547. Cited by: §6.2.3.
 A Bayesian Perspective on Generalization and Stochastic Gradient Descent. arXiv preprint arXiv:1710.06451v2. Cited by: §4.1.
 Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In The International Conference on Computer Vision (ICCV), Cited by: §1, §3.1.
 TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. In Advances in Neural Information Processing Systems (NIPS), Cited by: §6.2.3.
 Roofline: An Insightful Visual Performance Model for FloatingPoint Programs and Multicore Architectures. Vol. 52, pp. 65–76. Cited by: §5.2.
 The Microsoft 2017 Conversational Speech Recognition System. Technical report External Links: Link Cited by: Table 1.
 ImageNet Training in Minutes. In The International Conference on Parallel Processing (ICPP), pp. 1:1–1:10. Cited by: §6.2.1, §6.2.3.
 Recurrent Highway Networks. In The International Conference on Machine Learning (ICML), Cited by: §2.3.
Appendix A Artifact Appendix
a.1. Abstract
The artifact contains the latest version of our codebase, called “Catamount”, which is a compute graph analysis tool to load, construct, and modify deep learning (DL) models and to symbolically analyze their compute requirements. Catamount can read DL model checkpoints saved from DL frameworks (e.g., from Tensorflow). This artifact includes (A) Tensorflow checkpoints for each of the models (compute graphs) analyzed in the PPoPP 2019 paper, Beyond Human Level Accuracy: Computational Challenges in Deep Learning and (B) shell scripts to run graph analytics and generate results for Figures 7 to 10 of the paper. To validate the results, run the test script, which generates and collects the corresponding outputs:
~$ bash catamount/frameworks/example_graphs/tensorflow/full_models/generate_results.sh
a.2. Artifact checklist (metainformation)

Algorithm: Catamount can construct or load compute graphs and using various graph traversal algorithms, propagate symbolic graph dimensions and calculate various compute requirements, including compute Flops, memory accesses, and memory footprint.

Data set: Consists of compute graph checkpoints from neural network models trained in Tensorflow. The models are a word language model (LSTM), character language model (RHN), neural machine translation (encoder/decoder+attention), speech recognition (encoder/decoder+attention), and image classification (ResNet).

Operating system: Linux

Hardware: No special hardware is required. CPUbased system. Recommended 8+ GB memory.

Program requirements: Python 3.6

Python packages: Catamount requires Python packages to run. First, it requires a virtual environment created with Virtualenv (https://virtualenv.pypa.io/en/latest/, commonly included with Python 3.6), or another virtual environment package. Second, Catamount depends on three other Python packages, numpy, sympy, and tensorflow>=1.7. See instructions below to install these dependencies.

Input: Each Catamount test takes as input the Tensorflow model definition (compute graph) for the problem domain. Model descriptions are in Section 2.

Output: Catamount tests output all analytics about the models (compute graphs) they load, including symbolic model parameters, algorithmic Flops, memory accesses, and memory footprint. By binding these symbolic functions to particular values, the tests also output the numerical values.

How much disk space required (approximately)?: 1 GB

How much time is needed to set up experiments (approximately)?: Less than 5 minutes

How much time is needed to complete experiments (approximately)?: Less than 2 hours, depending on system CPU, memory performance

Publicly available?: Yes

Artifact DOI: https://doi.org/10.5281/zenodo.2259280

Repository location: https://github.com/baiduresearch/catamount

Code/data licenses (if publicly available)?: Apache 2.0
a.3. Description
a.3.1. How delivered
Catamount is an open source Python package under Apache 2.0 license and is hosted with code and example DL compute graphs on GitHub (https://github.com/baiduresearch/catamount).
a.3.2. Software dependencies
Catamount depends on recent versions of numpy, sympy, and tensorflow>=1.7, which work most stably with recent versions of Python. It is strongly recommended that users begin with Python 3.6 to install and run Catamount tests. Catamount may not work smootly with prior versions of Python.
a.3.3. Data sets
The model definitions (compute graphs) analyzed in this paper are included as endtoend tests of Catamount functionality and are distributed in the Catamount Github repository. These model definitions are in the form of Tensorflow checkpoints that were saved using the standard Tensorflow saver as follows:
# ... Construct a TF model ... # Set output directory outdir = ... # Start TF session, create saver, and save model with tf.Session() as sess: sess.run(tf.global_variables_initializer()) saver = tf.train.Saver() saver.save(sess, os.path.join(outdir, ’tf_graph’))
This process saves the graph definition as a Tensorflow MetaGraphDef file, tf_graph.meta, along with saved parameters. Catamount can load the MetaGraphDef (.meta) files as graphlike Python objects for analysis. An example saving process can be found in the Catamount repo at catamount/frameworks/example_graphs/tensorflow/rnn/tf_dyanmic_rnn.py.
The .meta files used to analyze compute requirements for the different applications in this paper can be found in the following locations in the repo:

Machine translation, word and character language models:
catamount/frameworks/example_graphs/tensorflow/full_models/language_models/ 
Image classification ResNet models:
catamount/frameworks/example_graphs/tensorflow/full_models/image_classification/ 
Speech recognition attention model:
catamount/frameworks/example_graphs/tensorflow/full_models/speech_attention/
Finally, the tests that load these graphs and generate analytical models and numerical outputs can be found in the Catamount fullgraph tests directory:

Language models: catamount/tests/full/tf_language_models.py. Pass the parameter domain <domain>, where <domain> can be one of charlm, nmt, or wordlm, for character, machine translation, or word models, respectively.

Image classification: catamount/tests/full/tf_image_resnet.py. Pass the model depth as a parameter depth. Supported depths currently include ResNet 18, 34, 50, 101, or 152.

Speech recognition: catamount/tests/full/tf_speech_attention.py
a.4. Regenerating experiments from this paper
To download Catamount and regenerate results for this paper, run the following commands. These commands clone the public repository, check out the commit known to work for validating PPoPP paper results, and run the tests that generate results:
~$ git clone https://github.com/baiduresearch/catamount
~$ cd catamount
~$ git checkout b ppoppartifactvalidation ppopp2019artifact
~$ bash catamount/frameworks/example_graphs/tensorflow/full_models/generate_results.sh
a.5. Evaluation and expected result
With the commands in the prior subsection, Catamount should create an output file for each of the 9 analyzed compute graphs. These files will be named ppopp_2019_tests/output_*.txt. To regather results from these files after running the generate_results.sh script, you can run the following command inside the toplevel Catamount directory:
~$ bash catamount/frameworks/example_graphs/tensorflow/full_models/gather_results.sh
a.6. Experiment customization
To customize Catamount tests or experimental results from this paper, users can modify the appropriate Python script in the tests directory, catamount/tests/full/. By changing the the bind_subs dictionary, users can bind symbolic dimensions of the compute graphs to different values. Users can also create their own compute graph definitions manually using the Catamount API (see catamount/api/) or by checkpointing Tensorflow models and loading them into Catamount as shown in the tests.
Catamount can calculate the algorithmic compute requirements for a loaded model. In particular, it calculates algorithmic FLOPs, memory bytes accessed, and minimal memory footprint for a pass through the compute graph. In addition to requirements presented in the paper, Catamount can calculate algorithmic IO, which is the amount of data accessed for input to and output from a model. Training data is often stored on and read from disk, and placed into the model’s input memory allocations. Algorithmic IO is proportional to the batch size, but stays fixed as model size and training step compute requirements grow. We do not investigate algorithmic IO in this work, because we expect IO will grow very slowly relative to compute.
Comments
There are no comments yet.