1 Introduction
Deeper neural network models are able to express more complex functions, and recent results have shown that with the use of residual (He et al., 2016a) and skip (Huang et al., 2017) connections to address vanishing gradients, such networks can be trained effectively to leverage this additional capacity. As a result, the use of deeper network architectures has become prevalent, especially for visual inference tasks (He et al., 2016b). The shift to larger architectures has delivered significant improvements in performance, but also increased demand on computational resources. In particular, deeper network architectures require significantly more ondevice memory during training—much more so than for inference. This is because training requires retaining the computed activations of all intermediate layers since they are needed to compute gradients during the backward pass.
The increased memory footprint means fewer training samples can fit in memory and be processed as a batch on a single GPU. This is inefficient: smaller batches are not able to saturate all available parallel cores, especially because computation in “deeper” architectures is distributed to be more sequential. Smaller batches also complicate the use of batchnormalization
(Ioffe & Szegedy, 2015), since batch statistics are now computed over fewer samples making training less stable. These considerations often force the choice of architecture to be based not just on optimality for inference, but also practical feasibility for training—for instance, deep residual networks for large images drop resolution early, so that most layers have smaller sized outputs.While prior work to address this has tradedoff memory for computation (Martens & Sutskever, 2012; Chen et al., 2016; Gruslys et al., 2016; Gomez et al., 2017)
, their focus has been on enabling exact gradient computation. However, since stochastic gradient descent (SGD) inherently works with noisy gradients at each iteration, we propose an algorithm that computes reasonably approximate gradients, while significantly reducing a network’s memory footprint and with virtually no additional computational cost. Our work is motivated by distributed training algorithms that succeed despite working with approximate and noisy gradients aggregated across multiple devices
(Recht et al., 2011; Dean et al., 2012; Seide et al., 2014; Wen et al., 2017). We propose using lowprecision approximate activations—that require less memory—to compute approximate gradients during backpropagation (backprop) on a single device. Note that training with a lowerprecision of 16 instead of 32bit floatingpoint representations is not uncommon. But this lower precision is used for all computation, and thus allows only for a modest lowering of precision, since the approximation error builds up across the forward and then backward pass through all layers.In this work, we propose a new backprop implementation that performs the forward pass through the network at fullprecision, and incurs limited approximation error during the backward pass. We use each layer’s fullprecision activations to compute the activations of subsequent layers. However, once these activations have been used in the forward pass, our method discards them and stores a lowprecision approximation instead. During the backward pass, gradients are propagated back through all the layers at full precision, but instead of using the original activations, we use their lowprecision approximations. As a result, we incur an approximation error at each layer when computing the gradients to the weights from multiplying the incoming gradient with the approximate activations, but ensure the error in gradients going back to the previous layer is minimal.
Our experimental results show that even using only 4bit fixedpoint approximations, for the original 32bit floatingpoint activations, causes only minor degradation in training quality. This significantly lowers the memory required for training, which comes essentially for “free”—incurring only the negligible additional computational cost of converting activations to and from low precision representations. Our memoryefficient version of backprop is thus able to use larger batch sizes at each iteration—to fully use available parallelism and compute stable batch statistics—and makes it practical for researchers to explore the use of much larger and deeper architectures than before.
2 Related Work
A number of works focus on reducing the memory footprint of a model during inference, e.g., by compression (Han et al., 2015) and quantization Hubara et al. (2017), to ensure that it can be deployed on resourcelimited mobile devices, while still delivering reasonable accuracy. These methods still require storing full versions of model weights and activations during network training, and assume there is sufficient memory to do so. However, training requires significantly more memory than inference because of the need to store all intermediate activations. And so, memory can be a bottleneck during training, especially with the growing preference for larger and deeper network architectures. The most common recourse is to simply use multiple GPUs during training. But this introduces the overhead of intradevice communication, and often underutilizes the available parallelism on each device—computation in deeper architectures is distributed more sequentially and, without sufficient data parallelism, often does not saturate all GPU cores.
A popular strategy to reduce training memory requirements is “checkpointing”. Activations for only a subset of layers are stored at a time, and the rest recovered by repeating forward computations (Martens & Sutskever, 2012; Chen et al., 2016; Gruslys et al., 2016). This affords memory savings with the tradeoff of additional computational cost—e.g., Chen et al. (2016) propose a strategy that requires memory proportional to the squareroot of the number of layers, while requiring up to the computational cost of an additional forward pass. In a similar vein, Gomez et al. (2017) considered network architectures with “reversible” or invertible layers to allow recomputing input activations of such layers from their outputs during the backward pass.
These methods likely represent the best possible solutions if the goal is restricted to computing exact gradients. But SGD is fundamentally a noisy process, and the exact gradients computed over a batch at each iteration are already an approximation—of gradients of the model over the entire training set (Robbins & Monro, 1985). Researchers have posited that further approximations are possible without degrading training ability, and used this to realize gains in efficiency. For distributed training, asynchronous methods (Recht et al., 2011; Dean et al., 2012) delay synchronizing models across devices to mitigate communication latency. Despite each device now working with stale models, there is no major degradation in training performance. Other methods quantize gradients to two (Seide et al., 2014) or three levels (Wen et al., 2017) so as to reduce communication overhead, and again find that training remains robust to such approximation. Our work also adopts an approximation strategy to gradient computation, but targets the problem of memory usage on a each device. We approximate activations, rather than gradients, with lowerprecision representations, and by doing so, we are able to achieve considerable reductions in a model’s memory footprint during training. Note that since our method achieves a constant factor saving in memory for backpropagation across any group of layers, it can also be employed within checkpointing to further improve memory cost.
It is worth differentiating our work from those that carry out all training computations at lowerprecision Micikevicius et al. (2017); Gupta et al. (2015). This strategy allows for a modest loss in precision: from 32 to 16bit representations. In contrast, our approach allows for much greater reduction in precision. This is because we carry out the forward pass in fullprecision, and approximate activations only after they have been used by subsequent layers. Our strategy limits accumulation of errors across layers, and we are able to replace 32bit floats with 8 and even 4bit fixedpoint approximations, with little to no effect on training performance. Note that performing all computation at lowerprecision also has a computational advantage: due to reduction indevice memory bandwidth usage (transferring data from global device memory to registers) in Micikevicius et al. (2017), and due to the use of specialized hardware in Gupta et al. (2015). While the goal of our work is different, our strategy can be easily combined with these ideas: compressing intermediate activations to a greater degree, while also using 16 instead of 32bit precision for computation.
3 Proposed Method
We now describe our approach to memoryefficient training. We begin by reviewing the computational steps in the forward and backward pass for a typical network layer, and then describe our approximation strategy to reducing the memory requirements for storing intermediate activations.
3.1 Background
A neural network is composition of linear and nonlinear functions that map the input to the final desired output. These functions are often organized into “layers”, where each layer consists of a single linear transformation—typically a convolution or a matrix multiply—and a sequence of nonlinearities. We use the “preactivation” definition of a layer, where we group the linear operation with the nonlinearities that immediately
preceed it. Consider a typical network whose layer applies batchnormalization and ReLU to its input followed by a linear transform:[B.Norm.]  
(1)  
[Sc.&B.]  (2)  
[ReLU]  (3)  
[Linear]  (4) 
to yield the output activations
that are fed into subsequent layers. Here, each activation is a tensor with two or four dimensions: the first indexing different training examples, the last corresponding to “channels”, and others to spatial location. Mean
and Varaggregate statistics over batch and spatial dimensions, to yield vectors
and with perchannel means and variances. Elementwise addition and multiplication (denoted by ) are carried out by “broadcasting” when the tensors are not of the same size. The final operation represents the linear transformation, with denoting matrix multiplication. This linear transform can also correspond to a convolution.Note that (1)(4) are defined with respect to learnable parameters and , where are both vectors of the same length as the number of channels in , and denotes a matrix (for fullyconnected layers) or elements of a convolution kernel. These parameters are learned iteratively using SGD, where at each iteration, they are updated based on gradients—, , and
—of some loss function computed on a batch of training samples.
To compute gradients with respect to all parameters for all layers in the network, the training algorithm first computes activations for all layers in sequence, ordered such that each layer in the sequence takes as input the output from a previous layer. The loss is computed with respect to activations of the final layer, and then the training algorithm goes through all layers again in reverse sequence, using the chain rule to backpropagate gradients of this loss. For the
layer, given the gradients of the loss with respect to the output, this involves computing gradients and with respect to the layer’s learnable parameters, as well as gradients with respect to its input for further propagation. These gradients are given by:[Linear]  
(5)  
[ReLU]  (6)  
[Sc.&B.]  
(7)  
[B.Norm.]  
(8) 
where Sum and Mean again aggregate over all but the last dimension, and is a tensor the same size as that is 1 where the values in are positive, and 0 otherwise.
When the goal is to just compute the final output of the network, the activations of an intermediate layer can be discarded during the forward pass as soon as we finish processing the subsequent layer or layers that use it as input. However, we need to store all intermediate activations during training because they are needed to compute gradients during backpropagation: (5)(8) involve not just the values of the incoming gradient, but also the values of the activations themselves. Thus, training requires enough available memory to hold the activations of all layers in the network.
3.2 Backpropagation with Approximate Activations
We begin by observing we do not necessarily need to store all intermediate activations and within a layer. For example, it is sufficient to store the activation values right before the ReLU, along with the variance vector (which is typically much smaller than the activations themselves). Given , we can reconstruct the other activations and needed in (5)(8) using elementwise operations, which typically have negligible computational cost. Some deep learning frameworks already use such “fused” layers to conserve memory, and we use this to measure memory usage for exact training.
Storing one activation tensor at fullprecision for every layer still requires a considerable amount of memory. We therefore propose retaining an approximate lowprecision version of , that requires much less memory for storage, for use in (5)(8) during backpropagation. As shown in Fig. 1, we use fullprecision versions of all activations during the forward pass to compute from as per (1)(4), and use to compute its approximation . The full precision approximations are discarded as soon they have been used—the intermediate activations are discarded as soon as the approximation and output have been computed, and is discarded after it has been used by a subsequent layer. Thus, only the approximate activations and (fullprecision) variance vector are retained in memory for backpropagation.
We use a simple, computationally inexpensive approach to approximate via a bit fixedpoint representation for some desired value of . Since is normalized to be zeromean and unitvariance, has mean and variance . We compute an integer tensor from as:
(9) 
where indicates the “floor” operator, and . The resulting integers (between and ) can be directly stored with bits. When needed during backpropagation, we recover a floatingpoint tensor holding the approximate activations as:
(10) 
This simply has the effect of clipping to the range (the range may be slightly asymmetric around because of rounding), and quantizing values in fixedsize intervals (to the median of each interval) within that range. However, crucially, this approximation ensures that the sign of each value is preserved, i.e., .
3.3 Approximation Error in Training
Since the forward computations happen in fullprecision, there is no error introduced in any of the activations prior to approximation. To analyze the error introduced by our approach, we then consider the effect of using instead of (and equivalently, and derived from ) to compute gradients in (5)(8). We begin by noting that for all values of that fall within the range (and are therefore not clipped), the worstcase approximation error in the activations themselves is bounded by half the width of the quantization intervals:
(11) 
where Var denotes perchannel variance (and the RHS is interpreted as applying to all channels). Hence, the approximation error is a fraction of the variance in the activations themselves, and is lower for higher values of . It is easy to see that since and are derived from and by clipping negative values of both to 0, which only decreases the error. Further, since is related to by simply scaling, the error in is also bounded as a fraction of its variance, which is one, i.e: .
We next examine how these errors in the activations affect the accuracy of gradient computations in (5)(8). During the first backpropagation step in (5) through the linear transform, the gradient to the learnable transform weights will be affected by the approximation error in . However, the gradient can be computed exactly (as a function of the incoming gradient to the layer ), because it does not depend on the activations. Backpropagation through the ReLU in (7) is also not affected, because it depends only on the sign of the activations, which is preserved by our approximation. When backpropagating through the scale and bias in (6), only the gradient to the scale depends on the activations, but gradients to the bias and to can be computed exactly.
And so, while our approximation introduces some error in the computations of and , the gradient flowing towards the input of the layer is exact, until it reaches the batchnormalization operation in (8). Here, we do incur an error, but note that this is only in one of the three terms of the expression for —which accounts for backpropagating through the variance computation, and is the only term that depends on the activations. Hence, while our activation approximation does introduce some errors in the gradients for the learnable weights, we limit the accumulation of these errors across layers because a majority of the computations for backpropagation to the input of each layer are exact. This is illustrated in Fig. 1, with the use of green arrows to show computations that are exact, and red arrows for those affected by the approximation.
3.4 Network Architectures and Memory Usage
Our full training algorithm applies our approximation strategy to every layer (defined by grouping linear transforms with preceding nonlinear activations) during the forward and backward pass. Skip and residual connections are handled easily, since backpropagation through these connections involves simply copying to and adding gradients from both paths, and doesn’t involve the activations themselves. (Although we do not consider this in our implementation, older residual connections that are added after batchnormalization but before the ReLU can also be handled, but would require saving activations both before and after addition—in the traditional case, well as our approach).
Our method is predicated on the use of ReLU activations since its gradient depends only on the sign of the activations, and can be used for other such nonlinearities such as “leaky”ReLUs. Other activations (like sigmoid) may incur additional errors—in particular, we do not approximate the activations of the final output layer in classifier networks that go through a SoftMax. However, since this is typically at the final layer, and computing these activations is immediately followed by backpropagating through that layer, approximating these activations offers no savings in memory. Our approach also handles average pooling by simply folding it in with the linear transform. For maxpooling, exact backpropagation through the pooling operation would require storing the argmax indices (the number of bits required to store these would depend on the maxpool receptive field size). However, since maxpool layers are used less often in recent architectures in favor of learned downsampling (ResNet architectures for image classification use maxpooling only in one layer), we instead choose not to approximate layers with maxpooling for simplicity.
Given a network with layers, our memory usage depends on connectivity for these layers. Our approach requires storing the approximate activations for each layer, each occupying reduced memory rate at a fractional rate of . During the forward pass, we also need to store, at fullprecision, those activations that are yet to be used by subsequent layers. This is one layer’s activations for feedforward networks, and two layers’ for standard residual architectures. More generally, we will need to store activations for upto layers, where is the “width” of the architecture—which we define as the maximum number of outstanding layer activations that remain to be used as process layers in sequence. During backpropagation, the same amount of space is required for storing gradients till they are used by previous layers. We also need space to recreate a layer’s approximate activations as fullprecision tensors from the lowbit stored representation, for use in computation.
Thus, assuming that all activations of layers are the same size, our algorithm requires memory, compared to the standard requirement of . This leads to substantial savings for deep networks with large since when approximating 32bit floating point activations with bits.
4 Experiments
We developed a library that implements the proposed method for approximate memoryefficient training, given a network architecture specification which can include residual layers (i.e., ). As illustrated in Fig. 1, the method allocates a pair of global buffers for the direct and residual paths that is common to all layers. At any point during the forward pass, these buffers hold the fullprecision activations that are needed for computation of subsequent layers. The same buffers are used to store gradients during the backward pass. Beyond these common buffers, the library only stores the lowprecision approximate activations for each layer for use during the backwardpass. Further details of our implementation are provided in the appendix.
We compare our approximate training approach, with 8 and 4bit activations, to exact training with fullprecision activations. For a fair comparison, we again only store one set of activations (like our method, but with full precision) for a group of batchnormalization, ReLU, and linear (convolution) operations. This is achieved with our library by storing without approximation ().
As a baseline approach, we also implement a naive approximationbased training algorithm that replaces activations with lowprecision versions during the forward pass. We do this conservatively—at each layer, all computations are carried out in full precision, and the activations are only approximated right before the ReLU in a manner identical to our method. However, unlike our method, this baseline uses this approximatedversion of the activations as input to the subsequent convolution operation. Note that we use approximate activations only for training, and use exact activations for computing and reporting test errors for models trained with this baseline.
CIFAR10 (ResNet164)  CIFAR100 (ResNet164)  ImageNet (ResNet152)  
Test Set Error  Test Set Error  Val Set Top5 Error  
Exact  ()  5.36%0.15  23.44%0.26  7.20% 
Naive 8bit Approximation 
75.49%9.09  95.41%2.16    
Proposed Method 

8bit  ()  5.48%0.13  23.63%0.32  7.70% 
4bit  ()  5.49%0.16  23.58%0.30  7.72% 
CIFAR10 and CIFAR100. We begin with comparisons on 164layer preactivation residual networks (He et al., 2016b) on CIFAR10 and CIFAR100 (Krizhevsky & Hinton, 2009), using threelayer “bottlneck” residual units and parameterfree shortcuts for all residual connections. We train the network for 64k iterations with a batch size of 128, momentum of 0.9, and weight decay of . Following He et al. (2016b), the learning rate is set to for the first 400 iterations, then increased to , and dropped by a factor of 10 at 32k and 48k iterations. We use standard dataaugmentation with random translation and horizontal flips. We train these networks with our approach using and bit approximations, and measure degradation in accuracy with respect to exact training—repeating training for all cases with ten random seeds. We also include comparisons to the naive approximation baseline (with bits). We visualize the evolution of training losses in Fig. 2, and report test set errors of the final model in Table 1.
We find that the training loss when using our lowmemory approximation strategy closely follow those of exact backpropagation, throughout the training process. Moreover, the final mean test errors of models trained with even 4bit approximations (i.e., corresponding to
) are only slightly higher than those trained with exact computations, with the difference being lower than the standard deviation across different initializations. In contrast, we find that training using the naive approximation baseline simply fails, highlighting the importance of preventing accumulation of errors across layers using our approach.
To examine the reason behind the robustness of our method, Fig. 3 visualizes the error in the final parameter gradients used to update the model. Specifically, we take two models for CIFAR100—at the start and end of training—and then compute gradients for a 100 batches with respect to the convolution kernels of all layers exactly, and using our approximate strategy. We plot the average squared error between these gradients. We compare this approximation error to the “noise” inherent in SGD, due to the fact that each iteration considers a random batch of training examples. This is measured by average variance between the (exact) gradients computed in the different batches. We see that our approximation error is between one and two orders of magnitude below the SGD noise for all layers, both at the start and end of training. So while we do incur an error due to approximation, this is added to the much higher error that already exists due to SGD even in exact training, and hence further degradation is limited.
ResNet164  ResNet254  ResNet488  ResNet1001  ResNet10014x  
Maximum  Exact  688  474  264  134  26 
Batchsize  4bit  2522  2154  1468  876  182 
Runtime  Exact  4.1 ms  6.5 ms  13.3 ms  31.3 ms  130.8 ms 
per Sample  4bit  4.3 ms  6.7 ms  12.7 ms  26.0 ms  101.6 ms 
ImageNet. We also report results on training models for ImageNet (Russakovsky et al., 2015). Here, a 152layer residual architecture, again using threelayer bottleneck units and preactivation parameterfree shortcuts. We train this network with a batch size of 256 for a total of 640k iterations with a momentum of 0.9, weight decay of , and standard scale, color, flip, and translation augmentation. The initial learning rate is set to with drops by factor of 10 every 160k iterations. Figure 2 shows the evolution of training loss in this case as well, and Table 1 reports top5 validation accuracy (using 10 crops at a scale of 256) for models trained using exact computation, and our approach with and bit approximations. As with the CIFAR experiments, training losses using our strategy closely follow that of exact training (interestingly, the loss using our method is slightly lower than that of exact training during the final iterations, although this is likely due to random initialization), and the drop in validation set accuracy is again relatively small: at for a memory savings factor of with bit approximations.
Memory and Computational Efficiency. For the CIFAR experiments, the full size batch were able to fit on a single 1080Ti GPU for both exact training and our method. In this case, the batches were also large enough to saturate all GPU cores for both our method and exact training. The running times per iteration were almost identical—with a very slight increase in our case due to the cost computing approximations: exact vs approximate (4bit) training took 0.66 seconds vs 0.72 seconds for CIFAR100. For ImageNet training, we parallelized computation across two GPUs, and while our method was able to fit half a batch (size ) on each GPU, exact training required two forwardbackward passes (followed by averaging gradients) with sized batches perGPU perpass. In this case, the smaller batches for exact training underutilized the available parallelism on each GPU compared to our method, and this time our approach had a slight computational advantage—the periteration time (across two GPUs) was 2s vs 1.7s for exact vs approximate (4bit) training using our method.
However, these represent comparisons restricted to having the same total batch size (needed to evaluate relative accuracy). For a more precise evaluation of memory usage, and the resulting computational efficiency from parallelism, we considered residual networks for CIFAR10 of various depths up to 1001 layers—and additionally for the deepest network, a version with four times as many feature channels in each layer. For each network, we measured the largest batch size that could be fit in memory with our method (with ) vs exact training, i.e., such that a batch of caused an outofmemory error on a 1080Ti GPU. We also measured the corresponding wallclock training time per sample, computed as the training time periteration divided by this batch size. These results are summarized in Table 2. We find that in all cases, our method allows significantly larger batches to be fit in memory. Moreover for larger networks, our method also yields a notable computational advantage since larger batches permit full exploitation of available cores on the GPU.
5 Conclusion
We introduced a new algorithm for approximate gradient computation in neural network training, that significantly reduces the amount of required ondevice memory. Our experiments show that this comes at a minimal cost in terms of both quality of the learned models, and computational expense. With a lower memory footprint, our method allows training with larger batches in each iteration—improving efficiency and stability—and exploration of deeper architectures that were previously impractical to train. Our reference implementation is available at http://projects.ayanc.org/blpa/.
Our method shows that SGD is reasonably robust to working with approximate activations. While we used an extremely simple approximation strategy—uniform quantization—in this work, we are interested in exploring whether more sophisticated techniques—e.g., based on random projections or vector quantization—can provide better tradeoffs, especially if informed by statistics of gradients and errors from prior iterations. It is also worth investigating whether our approach to partial approximation can be utilized in other settings, for example, to reduce interdevice communication for distributed training with data or model parallelism.
Acknowledgments
A. Chakrabarti acknowledges support from NSF grant IIS1820693. B. Moseley was supported in part by a Google Research Award and NSF grants CCF1830711, CCF1824303, and CCF1733873.
References
 Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: A system for largescale machine learning. arXiv preprint arXiv:1605.08695, 2016.
 Chen et al. (2016) Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
 Dean et al. (2012) Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, 2012.

Gomez et al. (2017)
Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B.
The reversible residual network: Backpropagation without storing activations.
In Advances in Neural Information Processing Systems, pp. 2211–2221, 2017.  Gruslys et al. (2016) Gruslys, A., Munos, R., Danihelka, I., Lanctot, M., and Graves, A. Memoryefficient backpropagation through time. In Advances in Neural Information Processing Systems, pp. 4125–4133, 2016.

Gupta et al. (2015)
Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P.
Deep learning with limited numerical precision.
In
Proc. International Conference on Machine Learning (ICML)
, pp. 1737–1746, 2015.  Han et al. (2015) Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

He et al. (2016a)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
In
Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
, pp. 770–778, 2016a.  He et al. (2016b) He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In Proc. the European Conference on Computer Vision (ECCV), 2016b.
 Huang et al. (2017) Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),, 2017.
 Hubara et al. (2017) Hubara, I., Courbariaux, M., Soudry, D., ElYaniv, R., and Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research (JMLR), 2017.
 Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. International Conference on Machine Learning (ICML),, pp. 448–456, 2015.
 Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, 2009.
 Martens & Sutskever (2012) Martens, J. and Sutskever, I. Training deep and recurrent networks with hessianfree optimization. In Neural Networks: Tricks of the Trade  Second Edition, pp. 479–535. 2012.
 Micikevicius et al. (2017) Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaev, O., Venkatesh, G., et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
 Recht et al. (2011) Recht, B., Re, C., Wright, S., and Niu, F. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, 2011.
 Robbins & Monro (1985) Robbins, H. and Monro, S. A stochastic approximation method. In Herbert Robbins Selected Papers, pp. 102–109. Springer, 1985.
 Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 2015.
 Seide et al. (2014) Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1bit Stochastic Gradient Descent and its Application to Dataparallel Distributed Training of Speech DNNs. In Proc. Interspeech, 2014.
 Wen et al. (2017) Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li, H. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems, 2017.
Appendix: Implementation Details
We implemented our approximate training algorithm using the TensorFlow library
(Abadi et al., 2016). However, we only used TensorFlow’s functions for individual forward and gradient computations, but not on its automatic differentiation functionality. Instead, our library allows specifying general residual network architectures, and based on this specification, creates a set of TensorFlow ops for doing forward and backward passes through each layer. We also used custom ops implemented in CUDA to handle the conversion to and from lowprecision representations. Each layer’s forward and backward ops are called in separate sess.run calls, and all data that needs to persist between calls—including still to be used full precision activations and gradients in the forward and backward pass, and approximate intermediate activations—are stored explicitly as Tensorflow variables.For the forward and backward passes through the network, we call these ops in sequence followed by ops to update the model parameters based on the computed gradients. We chose not to allocate and then free variables for the fullprecision layer activations and gradients, since this caused memory fragmentation with Tensorflow’s memory management routines. Instead, as described in Sec. 4, we used two common variables as buffers for all layers to hold activations (in the forward pass) and gradients (in the backward pass) for the direct and residual paths in the network respectively. We reuse these buffers by overwriting old activations and gradients with new ones. The size of these buffers is set based on the largest layer, and we used slices of these buffers for smaller layers.
Comments
There are no comments yet.