Backprop with Approximate Activations for Memory-efficient Network Training

01/23/2019 ∙ by Ayan Chakrabarti, et al. ∙ Washington University in St Louis 0

Larger and deeper neural network architectures deliver improved accuracy on a variety of tasks, but also require a large amount of memory for training to store intermediate activations for back-propagation. We introduce an approximation strategy to significantly reduce this memory footprint, with minimal effect on training performance and negligible computational cost. Our method replaces intermediate activations with lower-precision approximations to free up memory, after the full-precision versions have been used for computation in subsequent layers in the forward pass. Only these approximate activations are retained for use in the backward pass. Compared to naive low-precision computation, our approach limits the accumulation of errors across layers and allows the use of much lower-precision approximations without affecting training accuracy. Experiments on CIFAR and ImageNet show that our method yields performance comparable to full-precision training, while storing activations at a fraction of the memory cost with 8- and even 4-bit fixed-point precision.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deeper neural network models are able to express more complex functions, and recent results have shown that with the use of residual (He et al., 2016a) and skip (Huang et al., 2017) connections to address vanishing gradients, such networks can be trained effectively to leverage this additional capacity. As a result, the use of deeper network architectures has become prevalent, especially for visual inference tasks (He et al., 2016b). The shift to larger architectures has delivered significant improvements in performance, but also increased demand on computational resources. In particular, deeper network architectures require significantly more on-device memory during training—much more so than for inference. This is because training requires retaining the computed activations of all intermediate layers since they are needed to compute gradients during the backward pass.

The increased memory footprint means fewer training samples can fit in memory and be processed as a batch on a single GPU. This is inefficient: smaller batches are not able to saturate all available parallel cores, especially because computation in “deeper” architectures is distributed to be more sequential. Smaller batches also complicate the use of batch-normalization 

(Ioffe & Szegedy, 2015), since batch statistics are now computed over fewer samples making training less stable. These considerations often force the choice of architecture to be based not just on optimality for inference, but also practical feasibility for training—for instance, deep residual networks for large images drop resolution early, so that most layers have smaller sized outputs.

While prior work to address this has traded-off memory for computation (Martens & Sutskever, 2012; Chen et al., 2016; Gruslys et al., 2016; Gomez et al., 2017)

, their focus has been on enabling exact gradient computation. However, since stochastic gradient descent (SGD) inherently works with noisy gradients at each iteration, we propose an algorithm that computes reasonably approximate gradients, while significantly reducing a network’s memory footprint and with virtually no additional computational cost. Our work is motivated by distributed training algorithms that succeed despite working with approximate and noisy gradients aggregated across multiple devices 

(Recht et al., 2011; Dean et al., 2012; Seide et al., 2014; Wen et al., 2017). We propose using low-precision approximate activations—that require less memory—to compute approximate gradients during back-propagation (backprop) on a single device. Note that training with a lower-precision of 16- instead of 32-bit floating-point representations is not un-common. But this lower precision is used for all computation, and thus allows only for a modest lowering of precision, since the approximation error builds up across the forward and then backward pass through all layers.

Figure 1:

Proposed Approach. We show the computations involved in the forward and backward pass during network training for a single “pre-activation” layer, with possible residual connections. The forward pass is exact, but we discard full-precision activations right after use by subsequent layers (we store these in common global buffers, and overwrite activations once they have been used and no longer needed for forward computation). Instead, we store a low-precision approximation of the activations which occupies less memory, and use these during back-propagation. Our approach limits errors in the gradient flowing back to the input of a layer, and thus accumulation of errors across layers. Indeed, since our approximation preserves signs (needed to compute the ReLU gradient), most computation along the path back to the input are exact—with only the back-prop through the variance-computation in batch-normalization being approximate.

In this work, we propose a new backprop implementation that performs the forward pass through the network at full-precision, and incurs limited approximation error during the backward pass. We use each layer’s full-precision activations to compute the activations of subsequent layers. However, once these activations have been used in the forward pass, our method discards them and stores a low-precision approximation instead. During the backward pass, gradients are propagated back through all the layers at full precision, but instead of using the original activations, we use their low-precision approximations. As a result, we incur an approximation error at each layer when computing the gradients to the weights from multiplying the incoming gradient with the approximate activations, but ensure the error in gradients going back to the previous layer is minimal.

Our experimental results show that even using only 4-bit fixed-point approximations, for the original 32-bit floating-point activations, causes only minor degradation in training quality. This significantly lowers the memory required for training, which comes essentially for “free”—incurring only the negligible additional computational cost of converting activations to and from low precision representations. Our memory-efficient version of backprop is thus able to use larger batch sizes at each iteration—to fully use available parallelism and compute stable batch statistics—and makes it practical for researchers to explore the use of much larger and deeper architectures than before.

2 Related Work

A number of works focus on reducing the memory footprint of a model during inference, e.g., by compression (Han et al., 2015) and quantization Hubara et al. (2017), to ensure that it can be deployed on resource-limited mobile devices, while still delivering reasonable accuracy. These methods still require storing full versions of model weights and activations during network training, and assume there is sufficient memory to do so. However, training requires significantly more memory than inference because of the need to store all intermediate activations. And so, memory can be a bottleneck during training, especially with the growing preference for larger and deeper network architectures. The most common recourse is to simply use multiple GPUs during training. But this introduces the overhead of intra-device communication, and often under-utilizes the available parallelism on each device—computation in deeper architectures is distributed more sequentially and, without sufficient data parallelism, often does not saturate all GPU cores.

A popular strategy to reduce training memory requirements is “checkpointing”. Activations for only a subset of layers are stored at a time, and the rest recovered by repeating forward computations (Martens & Sutskever, 2012; Chen et al., 2016; Gruslys et al., 2016). This affords memory savings with the trade-off of additional computational cost—e.g., Chen et al. (2016) propose a strategy that requires memory proportional to the square-root of the number of layers, while requiring up to the computational cost of an additional forward pass. In a similar vein, Gomez et al. (2017) considered network architectures with “reversible” or invertible layers to allow re-computing input activations of such layers from their outputs during the backward pass.

These methods likely represent the best possible solutions if the goal is restricted to computing exact gradients. But SGD is fundamentally a noisy process, and the exact gradients computed over a batch at each iteration are already an approximation—of gradients of the model over the entire training set (Robbins & Monro, 1985). Researchers have posited that further approximations are possible without degrading training ability, and used this to realize gains in efficiency. For distributed training, asynchronous methods (Recht et al., 2011; Dean et al., 2012) delay synchronizing models across devices to mitigate communication latency. Despite each device now working with stale models, there is no major degradation in training performance. Other methods quantize gradients to two (Seide et al., 2014) or three levels (Wen et al., 2017) so as to reduce communication overhead, and again find that training remains robust to such approximation. Our work also adopts an approximation strategy to gradient computation, but targets the problem of memory usage on a each device. We approximate activations, rather than gradients, with lower-precision representations, and by doing so, we are able to achieve considerable reductions in a model’s memory footprint during training. Note that since our method achieves a constant factor saving in memory for back-propagation across any group of layers, it can also be employed within checkpointing to further improve memory cost.

It is worth differentiating our work from those that carry out all training computations at lower-precision Micikevicius et al. (2017); Gupta et al. (2015). This strategy allows for a modest loss in precision: from 32- to 16-bit representations. In contrast, our approach allows for much greater reduction in precision. This is because we carry out the forward pass in full-precision, and approximate activations only after they have been used by subsequent layers. Our strategy limits accumulation of errors across layers, and we are able to replace 32-bit floats with 8- and even 4-bit fixed-point approximations, with little to no effect on training performance. Note that performing all computation at lower-precision also has a computational advantage: due to reduction in-device memory bandwidth usage (transferring data from global device memory to registers) in Micikevicius et al. (2017), and due to the use of specialized hardware in Gupta et al. (2015). While the goal of our work is different, our strategy can be easily combined with these ideas: compressing intermediate activations to a greater degree, while also using 16- instead of 32-bit precision for computation.

3 Proposed Method

We now describe our approach to memory-efficient training. We begin by reviewing the computational steps in the forward and backward pass for a typical network layer, and then describe our approximation strategy to reducing the memory requirements for storing intermediate activations.

3.1 Background

A neural network is composition of linear and non-linear functions that map the input to the final desired output. These functions are often organized into “layers”, where each layer consists of a single linear transformation—typically a convolution or a matrix multiply—and a sequence of non-linearities. We use the “pre-activation” definition of a layer, where we group the linear operation with the non-linearities that immediately

preceed it. Consider a typical network whose layer applies batch-normalization and ReLU to its input followed by a linear transform:

[Sc.&B.] (2)
[ReLU] (3)
[Linear] (4)

to yield the output activations

that are fed into subsequent layers. Here, each activation is a tensor with two or four dimensions: the first indexing different training examples, the last corresponding to “channels”, and others to spatial location. Mean

and Var

aggregate statistics over batch and spatial dimensions, to yield vectors

and with per-channel means and variances. Element-wise addition and multiplication (denoted by ) are carried out by “broadcasting” when the tensors are not of the same size. The final operation represents the linear transformation, with denoting matrix multiplication. This linear transform can also correspond to a convolution.

Note that (1)-(4) are defined with respect to learnable parameters and , where are both vectors of the same length as the number of channels in , and denotes a matrix (for fully-connected layers) or elements of a convolution kernel. These parameters are learned iteratively using SGD, where at each iteration, they are updated based on gradients—, , and

—of some loss function computed on a batch of training samples.

To compute gradients with respect to all parameters for all layers in the network, the training algorithm first computes activations for all layers in sequence, ordered such that each layer in the sequence takes as input the output from a previous layer. The loss is computed with respect to activations of the final layer, and then the training algorithm goes through all layers again in reverse sequence, using the chain rule to back-propagate gradients of this loss. For the

layer, given the gradients of the loss with respect to the output, this involves computing gradients and with respect to the layer’s learnable parameters, as well as gradients with respect to its input for further propagation. These gradients are given by:

[ReLU] (6)

where Sum and Mean again aggregate over all but the last dimension, and is a tensor the same size as that is 1 where the values in are positive, and 0 otherwise.

When the goal is to just compute the final output of the network, the activations of an intermediate layer can be discarded during the forward pass as soon as we finish processing the subsequent layer or layers that use it as input. However, we need to store all intermediate activations during training because they are needed to compute gradients during back-propagation: (5)-(8) involve not just the values of the incoming gradient, but also the values of the activations themselves. Thus, training requires enough available memory to hold the activations of all layers in the network.

3.2 Back-propagation with Approximate Activations

We begin by observing we do not necessarily need to store all intermediate activations and within a layer. For example, it is sufficient to store the activation values right before the ReLU, along with the variance vector (which is typically much smaller than the activations themselves). Given , we can reconstruct the other activations and needed in (5)-(8) using element-wise operations, which typically have negligible computational cost. Some deep learning frameworks already use such “fused” layers to conserve memory, and we use this to measure memory usage for exact training.

Storing one activation tensor at full-precision for every layer still requires a considerable amount of memory. We therefore propose retaining an approximate low-precision version of , that requires much less memory for storage, for use in (5)-(8) during back-propagation. As shown in Fig. 1, we use full-precision versions of all activations during the forward pass to compute from as per (1)-(4), and use to compute its approximation . The full precision approximations are discarded as soon they have been used—the intermediate activations are discarded as soon as the approximation and output have been computed, and is discarded after it has been used by a subsequent layer. Thus, only the approximate activations and (full-precision) variance vector are retained in memory for back-propagation.

We use a simple, computationally inexpensive approach to approximate via a -bit fixed-point representation for some desired value of . Since is normalized to be zero-mean and unit-variance, has mean and variance . We compute an integer tensor from as:


where indicates the “floor” operator, and . The resulting integers (between and ) can be directly stored with -bits. When needed during back-propagation, we recover a floating-point tensor holding the approximate activations as:


This simply has the effect of clipping to the range (the range may be slightly asymmetric around because of rounding), and quantizing values in fixed-size intervals (to the median of each interval) within that range. However, crucially, this approximation ensures that the sign of each value is preserved, i.e., .

3.3 Approximation Error in Training

Since the forward computations happen in full-precision, there is no error introduced in any of the activations prior to approximation. To analyze the error introduced by our approach, we then consider the effect of using instead of (and equivalently, and derived from ) to compute gradients in (5)-(8). We begin by noting that for all values of that fall within the range (and are therefore not clipped), the worst-case approximation error in the activations themselves is bounded by half the width of the quantization intervals:


where Var denotes per-channel variance (and the RHS is interpreted as applying to all channels). Hence, the approximation error is a fraction of the variance in the activations themselves, and is lower for higher values of . It is easy to see that since and are derived from and by clipping negative values of both to 0, which only decreases the error. Further, since is related to by simply scaling, the error in is also bounded as a fraction of its variance, which is one, i.e: .

We next examine how these errors in the activations affect the accuracy of gradient computations in (5)-(8). During the first back-propagation step in (5) through the linear transform, the gradient to the learnable transform weights will be affected by the approximation error in . However, the gradient can be computed exactly (as a function of the incoming gradient to the layer ), because it does not depend on the activations. Back-propagation through the ReLU in (7) is also not affected, because it depends only on the sign of the activations, which is preserved by our approximation. When back-propagating through the scale and bias in (6), only the gradient to the scale depends on the activations, but gradients to the bias and to can be computed exactly.

And so, while our approximation introduces some error in the computations of and , the gradient flowing towards the input of the layer is exact, until it reaches the batch-normalization operation in (8). Here, we do incur an error, but note that this is only in one of the three terms of the expression for —which accounts for back-propagating through the variance computation, and is the only term that depends on the activations. Hence, while our activation approximation does introduce some errors in the gradients for the learnable weights, we limit the accumulation of these errors across layers because a majority of the computations for back-propagation to the input of each layer are exact. This is illustrated in Fig. 1, with the use of green arrows to show computations that are exact, and red arrows for those affected by the approximation.

Figure 2: Approximate Training on CIFAR and ImageNet. We show the evolution of training losses for ResNet-164 models trained on CIFAR-10 and CIFAR-100, and ResNet-152 models trained on ImageNet with exact training and using our approximation strategy. CIFAR results are summarized across ten random initializations with bands depicting minimum and maximum loss values. We find that the loss using our method closely follow that of exact training through all iterations. For CIFAR, we also include results for training when using a “naive” 8-bit approximation baseline—where the approximate activations are also used in the forward pass. In this case, errors accumulate across layers and we find that training fails.

3.4 Network Architectures and Memory Usage

Our full training algorithm applies our approximation strategy to every layer (defined by grouping linear transforms with preceding non-linear activations) during the forward and backward pass. Skip and residual connections are handled easily, since back-propagation through these connections involves simply copying to and adding gradients from both paths, and doesn’t involve the activations themselves. (Although we do not consider this in our implementation, older residual connections that are added after batch-normalization but before the ReLU can also be handled, but would require saving activations both before and after addition—in the traditional case, well as our approach).

Our method is predicated on the use of ReLU activations since its gradient depends only on the sign of the activations, and can be used for other such non-linearities such as “leaky”-ReLUs. Other activations (like sigmoid) may incur additional errors—in particular, we do not approximate the activations of the final output layer in classifier networks that go through a Soft-Max. However, since this is typically at the final layer, and computing these activations is immediately followed by back-propagating through that layer, approximating these activations offers no savings in memory. Our approach also handles average pooling by simply folding it in with the linear transform. For max-pooling, exact back-propagation through the pooling operation would require storing the arg-max indices (the number of bits required to store these would depend on the max-pool receptive field size). However, since max-pool layers are used less often in recent architectures in favor of learned downsampling (ResNet architectures for image classification use max-pooling only in one layer), we instead choose not to approximate layers with max-pooling for simplicity.

Given a network with layers, our memory usage depends on connectivity for these layers. Our approach requires storing the approximate activations for each layer, each occupying reduced memory rate at a fractional rate of . During the forward pass, we also need to store, at full-precision, those activations that are yet to be used by subsequent layers. This is one layer’s activations for feed-forward networks, and two layers’ for standard residual architectures. More generally, we will need to store activations for upto layers, where is the “width” of the architecture—which we define as the maximum number of outstanding layer activations that remain to be used as process layers in sequence. During back-propagation, the same amount of space is required for storing gradients till they are used by previous layers. We also need space to re-create a layer’s approximate activations as full-precision tensors from the low-bit stored representation, for use in computation.

Thus, assuming that all activations of layers are the same size, our algorithm requires memory, compared to the standard requirement of . This leads to substantial savings for deep networks with large since when approximating 32-bit floating point activations with bits.

4 Experiments

We developed a library that implements the proposed method for approximate memory-efficient training, given a network architecture specification which can include residual layers (i.e., ). As illustrated in Fig. 1, the method allocates a pair of global buffers for the direct and residual paths that is common to all layers. At any point during the forward pass, these buffers hold the full-precision activations that are needed for computation of subsequent layers. The same buffers are used to store gradients during the back-ward pass. Beyond these common buffers, the library only stores the low-precision approximate activations for each layer for use during the backward-pass. Further details of our implementation are provided in the appendix.

We compare our approximate training approach, with 8- and 4-bit activations, to exact training with full-precision activations. For a fair comparison, we again only store one set of activations (like our method, but with full precision) for a group of batch-normalization, ReLU, and linear (convolution) operations. This is achieved with our library by storing without approximation ().

As a baseline approach, we also implement a naive approximation-based training algorithm that replaces activations with low-precision versions during the forward pass. We do this conservatively—at each layer, all computations are carried out in full precision, and the activations are only approximated right before the ReLU in a manner identical to our method. However, unlike our method, this baseline uses this approximated-version of the activations as input to the subsequent convolution operation. Note that we use approximate activations only for training, and use exact activations for computing and reporting test errors for models trained with this baseline.

CIFAR-10 (ResNet-164) CIFAR-100 (ResNet-164) ImageNet (ResNet-152)
Test Set Error Test Set Error Val Set Top-5 Error
Exact () 5.36%0.15 23.44%0.26 7.20%

Naive 8-bit


75.49%9.09 95.41%2.16 -

Proposed Method

8-bit () 5.48%0.13 23.63%0.32 7.70%
4-bit () 5.49%0.16 23.58%0.30 7.72%
Table 1: Accuracy Comparisons on CIFAR-10, CIFAR-100, and ImageNet. CIFAR results show mean std over training with ten random initializations for each case. ImageNet results are with 10-crop testing.

CIFAR-10 and CIFAR-100. We begin with comparisons on 164-layer pre-activation residual networks (He et al., 2016b) on CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009), using three-layer “bottlneck” residual units and parameter-free shortcuts for all residual connections. We train the network for 64k iterations with a batch size of 128, momentum of 0.9, and weight decay of . Following  He et al. (2016b), the learning rate is set to for the first 400 iterations, then increased to , and dropped by a factor of 10 at 32k and 48k iterations. We use standard data-augmentation with random translation and horizontal flips. We train these networks with our approach using and bit approximations, and measure degradation in accuracy with respect to exact training—repeating training for all cases with ten random seeds. We also include comparisons to the naive approximation baseline (with bits). We visualize the evolution of training losses in Fig. 2, and report test set errors of the final model in Table 1.

We find that the training loss when using our low-memory approximation strategy closely follow those of exact back-propagation, throughout the training process. Moreover, the final mean test errors of models trained with even 4-bit approximations (i.e., corresponding to

) are only slightly higher than those trained with exact computations, with the difference being lower than the standard deviation across different initializations. In contrast, we find that training using the naive approximation baseline simply fails, highlighting the importance of preventing accumulation of errors across layers using our approach.

Figure 3: We visualize errors in the computed gradients of learnable parameters (convolution kernels) for different layers for two snapshots of a CIFAR-100 model at the start and end of training. We plot errors between the true gradients and those computed by our approximation, averaged over a 100 batches. We compare to the errors from SGD itself: the variance between the (exact) gradients computed from different batches. We find this SGD noise to be 1-2 orders of magnitude higher, explaining why our approximation does not significantly impact training performance.

To examine the reason behind the robustness of our method, Fig. 3 visualizes the error in the final parameter gradients used to update the model. Specifically, we take two models for CIFAR-100—at the start and end of training—and then compute gradients for a 100 batches with respect to the convolution kernels of all layers exactly, and using our approximate strategy. We plot the average squared error between these gradients. We compare this approximation error to the “noise” inherent in SGD, due to the fact that each iteration considers a random batch of training examples. This is measured by average variance between the (exact) gradients computed in the different batches. We see that our approximation error is between one and two orders of magnitude below the SGD noise for all layers, both at the start and end of training. So while we do incur an error due to approximation, this is added to the much higher error that already exists due to SGD even in exact training, and hence further degradation is limited.

ResNet-164 ResNet-254 ResNet-488 ResNet-1001 ResNet-1001-4x
Maximum Exact 688 474 264 134 26
Batch-size 4-bit 2522 2154 1468 876 182
Run-time Exact 4.1 ms 6.5 ms 13.3 ms 31.3 ms 130.8 ms
per Sample 4-bit 4.3 ms 6.7 ms 12.7 ms 26.0 ms 101.6 ms
Table 2: Comparison of maximum batch-size and wall-clock time per training example (i.e., training time per-iteration divided by batch size) for different ResNet architectures on CIFAR-10.

ImageNet. We also report results on training models for ImageNet (Russakovsky et al., 2015). Here, a 152-layer residual architecture, again using three-layer bottleneck units and pre-activation parameter-free shortcuts. We train this network with a batch size of 256 for a total of 640k iterations with a momentum of 0.9, weight decay of , and standard scale, color, flip, and translation augmentation. The initial learning rate is set to with drops by factor of 10 every 160k iterations. Figure 2 shows the evolution of training loss in this case as well, and Table 1 reports top-5 validation accuracy (using 10 crops at a scale of 256) for models trained using exact computation, and our approach with and bit approximations. As with the CIFAR experiments, training losses using our strategy closely follow that of exact training (interestingly, the loss using our method is slightly lower than that of exact training during the final iterations, although this is likely due to random initialization), and the drop in validation set accuracy is again relatively small: at for a memory savings factor of with bit approximations.

Memory and Computational Efficiency. For the CIFAR experiments, the full -size batch were able to fit on a single 1080Ti GPU for both exact training and our method. In this case, the batches were also large enough to saturate all GPU cores for both our method and exact training. The running times per iteration were almost identical—with a very slight increase in our case due to the cost computing approximations: exact vs approximate (4-bit) training took 0.66 seconds vs 0.72 seconds for CIFAR-100. For ImageNet training, we parallelized computation across two GPUs, and while our method was able to fit half a batch (size ) on each GPU, exact training required two forward-backward passes (followed by averaging gradients) with sized batches per-GPU per-pass. In this case, the smaller batches for exact training underutilized the available parallelism on each GPU compared to our method, and this time our approach had a slight computational advantage—the per-iteration time (across two GPUs) was 2s vs 1.7s for exact vs approximate (4-bit) training using our method.

However, these represent comparisons restricted to having the same total batch size (needed to evaluate relative accuracy). For a more precise evaluation of memory usage, and the resulting computational efficiency from parallelism, we considered residual networks for CIFAR-10 of various depths up to 1001 layers—and additionally for the deepest network, a version with four times as many feature channels in each layer. For each network, we measured the largest batch size that could be fit in memory with our method (with ) vs exact training, i.e., such that a batch of caused an out-of-memory error on a 1080Ti GPU. We also measured the corresponding wall-clock training time per sample, computed as the training time per-iteration divided by this batch size. These results are summarized in Table 2. We find that in all cases, our method allows significantly larger batches to be fit in memory. Moreover for larger networks, our method also yields a notable computational advantage since larger batches permit full exploitation of available cores on the GPU.

5 Conclusion

We introduced a new algorithm for approximate gradient computation in neural network training, that significantly reduces the amount of required on-device memory. Our experiments show that this comes at a minimal cost in terms of both quality of the learned models, and computational expense. With a lower memory footprint, our method allows training with larger batches in each iteration—improving efficiency and stability—and exploration of deeper architectures that were previously impractical to train. Our reference implementation is available at

Our method shows that SGD is reasonably robust to working with approximate activations. While we used an extremely simple approximation strategy—uniform quantization—in this work, we are interested in exploring whether more sophisticated techniques—e.g., based on random projections or vector quantization—can provide better trade-offs, especially if informed by statistics of gradients and errors from prior iterations. It is also worth investigating whether our approach to partial approximation can be utilized in other settings, for example, to reduce inter-device communication for distributed training with data or model parallelism.


A. Chakrabarti acknowledges support from NSF grant IIS-1820693. B. Moseley was supported in part by a Google Research Award and NSF grants CCF-1830711, CCF-1824303, and CCF-1733873.


  • Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: A system for large-scale machine learning. arXiv preprint arXiv:1605.08695, 2016.
  • Chen et al. (2016) Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  • Dean et al. (2012) Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, 2012.
  • Gomez et al. (2017) Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B.

    The reversible residual network: Backpropagation without storing activations.

    In Advances in Neural Information Processing Systems, pp. 2211–2221, 2017.
  • Gruslys et al. (2016) Gruslys, A., Munos, R., Danihelka, I., Lanctot, M., and Graves, A. Memory-efficient backpropagation through time. In Advances in Neural Information Processing Systems, pp. 4125–4133, 2016.
  • Gupta et al. (2015) Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. Deep learning with limited numerical precision. In

    Proc. International Conference on Machine Learning (ICML)

    , pp. 1737–1746, 2015.
  • Han et al. (2015) Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  • He et al. (2016a) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In

    Proc.  IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

    , pp. 770–778, 2016a.
  • He et al. (2016b) He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In Proc. the European Conference on Computer Vision (ECCV), 2016b.
  • Huang et al. (2017) Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proc.  IEEE Conference on Computer Vision and Pattern Recognition (CVPR),, 2017.
  • Hubara et al. (2017) Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research (JMLR), 2017.
  • Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc.  International Conference on Machine Learning (ICML),, pp. 448–456, 2015.
  • Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, 2009.
  • Martens & Sutskever (2012) Martens, J. and Sutskever, I. Training deep and recurrent networks with hessian-free optimization. In Neural Networks: Tricks of the Trade - Second Edition, pp. 479–535. 2012.
  • Micikevicius et al. (2017) Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaev, O., Venkatesh, G., et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
  • Recht et al. (2011) Recht, B., Re, C., Wright, S., and Niu, F. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, 2011.
  • Robbins & Monro (1985) Robbins, H. and Monro, S. A stochastic approximation method. In Herbert Robbins Selected Papers, pp. 102–109. Springer, 1985.
  • Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 2015.
  • Seide et al. (2014) Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-bit Stochastic Gradient Descent and its Application to Data-parallel Distributed Training of Speech DNNs. In Proc. Interspeech, 2014.
  • Wen et al. (2017) Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li, H. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems, 2017.

Appendix: Implementation Details

We implemented our approximate training algorithm using the TensorFlow library 

(Abadi et al., 2016). However, we only used TensorFlow’s functions for individual forward and gradient computations, but not on its automatic differentiation functionality. Instead, our library allows specifying general residual network architectures, and based on this specification, creates a set of TensorFlow ops for doing forward and backward passes through each layer. We also used custom ops implemented in CUDA to handle the conversion to and from low-precision representations. Each layer’s forward and backward ops are called in separate calls, and all data that needs to persist between calls—including still to be used full precision activations and gradients in the forward and backward pass, and approximate intermediate activations—are stored explicitly as Tensorflow variables.

For the forward and backward passes through the network, we call these ops in sequence followed by ops to update the model parameters based on the computed gradients. We chose not to allocate and then free variables for the full-precision layer activations and gradients, since this caused memory fragmentation with Tensorflow’s memory management routines. Instead, as described in Sec. 4, we used two common variables as buffers for all layers to hold activations (in the forward pass) and gradients (in the backward pass) for the direct and residual paths in the network respectively. We reuse these buffers by overwriting old activations and gradients with new ones. The size of these buffers is set based on the largest layer, and we used slices of these buffers for smaller layers.