To improve accuracy, deeper and deeper neural networks are being trained, leading to seemingly ever-improving performance in computer vision, natural language processing, and speech recognition. Three techniques have been key to usefully increasing the depth of networks: rectified-linear units (ReLU)[Nair and Hinton, 2010], batch normalization [Ioffe and Szegedy, 2015]
, and residual connections[He et al., 2016, Wu et al., 2016, Szegedy et al., 2016, Huang et al., 2016, Zagoruyko and Komodakis, 2016, Larsson et al., 2016, Iandola et al., 2016].
These techniques have fulfilled the promise of deep learning, allowing networks with a large number of relatively narrow layers to achieve high accuracy. The experiments that established the usefulness of these techniques were carried out using 32-bit precision floating-point arithmetic.
Network architectures are often compared in terms of the number of parameters in the network, and the number of FLOPs per test evaluation. Memory requirements, either at training time or test time are rarely explicitly taken into account, even though they often seem to be a limiting factor in network design. ResNets double the number of feature planes per layer each time the linear size of the hidden layers is down-scaled by a factor of two, in order to keep the computational cost per layer constant. However, each down-sampling halves the number of hidden units, so most of the network’s depth has to come after the input images have been scaled down by at least a factor of 16. Similarly, the placement of batch-normalization blocks in Inception-IV was limited by memory consideration [Szegedy et al., 2016]. Bandwidth is another reason to take memory into account. Finite data transfer speeds and cache sizes make it harder to make full use of all available processing power.
To improve efficiency—reducing memory and power requirements—networks have been trained with low-precision floating-point arithmetic, fixed-point arithmetic, and even binary arithmetic [Simard and Graf, 1994, Vanhoucke et al., 2011, Courbariaux et al., 2014, Gupta et al., 2015, Lin et al., 2016, Courbariaux et al., 2015, Rastegari et al., 2016, Miyashita et al., 2016, Hubara et al., 2016]. These experiments have been carried out for relatively shallow networks—generally between three and eight layers. We will focus on much deeper networks. Well-designed deep networks are more efficient than wide, shallow networks, so we believe that reducing the level of precision is more challenging, but more useful when successful.
With this in mind, we have tried to train modern network architectures using limited-precision arithmetic, focusing on the activations halfway through each batch-normalization module, after normalization, before the affine transform.
In Section 1.1 we briefly explore the differences in memory consumption for neural networks at test and training time. In Section 1.2 we discuss converting complicated floating-point multiplications into simpler integer addition operations. In Section 1.3 we recap batch-normalization, splitting it into normalization and affine transform steps.
We focus our experiments on classes of models that are popular in computer vision. However, ideas from computer vision have informed network design in other areas, such as speech recognition [Xiong et al., 2016]2016, Gehring et al., 2016, Wu et al., 2016], and have been extended to three dimensions to process video [Tran et al., 2015]
, so low precision batch-normalized activations should be applicable more broadly. The networks we consider use batch-normalization after each learnt layer, before applying an activation function. We explain this in Section1.4.
Attempts to use lower precision arithmetic have often combined a range of levels of precision, depending on the location. In general, lower precision is needed when storing the network weights and activations during the forward pass, while higher precision is needed during accumulation steps, e.g. for matrix multiplication, backpropagation and gradient descent. Storing the activations accurately seems to be crucial. From a full precision AlexNet/ImageNet top-5 accuracy baseline of 80.2%, binarizing only the weights reduces the accuracy only marginally to 79.4%, but binarizing the activations as well reduces the accuracy to 69.2%[Rastegari et al., 2016]
. That is still extremely impressive performance, given the replacement of floating-point arithmetic with much simpler 1-bit XNOR operations, but it clearly cannot be used as a drop-in replacement for existing network designs. Also, it seems unlikely that binary activations are compatible with residual connections. We consider a range of quantization schemes using between two and eight bits, see Section1.5.
1.1 Test and training memory use
Neural network memory usage is very different during training and testing. At test time, hidden units only need to be stored long enough to calculate the values of any output nodes. For feed-forward networks such as VGG [Simonyan and Zisserman, 2014], this means storing one hidden layer to calculate the next. For modular networks such as ResNets [He et al., 2016], once the output of a module is calculated, all the internal state can be forgotten. In contrast, DenseNets [Huang et al., 2016] accumulate a relatively large state within the densely connected blocks. The peak memory used when calculating the forward pass for a single image may not seem huge, but it is significant compared to, say, the size of a typical mobile processor’s cache.
At training time, much more memory is needed. Many of the activations generated during the forward pass must be stored to efficiently implement backpropagation. Additionally, training is generally carried out with batches of size 64–256 for reasons of computational efficiency, and to benefit from batch-normalization.
Quantizing the batch-normalized activations has potential advantages both at training and at test time, reducing memory requirements and computational burden.
1.2 Forward Propagation with fewer multiplications
In addition to saving memory, some low precision representations also have the useful property of allowing general floating-point multiplication to be replaced with integer addition. In Simard and Graf , the authors take advantage of the finite range of the tanh activation function; they approximate the output using only 3 bits, with the form . Using one of the three bits to store the sign, the other two bits allow to have range four, i.e. . They also use a similar trick during backpropagation but using five bits. In addition to saving memory, the special form in which the activations are stored means that they can be multiplied by the network weights using just integer addition in the exponent, and a 1-bit XOR for the sign bit. This is computationally much simpler than regular floating-point multiplication.
Unlike tanh, rectified linear units have unbounded range, so quantization may be less reliable. Batch-normalization is a double-edged sword: it regularizes the typical range of the activation during the forward pass, but it increases the complexity of the backward pass. For these reasons, we consider a range of quantization schemes, including log-scale ones that allow multiplications to be replaced by addition in the exponent.
We will briefly review the definition of batch-normalization and its effect on backpropagation. We will separate the normal batch-normalization layer into its two constituent parts, a feature-wise normalization N, and a feature-wise, learnt, affine transform A. Let
The affine transform is then applied
During backpropagation, the gradient is adjusted using the formula ()
The normalized values N are good candidates for storing using low precision for two reasons. Having been normalized, they are easier to store efficiently as typically the bulk of the values lie within a limited range, say . Secondly, the N occur twice in (2
), so we need to store them in some form to apply the chain rule.
1.4 Cromulent networks
We will focus on networks that have a specific form. Many popular network architectures either have this form, or can be modified to have it with minor changes. Consider a network composed of the following types of modules:
– Learnt layers such as fully-connected (Linear) layers, or convolutional layers. For backpropagation, the input needs to be available, but the output does not. We can also consider the identity function to be a ‘trivial’ learnt layer.
– Batch-Normalization: BN=A(N). N
is needed for backpropagation through both the affine layer and the normalization layer. Nis commonly recalculated from during the backward pass. We are proposing instead just to store a quantized version of N.
– The rectified-linear (ReLU) activation function. Either the input or the output is needed for backpropagation.
– ‘Branch’ nodes that have one input, and produces as output multiple identical copies of the input. Nothing needs to be stored for backpropagation; the incoming gradients are simply summed together elementwise.
– ‘Add’ nodes, that take multiple inputs, and add them together elementwise. The activations do not need to be stored for backpropagation.
– Average pooling. Nothing needs to be stored for backpropagation. Max-pooling is also allowed if the indices of the maximal inputs within each pooling region are stored.
We will say that the network is cromulent if any forward path through the network has the form
with denoting either a direct connection, a branch node, an add node or a pooling node. Cromulent networks include VGG-like networks; pre-activated ResNets and WideResNets; and DenseNets, with or without bottlenecks and compression. Some networks, such as Inception V4, SqueezeNets, and FractalNets are not cromulent, but can be turned into cromulent networks by converting them into pre-activated form, with BN-ReLU-Convolution blocks replacing Convolution-BN-ReLU blocks.
The motivation for our definition of cromulent is to ensure the networks permit a number of optimizations:
– At training time, to do backpropagation it is sufficient to store the N, as you can easily recalculate the adjacent Affine-ReLU transformations. The recalculation is very lightweight, compared to the learnt layers and the N part of batch-normalization.
– Combining the Affine-ReLU-Learnt layers into one so multiplications are replaced with additions; see Section 1.5.
– For VGG and DenseNets at test time, the calculation of N can be rolled into the preceding learnt layer, meaning the output of the learnt layer can be stored directly in compressed form, saving bandwidth and memory. This is especially useful for DenseNets which accumulate a large state within each densely connected block.
1.5 Low precision representations
In Table 1 we define eight quantization formulae. The first seven are similar to ones from Miyashita et al. ; however, we apply the formulae immediately after normalization, before the Affine-ReLU layers, so there are three key differences. Firstly, the input can be negative. We therefore use one bit to store the sign of the input—this seems to be more efficient overall as it save us from having to store additional information from the forward pass to apply equation (2). Secondly, we can fix the output scale, rather than having to vary the output range according to the level within the network as in their paper. Thirdly, it allows a single normalization to be paired with different affine transformations, which is needed to implement DenseNets.
The first four approximations use a log-scale, using 2, 3, 4 and 5 bits of memory respectively. They are compatible with replacing the multiplications during the forward pass with additions. We also consider three uniform scales, using 4, 5, and 8 bits. The uniform scales provide a different route to more efficient computations. Quantization the network weights with a uniform scale during the forward pass would allow the floating-point multiplications to be replaced with, say, 8-bit integer multiplication.
1.5.1 Four-bit log-scale quantization
First look at . Like all of the approximation formulae, has been constructed so that in a weak sense for in a neighborhood of zero. Its range has size , consisting of the values
In words, to calculate , we scale by a constant factor, take absolute value to make it positive, take logs to the base 2, round down to the nearest integer, restrict the value to the range , invert the log operation by raising 2 to that power, and then multiply by sign to restore symmetry. The choice
for the range of the exponent is a compromise between accuracy near zero and being able to represent values further away from one. If we put a standard Gaussian random variableinto
, then the mean value is preserved by symmetry, and the standard deviation is preserved thanks to the choice of the constant 1.36. Note that for a floating-point number, calculating is just a matter of inspecting the exponent.
We can look at as a form of dropout with respect to all but the most significant bit of . Unlike normal dropout, it is applied in the same way during training and testing. To measure how destructive this form of dropout is, we can look the correlation between N and . See Figure 1 and Table 2. Even for a moderately heavy tailed distribution, the output of is highly correlated with the input—by Chebyshev’s inequality, at least 99.6% of the N must lie within the range of . The variance of N will generally be close to one.
for a standard Gaussian and a heavy tailed probability distribution.
Replacing multiplications with additions in the exponent requires a little algebra to merge together the Affine-ReLU-Learnt triplet of layers. Let N denote the -th element of the the output of the normalization operation, let and denote the parameters of the affine transform, and let denote a network parameter. Then
We have replaced a number of floating-point multiplications for integer addition and floating-point addition. (Alternatively, this could be implemented with bit-shifting and fixed-point addition). This is significant as it would allow networks to be deployed to less powerful, more energy efficient devices.
During training, we can save a substantial amount of memory by using in place of N in equation (2) and, and by recalculating the Affine-ReLU transformation as needed, rather than saving the result.
1.5.2 Other representations
We have also defined a variety of other functions corresponding to low precision representations. Using three bits we can represent a subset of the range of , specifically . Using two bits, we can represent the values . We can still perform the low precision multiplication trick, by storing the weights multiplied by . With 5 bits, we can increase the precision by using a log-scale with base ; the multiplication trick can be modified by storing and .
To construct the uniform scales, we had to compromise between accuracy and width of support. The final representation in Table 1 is an attempt to improve accuracy at the expense of simplicity. The set of output points is less concentrated around zero, allowing it to provide better accuracy further afield. The number 1.29 is chosen to give a range of approximately .
We have performed a range of experiments with the CIFAR-10 and ImageNet datasets.
2.1 Fully connected, permutation invariant CIFAR-10
To test low precision approximations for fully-connected networks, we treated CIFAR-10’s images as unordered collections of 3072 features, without data augmentation, and preprocessed only by scaling to the interval . We constructed a range of two layer, fully connected networks,
, for each of the quantization schemes. We trained the networks for 100 epochs in batches of size 100, with learning rate
and Nesterov momentum of 0.9. See Figure2. The two and three bit approximation schemes seem to have a strong regularizing effect during training, with substantially worse recall of the training labels. On the test set the effect is much smaller, and even positive for networks of size 64 to 128, suggesting the some overfitting is being prevented. This suggests that very low precision storage could be useful as a computationally more efficient alternative to regular batch-normalization combined with dropout.
Using the fb.resnet.torch111https://github.com/facebook/fb.resnet.torch package we trained a variety of convolutional networks on CIFAR-10. We trained two VGG-like [Simonyan and Zisserman, 2014] network with pairs of convolutions, followed by
max-pooling with stride 2, and the number of features doubling after each max-pooling halves the spatial size, i.e.
with and . These are relatively shallow networks, but are included for the sake of comparison.
To experiment with deeper networks, we also trained ResNet-20 and ResNet-44 networks [He et al., 2016], WideResNets [Zagoruyko and Komodakis, 2016] with depth 40 and widening factor 4, and DenseNets [Huang et al., 2016] with growth rate and depths 40 and 100. We used the default learning rate schedule for the VGG and ResNet networks: 164 epochs with learning rate annealing. For WideResNets and DenseNets, we used their customary 200 and 300 epoch learning rate schedules, respectively. All networks were trained with weight decay of
, and data augmentation in the form of random crops after 4-pixel zero padding, and horizontal flips.
clearly inhibits proper training, at least with the defaults learning meta-parameters. and perform much better, especially when you consider the reduction in computational complexity of the operations involved. In this set of experiments, the uniform scale quantizations are broadly similar to the log-scale ones. Although not computationally simpler, the results show that substantial reductions can be made in memory usage with minimal loss of accuracy.
Again using fb.resnet.torch, we trained a number of ResNet models on ImageNet. Models are trained for 90 epochs with weight decay , except where noted.
The ResNet-18 consists of 8 basic residual blocks; ResNet-50 consists of 14 bottleneck residual blocks. We also trained a wide ResNet with widening factor 2, depth 18; all layers after the initial convolution have double the width of a regular ResNet-18. Although the bodies of these networks are cromulent, they begin with modules
mapping the 3x224x224 input to size 64x56x56. To avoid changing the model definition, we decided to keep the initial BN operation unchanged, using low precision BN for all subsequent BN layers. See Table 4.
Here we find that the uniform scale quantization can inhibit learning, while the other representations do better. Perhaps unsurprisingly, quantization affects the deep ResNet-50 more than the WideResNet-18. The failure of the uniform quantization methods are in stark contrast to the results for CIFAR-10, demonstrating that quantization methods cannot be assumed to be ‘mostly-harmless’ without extensive testing on challenging datasets.
The version of ResNet-18 has a top-5 error 0.65 percentage points higher than the 32-bit baseline. To see if performance was suffering due to the use of weight decay, we reduced the decay rate to and trained for ten additional epochs: the top-1/top5 errors improved to 31.15% / 11.13%, narrowing the gap to 0.37%. This suggests that the low precision models need less regularization, but are capable of very similar levels of performance.
2.4 Mixed networks with fine-tuning
An alternative to training networks with reduced precision arithmetic is to take a network trained using 32-bit precision, and then try to compress the network to reduce test-time complexity, for example by pruning small weights, and quantizing the remaining weights [Han et al., 2016]. With this in mind, we try substituting regular batch-normalization for -BN and -BN in pretrained DenseNet-121 and DenseNet-201 ImageNet networks.
We tried quantizing (i) all normalized activations, and (ii) all but the first normalization, and (iii) all the normalizations in the second, third and fourth densely connected blocks (all but the first 14 BN-operations). See Tables 6 and 6. The early layers of the network, where the number of features planes is small, seem to be most sensitive to reductions in accuracy.
We did the same with a ResNet-18. For this experiment we made the network fully cromulent: with reference to 3, we moved the max-pooling module to immediately after the first convolution. We tried quantizing all the normalization operations, and all but the first normalization. Again, except at the beginning of the network, quantizing the normalized activations preserves most of their functionality.
|…quantizing all 121 BN||32.05||12.07||27.33||9.04|
|…quantizing all but the first BN||30.87||11.13||27.11||8.87|
|…quantizing 3 out of 4 dense blocks||27.97||9.34||26.42||8.38|
|…quantizing all 201 BN||28.61||9.87||25.00||7.54|
|…quantizing all but the first BN||27.97||9.21||24.94||7.33|
|…quantizing 3 out of 4 dense blocks||25.42||7.68||24.06||7.07|
|…quantizing all BN||36.26||14.72||33.71||12.44|
|…quantizing all but first BN||34.58||13.40||32.16||12.03|
|…quantizing all 121 BN||26.69||8.65||25.32||7.94|
|…quantizing all but the first BN||26.26||8.48||25.28||7.88|
|…quantizing 3 out of 4 dense blocks||25.75||8.20||25.21||7.72|
|…quantizing all 201 BN||23.84||7.16||23.47||6.69|
|…quantizing all but the first BN||23.71||6.90||23.33||6.59|
|…quantizing 3 out of 4 dense blocks||23.35||6.71||23.18||6.59|
|…quantizing all BN||31.53||11.52||30.63||11.00|
|…quantizing all but first BN||31.06||11.32||30.42||11.00|
2.5 Training with quantized gradients
We have quantized the activations; we have not so far quantized the gradients—the partial derivatives of the cost function with respect to the activations. While Hubara et al. , Miyashita et al.  have shown this is possible for feed-forward networks, the situation is more complicated for network architectures with multiple connections such as DenseNets. Consider the DenseNet- networks from Section 2.2 ( and ). Each of the three densely connected blocks consists of an input followed by convolutions. The -th convolution takes the original input concatenated with the outputs of the preceding convolutions. During the backward pass, it has to somehow send the backpropagation error signal to each of its inputs.
The messages cannot realistically be sent directly; the number of messages would grow with order . Instead, the messages are accumulated additively working backwards through the dense block222https://github.com/liuzhuang13/DenseNet/.
Quantizing the gradients corresponds to iteratively quantizing partial sums of the messages. Experimentally, this amplifies numerical imprecision, destroying the capacity of the DenseNets to learn CIFAR-10.
Alternatively, the gradients can be backpropagated with floating-point precision, but quantized immediately before being applied to each convolution—converting multiplications to additions, but not reducing the overall memory footprint. We trained -quantized DenseNets, quantizing the gradients to 8-bits using the set . Compared to totally unquantized gradients, mean test errors increased very slightly: from 5.45% to 5.64% for , and from 4.29% to 4.44% for .
We have looked at the experimental properties of some low precision approximation schemes. The natural way to take advantage of this method is:
– At training time, combine the Affine-ReLU-Convolution operations into one function/kernel, so that the memory-reads can be replaced with low precision versions. This could be particularly useful for DenseNets, as the convolutions (or bottleneck projections) tend to have much larger input than output.
– At test time, for VGG and DenseNets, also roll the N part of batch-normalization into the Affine-ReLU-Convolution operation, and write the output directly in quantized form.
Implementing this efficiently is a moderate engineering challenge—the same is true for all BLAS and convolutional operations. Particularly in the case of DenseNets, being able to read the input from memory in a compressed form is an advantage, as each convolutional layer takes many more activations as input than it produces as output. Storing the 4D hidden layers tensors in the form heightwidthbatchfeatures, rather than the usual batchfeaturesheightwidth could be helpful in terms of efficiently reading contiguous blocks from memory.
Modern, efficient, deep neural-networks can be trained with substantially less memory, and nearly all of the multiplication operations can be replaced with additions, all with only very minor loss in accuracy. Moreover, this technique will allow larger networks to be trained in situations where memory is currently a limiting factor, such as 3D convolutional networks for video, and large language models.
The first few layers of a network, where the number of feature planes is smaller, seem to be more sensitive to quantization. Mixing the number of bits used, with 5 bits in the lower layers, and 3 or 4 bits in the higher layers might be a good way to balance accuracy and memory footprint.
Our experiments have been limited to ConvNets designed for use with floating-point arithmetic, using the existing training procedures without modification. Better results may be possible by designing new network architectures, or by fine-tuning the training procedures.
Our definition of cromulent includes a wide range of networks. However, it is not meant to exhaustively identify every opportunity for low-precision batch-normalized activations. For example, in Cooijmans et al. , batch-normalization is added to an LSTM recurrent network. LSTM cells are built using sigmoid and tanh activations instead of ReLUs, but much of the internal state could still potentially be stored at reduced precision.
To keep our analysis focused, we have not tried to optimize everything at the same time, focusing instead on just the activations. There are almost certainly additional opportunities to improve the computational performance of networks with low-precision batch-normalized activations:
– ConvNets are often robust with respect to quantizing the network weights [Merolla et al., 2016]. Allowing the network weights to take just three values, a ResNet-18 achieved 15.8% validation error on ImageNet [Li and Liu, 2016]. An interesting problem for future work is merging mild weight quantization with low-precision batch-normalized-activations whilst minimizing loss of accuracy.
– If the activations and network weights are quantized, lower-precision arithmetic might be sufficient for summing the activationweight terms for each convolutional layer.
Some of these optimizations will be immediately useful, particularly the reduced memory overhead. To take full advantage of these techiques will require the development of new, low-power hardware.
- Cooijmans et al.  T. Cooijmans, N. Ballas, C. Laurent, Ç. Gülçehre, and A. Courville. Recurrent batch normalization. arXiv preprint arXiv:1603.09025, 2016. URL http://arxiv.org/abs/1603.09025.
- Courbariaux et al.  M. Courbariaux, Y. Bengio, and J.-P. David. Training deep neural networks with low precision multiplications, 2014. URL http://arxiv.org/abs/1412.7024. Accepted as a workshop contribution at ICLR 2015.
- Courbariaux et al.  M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in Neural Information Processing Systems 28, 2015. URL https://arxiv.org/abs/1511.00363.
- Gehring et al.  J. Gehring, M. Auli, D. Grangier, and Y. N. Dauphin. A convolutional encoder model for neural machine translation. 2016. URL http://arxiv.org/abs/1611.02344.
Gupta et al. 
S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan.
Deep learning with limited numerical precision.
Proceedings of the 32nd International Conference on Machine Learning, ICML, 2015. URL http://jmlr.org/proceedings/papers/v37/.
- Han et al.  S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. ICLR, 2016. URL http://arxiv.org/abs/1510.00149.
- He et al.  K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. ECCV, 2016. URL https://arxiv.org/abs/1603.05027.
- Huang et al.  G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. 2016. URL https://arxiv.org/abs/1608.06993.
- Hubara et al.  I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. 2016. URL http://arxiv.org/pdf/1609.07061.
- Iandola et al.  F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 0.5MB model size, 2016. URL http://arxiv.org/abs/1602.07360.
- Ioffe and Szegedy  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on Machine Learning, 2015. URL http://jmlr.org/proceedings/papers/v37/ioffe15.pdf.
- Larsson et al.  G. Larsson, M. Maire, and G. Shakhnarovich. FractalNet: Ultra-deep neural networks without residuals. 2016. URL https://arxiv.org/abs/1605.07648.
- Li and Liu  F. Li and B. Liu. Ternary weight networks. 2016. URL http://arxiv.org/abs/1605.04711.
- Lin et al.  Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications. ICLR, 2016. URL http://arxiv.org/abs/1510.03009.
- Merolla et al.  P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha. Deep neural networks are robust to weight binarization and other non-linear distortions. 2016. URL http://arxiv.org/abs/1606.01981.
- Miyashita et al.  D. Miyashita, E. H. Lee, and B. Murmann. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025, 2016. URL http://arxiv.org/abs/1603.01025.
Nair and Hinton 
V. Nair and G. E. Hinton.
Rectified linear units improve restricted boltzmann machines.Proceedings of the 27th International Conference on Machine Learning, 2010. URL http://www.icml2010.org/papers/432.pdf.
- Rastegari et al.  M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: Imagenet classification using binary convolutional neural networks. ECCV, 2016. URL http://arxiv.org/abs/1603.05279.
- Simard and Graf  P. Y. Simard and H. P. Graf. Backpropagation without multiplication. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 232–239. Morgan Kaufmann Publishers, Inc., 1994. URL https://papers.nips.cc/paper/833-backpropagation-without-multiplication.
- Simonyan and Zisserman  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. 2014. URL http://arxiv.org/abs/1409.1556.
- Szegedy et al.  C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. 2016. URL http://arxiv.org/abs/1602.07261.
- Tran et al.  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497. IEEE, 2015. URL https://arxiv.org/abs/1412.0767.
- Vanhoucke et al.  V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.
- Wu et al.  Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. 2016. URL http://arxiv.org/abs/1609.08144.
- Xiong et al.  W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig. The Microsoft 2016 conversational speech recognition system. 2016. URL http://arxiv.org/abs/1609.03528.
- Zagoruyko and Komodakis  S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016. URL https://arxiv.org/abs/1605.07146.
- Zhou et al.  J. Zhou, Y. Cao, X. Wang, P. Li, and W. Xu. Deep recurrent models with fast-forward connections for neural machine translation. TACL, 2016. URL http://arxiv.org/abs/1606.04199.