1 Introduction
To improve accuracy, deeper and deeper neural networks are being trained, leading to seemingly everimproving performance in computer vision, natural language processing, and speech recognition. Three techniques have been key to usefully increasing the depth of networks: rectifiedlinear units (ReLU)
[Nair and Hinton, 2010], batch normalization [Ioffe and Szegedy, 2015], and residual connections
[He et al., 2016, Wu et al., 2016, Szegedy et al., 2016, Huang et al., 2016, Zagoruyko and Komodakis, 2016, Larsson et al., 2016, Iandola et al., 2016].These techniques have fulfilled the promise of deep learning, allowing networks with a large number of relatively narrow layers to achieve high accuracy. The experiments that established the usefulness of these techniques were carried out using 32bit precision floatingpoint arithmetic.
Network architectures are often compared in terms of the number of parameters in the network, and the number of FLOPs per test evaluation. Memory requirements, either at training time or test time are rarely explicitly taken into account, even though they often seem to be a limiting factor in network design. ResNets double the number of feature planes per layer each time the linear size of the hidden layers is downscaled by a factor of two, in order to keep the computational cost per layer constant. However, each downsampling halves the number of hidden units, so most of the network’s depth has to come after the input images have been scaled down by at least a factor of 16. Similarly, the placement of batchnormalization blocks in InceptionIV was limited by memory consideration [Szegedy et al., 2016]. Bandwidth is another reason to take memory into account. Finite data transfer speeds and cache sizes make it harder to make full use of all available processing power.
To improve efficiency—reducing memory and power requirements—networks have been trained with lowprecision floatingpoint arithmetic, fixedpoint arithmetic, and even binary arithmetic [Simard and Graf, 1994, Vanhoucke et al., 2011, Courbariaux et al., 2014, Gupta et al., 2015, Lin et al., 2016, Courbariaux et al., 2015, Rastegari et al., 2016, Miyashita et al., 2016, Hubara et al., 2016]. These experiments have been carried out for relatively shallow networks—generally between three and eight layers. We will focus on much deeper networks. Welldesigned deep networks are more efficient than wide, shallow networks, so we believe that reducing the level of precision is more challenging, but more useful when successful.
With this in mind, we have tried to train modern network architectures using limitedprecision arithmetic, focusing on the activations halfway through each batchnormalization module, after normalization, before the affine transform.
In Section 1.1 we briefly explore the differences in memory consumption for neural networks at test and training time. In Section 1.2 we discuss converting complicated floatingpoint multiplications into simpler integer addition operations. In Section 1.3 we recap batchnormalization, splitting it into normalization and affine transform steps.
We focus our experiments on classes of models that are popular in computer vision. However, ideas from computer vision have informed network design in other areas, such as speech recognition [Xiong et al., 2016]
and neural machine translation
[Zhou et al., 2016, Gehring et al., 2016, Wu et al., 2016], and have been extended to three dimensions to process video [Tran et al., 2015], so low precision batchnormalized activations should be applicable more broadly. The networks we consider use batchnormalization after each learnt layer, before applying an activation function. We explain this in Section
1.4.Attempts to use lower precision arithmetic have often combined a range of levels of precision, depending on the location. In general, lower precision is needed when storing the network weights and activations during the forward pass, while higher precision is needed during accumulation steps, e.g. for matrix multiplication, backpropagation and gradient descent. Storing the activations accurately seems to be crucial. From a full precision AlexNet/ImageNet top5 accuracy baseline of 80.2%, binarizing only the weights reduces the accuracy only marginally to 79.4%, but binarizing the activations as well reduces the accuracy to 69.2%
[Rastegari et al., 2016]. That is still extremely impressive performance, given the replacement of floatingpoint arithmetic with much simpler 1bit XNOR operations, but it clearly cannot be used as a dropin replacement for existing network designs. Also, it seems unlikely that binary activations are compatible with residual connections. We consider a range of quantization schemes using between two and eight bits, see Section
1.5.1.1 Test and training memory use
Neural network memory usage is very different during training and testing. At test time, hidden units only need to be stored long enough to calculate the values of any output nodes. For feedforward networks such as VGG [Simonyan and Zisserman, 2014], this means storing one hidden layer to calculate the next. For modular networks such as ResNets [He et al., 2016], once the output of a module is calculated, all the internal state can be forgotten. In contrast, DenseNets [Huang et al., 2016] accumulate a relatively large state within the densely connected blocks. The peak memory used when calculating the forward pass for a single image may not seem huge, but it is significant compared to, say, the size of a typical mobile processor’s cache.
At training time, much more memory is needed. Many of the activations generated during the forward pass must be stored to efficiently implement backpropagation. Additionally, training is generally carried out with batches of size 64–256 for reasons of computational efficiency, and to benefit from batchnormalization.
Quantizing the batchnormalized activations has potential advantages both at training and at test time, reducing memory requirements and computational burden.
1.2 Forward Propagation with fewer multiplications
In addition to saving memory, some low precision representations also have the useful property of allowing general floatingpoint multiplication to be replaced with integer addition. In Simard and Graf [1994], the authors take advantage of the finite range of the tanh activation function; they approximate the output using only 3 bits, with the form . Using one of the three bits to store the sign, the other two bits allow to have range four, i.e. . They also use a similar trick during backpropagation but using five bits. In addition to saving memory, the special form in which the activations are stored means that they can be multiplied by the network weights using just integer addition in the exponent, and a 1bit XOR for the sign bit. This is computationally much simpler than regular floatingpoint multiplication.
Unlike tanh, rectified linear units have unbounded range, so quantization may be less reliable. Batchnormalization is a doubleedged sword: it regularizes the typical range of the activation during the forward pass, but it increases the complexity of the backward pass. For these reasons, we consider a range of quantization schemes, including logscale ones that allow multiplications to be replaced by addition in the exponent.
1.3 BatchNormalization
We will briefly review the definition of batchnormalization and its effect on backpropagation. We will separate the normal batchnormalization layer into its two constituent parts, a featurewise normalization N, and a featurewise, learnt, affine transform A. Let
denote a vector of features. During training, the vector is normalized to have mean zero and variance approximately one:
(1) 
The affine transform is then applied
During backpropagation, the gradient is adjusted using the formula ()
(2) 
The normalized values N are good candidates for storing using low precision for two reasons. Having been normalized, they are easier to store efficiently as typically the bulk of the values lie within a limited range, say . Secondly, the N occur twice in (2
), so we need to store them in some form to apply the chain rule.
1.4 Cromulent networks
We will focus on networks that have a specific form. Many popular network architectures either have this form, or can be modified to have it with minor changes. Consider a network composed of the following types of modules:
– Learnt layers such as fullyconnected (Linear) layers, or convolutional layers. For backpropagation, the input needs to be available, but the output does not. We can also consider the identity function to be a ‘trivial’ learnt layer.
– BatchNormalization: BN=A(N). N
is needed for backpropagation through both the affine layer and the normalization layer. N
is commonly recalculated from during the backward pass. We are proposing instead just to store a quantized version of N.– The rectifiedlinear (ReLU) activation function. Either the input or the output is needed for backpropagation.
– ‘Branch’ nodes that have one input, and produces as output multiple identical copies of the input. Nothing needs to be stored for backpropagation; the incoming gradients are simply summed together elementwise.
– ‘Add’ nodes, that take multiple inputs, and add them together elementwise. The activations do not need to be stored for backpropagation.
– Average pooling. Nothing needs to be stored for backpropagation. Maxpooling is also allowed if the indices of the maximal inputs within each pooling region are stored.
We will say that the network is cromulent if any forward path through the network has the form
with denoting either a direct connection, a branch node, an add node or a pooling node. Cromulent networks include VGGlike networks; preactivated ResNets and WideResNets; and DenseNets, with or without bottlenecks and compression. Some networks, such as Inception V4, SqueezeNets, and FractalNets are not cromulent, but can be turned into cromulent networks by converting them into preactivated form, with BNReLUConvolution blocks replacing ConvolutionBNReLU blocks.
The motivation for our definition of cromulent is to ensure the networks permit a number of optimizations:
– At training time, to do backpropagation it is sufficient to store the N, as you can easily recalculate the adjacent AffineReLU transformations. The recalculation is very lightweight, compared to the learnt layers and the N part of batchnormalization.
– Combining the AffineReLULearnt layers into one so multiplications are replaced with additions; see Section 1.5.
– For VGG and DenseNets at test time, the calculation of N can be rolled into the preceding learnt layer, meaning the output of the learnt layer can be stored directly in compressed form, saving bandwidth and memory. This is especially useful for DenseNets which accumulate a large state within each densely connected block.
1.5 Low precision representations
In Table 1 we define eight quantization formulae. The first seven are similar to ones from Miyashita et al. [2016]; however, we apply the formulae immediately after normalization, before the AffineReLU layers, so there are three key differences. Firstly, the input can be negative. We therefore use one bit to store the sign of the input—this seems to be more efficient overall as it save us from having to store additional information from the forward pass to apply equation (2). Secondly, we can fix the output scale, rather than having to vary the output range according to the level within the network as in their paper. Thirdly, it allows a single normalization to be paired with different affine transformations, which is needed to implement DenseNets.
The first four approximations use a logscale, using 2, 3, 4 and 5 bits of memory respectively. They are compatible with replacing the multiplications during the forward pass with additions. We also consider three uniform scales, using 4, 5, and 8 bits. The uniform scales provide a different route to more efficient computations. Quantization the network weights with a uniform scale during the forward pass would allow the floatingpoint multiplications to be replaced with, say, 8bit integer multiplication.
1.5.1 Fourbit logscale quantization
First look at . Like all of the approximation formulae, has been constructed so that in a weak sense for in a neighborhood of zero. Its range has size , consisting of the values
In words, to calculate , we scale by a constant factor, take absolute value to make it positive, take logs to the base 2, round down to the nearest integer, restrict the value to the range , invert the log operation by raising 2 to that power, and then multiply by sign to restore symmetry. The choice
for the range of the exponent is a compromise between accuracy near zero and being able to represent values further away from one. If we put a standard Gaussian random variable
into, then the mean value is preserved by symmetry, and the standard deviation is preserved thanks to the choice of the constant 1.36. Note that for a floatingpoint number
, calculating is just a matter of inspecting the exponent.We can look at as a form of dropout with respect to all but the most significant bit of . Unlike normal dropout, it is applied in the same way during training and testing. To measure how destructive this form of dropout is, we can look the correlation between N and . See Figure 1 and Table 2. Even for a moderately heavy tailed distribution, the output of is highly correlated with the input—by Chebyshev’s inequality, at least 99.6% of the N must lie within the range of . The variance of N will generally be close to one.
Bits  Distribution X  Correlation  sd 

2  N(0,1)  0.918  1.000 
3  N(0,1)  0.965  1.000 
4  N(0,1)  0.981  1.000 
2  Student’s t=3  0.769  0.888 
3  Student’s t=3  0.857  1.11 
4  Student’s t=3  0.970  0.978 
for a standard Gaussian and a heavy tailed probability distribution.
Replacing multiplications with additions in the exponent requires a little algebra to merge together the AffineReLULearnt triplet of layers. Let N denote the th element of the the output of the normalization operation, let and denote the parameters of the affine transform, and let denote a network parameter. Then
We have replaced a number of floatingpoint multiplications for integer addition and floatingpoint addition. (Alternatively, this could be implemented with bitshifting and fixedpoint addition). This is significant as it would allow networks to be deployed to less powerful, more energy efficient devices.
During training, we can save a substantial amount of memory by using in place of N in equation (2) and, and by recalculating the AffineReLU transformation as needed, rather than saving the result.
1.5.2 Other representations
We have also defined a variety of other functions corresponding to low precision representations. Using three bits we can represent a subset of the range of , specifically . Using two bits, we can represent the values . We can still perform the low precision multiplication trick, by storing the weights multiplied by . With 5 bits, we can increase the precision by using a logscale with base ; the multiplication trick can be modified by storing and .
To construct the uniform scales, we had to compromise between accuracy and width of support. The final representation in Table 1 is an attempt to improve accuracy at the expense of simplicity. The set of output points is less concentrated around zero, allowing it to provide better accuracy further afield. The number 1.29 is chosen to give a range of approximately .
2 Experiments
We have performed a range of experiments with the CIFAR10 and ImageNet datasets.
2.1 Fully connected, permutation invariant CIFAR10
To test low precision approximations for fullyconnected networks, we treated CIFAR10’s images as unordered collections of 3072 features, without data augmentation, and preprocessed only by scaling to the interval . We constructed a range of two layer, fully connected networks,
with
, for each of the quantization schemes. We trained the networks for 100 epochs in batches of size 100, with learning rate
and Nesterov momentum of 0.9. See Figure
2. The two and three bit approximation schemes seem to have a strong regularizing effect during training, with substantially worse recall of the training labels. On the test set the effect is much smaller, and even positive for networks of size 64 to 128, suggesting the some overfitting is being prevented. This suggests that very low precision storage could be useful as a computationally more efficient alternative to regular batchnormalization combined with dropout.2.2 Cifar10
Using the fb.resnet.torch
^{1}^{1}1https://github.com/facebook/fb.resnet.torch package we trained a variety of convolutional networks on CIFAR10. We trained two VGGlike [Simonyan and Zisserman, 2014] network with pairs of convolutions, followed bymaxpooling with stride 2, and the number of features doubling after each maxpooling halves the spatial size, i.e.
with and . These are relatively shallow networks, but are included for the sake of comparison.
To experiment with deeper networks, we also trained ResNet20 and ResNet44 networks [He et al., 2016], WideResNets [Zagoruyko and Komodakis, 2016] with depth 40 and widening factor 4, and DenseNets [Huang et al., 2016] with growth rate and depths 40 and 100. We used the default learning rate schedule for the VGG and ResNet networks: 164 epochs with learning rate annealing. For WideResNets and DenseNets, we used their customary 200 and 300 epoch learning rate schedules, respectively. All networks were trained with weight decay of
, and data augmentation in the form of random crops after 4pixel zero padding, and horizontal flips.
clearly inhibits proper training, at least with the defaults learning metaparameters. and perform much better, especially when you consider the reduction in computational complexity of the operations involved. In this set of experiments, the uniform scale quantizations are broadly similar to the logscale ones. Although not computationally simpler, the results show that substantial reductions can be made in memory usage with minimal loss of accuracy.
Network  ResNet18  ResNet50  WideResNet(2,18)  

fp32  30.43  10.76  24.01  7.02  25.94  8.19 
36.45  14.83  43.88  20.39  29.93  10.71  
33.28  12.54  26.19  8.16  27.24  9.01  
31.05  11.45  24.39  7.39  25.96  8.31  
36.55  15.02  49.75  25.82  30.00  10.92  
30.72  11.24  34.72  13.67  26.03  8.29  
30.15  10.73  25.68  7.93  25.95  8.21  
31.67  11.41  27.27  8.81  26.03  8.37 
2.3 ImageNet
Again using fb.resnet.torch, we trained a number of ResNet models on ImageNet. Models are trained for 90 epochs with weight decay , except where noted.
The ResNet18 consists of 8 basic residual blocks; ResNet50 consists of 14 bottleneck residual blocks. We also trained a wide ResNet with widening factor 2, depth 18; all layers after the initial convolution have double the width of a regular ResNet18. Although the bodies of these networks are cromulent, they begin with modules
(3) 
mapping the 3x224x224 input to size 64x56x56. To avoid changing the model definition, we decided to keep the initial BN operation unchanged, using low precision BN for all subsequent BN layers. See Table 4.
Here we find that the uniform scale quantization can inhibit learning, while the other representations do better. Perhaps unsurprisingly, quantization affects the deep ResNet50 more than the WideResNet18. The failure of the uniform quantization methods are in stark contrast to the results for CIFAR10, demonstrating that quantization methods cannot be assumed to be ‘mostlyharmless’ without extensive testing on challenging datasets.
The version of ResNet18 has a top5 error 0.65 percentage points higher than the 32bit baseline. To see if performance was suffering due to the use of weight decay, we reduced the decay rate to and trained for ten additional epochs: the top1/top5 errors improved to 31.15% / 11.13%, narrowing the gap to 0.37%. This suggests that the low precision models need less regularization, but are capable of very similar levels of performance.
2.4 Mixed networks with finetuning
An alternative to training networks with reduced precision arithmetic is to take a network trained using 32bit precision, and then try to compress the network to reduce testtime complexity, for example by pruning small weights, and quantizing the remaining weights [Han et al., 2016]. With this in mind, we try substituting regular batchnormalization for BN and BN in pretrained DenseNet121 and DenseNet201 ImageNet networks.
We tried quantizing (i) all normalized activations, and (ii) all but the first normalization, and (iii) all the normalizations in the second, third and fourth densely connected blocks (all but the first 14 BNoperations). See Tables 6 and 6. The early layers of the network, where the number of features planes is small, seem to be most sensitive to reductions in accuracy.
We did the same with a ResNet18. For this experiment we made the network fully cromulent: with reference to 3, we moved the maxpooling module to immediately after the first convolution. We tried quantizing all the normalization operations, and all but the first normalization. Again, except at the beginning of the network, quantizing the normalized activations preserves most of their functionality.
Network  Raw  Finetuned  
DenseNet121  25.25  7.88     
…quantizing all 121 BN  32.05  12.07  27.33  9.04 
…quantizing all but the first BN  30.87  11.13  27.11  8.87 
…quantizing 3 out of 4 dense blocks  27.97  9.34  26.42  8.38 
DenseNet201  22.78  6.46     
…quantizing all 201 BN  28.61  9.87  25.00  7.54 
…quantizing all but the first BN  27.97  9.21  24.94  7.33 
…quantizing 3 out of 4 dense blocks  25.42  7.68  24.06  7.07 
ResNet18  30.21  10.79     
…quantizing all BN  36.26  14.72  33.71  12.44 
…quantizing all but first BN  34.58  13.40  32.16  12.03 
Network  Raw  Finetuned  
DenseNet121  25.25  7.88     
…quantizing all 121 BN  26.69  8.65  25.32  7.94 
…quantizing all but the first BN  26.26  8.48  25.28  7.88 
…quantizing 3 out of 4 dense blocks  25.75  8.20  25.21  7.72 
DenseNet201  22.78  6.46     
…quantizing all 201 BN  23.84  7.16  23.47  6.69 
…quantizing all but the first BN  23.71  6.90  23.33  6.59 
…quantizing 3 out of 4 dense blocks  23.35  6.71  23.18  6.59 
ResNet18  30.21  10.79     
…quantizing all BN  31.53  11.52  30.63  11.00 
…quantizing all but first BN  31.06  11.32  30.42  11.00 
2.5 Training with quantized gradients
We have quantized the activations; we have not so far quantized the gradients—the partial derivatives of the cost function with respect to the activations. While Hubara et al. [2016], Miyashita et al. [2016] have shown this is possible for feedforward networks, the situation is more complicated for network architectures with multiple connections such as DenseNets. Consider the DenseNet networks from Section 2.2 ( and ). Each of the three densely connected blocks consists of an input followed by convolutions. The th convolution takes the original input concatenated with the outputs of the preceding convolutions. During the backward pass, it has to somehow send the backpropagation error signal to each of its inputs.
The messages cannot realistically be sent directly; the number of messages would grow with order . Instead, the messages are accumulated additively working backwards through the dense block^{2}^{2}2https://github.com/liuzhuang13/DenseNet/.
Quantizing the gradients corresponds to iteratively quantizing partial sums of the messages. Experimentally, this amplifies numerical imprecision, destroying the capacity of the DenseNets to learn CIFAR10.
Alternatively, the gradients can be backpropagated with floatingpoint precision, but quantized immediately before being applied to each convolution—converting multiplications to additions, but not reducing the overall memory footprint. We trained quantized DenseNets, quantizing the gradients to 8bits using the set . Compared to totally unquantized gradients, mean test errors increased very slightly: from 5.45% to 5.64% for , and from 4.29% to 4.44% for .
3 Implementation
We have looked at the experimental properties of some low precision approximation schemes. The natural way to take advantage of this method is:
– At training time, combine the AffineReLUConvolution operations into one function/kernel, so that the memoryreads can be replaced with low precision versions. This could be particularly useful for DenseNets, as the convolutions (or bottleneck projections) tend to have much larger input than output.
– At test time, for VGG and DenseNets, also roll the N part of batchnormalization into the AffineReLUConvolution operation, and write the output directly in quantized form.
Implementing this efficiently is a moderate engineering challenge—the same is true for all BLAS and convolutional operations. Particularly in the case of DenseNets, being able to read the input from memory in a compressed form is an advantage, as each convolutional layer takes many more activations as input than it produces as output. Storing the 4D hidden layers tensors in the form height
widthbatchfeatures, rather than the usual batchfeaturesheightwidth could be helpful in terms of efficiently reading contiguous blocks from memory.4 Conclusion
Modern, efficient, deep neuralnetworks can be trained with substantially less memory, and nearly all of the multiplication operations can be replaced with additions, all with only very minor loss in accuracy. Moreover, this technique will allow larger networks to be trained in situations where memory is currently a limiting factor, such as 3D convolutional networks for video, and large language models.
The first few layers of a network, where the number of feature planes is smaller, seem to be more sensitive to quantization. Mixing the number of bits used, with 5 bits in the lower layers, and 3 or 4 bits in the higher layers might be a good way to balance accuracy and memory footprint.
Our experiments have been limited to ConvNets designed for use with floatingpoint arithmetic, using the existing training procedures without modification. Better results may be possible by designing new network architectures, or by finetuning the training procedures.
Our definition of cromulent includes a wide range of networks. However, it is not meant to exhaustively identify every opportunity for lowprecision batchnormalized activations. For example, in Cooijmans et al. [2016], batchnormalization is added to an LSTM recurrent network. LSTM cells are built using sigmoid and tanh activations instead of ReLUs, but much of the internal state could still potentially be stored at reduced precision.
To keep our analysis focused, we have not tried to optimize everything at the same time, focusing instead on just the activations. There are almost certainly additional opportunities to improve the computational performance of networks with lowprecision batchnormalized activations:
– ConvNets are often robust with respect to quantizing the network weights [Merolla et al., 2016]. Allowing the network weights to take just three values, a ResNet18 achieved 15.8% validation error on ImageNet [Li and Liu, 2016]. An interesting problem for future work is merging mild weight quantization with lowprecision batchnormalizedactivations whilst minimizing loss of accuracy.
– If the activations and network weights are quantized, lowerprecision arithmetic might be sufficient for summing the activationweight terms for each convolutional layer.
Some of these optimizations will be immediately useful, particularly the reduced memory overhead. To take full advantage of these techiques will require the development of new, lowpower hardware.
References
 Cooijmans et al. [2016] T. Cooijmans, N. Ballas, C. Laurent, Ç. Gülçehre, and A. Courville. Recurrent batch normalization. arXiv preprint arXiv:1603.09025, 2016. URL http://arxiv.org/abs/1603.09025.
 Courbariaux et al. [2014] M. Courbariaux, Y. Bengio, and J.P. David. Training deep neural networks with low precision multiplications, 2014. URL http://arxiv.org/abs/1412.7024. Accepted as a workshop contribution at ICLR 2015.
 Courbariaux et al. [2015] M. Courbariaux, Y. Bengio, and J.P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in Neural Information Processing Systems 28, 2015. URL https://arxiv.org/abs/1511.00363.
 Gehring et al. [2016] J. Gehring, M. Auli, D. Grangier, and Y. N. Dauphin. A convolutional encoder model for neural machine translation. 2016. URL http://arxiv.org/abs/1611.02344.

Gupta et al. [2015]
S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan.
Deep learning with limited numerical precision.
Proceedings of the 32nd International Conference on Machine Learning, ICML
, 2015. URL http://jmlr.org/proceedings/papers/v37/.  Han et al. [2016] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. ICLR, 2016. URL http://arxiv.org/abs/1510.00149.
 He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. ECCV, 2016. URL https://arxiv.org/abs/1603.05027.
 Huang et al. [2016] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. 2016. URL https://arxiv.org/abs/1608.06993.
 Hubara et al. [2016] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. 2016. URL http://arxiv.org/pdf/1609.07061.
 Iandola et al. [2016] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and 0.5MB model size, 2016. URL http://arxiv.org/abs/1602.07360.
 Ioffe and Szegedy [2015] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on Machine Learning, 2015. URL http://jmlr.org/proceedings/papers/v37/ioffe15.pdf.
 Larsson et al. [2016] G. Larsson, M. Maire, and G. Shakhnarovich. FractalNet: Ultradeep neural networks without residuals. 2016. URL https://arxiv.org/abs/1605.07648.
 Li and Liu [2016] F. Li and B. Liu. Ternary weight networks. 2016. URL http://arxiv.org/abs/1605.04711.
 Lin et al. [2016] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications. ICLR, 2016. URL http://arxiv.org/abs/1510.03009.
 Merolla et al. [2016] P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha. Deep neural networks are robust to weight binarization and other nonlinear distortions. 2016. URL http://arxiv.org/abs/1606.01981.
 Miyashita et al. [2016] D. Miyashita, E. H. Lee, and B. Murmann. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025, 2016. URL http://arxiv.org/abs/1603.01025.

Nair and Hinton [2010]
V. Nair and G. E. Hinton.
Rectified linear units improve restricted boltzmann machines.
Proceedings of the 27th International Conference on Machine Learning, 2010. URL http://www.icml2010.org/papers/432.pdf.  Rastegari et al. [2016] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNORNet: Imagenet classification using binary convolutional neural networks. ECCV, 2016. URL http://arxiv.org/abs/1603.05279.
 Simard and Graf [1994] P. Y. Simard and H. P. Graf. Backpropagation without multiplication. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 232–239. Morgan Kaufmann Publishers, Inc., 1994. URL https://papers.nips.cc/paper/833backpropagationwithoutmultiplication.
 Simonyan and Zisserman [2014] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. 2014. URL http://arxiv.org/abs/1409.1556.
 Szegedy et al. [2016] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inceptionv4, inceptionresnet and the impact of residual connections on learning. 2016. URL http://arxiv.org/abs/1602.07261.
 Tran et al. [2015] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497. IEEE, 2015. URL https://arxiv.org/abs/1412.0767.
 Vanhoucke et al. [2011] V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.
 Wu et al. [2016] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. 2016. URL http://arxiv.org/abs/1609.08144.
 Xiong et al. [2016] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig. The Microsoft 2016 conversational speech recognition system. 2016. URL http://arxiv.org/abs/1609.03528.
 Zagoruyko and Komodakis [2016] S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016. URL https://arxiv.org/abs/1605.07146.
 Zhou et al. [2016] J. Zhou, Y. Cao, X. Wang, P. Li, and W. Xu. Deep recurrent models with fastforward connections for neural machine translation. TACL, 2016. URL http://arxiv.org/abs/1606.04199.
Comments
There are no comments yet.