1 Motivation and Prior Work
The demonstrated success of deep learning in vision, speech and planning domains can be leveraged in practical applications involving edge computing devices only if the requirements of high power, large memory and high-end computational resources are significantly reduced. A common solution that circumvents all three of these issues is to replace multiplications in neural networks with a cheap operation such as bit shifts[Marchesi et al.1993, Kwan and Tang1993, Courbariaux, Bengio, and David2015, Lin et al.2015, Miyashita, Lee, and Murmann2016] or integer arithmetic [Raghavan et al.2017, Wu et al.2018], the reason being the ever-increasing number of parameters in deep neural networks (projected to reach 1 trillion parameters within the next decade) and bulky power-hogging digital multipliers. Convolutional networks are especially expensive since the number of multiplications grows with the number of filters and the size of filters, in addition to the size of the input.
In this paper, we focus on replacing multiplications with logical bit shift operations, which can be implemented efficiently in hardware using shift registers. While prior work [Marchesi et al.1993, Kwan and Tang1993, Courbariaux, Bengio, and David2015, Lin et al.2015, Miyashita, Lee, and Murmann2016] has noted this possibility, all of the above are restricted to a fixed range of shifts. For example, TC [Lin et al.2015] restricts all parameters to (ignoring the scaling factor), which corresponds to the set of operations of shifting by zero and inverting a sign bit, whereas [Kwan and Tang1993] uses a maximum shift of four, and [Miyashita, Lee, and Murmann2016] uses a maximum of five. A bit more generally, the majority of prior work on neural network compression e.g. [Han, Mao, and Dally2015, Polino, Pascanu, and Alistarh2018, Achterhold et al.2018] choose the numbers of bits of precision manually.
The proposed algorithm, Generalized Ternary Connect (GTC), learns simultaneously the number of bits and the values of the quantized weights. The weights are constrained to integer powers of two, which reduces floating point multiplication by the weights to bit shifts. We view the range of bit shifts as the number of bits of precision. Our second contribution is a novel parametrization of such a set of weights with trainable parameters that allow the quantization function to take a variety of shapes, ranging from a hyperbola, to a Heaviside function, to a hyperbolic tangent function or simply an identity mapping. These additional, trainable parameters significantly generalize TC and prior work on learning neural networks with weights that are powers of two. To the best of our knowledge, this is a novel end-to-end algorithm that jointly optimizes learning and compression from the standpoint of increased performance after training.
We evaluate GTC on two tasks: supervised image classification and unsupervised image generation with a variational auto-encoder (VAE) [Kingma and Welling2013]
. We show preliminary results on the MNIST and CIFAR-10 image datasets. While prior work on low precision deep neural networks has focused on classification problems, these experiments extend the demonstration to generative models, which can lead to different insights for future work. In addition, we demonstrate a novel variant of GTC that can be used as a generic optimizer for training neural networks, often outperforming Stochastic Gradient Descent (SGD). On the classification tasks, we compare GTC to several state-of-the-art neural network compression and low-precision approaches in Section4.3. Among the compared methods, GTC shows the best compression with minimal drop in accuracy while, unlike most of the prior work, producing a neural network whose forward pass does not require any multiplications.
It is clear that (scalar) multiplications constitute the majority of the computational cost of the forward pass during inference and prediction. For brevity, we omit the bias parameters throughout the paper. In our experiments, GTC outputs weights and biases in the form of bit shifts.
Ternary Connect (TC) [Lin et al.2015] replaces multiplications with logical bit shift operations by quantizing weights to the three values and (up to constant). Our goal is to generalize TC by restricting weights to the feasible set
where is the set of integers. In other words, the weights can be and . The generalization to powers of two allows floating-point multiplications to be implemented as logical bit shifts. For a floating point number with mantissa and exponent , its product with a weight is given by
Apparent from the above, in this paper we do not alter the representation of the activation , although the same techniques developed herein can be extended. Similarly, we do not consider the multiplications in the backward pass during gradient-based training, however a technique similar to Quantized Back-Propagation [Lin et al.2015] can be developed. Let denote the quantized weight corresponding to some . The starting point for our quantization scheme is the identity transformation
where the operator returns the sign of the operand, or returns zero if the operand is close to zero. Empirically, we found that the leading sign of needs to match the sign of for good performance.
The constraint that is usually enforced through rounding or stochastic rounding [Courbariaux, Bengio, and David2015]. In our case, either rounding operator can be applied to the exponent . We introduce novel parameters that are specific for each layer, and formulate the quantization function as
Here can be any transformation of the parameters . Whereas prior work has predominantly used fixed quantization schemes, future work may explore schemes where the quantizer can potentially be an entire neural network. Such static quantization schemes are used because they asymptotically minimize the error introduced due to quantization of the parameters [Gish and Pierce1968], and many previous works use quantization error as an objective criterion [Han, Mao, and Dally2015, Raghavan et al.2017]. However, when the numbers and values of quantization levels are learned, the effect of quantization error on the accuracy is not clear. Furthermore, the benefits due to reduced computational and memory overhead may be worth a small loss in accuracy due to ignoring quantization error. In this paper, we explore a linear function as the choice of .
Despite the simple linear form, these parameters allow a very powerful family of transformations. Figure 1 illustrates four example instantiations of for some . The trivial case is when and giving the identity transform. The case when corresponds to ternary quantization based on the sign of the weight . The case when gives a family of hyperbolas whose foci and eccentricity are controlled by . The case when gives a family of hyperbolic tangent functions. To the best of our knowledge, this is a novel parameterization of the set (1). These additional trainable parameters significantly generalize TC and prior work on learning neural networks with weights that are powers of two [Miyashita, Lee, and Murmann2016, Marchesi et al.1993, Kwan and Tang1993, Lin et al.2015, Courbariaux, Bengio, and David2015]. Next we describe a novel notion of the bit precision stemming from (7).
Suppose be the matrix, with and , we get as . Note that the value of was quantized to with large quantization error. In our experiments we find that the quantization error is not very important for high accuracy, and that the learned values of are non-trivial.
Consider the compression of a deep neural network whose parameters are of the form of (7) using a fixed length code. Apart from the leading sign bit, each exponent can be stored as an integer. The maximum value of the exponent is , so one might use an -bit integer for a 4X model compression compared to the storage of floating point parameters. Typically, parameters of deep networks should not be allowed to reach such large values as . Thus, the natural source of compression is the range of values taken by the exponent over all the parameters of the network. Alternatively, one might use the cardinality of the set of values taken by the exponent, but we adopt the former due to its simplicity and differentiability. For each layer , we define the number of bits as shown in (10).
The leading term of one accounts for the sign bit, whereas accounts for the range of all the exponents, sufficient to represent all exponents between the maximum and minimum exponent. In some cases, this range may not include zero, whereas may be zero due to the sign function in (7). Therefore, we add a one and take the logarithm (base 2) to finally arrive at the number of bits.
Continuing with our previous example, suppose that consists of , we get bits. Note that this is not an optimal code e.g. we could represent this set of exponents with bits (excluding the sign bit).
3 Joint Learning and Compression
We use Knowledge Distillation [Hinton, Vinyals, and Dean2015] motivated by the recent success [Polino, Pascanu, and Alistarh2018, Mishra and Marr2018, Mishra and Marr2017] of knowledge distillation techniques in training low-precision networks. Similar to the above prior work, we simultaneously train a standard 32-bit network, herein referred to as the teacher, and a low precision network, herein referred to as the student (see Figure 2).
GTC differs from this prior work in three ways. First, GTC learns the number of bits as well as the quantized values. Second, while in previous work one goal of distillation is to reduce the number of neural network layers by using a compact student network distilled from a larger teacher network, note that GTC uses identical network architectures for the teacher and the student. Finally, in prior work the teacher is trained first and then fixed during the distillation process, whereas GTC learns both student and teacher models simultaneously from scratch. The GTC optimization objective is
The first term
is any standard supervised or unsupervised loss function, for example, cross-entropy (or negative log-likelihood) for supervised classification, or reconstruction error for unsupervised auto-encoders.
The second term denotes a distillation loss that is a function of the pairwise predictions of the teacher and the student. Note that in our case, since the student parameters are derived from the teacher parameters and the quantization parameters in (7), the distillation term is a function of and
. The difference of logits (before softmax) between the student and the teacher was used in[Hinton, Vinyals, and Dean2015]. Empirically, we found that using the cross-entropy (after softmax) leads to significantly better accuracy. We extended this to the unsupervised setting by using a sigmoid squashing function then calculating the cross-entropy.
The last term penalizes values of that lead to a large variation in the power-of-two weights after quantization. This notion is captured in the number of bits as shown in (10), so is simply the sum over all the neural network layers. Roughly speaking, corresponds to the cost of bits and can be used to encode hardware constraints. As , GTC may use up to 32-bits, and as , GTC produces a binary network. This penalty term is reminiscent of Minimum Description Length [Grünwald2007] formulations. The regularizing effect of penalizing the number of bits was demonstrated in [Raghavan et al.2017]. Note that in the absence of the penalty term, the student and teacher networks can be made arbitrarily close by adjusting the parameters.
One can interpret this objective as the Lagrangian of a constrained optimization problem [Raghavan et al.2017, Carreira-Perpinán2017, Carreira-Perpinán and Idelbayev2017], that primarily optimizes the accuracy of the standard 32-bit model, with the constraints that the pairwise student-teacher predictions are similar and the student model is maximally compressed according as in Section 2.2.
During training, we use the gradients wrt and to update111We use the ADAM optimizer unless otherwise specified. and respectively, which implicitly determine the (quantized) parameters of the student. Note that in general the gradients wrt are not the same as the corresponding gradients in a standard 32-bit training due to additional gradients from the distillation loss. During testing, we dispose the teacher model and make predictions using alone. We use the identity function as the STE to propagate gradients through round and ceiling functions.
It should be clear from the discussion that the above training procedure is conducted on a powerful GPU, and that the benefits of the power-of-two weights are not leveraged during training. That is, the training is performed offline before deployment to an edge device, where the compute, memory and power savings can be realized. Future work should extend GTC to quantization of gradients and activations e.g. using quantized backpropagation[Lin et al.2015] or using projection-based algorithms. These are yet to be explored in the setting where the bits and levels are learned.
We evaluate GTC on two tasks, namely, supervised image classification and unsupervised image generation with a variational auto-encoder (VAE). We show results on two computer vision datasets: MNIST[LeCun et al.1998] and CIFAR-10 [Krizhevsky2009]. We used the ADAM optimizer [Kingma and Ba2014] with a learning rate denoted as unless otherwise specified.
Neural Architectures: For the MNIST classification problem, we used a LeNet-style architecture that consists of four layers, namely two convolutional layers ( and filters, each of size , with max pooling and ReLU) followed by two fully connected layers ( and outputs with ReLU and soft-max respectively). Note that this is a simpler model than LeNet-5. For CIFAR-10 classification, we used Configuration D of VGG-16 [Simonyan and Zisserman2014], with the addition of batch normalization and ReLU after convolutional layers. We also added Dropout [Srivastava et al.2014] after alternating convolutional layers to avoid overfitting.
For the MNIST unsupervised problem, we only use the images as input. The encoder part of the VAE uses three fully connected layers consisting of and
neurons respectively, each with tanh activations. The result is input in to two layers corresponding to the mean and variance, each with a latent dimension of. The decoder consists of three fully connected layers consisting of and neurons, each with tanh activations. Finally, we use one fully connected layer with neurons with sigmoid activations to arrive at a grayscale image.
The first set of experiments demonstrate the training and convergence of GTC and the performance of the GTC model. In these experiments, we compare with the corresponding standard 32-bit model that is trained independently of GTC (ie. not the teacher model used in the distillation scheme of Section 3). In Section 4.3, we compare GTC with prior work as well as some natural baseline algorithms.
Figure 7 and Figure 7 show these experiments for supervised classification and unsupervised VAE respectively. In Figure 7 (left) we see that GTC converges to a training accuracy within of the standard 32-bit model in roughly the same number of iterations (10K iterations). We note some oscillations in the training accuracy of GTC, however they are within of the standard model once the standard model converges. This shows the benefit of training using distillation because the 32-bit model guides the learning of the GTC model. Moreover, on the test set, the accuracy of GTC is almost identical to that of the standard model.
On the CIFAR-10 dataset (see Figure 7 (right)), we observed a similar pattern. The training accuracy of the standard model was over whereas GTC achieved a training accuracy around
. The performance on the test set is lower than state-of-the-art results using VGG-16 on CIFAR-10. We believe the lower performance on CIFAR-10 is caused by the use of dropout layers that tends to slow learning, limited iterations of training, and no fine-tuning of the hyperparameters of GTC (see Section4.4). One can see that the trend in accuracy is upward.
Figure 7 shows a comparison of a VAE trained using GTC compared to a standard VAE for the MNIST dataset. For both of these variants, the reconstruction error was defined as the binary cross-entropy between the input image and the 32-bit reconstruction. For GTC, this is the first term as in (11). The distillation loss (second term) was defined as the binary cross entropy between the reconstructions of the teacher and student models. The Figure shows that the VAE trained using GTC converges in terms of the reconstruction error. However, in some cases the visual quality of the reconstructions from GTC appear to be less blurry than the reconstructions of the 32-bit standard model (see Figure 7).
Overall, GTC naturally converged to binary networks for MNIST, and binary for all but the first layer of VGG-16 for the CIFAR-10 dataset, with very little loss of accuracy. GTC converged to a binary VAE network whose output has similar visual quality compared to the standard VAE. This is the main takeaway from the learning experiments on GTC. To the best of our knowledge, these are the first experiments demonstrating generative models such as VAE using low precision parameters. Future work may extend this to more sophisticated generative models and draw potentially different insights than feedforward classification models.
In order to further showcase the compression achieved by GTC, Table 1 shows the weights learned by GTC along with the number of bits and the learned values of and . First, we observe that the learned values tend to be negative, whereas the values tend to be positive. That is, GTC learns S-shaped quantization functions. A special case of this is seen in the middle layers of VGG-16, where both values that are close to zero, which corresponds to a Heaviside function (see Figure 1).
The learned weights are shown as the number of bit shifts (and flipping of sign bit) required so that the result is equivalent to a floating-point multiplication. For each number of shifts in each layer, GTC uses both the signed (with flip) and unsigned weight. The histograms show that the learned weights are almost always bi-modal with two equal modes, and never uni-modal. The weights of the middle layers of VGG are shown as zero bit shifts.
Figure 7 shows the convergence of the bit cost as in (11). The figure shows the sum over layers of the number of bits used. We observe some oscillations for LeNet on MNIST, but no oscillations are seen in VGG-16 on CIFAR-10 as well as VAE on MNIST. In all three cases, the bit cost converges within the first of the total training iterations.
4.3 Experimental Comparison to Prior Work
We compared GTC with the following.
Post Mortem (PM) takes the final 32-bit model and snaps each weight to the nearest integer power of two. The result requires bits of storage ( bits for exponent, bit for sign). In general, there is no guarantee of accuracy.
Straight Through Estimator (STE) directly trains the quantized model without using distillation. The objective function is the cross-entropy of the student network. Gradients wrt are calculated using identity STE and used to update . There are no parameters. This approach worked very well in terms of accuracy, but requires bits.
STE+Bit uses the same approach as above, with the addition of a bit penalty term as in (11). The bits are calculated using (10), without using any parameters. This approach retained the high accuracy of STE in MNIST while compressing to a binary network. It did not make any progress on the CIFAR-10 dataset and resulted in poor accuracy. Together, these indicate the necessity of the distillation approach to training.
LogNet [Miyashita, Lee, and Murmann2016, Lee et al.2017] proposed a method to train a network similar to GTC, but weights, gradients and activations of the network are quantized. The number of bits is not learned. We attempted to reproduce their algorithm but restricted the quantization to the weights of the network. We were not able to reproduce the performance reported.
Ternary Connect TC [Lin et al.2015]: We show the results from the paper. Their network architecture is slightly different from the one used for the GTC experiments. GTC is more general and shows superior compression to TC.
Bayesian Compression (BC) [Louizos, Ullrich, and Welling2017]: We show the performance as reported. BC shows lower accuracy on MNIST and higher accuracy on CIFAR-10. We note that the VGG-16 implementation therein does not use any dropout. GTC shows significantly higher compression than BC.
Variational Quantization (VQ) [Achterhold et al.2018] extends BC to ternary quantization. The number of bits is not learned. VQ shows similar accuracy and compression as GTC on MNIST.
Deep Compression (DC) [Han, Mao, and Dally2015] does not learn the number of bits, but learns the locations of the centroids. It uses quantization error as the primary criterion. GTC shows superior accuracy and compression.
Differentiable Quantization (DQ) [Polino, Pascanu, and Alistarh2018] does not learn the number of bits, but learns the locations of the centroids. The paper used a ResNet-style architecture on CIFAR-10 with a fixed bits. The reported accuracy is inferior to the accuracy of GTC for the bits.
BitNet [Raghavan et al.2017] does learn the number the bits on a linear quantization scale, and has a similar learning scheme as GTC. However, the reported accuracy is lower with lower compression in comparison to GTC.
GTC (this work): Over all the compared methods, GTC shows the best compression with minimal drop in accuracy. We believe the lower performance on CIFAR-10 is caused by the use of dropout layers, limited iterations of training, and no fine-tuning of the hyperparameters of GTC. Most of the above baselines, with the exception of PM, STE and TC do not result in a neural network whose forward pass does not require any multiplications.
In previous sections, we showed the results of GTC with a fixed value of and . We arrived at these values by using a grid search on the MNIST dataset and picking the setting with highest accuracy with lowest number of bits. We did not perform any tuning for the CIFAR-10 dataset, which might be the reason for the somewhat low performance. Table 3 shows the results of this grid search. As expected, for any fixed value of , as the value of increases, the number of bits quickly decreases to and the accuracy decreases significantly. Similarly, for any fixed value of , increasing the coefficient for distillation loss increases the number of bits, and generally increases the accuracy. Ultimately, hardware constraints dictate the desired number of bits, which in turn can be used to decide appropriate values for and . Next we show an interesting use case of GTC as an optimizer by annealing the values of .
4.5 GTC for Optimization
Prior work [Raghavan et al.2017, Lin et al.2015] has noted that low precision networks exhibit fast convergence in the early stages of learning in some cases. We leverage this property in a novel variant of GTC that can be used as an off-the-shelf optimizer. We initialize GTC with a high value of , the coefficient of bit cost and as training progress, we rapidly reduce the value of . Figure 7 shows the result of such a GTC variant. We used Stochastic Gradient Descent (SGD) with a fixed learning rate . In the variant denoted as Annealed , is divided by every iterations. The figure shows that the training accuracy of both of the variants of GTC is higher till 40K iterations. The 32-bit standard network (optimized by SGD) converges to a higher accuracy over GTC with a the fixed after 40K iterations. Interestingly, the variant with annealed has superior accuracy and dominates the 32-bit standard model throughout. It is to be noted that this effect vanishes when using ADAM instead of SGD, confirming the effect of low precision training on the effective learning rate as noted in [Raghavan et al.2017].
This paper proposed GTC that generalizes ternary connect in a number of ways. GTC was demonstrated to be effective in learning and compression, and in comparison to baselines and prior work. These algorithmic advances directly translates to power and computational efficiency during inference. To quantify the gains in hardware, we offer two key simulations on FPGA and Raspberry-Pi. These gains comes from the compressed weights and the elimination of multiplications. Instead of performing the standard multiply/add operation, we can instead do a shift/add operation. On FPGA, the per operation power consumption reduced from 49mW for multiplication to 10mW for bit-shifting as in (3)222Based on RTL synthesis at 800MHz on 16nm Zynq Ultrascale+(ZU9EG) using Xilinx Vivado Tools. Similarly on Raspberry-Pi, the per operation time taken reduced from 93ns to 67ns. These gains are significant when amortized over the millions or billions of parameters in deep networks. Additional gains are anticipated from handling, storing and retrieving compressed weights. Specialized digital hardware can implement bit-shifting directly with memory storage for further gains. Even in analog multipliers, a reduction in circuitry is realized with simpler current/voltage dividers compared to complex op-amp circuitry.
This material is based upon work supported by the Office of Naval Research (ONR) under contract N00014-17-C-1011, and NSF #1526399. The opinions, findings and conclusions or recommendations expressed in this material are those of the author and should not necessarily reflect the views of the Office of Naval Research, the Department of Defense or the U.S. Government.
- [Achterhold et al.2018] Achterhold, J.; Koehler, J. M.; Schmeink, A.; and Genewein, T. 2018. Variational network quantization.
- [Bengio, Léonard, and Courville2013] Bengio, Y.; Léonard, N.; and Courville, A. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.
- [Carreira-Perpinán and Idelbayev2017] Carreira-Perpinán, M. A., and Idelbayev, Y. 2017. Model compression as constrained optimization, with application to neural nets. part ii: Quantization. arXiv preprint arXiv:1707.04319.
- [Carreira-Perpinán2017] Carreira-Perpinán, M. A. 2017. Model compression as constrained optimization, with application to neural nets. part i: General framework. arXiv preprint arXiv:1707.01209.
- [Courbariaux, Bengio, and David2015] Courbariaux, M.; Bengio, Y.; and David, J.-P. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. 3123–3131.
- [Gish and Pierce1968] Gish, H., and Pierce, J. 1968. Asymptotically Efficient Quantizing. IEEE Transactions on Information Theory 14(5):676–683.
- [Grünwald2007] Grünwald, P. D. 2007. The minimum description length principle. MIT press.
- [Han, Mao, and Dally2015] Han, S.; Mao, H.; and Dally, W. J. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.
- [Hinton, Vinyals, and Dean2015] Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- [Krizhevsky2009] Krizhevsky, A. 2009. Learning Multiple Layers Of Features From Tiny Images.
- [Kwan and Tang1993] Kwan, H. K., and Tang, C. 1993. Multiplierless multilayer feedforward neural network design suitable for continuous input-output mapping. Electronics Letters 29(14):1259–1260.
- [LeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-Based Learning Applied To Document Recognition. Proceedings of the IEEE 86(11):2278–2324.
- [Lee et al.2017] Lee, E. H.; Miyashita, D.; Chai, E.; Murmann, B.; and Wong, S. S. 2017. Lognet: Energy-efficient neural networks using logarithmic computation. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, 5900–5904. IEEE.
- [Lin et al.2015] Lin, Z.; Courbariaux, M.; Memisevic, R.; and Bengio, Y. 2015. Neural networks with few multiplications. arXiv preprint arXiv:1510.03009.
- [Louizos, Ullrich, and Welling2017] Louizos, C.; Ullrich, K.; and Welling, M. 2017. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, 3288–3298.
- [Marchesi et al.1993] Marchesi, M.; Orlandi, G.; Piazza, F.; and Uncini, A. 1993. Fast neural networks without multipliers. IEEE transactions on Neural Networks 4(1):53–62.
- [Mishra and Marr2017] Mishra, A., and Marr, D. 2017. Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy. arXiv preprint arXiv:1711.05852.
- [Mishra and Marr2018] Mishra, A., and Marr, D. 2018. Wrpn & apprentice: Methods for training and inference using low-precision numerics. arXiv preprint arXiv:1803.00227.
- [Miyashita, Lee, and Murmann2016] Miyashita, D.; Lee, E. H.; and Murmann, B. 2016. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025.
- [Polino, Pascanu, and Alistarh2018] Polino, A.; Pascanu, R.; and Alistarh, D. 2018. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668.
- [Raghavan et al.2017] Raghavan, A.; Amer, M.; Chai, S.; and Taylor, G. 2017. Bitnet: Bit-regularized deep neural networks. arXiv preprint arXiv:1708.04788.
- [Simonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
[Srivastava et al.2014]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov,
Dropout: a simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research15(1):1929–1958.
- [Wu et al.2018] Wu, S.; Li, G.; Chen, F.; and Shi, L. 2018. Training and inference with integers in deep neural networks. arXiv preprint arXiv:1802.04680.