1 Introduction
Recent works using deep convolutional networks have been successfully applied to a large variety of computer vision tasks, such as image recognition
(He et al., 2016), object segmentation (He et al., 2017) and scene segmentation (Chen et al., 2018). These networks are large. For example, ResNet152 has 60.2 million parameters (Zagoruyko and Komodakis, 2016) and requires 11.3 billion FLOPs (He et al., 2016). A large number of parameters results in a large memory footprint. At 32bit floatingpoint precision, 229.64 MB is needed to store the ResNet152 parameter values.In lowlatency or mobile applications, lower computation complexity, lower memory footprint and better energy efficiency are desired. Many prior works address this need of lower computation complexity. In a survey paper Cheng et al. (2018), efficient computation of neural networks is organized into four categories: network pruning, lowrank decomposition, teacherstudent network and network quantization.
Network pruning removes redundant parameters which are not sensitive to performance. Lowrank decomposition uses matrix or tensor decomposition methods to reduce number of parameters. In teacherstudent network, knowledge transfer is exploited to train a smaller student network using a bigger teacher network. In these three categories, their common theme is a reduction of number of parameters. During forward propagation, one of the most computationally intensive operation in a neural network is the matrix multiplication of parameters with input. With reduced parameters, FLOPs and memory footprint reduce. With reduced FLOPs, energy efficiency improves.
In the forementioned categories, network parameters typically use floatingpoint precision. In the last category, network quantization, the parameters and, for some works, all computations are quantized. For many lowlatency or mobile applications, we typically train offline and deploy pretrained models. Thus, the main goal is the efficiency in forward propagation. It is desirable to compute backward propagation and parameter updates in floatingpoint precision. The seminal work Courbariaux et al. (2015) matches our scope. They quantize network weights to binary values, e.g. 1.0 and 1.0, while also keeping weight values in floatingpoint precision for backward propagation. During forward propagation, instead of matrix multiplication of weights with input, the sign of these binary weights specify addition or subtraction of inputs. Memory footprint is dramatically reduced to 1 bit per weight. Energy efficiency improves because addition is more energy efficient than multiplication Horowitz (2014).
Prior works in network quantization (Courbariaux et al., 2015; Li et al., 2016; Hubara et al., 2016; Zhou et al., 2016; Wu et al., 2018) typically start training from quantizing all weights in the network. Quantization creates error which is the difference between the original value and its quantized value. In other words, actual weight value, , is
(1) 
To reduce the impact of , we hypothesize that if we quantize some weights while leaving others in floatingpoint precision, the latter ones would be able to compensate for the error introduced by quantization. To reach a fully quantized network, we propose an iterative training, where we gradually quantize more and more weights. This raises two questions. First, how to choose the grouping of weights to quantize together at each iteration. Second, how to choose the quantization order across groups. A feedforward, deep neural network has many layers. One natural grouping choice is one group per layer. For the quantization order of groups, we propose a sensitivity pretraining to choose the order. A random order and other obvious orders are chosen as comparison.
Contributions.

We propose an iterative training regime that gradually finds a full binary weight network starting from an initial partial binary weight network.

We demonstrate empirically that starting from a partial binary weight network result in higher accuracy than starting from a full binary weight one.

We demonstrate empirically that the forward order is best, compared to other obvious orders. In addition, sensitivity pretraining can further improve that.

Code is available at https://github.com/rakutentech/iterative_training.
In the sections that follow, we describe the iterative training algorithm in detail. Next, we present the iterative training of fully connected networks using the MNIST dataset (Lecun et al., 1998)
and of convolutional neural networks using the CIFAR10
(Krizhevsky, 2009) and ImageNet (Russakovsky et al., 2015) datasets. Then we present the sensitivity pretraining for convolution neural networks. Finally, we discuss related work and conclusion.2 Iterative Training
A feedforward, deep neural network has many layers, say, . We study iterative training by quantizing more and more weights layerbylayer. Iterative training starts from one quantized layer while all other layers are in floatingpoint precision. Each iteration trains for a fixed number of epochs, . Next, we quantize the next layer and trains for another epochs. Iterative training stops when there are no more layers to quantize. In the case of ResNet architectures, same as the original paper, we reduce learning rate by 10 twice and continue training. Algorithm 1 illustrates the iterative training regime. As the experiments will show, this regime consistently finds fully quantized network with better accuracies than starting from an initial fully quantized network (our baseline).
For quantization scheme, we follow weight binarization in Courbariaux et al. (2015), but, for simplicity, without "tricks": no weight clipping and no learning rate scaling. In addition, we use softmax instead of square hinge loss. The inner forloop in Algorithm 1 is the same as the training regime in Courbariaux et al. (2015)
, except that a state variable is introduced to control whether a layer needs binarization or not. We use the PyTorch framework
Paszke et al. (2019). ImageNet results in the biggest GPU memory needs and longest training time, which are about 10 GB and about one day to train one model on a Nvidia V100, respectively.Network  30010010  78478410  Vgg5  Vgg9  ResNet20  ResNet21 

Convolutional layers  64, 64  64, 64  16, 3x[16, 16]  64, 4x[64]  
128, 128  3x[32, 32]  5x[128], 5x[256]  
256, 256  3x[64, 64]  5x[512]  
Fully connected layers  300, 100, 10  784, 784, 10  256, 256, 10  256, 256, 10  10  1000 
Dataset  MNIST  MNIST  CIFAR10  CIFAR10  CIFAR10  ImageNet 2012 
Train / Validation / Test  55K / 5K / 10K  55K / 5K / 10K  45K / 5K / 10K  45K / 5K / 10K  45K / 5K / 10K  1.2M / 0 / 50K 
Batch size  100  100  100  100  128  256 
Optimizer  Adam  Adam  Adam  Adam  SGD  SGD 
Momentum 0.9  Momentum 0.9  
Weight decay 1e4  Weight decay 1e4  
Pretraining epochs,  150  150  200  450  300  20 
Epochs per layer  150  150  150  150  50  2 
Layers,  3  3  5  9  20  21 
Total epochs,  450  450  750  1350  1200  67 
Order count  6  6  120  362,880  2e+18  5e+19 
Weight count  266,200  1,237,152  4,300,992  2,261,184  268,336  About 11e6 
As shown by order count in Table 1, there is a large number of layer binarization order for a deep neural network. In this work, we experiment with random and obvious orders, to show that starting from a partially quantized weight network is better than starting from fully quantized one. In a later section, we introduce the proposed sensitivity pretraining to select a layer binarization order.
For obvious orders, we experiment with the forward order, i.e., quantizing layerbylayer from input layer towards output layer and the reverse order, i.e., from output layer towards input layer. We then compare to training when: (1) all weights are quantized from start (baseline) (2) all weights are in floatingpoint precision and stay so. As the experiments will show, for bigger and deeper networks, the forward order consistently finds fully quantized network with better accuracies than other orders.
In the following subsections, we discuss experimental results for fully connected and convolutional networks.
2.1 Iterative Training for Fully Connected Networks
We investigate iterative training of fully connected networks with the MNIST dataset, which has 60,000 training and 10,000 test images. We use the last 5,000 images from the training set for validation and the remaining 55,000 as training images for all MNIST experiments. We use no data augmentation. We use batch normalization
(Ioffe and Szegedy, 2015), no dropout and weight initialization as He et al. (2015). We use softmax as classifier.
For iterative training, we train for 150 epochs per layer. For each network architecture, the total number of training epochs is number of layers multiplied by 150 epochs. Because there are three layers for the chosen networks, all MNIST experiments are trained for 450 epochs. For all cases, we find the best learning rate from the best error on the validation set. For layerbylayer binarization cases, the best error is selected from epochs when all layers are binarized. We then use each corresponding best learning rate for the error on the test set. We vary the seed for 5 training sessions and report the learning curves of the average test errors in the figures. Table 1 reports other hyperparameters.
Case  30010010  78478410  Improvement 

Binary  0.023  0.021  0.002 
Float  0.015  0.014  0.001 
Reverse  0.024  0.018  0.006 
Forward  0.026  0.016  0.010 
For network architectures, we study the 30010010 network (Lecun et al., 1998) and a bigger variant, 78478410. Figure 1
shows test errors for the 30010010 and 78478410 networks. The float case is training where all weights are in floatingpoint precision and stay so. The binary case (baseline) is training where all weights are binarized from the start. The forward case is training where layer binarization is in the forward order, the reverse in the reverse order. The solid lines are the mean across multiple runs and the matching shaded color is one standard deviation.
For the smaller network, 30010010, the binary case reaches a lower error than forward and reverse orders. Next best is the reverse order then the forward one. This shows that order of layer binarization matters for accuracy. On the contrary, for the bigger network, 78478410, the forward and reverse cases does better than the binary one. Binarization operation is not differentiable. According to Equation 1, it injects a random error signal into the network. During iterative training, some of the weights are in floatingpoint precision. We hypothesize that they are compensating for the random error. At the same time, we think bigger networks are more robust due to more parameters.
The error improvement of upgrading to a bigger network is given in Table 2. The forward and reverse orders have significantly higher improvement than float and binary, showing that iterative training is beneficial. In addition, the forward order has a higher improvement than reverse. We observe the same pattern for subsequent network architectures. Namely, for bigger and deeper networks, starting from partial binary weight network, instead of full binary weight network, iterative training with forward weight quantization order finds full binary weight network with higher accuracies.
2.2 Iterative Training for Convolutional Networks
We investigate iterative training of convolutional networks with the CIFAR10 dataset, which has 50,000 training and 10,000 test images. We randomly choose 5,000 images from the training set as the validation set and the remaining 45,000 as training images for all CIFAR10 experiments. We use the same data augmentation as He et al. (2016)
: 4 pixels are padded on each side, and a 32x32 crop is randomly sampled from the padded image or its horizontal flip. We use batch normalization, no dropout and weight initialization as
He et al. (2015). We use softmax as classifier.We experiment with VGG (Simonyan and Zisserman, 2015) and ResNet (He et al., 2016) architectures. For iterative training of VGG architectures, we train for 150 epochs per layer. For iterative training of ResNet20 architecture, we train for 50 epochs per layer. Same as the original paper, we reduce learning rate by a factor 10 twice, once at 1000 epochs and a second time at 1100 epochs. Then stop training at 1200 epochs. Using same methodology as MNIST experiments, for all cases, we use the validation set to tune the learning rate and test set to report errors. Table 1 reports other hyperparameters.
For VGG architecture, we study a shallower, Vgg5, and a deeper network, Vgg9. As their names suggest, Vgg5 has 5 layers and Vgg9, 9. Figure 2 shows test errors for Vgg5 and Vgg9 networks. The float case is training where all weights are in floatingpoint precision and stay so. The binary case (baseline) is training where all weights are binarized from the start. The forward case is training where layer binarization is in the forward order, the reverse in the reverse order and the random case, a randomly selected order.
For both network architectures, the binary case has the highest error and the float case the lowest error. In the same pattern as the larger MNIST network, starting from partial binary weight networks, iterative training finds full binary weight networks that have lower error than the binary cases. For Vgg5, a shallower network, the ascending error ranking is reverse, forward then random. For Vgg9, a deeper network, the ranking is forward, random then reverse. This shows again that layer binarization order matters.
Case  Vgg5  Vgg9  Improvement 

Binary  0.30  0.28  0.02 
Float  0.16  0.08  0.08 
Reverse  0.22  0.1126  0.1074 
Forward  0.22  0.1025  0.1175 
The error improvement of upgrading to Vgg9 from Vgg5 is summarized in Table 3. There is a small improvement for the binary case. The float case has a significantly higher improvement than binary. Next higher is the reverse case. Finally, the forward case has the highest improvement. In the same pattern as in the MNIST experiments, favoring iterative training and the forward order.
As shown in Table 1, although Vgg9 has a smaller number of weight parameters than Vgg5, it has more layers. Iterative training continues to be beneficial. We hypothesize that this is due to a more gradual rate of total binarization. For Vgg9, as each layer is binarized, relatively more weights stay in floatingpoint precision to compensate for the random noise injected by the binarization operation.
For an even deeper network, we study ResNet20 from He et al. (2016), which has 20 layers, as its name suggests. Figure 3 shows test errors for the ResNet20 network. The binary case has the highest error and the float case has the lowest error. In the same pattern as other network architectures, iterative training finds full binary weight networks that have lower error than the binary case. In increasing error order is forward, random and reverse. Again, showing that the order of binarization matter and the forward order has advantage. In the next section, we propose a sensitivity pretraining to select a binarization order.
3 Sensitivity Pretraining
In prior sections we demonstrated empirically that starting from a partial binary weight network results in higher accuracy than starting from a fully binary weight one for larger and deeper networks. In this section, we describe the proposed sensitivity pretraining to choose a the binarization order.
For shallower neural networks like the 3layer fully connected networks for the MNIST dataset, exhaustive search for the best binarization order is possible. For deeper neural networks such as Vgg5, Vgg9 and Resnet20, it is impractical to do so, as shown by order count in Table 1. However, we can obtain a measure of error sensitivity to layer quantization. Then let the sensitivity be a guide for binarization ordering.
Sensitivity is computed as follows. We train models, where in each model only the weights of the Lth layer is binarized while others are in floatingpoint precision. We train for epochs and, as before, use validation set to tune the learning rate to get the best validation error for each sensitivity model. for Vgg5 is 200 and for Vgg9 is 450. for ResNet20 is 300. For ResNet, same as the original paper, we reduce learning rate by 10 twice, one at epoch 200 and again at epoch 250. Then we rank these best validation errors in ascending order. This becomes the ascending layer binarization order for iterative training.
During iterative training using ascending order, the layer that had the lowest error will be binarized first, while the layer that had the highest error last, meaning the latter stays in floatingpoint precision the longest during training. As shown in Figure 4 for Vgg5 and Vgg9, the ascending order results in a fully binary weight network with the lowest error, beating the forward ones. Also shown is the descending order, which is the reverse of the ascending one. For both networks, the descending order results in error higher than ascending, showing again that binarization order matters. In the case of Vgg5, the random order is worst while descending one follows closely behind. In the case of Vgg9, the descending one is the worst of all. In short, the lower the error for one order, the higher its reverse order would be.
For ResNet20, Figure 5 shows the test errors with ascending and descending orders. Unlike for Vgg5 and Vgg9, the forward order reaches accuracy better than both ascending and descending orders. The proposed sensitivity pretraining considers binarization of layers independently. We hypothesize that there may be interactions between multiple layers.
For ImageNet, we experiment with ResNet18 He et al. (2016). Since it has 21 layers, we will refer to it as ResNet21. The optimizer is SGD with momentum 0.9 and weight decay 1e4. For sensitivity pretraining, is 20 epochs. For each layer, we sweep 3 learning rates and use the lastepoch errors of the test set to choose the ascending order. In the full training, we choose 2 epochs per layer. The starting learning rate, 0.01, comes from the best learning rate in sensitivity pretraining. Same as the orginal paper, we reduce learning rate by 10 twice, after 42 epochs and again after 57 epochs. We stop training after 67 epochs. The floatingpoint training is just one run, while all other binarization training are from 5 randomseeded runs. Figure 6 shows the test errors with forward and ascending orders. The ascending order has a lower mean error than forward. Both of which are better than binary. Again, binarization order matters and ascending order is better than the forward one.
3.1 Exhaustive Search
For shallower neural networks like the 3layer fully connected network for the MNIST dataset, exhaustive search for the best binarization order is possible. Figure 7 shows result for all combinations of layer binarization order for 30010010 and 78478410 networks. For the former, a smaller network, the ascending order turns out to be same as reverse. Errors for all combinations are very close. The best order is not the ascending one, but 132 and 312, both of which are better than binary by a small margin. 132 means binarization order is layer 1, layer 3 then layer 2. Thus, also for 30010010, starting from partial weight binarization is better than from full weight binarization.
For the bigger network, 78478410, the ascending order is better than forward and reverse ones. The descending order is worst of all others. This is consistent with the results from convolutional networks. Here, the ascending one shares with another in best accuracy.
In summary, we proposed using sensitivity pretraining as a guide for layer binarization order. For 78478410, Vgg5, Vgg9 and ResNet21, we have shown empircally that better accuracies are achieved. This improvement comes at a cost of pretraining additional models.
4 Related Work
Our work introduces an iterative layerbylayer quantization training regime. Although we demonstrated the results using weight binarization, the regime is independent of quantization schemes. We think other schemes, e.g., Li et al. (2016) (where weights are ternary: 1.0, 0 and 1.0), may yield similar trends.
Hu et al. (2018) transforms weight binarization as a hashing problem. Same as ours, their iterative method also operates layerbylayer, from input layer towards output layer. However, they start from a pretrained network and, after weight quantization without finetuning, finetune the biases. Ours starts from an untrained network and gradually trains a full binary weight network, which we believe allows the network to adapt to the random noise created by the quantization operation. In addition, their final weights are not pure binary, but powerof2 multiples. When constrained to pure binary, they report nonconvergence. Our iterative training does not require pure binary weights. For future work, we can binarize using powerof2 multiples.
Zhou et al. (2017) iterates over both pruning and quantization techniques. First, weights are partitioned into two groups. Then, weights in first group are quantized to powerof2 multiples or zero. Next, weights in the second groups are finetuned, while the first group receives no parameter updates. In the next iteration, some of weights in second group is assigned to the first group. The process is repeated until all weights are members of the first group. In this partitioning scheme, the first group contains weights from all layers. It is possible to merge both methods because their partitioning is orthogonal to ours. Once weights join the first group, their values stay unchanged for the rest of the finetuning. Because our binarization is based on (Courbariaux et al., 2015), floatingpoint weights prior to quantization are saved for parameter updates. Thus, during iterative training of later layers, weights of prior layer are allowed to adapt and flip signs. However, the disadvantage is more memory are required during training.
In lowrank decompositions and teacherstudent network, weights are still in floatingpoint precision. For lowrank decomposition, the implementation requires decomposition operation, which is computationally expensive, and factorization requires extensive model retraining to achieve convergence when compared to the original model (Cheng et al., 2018). Similarly, due to iterative nature of our proposed training regime, training time is also lengthened.
5 Conclusions and Further Work
In this work, we proposed a simple iterative training that gradually trains a partial binary weight network to a full binary weight network layerbylayer. We showed empirically that this regime results in higher accuracy than starting training from a fully binarized weight network. The order of layer binarization matters. We show empircally that, for larger and deeper neural networks, the forward order achieves better accuracies than other binarization orders. We proposed a sensitivity pretraining for selection of binarization order. For 78478410, Vgg5, Vgg9 and ResNet21, this guided order achieve better accuracies than the forward order.
Iterative training has a cost, which is lengthened training. This tradeoff may be acceptable in many applications where pretrained models are deployed, because efficiency in only forward propagation is needed. A binary weight neural network dramatically reduces computation complexity, memory footprint and, thus, increases energy efficiency. For future work, we would like to understand analytically why layer quantization works and the optimal quantization order.
References
 EncoderDecoder with Atrous Separable Convolution for Semantic Image Segmentation. In European Conference on Computer Vision, pp. 833–851. External Links: 1802.02611 Cited by: §1.
 A Survey of Model Compression and Acceleration for Deep Neural Networks. IEEE Signal Processing Magazine 35 (1), pp. 126–136. External Links: 1710.09282 Cited by: §1, §4.
 BinaryConnect: Training Deep Neural Networks with binary weights during propagations. In Neural Information Processing Systems, External Links: 1511.00363, ISSN 10495258 Cited by: §1, §1, §2, §4.
 Mask RCNN. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. External Links: Document, 1703.06870, ISBN 9781538610329, ISSN 15505499 Cited by: §1.
 Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034. External Links: arXiv:1502.01852v1 Cited by: §2.1, §2.2.

Deep Residual Learning for Image Recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 770–778. External Links: 1512.03385 Cited by: §1, §2.2, §2.2, §2.2, §3.  1.1 Computing’s energy problem (and what we can do about it). In 2014 IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14. External Links: Document, ISBN 9781479909209 Cited by: §1.

From Hashing to CNNs: Training Binary Weight Networks via Hashing.
In
ThirtySecond AAAI Conference on Artificial Intelligence (AAAI18)
, pp. 3247–3254. External Links: 1802.02733 Cited by: §4.  Binarized neural networks. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29, pp. . Cited by: §1.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.
In
International Conference on Machine Learning
, Vol. 37, pp. 448–456. External Links: 1502.03167 Cited by: §2.1.  Learning Multiple Layers of Features from Tiny Images. Technical report Cited by: §1.
 Gradientbased learning applied to document recognition. Vol. 86, pp. 2278–2324. Cited by: §1, §2.1.
 Ternary Weight Networks. In NIPS 2016, 1st International Workshop on Efficient Methods for Deep Neural Networks, External Links: 1605.04711 Cited by: §1, §4.

PyTorch: an imperative style, highperformance deep learning library
. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. Cited by: §2.  ImageNet Large Scale Visual Recognition Challenge. In International Journal of Computer Vision (IJCV), Vol. 115, pp. 211–252. External Links: Document Cited by: §1.
 Very Deep Convolutional Networks for LargeScale Image Recognition. In International Conference on Learning Representations, External Links: 1409.1556 Cited by: §2.2.
 Training and Inference with Integers in Deep Neural Networks. In International Conference on Learning Representations, pp. 1–14. External Links: 1802.04680 Cited by: §1.
 Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC), E. R. H. Richard C. Wilson and W. A. P. Smith (Eds.), pp. 87.1–87.12. External Links: Document, ISBN 1901725596 Cited by: §1.
 Incremental Network Quantization: Towards Lossless CNNs with LowPrecision Weights. In International Conference on Learning Representations, pp. 1–14. External Links: 1702.03044 Cited by: §4.
 DoReFaNet: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. External Links: 1606.06160 Cited by: §1.
Comments
There are no comments yet.