1 Introduction
Convolutional neural networks (CNNs) have been widely used in various computer vision tasks, such as image classification [9], object detection [19] and visual segmentation [15]. These neural networks are often of heavy design with massive parameters and computational costs, which cannot be directly deployed on portable devices without model compressing techniques, pruning [8], knowledge distillation [10], compact model design [11, 22], and quantization [18, 25].
Wherein, 1bit quantization has been recently received a great attention, which represents the weights and activations in the network using only two values, and
. Thus, binarized networks could be efficiently applied in a series of realworld applications (camera and mobile phone). Nevertheless, the performance of binary neural networks (BNNs) are still far worse than that of their original models. Figure
1 summarizes the performance of stateoftheart binarization methods [16, 13, 18, 25, 14, 5]on the ImageNet benchmark
[3], including XNORNet
[18], BiReal Net [14], PCNN [5], . Although they have made tremendous efforts for enhancing the performance of BNNs, the highest top1 accuracy obtained by PCNN [5] is about lower than that of the baseline ResNet18 [9].The severe accuracy drop mentioned in Figure 1
greatly limits the practicality of BNNs, considering that there are a number of computer vision taks with very high precision requirements such as face recognition
[21] and person reidentification [6]. The main reason could be derived from the fact that discrimination of binary features cannot match that of the fullprecision features with the same dimensionality. Therefore, it is necessary to find a tradeoff approach for establishing compact binary networks with acceptable model sizes by increasing the number of channels in each convolutional layer. Motivated by the recent neural architecture search (NAS [1, 4, 23]) hotspot, we present to appropriately modify channel numbers of binarized networks and search a new architecture with different channel numbers but high precision. In practice, expansion ratios of all layers in the desired binary network will be encoded to form the search space, and the evolutionary algorithm will be utilized for effectively find the lower bound of BNNs for achieving the same performance as that of their fullprecision versions.2 Approach
Binarization Method.
Following the widelyused DoReFaNet [25], in the binary layer, the floatingpoint weights is approximated by binary weights and a floatingpoint scalar, while the floatingpoint activations are represented by binary values . The feedforward in DoReFaNet is defined as:
(1)  
where
calculates the mean of absolute value. In the backpropagation process, we adapt the “StraightThrough Estimator” method
[2] to estimate the corresponding gradients. During the quantization process, we restrain the weights and activations of all convolution layers and fullyconnected layers to only 1bit except the first and last layer, following the existing works [25, 14].The extremely binary quantization brings enormous computation acceleration and memory reduction. However, most of the stateoftheart binary networks cannot match the accuracy of the fullprecision counterpart models. Recently, the uniform width expansion proposed by WRPN [17] expands all the layers with only one hyperparameter for multibit quantization networks to pursue this goal.
Although widened binary networks can obtain acceptable performance, such a uniform expansion strategy will obviously increase the required memory and computational complexities, the binary network after expanding is larger than the original one. In fact, there is often strong redundancy in deep neural architectures, we do not need to expand all layers for achieving the desired performance. Thus, we propose to define a binary neural architecture search problem and utilize evolutionary algorithm to search the optimal architectures.
Search Space.
For the search space, we only focus on the search for network width, the number of the channels of each layer. For a given network architecture which has layers, we define to encode the expansion ratio hyperparameter of each layer. Our goal is to search
for higher accuracy with less FLOPs. All the other hyperparameters and network settings like stride, kernel size, layer order, remain the same as the original fullprecision models.
In the uniform width expansion experiments as shown in Table 2, we observe that by only expanding channels by times, binary neural networks can obtain comparable performance to that of their fullprecision model on the ImageNet classification task. Thus we assume that is the empirical upper bound of expansion ratio to achieve fullprecision accuracy. We set as the largest expansion ratio, and use some smaller ratio to expand or even reduce channels. In practice, we have expansion ratio candidates in which is defined as follows:
(2) 
Search Algorithm.
As discussed above, we expect to search an optimal architecture with the expansion ratio set for making the accuracy of the binarized neural networks similar to that of its fullprecision models with as few parameters and floatingnumber opeartions (FLOPS) as possible. Therefore, the overall optimization can be described as:
(3)  
where is the fitness function in evolutionary algorithm and is loss on train set, is the corresponding trained weight with expansion ratio set . We first find an optimal through evolutionary algorithm on a train subset. Then we train the corresponding binary network on full train set to obtain the final model.
Specifically, in every generation during evolution, we maintain a population of individuals, , each of which denotes a bianry neural architecture according to a certain expansion ratio code satisfying Eq. 2. These individuals will be continuously updated with predesigned operations (corssover and mutation) to have greater fitness. Here we have two objects: high performance on the specific task, classification accuracy, and low computation costs, FLOPs. Thus, the fitness of an individual is defined as:
(4) 
where Acc and FLOPs are the Top1 validation accuracy and FLOPs of the corresponding widened networks of the individual , is the tradeoff parameter.
Compared with fullprecision layers, the FLOPs of binary layers are divided by as suggested in BiReal Net [14]. In the calculation of fitness in Eq. 4, we divide the FLOPs of the candidate models by the FLOPs of original binary network to get the same order of magnitude of accuracy. After defining the search space and fitness function, the evolutionary algorithm can effectively select excellent individuals with higher fitness during the evolution process until convergence.
3 Experiments
In this section, we conduct experiments to explore the empirical width lower bound of each layer in binary neural networks on several benchmark datasets, CIFAR10 [12], and ImageNet [3]. We use two widely used network structures as baselines, VGGsmall [24] and ResNet18 [9].
3.1 Experimental Settings
For the evolution search process, we search for 50 generations with 32 individuals in each generation. We train each candidate model for 10 epochs on the trainset and obtain the accuracy on validation set as the accuracy used in Eq.
4. For the tradeoff parameter , we set it to to keep the value of accuracy and FLOPs comparable.Cifar10
In CIFAR10 dataset, it takes about 12 hours on 8 V100 GPUs. Then we train 200 epochs for full CIFAR10 training. The learning rate starts as 0.1 and multiply by 0.1 in the epochs of 60, 120 and 180. We simply follow the same hyperparameter setup as that in [24].
ImageNet
As the ImageNet ILSVRC2012 dataset is very large, we do not use the whole train dataset in evolution process. We randomly sample a subset of 50,000 images from the original full trainset which belongs to 1000 classes with 50 images for each class in the evolution process and it takes about 180 hours on 8 V100 GPUs. Then we train 150 epochs to check if searched models reaches fullprecision accuracy. The learning rate starts from 0.1 and decays by 0.1 in the epochs of 50, 100 and 135. We simply follow the same hyperparameter setup as that in [9].
Initialization
When evaluating each candidate, we train 10 epochs on a small subset in ImageNet dataset, the accuracy of candidate models is especially low and makes it difficult to distinguish the better models from the worse ones. Therefore, we train the model uniformly widened by on the subset with 150 epochs and use it to initialize all the candidate models which we simply intercept first corresponding channels values.
Models  FLOPs  Speedup  Memory  Top1(%) 

FullPrecision  608M    149M  93.48 
Uniform1  13.2M  46.1  7.3M  90.24 
Uniform2  45.3M  13.4  23.7M  91.65 
Uniform3  96.2M  6.3  49.3M  91.87 
Uniform4  166M  3.7  84.1M  92.56 
VGGAutoA  11.3M  53.6  5.1M  92.17 
VGGAutoB  59.3M  10.3  23.4M  93.06 
3.2 Results and Analysis
VGGsmall on CIFAR10
VGGsmall [24] is a variant network of the original VGGNet [20] designed for CIFAR10. We compare the searched models, AutomaticA, B, with uniformly widened models in Table 1. The standard binarized VGGSmall decreases accuracy only by about . As we uniformly increase the width, the accuracy increases subsequently. However with 4 widened, the accuracy of binarized network still does not achieve that of fullprecision network. Our AutomaticB model achieves higher accuracy than the Uniform4 with about 1/4 FLOPs and memory. It has the smallest accuracy gap with the fullprecision model. Although our AutomaticA model even has less channels than the original Uniform1 model, it achieves higher accuracy with about 2% improvement. This phenomenon confirms our original intention in designing the search space, that some layers need to be expanded and some layers need to be narrowed.
Models  FLOPs  Speedup  Top1(%)  Top5(%) 

FullPrecision  1820M    69.6  89.2 
PCNN  169M  10.8  57.3  80.0 
ABC  520M  3.5  62.5  84.2 
ABC  785M  2.3  65.0  85.9 
Uniform1  149M  12.2  52.77  76.85 
Uniform2  352M  5.2  64.0  85.45 
Uniform3  607M  3.0  68.51  88.25 
Uniform4  915M  2.0  70.35  89.27 
Res18AutoA  495M  3.7  68.64  88.46 
Res18AutoB  660M  2.8  69.65  89.08 
ResNet18 on ImageNet
We also conduct experiments on the largescale ImageNet dataset. In the uniform expansion experiments, as the width increases, the top1 accuracy can gradually approach that of the original fullprecision model. From the results in Table 2, our AutomaticB binarized model can obtain the the same performance with the fullprecision model with less than 1/3 computational cost. With similar FLOPs, AutomaticB outperform Uniform3 by 1.1% in terms of Top1 accuracy and 0.8% Top5 accuracy. Our evolutionary search finds a more accurate widened models with as less FLOPs as possible.
We also compare our models with some stateoftheart binarization methods in Table 2. PCNN [5] does not quantize the downsample layer and adds additional shortcut connections which could inevitably increase endtoend inference time. In the comparison of ABCNet with multiple bases, which means 5 binary bases for weight and 3 bases for activations, Our Uniform and Automatic models consistently performs better than ABCNet by a large margin.
Searched Architecture
To further analyze the searched network architecture, we show the number of output channels in each layer of two binary networks with similar accuracy, Res18AutoA and Uniform3 in Table 2. From Fig. 2, we observe that compared with Uniform3, the searched architecture Res18AutoA has fewer output channels in the 1st, 2nd and last stages. In addition, Res18AutoA needs more channels for the middle feature maps inside each block. These observations could inspire us to design blocks or architectures for more efficient convolutional neural networks.
4 Conclusion
To establish binary neural networks with higher precision and lower computational costs, this paper studies the binary neural architecture search problem. Based on the empirical study on uniform width expansion, we define a novel search space and utilize evolutionary algorithm to adjust the number of channels in each convolutional layer after binarizing. Experiments on benchmark datasets and neural architectures show that the proposed method can produce binary networks with acceptable parameters increment and the same performance as that of the fullprecision original network.
References

[1]
(2016)
Neural architecture search with reinforcement learning
. arXiv preprint arXiv:1611.01578. Cited by: §1. 
[2]
(2013)
Estimating or propagating gradients through stochastic neurons for conditional computation
. arXiv preprint arXiv:1308.3432. Cited by: §2.  [3] (2009) Imagenet: a largescale hierarchical image database. In CVPR, pp. 248–255. Cited by: §1, §3.

[4]
(2018)
Regularized evolution for image classifier architecture search
. arXiv preprint arXiv:1802.01548. Cited by: §1.  [5] (2019) Projection convolutional neural networks for 1bit cnns via discrete back propagation. In AAAI, Cited by: §1, §3.2.
 [6] (2019) Attribute aware pooling for pedestrian attribute recognition. arXiv preprint arXiv:1907.11837. Cited by: §1.
 [7] (2018) Autoencoder inspired unsupervised feature selection. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2941–2945. Cited by: §4.
 [8] (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR, Cited by: §1.
 [9] (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §1, §1, §3.1, §3.
 [10] (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1.
 [11] (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
 [12] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §3.
 [13] (2017) Towards accurate binary convolutional neural network. In NeurIPS, pp. 345–353. Cited by: §1.
 [14] (2018) Bireal net: enhancing the performance of 1bit cnns with improved representational capability and advanced training algorithm. In ECCV, Cited by: §1, §2, §2.

[15]
(2015)
Fully convolutional networks for semantic segmentation.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 3431–3440. Cited by: §1.  [16] (2016) BinaryNet: training deep neural networks with weights and activations constrained to +1 or 1. In CoRR, Cited by: §1.
 [17] (2018) WRPN: wide reducedprecision networks. In ICLR, Cited by: §2.
 [18] (2016) Xnornet: imagenet classification using binary convolutional neural networks. In ECCV, pp. 525–542. Cited by: §1, §1.
 [19] (2015) Faster rcnn: towards realtime object detection with region proposal networks. Cited by: §1.
 [20] (2015) Very deep convolutional networks for largescale image recognition. In ICLR, Cited by: §1, §3.2.
 [21] (2014) Deep learning face representation from predicting 10,000 classes. In CVPR, Cited by: §1.
 [22] (2018) Learning versatile filters for efficient convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1608–1618. Cited by: §1.
 [23] (2017) Towards evolutional compression. arXiv preprint arXiv:1707.08005. Cited by: §1.
 [24] (2017) Deep learning with low precision by halfwave gaussian quantization. In CVPR, pp. 5918–5926. Cited by: §3.1, §3.2, §3.
 [25] (2016) Dorefanet: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1, §1, §2.