Convolutional neural networks (CNNs) have been widely used in various computer vision tasks, such as image classification , object detection  and visual segmentation . These neural networks are often of heavy design with massive parameters and computational costs, which cannot be directly deployed on portable devices without model compressing techniques, pruning , knowledge distillation , compact model design [11, 22], and quantization [18, 25].
Wherein, 1-bit quantization has been recently received a great attention, which represents the weights and activations in the network using only two values, and
. Thus, binarized networks could be efficiently applied in a series of real-world applications (camera and mobile phone). Nevertheless, the performance of binary neural networks (BNNs) are still far worse than that of their original models. Figure1 summarizes the performance of state-of-the-art binarization methods [16, 13, 18, 25, 14, 5]
on the ImageNet benchmark
, including XNOR-Net, Bi-Real Net , PCNN , . Although they have made tremendous efforts for enhancing the performance of BNNs, the highest top-1 accuracy obtained by PCNN  is about lower than that of the baseline ResNet-18 .
The severe accuracy drop mentioned in Figure 1
greatly limits the practicality of BNNs, considering that there are a number of computer vision taks with very high precision requirements such as face recognition and person re-identification . The main reason could be derived from the fact that discrimination of binary features cannot match that of the full-precision features with the same dimensionality. Therefore, it is necessary to find a trade-off approach for establishing compact binary networks with acceptable model sizes by increasing the number of channels in each convolutional layer. Motivated by the recent neural architecture search (NAS [1, 4, 23]) hotspot, we present to appropriately modify channel numbers of binarized networks and search a new architecture with different channel numbers but high precision. In practice, expansion ratios of all layers in the desired binary network will be encoded to form the search space, and the evolutionary algorithm will be utilized for effectively find the lower bound of BNNs for achieving the same performance as that of their full-precision versions.
Following the widely-used DoReFa-Net , in the binary layer, the floating-point weights is approximated by binary weights and a floating-point scalar, while the floating-point activations are represented by binary values . The feed-forward in DoReFa-Net is defined as:
calculates the mean of absolute value. In the back-propagation process, we adapt the “Straight-Through Estimator” method to estimate the corresponding gradients. During the quantization process, we restrain the weights and activations of all convolution layers and fully-connected layers to only 1-bit except the first and last layer, following the existing works [25, 14].
The extremely binary quantization brings enormous computation acceleration and memory reduction. However, most of the state-of-the-art binary networks cannot match the accuracy of the full-precision counterpart models. Recently, the uniform width expansion proposed by WRPN  expands all the layers with only one hyper-parameter for multi-bit quantization networks to pursue this goal.
Although widened binary networks can obtain acceptable performance, such a uniform expansion strategy will obviously increase the required memory and computational complexities, the binary network after expanding is larger than the original one. In fact, there is often strong redundancy in deep neural architectures, we do not need to expand all layers for achieving the desired performance. Thus, we propose to define a binary neural architecture search problem and utilize evolutionary algorithm to search the optimal architectures.
For the search space, we only focus on the search for network width, the number of the channels of each layer. For a given network architecture which has layers, we define to encode the expansion ratio hyper-parameter of each layer. Our goal is to search
for higher accuracy with less FLOPs. All the other hyper-parameters and network settings like stride, kernel size, layer order, remain the same as the original full-precision models.
In the uniform width expansion experiments as shown in Table 2, we observe that by only expanding channels by times, binary neural networks can obtain comparable performance to that of their full-precision model on the ImageNet classification task. Thus we assume that is the empirical upper bound of expansion ratio to achieve full-precision accuracy. We set as the largest expansion ratio, and use some smaller ratio to expand or even reduce channels. In practice, we have expansion ratio candidates in which is defined as follows:
As discussed above, we expect to search an optimal architecture with the expansion ratio set for making the accuracy of the binarized neural networks similar to that of its full-precision models with as few parameters and floating-number opeartions (FLOPS) as possible. Therefore, the overall optimization can be described as:
where is the fitness function in evolutionary algorithm and is loss on train set, is the corresponding trained weight with expansion ratio set . We first find an optimal through evolutionary algorithm on a train subset. Then we train the corresponding binary network on full train set to obtain the final model.
Specifically, in every generation during evolution, we maintain a population of individuals, , each of which denotes a bianry neural architecture according to a certain expansion ratio code satisfying Eq. 2. These individuals will be continuously updated with pre-designed operations (corssover and mutation) to have greater fitness. Here we have two objects: high performance on the specific task, classification accuracy, and low computation costs, FLOPs. Thus, the fitness of an individual is defined as:
where Acc and FLOPs are the Top-1 validation accuracy and FLOPs of the corresponding widened networks of the individual , is the trade-off parameter.
Compared with full-precision layers, the FLOPs of binary layers are divided by as suggested in Bi-Real Net . In the calculation of fitness in Eq. 4, we divide the FLOPs of the candidate models by the FLOPs of original binary network to get the same order of magnitude of accuracy. After defining the search space and fitness function, the evolutionary algorithm can effectively select excellent individuals with higher fitness during the evolution process until convergence.
In this section, we conduct experiments to explore the empirical width lower bound of each layer in binary neural networks on several benchmark datasets, CIFAR-10 , and ImageNet . We use two widely used network structures as baselines, VGG-small  and ResNet-18 .
3.1 Experimental Settings
For the evolution search process, we search for 50 generations with 32 individuals in each generation. We train each candidate model for 10 epochs on the trainset and obtain the accuracy on validation set as the accuracy used in Eq.4. For the trade-off parameter , we set it to to keep the value of accuracy and FLOPs comparable.
In CIFAR-10 dataset, it takes about 12 hours on 8 V100 GPUs. Then we train 200 epochs for full CIFAR-10 training. The learning rate starts as 0.1 and multiply by 0.1 in the epochs of 60, 120 and 180. We simply follow the same hyper-parameter setup as that in .
As the ImageNet ILSVRC2012 dataset is very large, we do not use the whole train dataset in evolution process. We randomly sample a subset of 50,000 images from the original full trainset which belongs to 1000 classes with 50 images for each class in the evolution process and it takes about 180 hours on 8 V100 GPUs. Then we train 150 epochs to check if searched models reaches full-precision accuracy. The learning rate starts from 0.1 and decays by 0.1 in the epochs of 50, 100 and 135. We simply follow the same hyper-parameter setup as that in .
When evaluating each candidate, we train 10 epochs on a small subset in ImageNet dataset, the accuracy of candidate models is especially low and makes it difficult to distinguish the better models from the worse ones. Therefore, we train the model uniformly widened by on the subset with 150 epochs and use it to initialize all the candidate models which we simply intercept first corresponding channels values.
3.2 Results and Analysis
VGG-small on CIFAR-10
VGG-small  is a variant network of the original VGG-Net  designed for CIFAR-10. We compare the searched models, Automatic-A, B, with uniformly widened models in Table 1. The standard binarized VGG-Small decreases accuracy only by about . As we uniformly increase the width, the accuracy increases subsequently. However with 4 widened, the accuracy of binarized network still does not achieve that of full-precision network. Our Automatic-B model achieves higher accuracy than the Uniform-4 with about 1/4 FLOPs and memory. It has the smallest accuracy gap with the full-precision model. Although our Automatic-A model even has less channels than the original Uniform-1 model, it achieves higher accuracy with about 2% improvement. This phenomenon confirms our original intention in designing the search space, that some layers need to be expanded and some layers need to be narrowed.
ResNet-18 on ImageNet
We also conduct experiments on the large-scale ImageNet dataset. In the uniform expansion experiments, as the width increases, the top-1 accuracy can gradually approach that of the original full-precision model. From the results in Table 2, our Automatic-B binarized model can obtain the the same performance with the full-precision model with less than 1/3 computational cost. With similar FLOPs, Automatic-B outperform Uniform-3 by 1.1% in terms of Top-1 accuracy and 0.8% Top-5 accuracy. Our evolutionary search finds a more accurate widened models with as less FLOPs as possible.
We also compare our models with some state-of-the-art binarization methods in Table 2. PCNN  does not quantize the downsample layer and adds additional shortcut connections which could inevitably increase end-to-end inference time. In the comparison of ABC-Net with multiple bases, which means 5 binary bases for weight and 3 bases for activations, Our Uniform and Automatic models consistently performs better than ABC-Net by a large margin.
To further analyze the searched network architecture, we show the number of output channels in each layer of two binary networks with similar accuracy, Res18-Auto-A and Uniform-3 in Table 2. From Fig. 2, we observe that compared with Uniform-3, the searched architecture Res18-Auto-A has fewer output channels in the 1st, 2nd and last stages. In addition, Res18-Auto-A needs more channels for the middle feature maps inside each block. These observations could inspire us to design blocks or architectures for more efficient convolutional neural networks.
To establish binary neural networks with higher precision and lower computational costs, this paper studies the binary neural architecture search problem. Based on the empirical study on uniform width expansion, we define a novel search space and utilize evolutionary algorithm to adjust the number of channels in each convolutional layer after binarizing. Experiments on benchmark datasets and neural architectures show that the proposed method can produce binary networks with acceptable parameters increment and the same performance as that of the full-precision original network.
Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1.
Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §2.
-  (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §1, §3.
Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548. Cited by: §1.
-  (2019) Projection convolutional neural networks for 1-bit cnns via discrete back propagation. In AAAI, Cited by: §1, §3.2.
-  (2019) Attribute aware pooling for pedestrian attribute recognition. arXiv preprint arXiv:1907.11837. Cited by: §1.
-  (2018) Autoencoder inspired unsupervised feature selection. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2941–2945. Cited by: §4.
-  (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR, Cited by: §1.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §1, §1, §3.1, §3.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1.
-  (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
-  (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §3.
-  (2017) Towards accurate binary convolutional neural network. In NeurIPS, pp. 345–353. Cited by: §1.
-  (2018) Bi-real net: enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In ECCV, Cited by: §1, §2, §2.
Fully convolutional networks for semantic segmentation.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
-  (2016) BinaryNet: training deep neural networks with weights and activations constrained to +1 or -1. In CoRR, Cited by: §1.
-  (2018) WRPN: wide reduced-precision networks. In ICLR, Cited by: §2.
-  (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In ECCV, pp. 525–542. Cited by: §1, §1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Cited by: §1.
-  (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §1, §3.2.
-  (2014) Deep learning face representation from predicting 10,000 classes. In CVPR, Cited by: §1.
-  (2018) Learning versatile filters for efficient convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1608–1618. Cited by: §1.
-  (2017) Towards evolutional compression. arXiv preprint arXiv:1707.08005. Cited by: §1.
-  (2017) Deep learning with low precision by half-wave gaussian quantization. In CVPR, pp. 5918–5926. Cited by: §3.1, §3.2, §3.
-  (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1, §1, §2.