1 Introduction
Increasing demand for deploying high performance visual recognition systems encourages research on efficient neural networks. The research fields include pruning
[12], efficient architecture design[48, 15, 14], lowrank decomposition[16], network quantization[6, 34, 20] and knowledge distillation[13, 39]. Particularly, network quantization, especially binary or 1bit CNNs, are known to provide extreme computational savings with a relatively small accuracy drop. In addition, the computationally expensive floating point convolutions are replaced with computationally efficient XNOR and bitcount operations, which significantly speeds up inference
[34]. Current binary neural networks, however, are based on architectures designed for floatingpoint weights and activations[34, 28, 36]. We hypothesize that the backbone architectures used in current binary networks may not be optimal for binary parameters as they were designed for floating point ones. Instead, we may learn better binary network architectures by exploring the space of binary network topology. As shown in Figure 1, given the same binarization scheme, our searched architectures clearly outperform all the handcrafted architectures indicating the prowess of our search method in discovering better binary networks.
To discover better performing binary networks, we first apply one of the most wellknown binarization schemes [34] to the searched architectures from floating point NAS which use cell based search and gradient based search algorithms [27, 46, 9] and train the resulting binary networks on CIFAR10. Disappointingly, the binarized searched architectures do not perform well, as shown in Figure 2. We hypothesize two reasons for the failure of binarized searched floating point architectures. First, the set of layer types used in the floating point NAS is not necessarily the best set for binary networks. For example, separable convolutions will have large quantization error when binarized, since nested convolutions increase quantization error (Section 4.1.2). Additionally, we discover that the Zeroise layer type, which harms the accuracy in the floating point networks, improves the accuracy of binary networks. (Section 4.1.3). Second, the convolutional cell template used for floating point cell based NAS methods is not well suited for the binary domain because of severe vanishing gradients in the binary domain (Section 4.2).
Based on the above hypothesis and empirical observations, we formulate a cell based search space explicitly defined for binary networks. In addition, we propose to use a novel diversity regularizer to encourage exploration of diverse layer types in the early stages of search. We show that the diversity regularizer helps in searching better performing binary architectures.
Contributions.
Contributions are fourfold as follows:

[leftmargin=0cm,itemindent=.5cm,labelwidth=labelsep=0cm,align=left,itemsep=0.3em]

We propose the first architecture search method for binary neural networks. The searched architectures are adjustable to various computational budgets (in FLOPs) and outperform stateoftheart binary networks with comparable computational costs on both CIFAR10 and ImageNet datasets.

We propose to add intercell skipconnections to the cell template to alleviate the gradient vanishing problem in binary architectures.

We propose to use the Zeroise layer to reduce the quantization error in binary networks and show that it results in better performing binary networks.

We propose to diversify early stages of search and show that it also contributes to discovering better performing binary networks.
2 Related Work
2.1 Binary Neural Networks
There have been numerous attempts to improve the accuracy of lowbit precision CNNs. We categorize them into binarization schemes, architectural modifications and training methods.
Binarization schemes.
[6] used the sign function to binarize the weights and showed good performance on CIFAR10 but could not scale to larger datasets such as ImageNet. [34] proposed binary weight network (BWN) that uses the sign function with a scaling factor to binarize the weights and also showed improved performance on larger datasets. Several attempts have also been made to employ multibit weights to enhance the representational capacity of weightquantized networks[51, 20]. Various approaches quantize both weights and activations, offering higher memory savings and inference speedup than their weight quantization only counterparts. [7]
introduced binarized neural networks and proposed to use the sign function to binarize the weights and the activations and the straight through estimator (STE) to estimate the gradient. They showed good classification accuracy on relatively small datasets such as CIFAR10, but
[34] showed that it is not very successful on a larger dataset such as ImageNet. [34] proposed XNORNet which uses the sign function with a scaling factor to binarize the weights and the activations. They showed improved performance on large scale datasets, and showed that the computationally expensive floating point convolution operations can be replaced by highly efficient xnor and bit counting operations for the binary domain. We use the same binarization scheme for our method following recent binary networks [28, 24]. [49] tried combinations of 1bit weights and multibit low precision activations and gradients to obtain a better tradeoff between accuracy and quantization error. [22] approximated both weights and activations as a weighted sum of multiple binary filters to improve performance. [40] employed ternary activations with binary weights for better representation capacity. [42]proposed that the approximation of the binary weights and activations has sign inconsistencies which leads to increased quantization error. They use reinforcement learning to mine channelwise interactions to provide prior knowledge which alleviates the inconsistency in signs and preserves information, leading to reduction of quantization error. Recently some new binarization schemes have also been proposed
[10, 2]. [10] uses projection convolutional layers while [2] improves upon the analytically calculated scaling factor in XNORNet.These different binarization schemes do not modify the backbone architecture. In contrast, we search for better backbone architectures given a binarization scheme.
Architectural modifications.
Appropriate modifications to the backbone architecture can result in great improvements in accuracy [34, 24, 28]. [34]
proposed XNORNet which shows that changing the order of batch normalization (BN) and sign function is crucial for the performance of binary networks. XNORNet performs well on ImageNet with orders of magnitude faster inference time as the normal convolutions are replaced by XNOR operations.
[2] also adopt the changes proposed in XNORNet. [18] proposed a local binary convolutional layer to approximate the response of a floating point CNN layer using a nonlearnable binary sparse filter and a set of learnable linear filters. [28] connected the input floating point activations of consecutive blocks through identity connections before the sign function. Their motivation was to improve the representational capacity for binary networks by adding the floating point activation of the current block to the consequent block. They also introduced a better approximation of the gradient of the sign function for backpropagation. [52] ensembled multiple smaller binary networks to achieve comparable performance to floating point networks. [24] used circulant binary convolutions to enhance the representational capabilities of binary networks. Note that using circulant binary convolutions increases the computational cost but they do not report the FLOPs of their architecture, making fair comparison of algorithms with respect to the FLOPs difficult. [33] proposed a modified version of separable convolutions to binarize the MobileNetv1 architecture. However we found their modified separable convolution modules to not generalize to architectures other than MobileNet. Please refer to the supplement for more details about [24] and [33]. [54] decompose the floating point networks into groups and approximate each group using a set of binary bases. They also introduced a method to learn this decomposition dynamically. Most recently, [36]use evolutionary algorithms to change the number of channels for each convolution layer of a binarized ResNet backbone. However, their method trades more computation cost for better performance, reducing their inference speed up (
) to be far smaller than other binary networks ().Training Methods.
Training methods designed for lower bit or 1bit CNNs have been shown to be effective. [53] proposed methods to improve the training of lower bit networks. They showed that quantized networks, when trained progressively from higher to lower bitwidth, do not get trapped in a local minimum. Recently, [11] proposed a training method for binary networks using two new losses; Bayesian kernel loss and Bayesian feature loss. They claim that their new training method can be applied to all neural network architectures and show performance improvements using the ResNet18 backbone.
All these methods share one common fact that the modifications done are without changing the convolution layer type used or varying how these layers are combined i.e, the global network topology. We wish to discover an entirely new backbone architecture which is explicitly suited for binary networks. The possibility that different network topology can improve the performance of binary networks motivates us to search binary networks.
2.2 Efficient Neural Architecture Search
We search architectures for binary network by adopting ideas from neural architecture search (NAS) methods for floating point networks [56, 27, 46, 55, 32]. To reduce the severe computation cost of NAS methods, there are numerous proposals focused on accelerating the NAS algorithms [27, 9, 46, 3, 1, 31, 23, 43, 50, 25, 8, 26, 47, 45, 21, 32]. We categorize these attempts on accelerating neural architecture search (NAS) into cell based search and gradient based search algorithms.
Cell based search.
Starting from [56], many NAS methods [46, 9, 1, 31, 23, 27, 43, 50, 25, 8, 26, 47, 45, 21] have used the cell based search, where the objective of the NAS algorithm is to search for a cell, which will then be stacked to form the final network. The cell based search reduces the search space drastically from the entire network to a cell, significantly reducing the computational cost. The cell based search space is formulated by the set of possible layer types and the convolutional cell template. Additionally, the searched cell can be stacked any number of times given the computational budget.
Gradient based search algorithms.
In order to accelerate the search, methods including [27, 46, 43, 9] relax the discrete sampling of child architectures to be differentiable so that the gradient descent algorithm can be used. The relaxation involves taking a weighted sum of several layer types during the search to approximate a single layer type in the final architecture. [27] uses softmax of learnable parameters as the weights, while other methods [46, 43, 9] use the Gumbelsoftmax [17] instead, both of which allow seamless backpropagation by gradient descent. Coupled with the use of the cell based search, certain works have been able to drastically reduce the search time [9].
We make use of both the cell based search and gradient based search algorithms with modifications to the cell based search space and introduce a new regularizer for the optimization process of the gradient based search algorithms.
3 Binarizing Searched Architectures by NAS
It is well known that the searched architectures outperform handcrafted architectures. To obtain better binary networks, we binarize the searched architectures by cell based gradient search methods. Specifically, we apply the binarization scheme of XNORNet along with their architectural modifications [34] to architectures searched by DARTS, SNAS and GDAS. We show the learning curves of the binarized searched floating point architectures on CIFAR10 dataset in Figure 2.
Disappointingly, GDAS and SNAS reach around test accuracy and quickly plummet while DARTS did not train at all. This implies that floating point NAS methods cannot be trivially extended to search binary networks. We investigate the failure modes in training and find two issues; 1) using separable convolutions accumulates the quantization error repetitively, 2) the cell template does not propagate the gradients properly. To search binary networks, the set of layer types and the cell template should be redesigned to address frequent gradient vanishing and quantization error.
4 Approach
Based on a cell based gradient search method [27], we propose a binary network architecture search (BNAS) method to search binary networks. First, we define a set of layer types for binary networks that are resilient to quantization error. Second, we design a cell template to maintain gradients over multiple cells. Last, we propose a novel diversity regularizer to diversify the search and show that it helps to search better performing binary architectures. The searched cell is stacked to form the final architectures with various computational budgets measured in FLOPs.
4.1 Set of Layer Types for Binary Networks
Unlike the layer types used in floating point NAS, the layer types used in BNAS should be tolerable to quantization error. Starting from the layer types popularly used in floating point NAS [27, 46, 9, 56], we investigate the tolerance of various convolutional layers to quantization error. The layer types we investigate include the convolution, the dilated convolution and the separable convolution.
4.1.1 Convolutions and Dilated Convolutions
To investigate the resilience to quantization error, we review the binarization scheme we use [34]. Let be the weights of a floating point convolution layer with dimension and be an input activation with dimensions . The floating point convolution can be approximated by binary parameters denoted by and the binary input activation denoted by as:
(1) 
where is the convolution operation, is the Hadamard product (element wise multiplication), , , with , , and . The quantization error occurs because the average of the magnitudes of (
) and the average of the magnitudes of the subtensor of
centered at () is multiplied to the output of convolving only the signs since(2) 
for some vectors
and . Dilated convolutions are identical to convolutions in terms of quantization error. Since both convolutions and dilated convolutions show tolerable quantization error in binary networks [34], we include the standard convolutions and dilated convolutions in our set of layer types.4.1.2 Separable Convolutions
Separable convolutions [37] has been widely used to construct efficient network architectures for floating point networks [14]. Unlike floating point networks, we argue that the separable convolution is not suitable for binary networks. For computational efficiency, it uses nested convolutions to approximate a single convolution. The nested convolution can be written as:
(3) 
where denotes the separable convolution, and are the binary weights for the first and second convolution operation in the separable convolution layer, , and are the scaling factors for their respective binary weights and activations. Since every scaling factor induces quantization error, the nested convolutions in separable convolutions will result in more quantization error.
To empirically investigate how the quantization error affects training for different layer types, we construct small networks formed by repeating each kind of convolutional layers three times, followed by three fully connected layers. We train these networks on CIFAR10 in floating point and binary domain and summarize the results in Table 1.
Layer Type  Conv  Dil. Conv  Sep. Conv  
Kernel Size  
FP Acc. (%)  
Bin. Acc. (%) 
When binarized, both convolution and dilated convolution layers show only a reasonable drop in accuracy, while the separable convolution layers show performance equivalent to random guessing ( for CIFAR10). The observations in Table 1 confirm that the accumulated quantization error by the nested convolutions fails binary networks in training. This also partly explains why the binarized architecture of DARTS in Figure 2 does not train as it has a large number of separable convolutions.
4.1.3 Zeroise
The Zeroise layer outputs all zeros irrespective of the input[27]. In the authors’ implementation of [27]^{1}^{1}1https://github.com/quark0/darts, they remove the Zeroise layers when obtaining the final architecture. As the exclusion is not discussed in [27], we compare the accuracy with and without the Zeroise layer for DARTS in the DARTS column of Table 2 and empirically verify that the Zeroise layer is not particularly useful for floating point networks. We believe the reason to be that it only reduces the network capacity.
Search Method  DARTS  BNAS  
Zeroise Layer  ✗  ✓  ✗  ✓ 
Train Acc. (%)  ()  (+)  
Test Acc. (%)  ()  (+) 
However, we argue that the Zeroise layer can play an important role in binary networks and should be included. Particularly, we claim that the Zeroise layer can reduce quantization error in binary networks as shown in Figure 3. Empirically, we also observe meaningful gain in performance when we include the Zeroise layers in the final architecture (see BNAS column in Table 2).
Including the Zeroise layer in the final architecture is particularly beneficial when the situation similar to Figure 3
happens frequently as the quantization error reduction is significant. This degree of benefit may differ from dataset to dataset. As the dataset used for search may differ from the dataset used to train and evaluate the searched architecture, we propose to tune the probability of including the
Zeroise layer. Specifically, we propose a generalized layer selection criterion to adjust the probability of including the Zeroiselayer by a transferability hyperparameter
as:(4) 
where is the architecture parameter corresponding to the Zeroise layer and are the architecture parameters corresponding to the layer other than Zeroise. Larger encourages to pick the Zeroise layer only when it is substantially better than the other layers.
With the separable convolutions and the Zeroise layer type considered, we summarize the defined set of layer types for BNAS in Table 3.
Layer Type  Bin Conv.  Bin Dil. Conv.  MaxPool  AvgPool  Zeroise  
Kernel Size  N/A 
refer to the binary convolution, binary dilated convolution, max pooling and average pooling layers, respectively.
4.2 InterCell Skip Connection in Cell Template
With the defined set of layer types, we learn a network architecture with the convolutional cell template proposed in [56]
. However, the learned architecture suffers from severe vanishing gradient problems in the binary domain as shown in Figure
4.after 150 epochs, implying high classification loss. The magnitudes of the gradients are proportional to the loss but they are very small especially in the early layers, indicating a gradient vanishing problem.
Investigating the reasons for the vanishing gradient, we observe the following. Skipconnections in the cell template proposed in [56] are confined to be inside a single convolutional cell i.e., intracell skipconnections. The intracell skipconnections do not propagate the gradients outside the cell. To help convey information over multiple cells, we propose to add skipconnections between multiple cells as illustrated in Figure 5
. These intercell skipconnections help propagate gradients throughout the network. Note that since the handcrafted binary networks use the ResNet family as their backbone, they already have similar residual connections as shown in Figure
6. We empirically validate the usefulness of the intercell skip connections in Section 5.5.4.3 Diversifying Search for Binary Networks
After the binary cell based search space is defined, we form a overparameterized DAG containing all layer types similar to [27]. Unfortunately, in BNAS, the binary layers with learnable parameters (e.g., convolutional layers) are not selected as often in the early stages of search as the layers requiring no learning are more favorable to the search algorithm than the undertrained layers. This is more prominent in the binary domain because binary layers train slower than floating point ones [7]. To alleviate this, we propose to use an exponentially annealed entropy based regularizer during the search and call it the diversity regularizer. Specifically, we subtract the entropy of the architecture parameter distribution from the search objective as:
(5) 
where is the search objective of [27] (crossentropy), are the DAG’s parameters, are the architecture parameters, is the entropy, is the scaling hyperparameter, is the epoch, and is the annealing hyperparameter. This will encourage the architecture parameter distribution to be more uniform in the early stages, allowing the search to explore more diverse layer types.
Using the diversity regularizer, we observed a relative increase in the average number of learnable layer types picked in the first epochs of the search. We also empirically validate the benefit of diversity in search using CIFAR10 dataset in Table 4. While the accuracy improvement from the diversity regularizer in the floating point NAS methods such as DARTS [27] is marginal (), the improvement in our binary network is more meaningful ().
Search Method  DARTS  BNAS  
Diversity  ✗  ✓  ✗  ✓ 
Test Acc. (%)  (+)  (+) 
5 Experiments^{2}^{2}2We will publicly release the code and learned models very soon.
5.1 Experimental Setup
Datasets.
We use CIFAR10[19] and ImageNet (ILSVRC 2012)[35] datasets to evaluate the image classification accuracy. For searching binary networks, we use the CIFAR10 dataset. For training the final architectures from scratch, we use both CIFAR10 and ImageNet. During the search, we hold out half of the training data of CIFAR10 as the validation set to evaluate the quality of search. For final evaluation of the searched architecture, we train it from the scratch using the full training set and report Top1 (and Top5 for ImageNet) accuracy.
Search details.
We train a small network with cells and initial number of channels using SGD with the diversity regularizer (Section 4.3) for epochs with batch size of . We use momentum with initial learning rate of using cosine annealing [29] and a weight decay of . We use same architecture hyperparameters as [27] except for the additional diversity regularizer where we use and . Our cell search takes approximately 10 hours on a single NVIDIA RTX 2080TI GPU.
Learning details.
For CIFAR10, we train the final networks for epochs with batch size . We use SGD with momentum and weight decay of . We use the one cycle learning rate scheduler[38] with the learning rate ranging from 5 10^{2} to 4 10^{4}. For ImageNet, we train the models for epochs with batch size . We use SGD with momentum , with an initial learning rate of and a weight decay of . We use the cosine restart scheduler [29] with the minimum learning rate of and the length of one cycle being epochs.
Final architecture configurations.
We vary the size of our BNAS variants to compare with the other binary networks with different FLOPs and name them as BNAS{Mini, A, B, C, D, E} as shown in Table 5.
BNAS  Mini  A  B  C  D  E 
# Cells  
# Chn.  
Dataset  CIFAR10  ImageNet 
Comparison with other binary networks.
In experiments on CIFAR10, we show the test accuracy for four different FLOPs budgets for binary networks. For XNORNet with different backbone architectures, we use the floating point architectures from a public source^{4}^{4}4https://github.com/kuangliu/pytorchcifar and apply the binarization scheme of XNORNet. In experiments on ImageNet, we excerpt the test accuracy of various binary networks from their respective papers and show test accuracy for two different FLOPs budgets for binary networks. Similar to how ABCNet with a single base is used for comparison in previous work[28, 10], we compare PCNN with a single projection kernel for both CIFAR10 and ImageNet.
5.2 Comparing the Searched and Handcrafted Cell
We qualitatively compare our searched cell with the XNORNet cell that is based on the ResNet18 architecture in Figure 6. As shown in the figure, our searched cell has a contrasting structure to the handcrafted ResNet18 architecture. Both cells contain only two 33 binary convolution layer types, but the extra Zeroise layer types selected by our search algorithm help in reducing the quantization error. The topology in which the Zeroise layer types and convolution layer types are connected also contributes to improving the classification performance of our searched cell. In the following subsections, we show that our searched topology yields better binary networks that outperform current handcrafted binary networks. More qualitative comparisons of our searched cell can be found in the supplement.
5.3 Results on CIFAR10
We quantitatively compare our BNAS models to the existing binary networks with various FLOPs on CIFAR10 dataset and summarize the results in Table 6. Note that the order of performance in floating point does not always translate for binary networks, as binarized DenseNet and SENet perform worse than ResNet18. Our architectures searched by the proposed BNAS method outperform all the handcrafted architectures with the XNORNet binarization scheme with similar or lesser FLOPs. Our BNAS variants also outperform the stateoftheart binary networks that have the same FLOPs except PCNN with in the FLOPs budget. The ‘Projection’ binarization scheme used in PCNN has advantage over the ‘Sign + Scale’ binarization scheme in the FLOPs budget for CIFAR10.
It is interesting to note that BNASMini outperforms all the other networks in the FLOPs budget with far less FLOPs. Note that both PCNN () and BNASC approach the floating point ResNet18’s test accuracy, with less than difference.
FLOPs ()  Model (Backbone)  Binarization  Test Acc (%) 
PCNN () (WRN22)  Projection  
BNASMini  Sign + Scale  
XNORNet (ResNet18)  Sign + Scale  
XNORNet (DenseNet)  Sign + Scale  
XNORNet (NiN)  Sign + Scale  
XNORNet (SENet)  Sign + Scale  
BinaryNet (ResNet18)  Sign  
BNASA  Sign + Scale  
XNORNet (ResNet34)  Sign + Scale  
XNORNet (WRN40)  Sign + Scale  
PCNN () (WRN22)  Projection  
BNASB  Sign + Scale  
XNORNet (ResNext2964)  Sign + Scale  
BNASC  Sign + Scale  
ResNet18 (FP)  N/A 
5.4 Results on ImageNet
FLOPs ()  Model  Binarization  Searched  Pretraining  Top Acc. (%)  Top Acc. (%) 
BinaryNet [7]  Sign  ✗  ✗  
ABCNet [22]  Clip + Scale  ✗  ✗  
XNORNet [34]  Sign + Scale  ✗  ✗  
BNASD  Sign + Scale  ✓  ✗  
BiReal [28]  Sign + Scale  ✗  ✓  
XNORNet++ [2]  Sign + Scale*  ✗  ✗  
PCNN [10]  Projection  ✗  ✓  
BNASE  Sign + Scale  ✓  ✗  
ResNet18 (FP)  N/A  ✗  ✗ 
We now quantitatively compare our searched binary networks to state of the art binary networks on the ImageNet dataset in Table 7. As XNORNet have proposed to not binarize the first convolution and the last fullyconnected layer [34] which subsequent binary networks [28, 22, 10, 2] have followed, we also follow this convention for the ease of comparison. Note that BiReal, PCNN, and XNORNet++ further unbinarize the downsampling convolutions in the ResNet18 architecture which increases the FLOPs. We have confirmed with the authors of XNORNet that the accuracy of for XNORNet was without unbinarizing the downsampling convolutions, which makes XNORNet different in FLOPs to BiReal, PCNN, and XNORNet++. We also do not unbinarize the downsampling convolutions in our binary networks.
To compare with XNORNet, ABCNet, and BinaryNet, we train BNASD which has around FLOPs. As shown in Table 7, with the same FLOPs, BNASD outperforms XNORNet by a large margin of for the top1 accuracy and for the top5 accuracy, and outperforms BinaryNet and ABCNet with even larger margins. Note that BNASD shares the same binarization scheme as XNORNet but has noticeably higher test accuracy, highlighting the effectiveness of our searched global topology of the architecture.
To compare with BiReal, XNORNet++, and PCNN, we train BNASE which has around FLOPs. Note that BiReal and PCNN use floating point or binary weight pretraining schemes while XNORNet++ and BNASE do not use any pretraining. Even without pretraining, BNASE outperforms BiReal, XNORNet++, and PCNN by a margin of at least for top1 accuracy and at least for top5 accuracy. With the same binarization scheme (comparing BiReal with BNASE), our BNASE outperforms the handcrafted ResNet18 backbone by a margin of for top1 accuracy and for top5 accuracy. Interestingly, even without applying the recent binarization schemes of XNORNet++ and PCNN, our BNASE still outperforms both of them. We expect further improvements by applying the binarization schemes proposed in XNORNet++ and PCNN.
5.5 Ablation Studies
We perform ablation studies on various components of our full method. We use the CIFAR10 dataset for the experiments with various FLOPs budgets and summarize the results in Table 8.
Comparing No Div with Full, the searched cell with the diversity regularizer clearly outperforms the searched cell without the diversity regularizer for all the model variants. Removal of intercell skipconnections (No Skip) severely degrades the performance of our binary networks. It is interesting to note that for all three BNAS variants (BNASA, B, and C), the models eventually collapsed to very low training and test accuracy and exhibited gradient vanishing issues for the No Skip setting. Comparing No Zeroise with Full, our searched architectures with the Zeroise layers achieve better test accuracy on CIFAR10. Interestingly, the largest model (BNASC) without Zeroise layers performs worse than BNASA and BNASB, due to excess complexity. Please refer to the supplement for more detailed analysis on our ablated models.
Model  Full  No Skip  No Zeroise  No Div 
BNASA  
BNASB  
BNASC 
6 Conclusion
To design better performing binary networks, we propose a method to obtain binary networks by searching, called BNAS. BNAS searches for a cell that can be stacked to generate networks for various computational budgets. To configure the BNAS search space, we define a new set of binary layer types and modify the existing cell template. In addition, we propose to use the Zeroise layer type in the final network, whereas other existing floating point NAS methods have only used it during the search, and show that it helps in improving the performance of binary networks. Moreover, we propose to diversify the selection of layers in early search, and show that it helps in obtaining better binary architectures. The learned BNAS networks outperform other handcrafted binary networks with similar or less computational costs on both CIFAR10 and ImageNet.
Acknowledgement
References
 [1] Gabriel Bender, PieterJan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying oneshot architecture search. In ICML, 2018.
 [2] Adrian Bulat and Georgios Tzimiropoulos. Xnornet++: Improved binary neural networks. In BMVC, 2019.
 [3] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In ICLR, 2019.
 [4] Hanlin Chen, Li’an Zhuo, Baochang Zhang, Xiawu Zheng, Jianzhuang Liu, David S. Doermann, and Rongrong Ji. Binarized neural architecture search. ArXiv, abs/1911.10862, 2019.
 [5] Yukang Chen, Gaofeng Meng, Qian Zhang, Xinbang Zhang, Liangchen Song, Shiming Xiang, and Chunhong Pan. Joint neural architecture search and quantization. ArXiv, abs/1811.09426, 2018.
 [6] Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, 2015.
 [7] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.
 [8] JinDong Dong, AnChieh Cheng, DaCheng Juan, Wei Wei, and Min Sun. Dppnet: Deviceaware progressive search for paretooptimal neural architectures. In ECCV, 2018.
 [9] Xuanyi Dong and Yi Yang. Searching for a robust neural architecture in four gpu hours. In CVPR, 2019.

[10]
Jiaxin Gu, Ce Li, Baochang Zhang, Jungong Han, Xianbin Cao, Jianzhuang Liu, and
David Doermann.
Projection convolutional neural networks for 1bit cnns via discrete back propagation.
In AAAI, 2019.  [11] Jiaxin Gu, Junhe Zhao, Xiaolong Jiang, Baochang Zhang, Jianzhuang Liu, Guodong Guo, and Rongrong Ji. Bayesian optimized 1bit cnns. In CVPR, 2019.
 [12] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
 [13] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [14] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [15] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. SqueezeNet: Alexnetlevel accuracy with 50x fewer parameters and 0.5mb model size. arXiv:1602.07360, 2016.
 [16] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
 [17] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. In ICLR, 2017.
 [18] Felix JuefeiXu, Vishnu Naresh Boddeti, and Marios Savvides. Local binary convolutional neural networks. In CVPR, 2017.
 [19] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical report, 2009.
 [20] Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
 [21] Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search. arXiv preprint arXiv:1902.07638, 2019.
 [22] Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. In NIPS, 2017.
 [23] Chenxi Liu, LiangChieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L. Yuille, and Li FeiFei. Autodeeplab: Hierarchical neural architecture search for semantic image segmentation. In CVPR, 2019.
 [24] Chunlei Liu, Yue Qi, Xue Xia, Baochang Zhang, Jiaxin Gu, Jianzhuang Liu, Rongrong Ji, and David S. Doermann. Circulant binary convolutional networks: Enhancing the performance of 1bit dcnns with circulant back propagation. In CVPR, 2019.
 [25] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, LiJia Li, Li FeiFei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In ECCV, 2018.
 [26] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. In ICLR, 2018.
 [27] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In ICLR, 2019.
 [28] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and KwangTing Cheng. Bireal net: Enhancing the performance of 1bit cnns with improved representational capability and advanced training algorithm. In ECCV, 2018.
 [29] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
 [30] Qian Lou, Lantao Liu, Minje Kim, and Lei Jiang. Autoqb: Automl for network quantization and binarization on mobile devices. ArXiv, abs/1902.05690, 2019.
 [31] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and TieYan Liu. Neural architecture optimization. In NIPS. 2018.
 [32] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural architecture search via parameters sharing. In ICML, 2018.
 [33] Hai Phan, Dang Huynh, Yihui He, Marios Savvides, and Zhiqiang Shen. Mobinet: A mobile binary network for image classification. arXiv preprint arXiv:1907.12629, 2019.
 [34] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
 [35] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
 [36] Mingzhu Shen, Kai Han, Chunjing Xu, and Yunhe Wang. Searching for accurate binary neural architectures. arXiv preprint arXiv:1909.07378, 2019.
 [37] Laurent Sifre and Stéphane Mallat. Rigidmotion scattering for image classification.
 [38] Leslie N Smith. A disciplined approach to neural network hyperparameters: Part 1–learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820, 2018.
 [39] Sarah Tan, Rich Caruana, Giles Hooker, Paul Koch, and Albert Gordo. Learning global additive explanations for neural nets using model distillation. arXiv preprint arXiv:1801.08640, 2018.
 [40] Diwen Wan, Fumin Shen, Li Liu, Fan Zhu, Jie Qin, Ling Shao, and Heng Tao Shen. Tbn: Convolutional neural network with ternary inputs and binary weights. In ECCV, 2018.
 [41] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardwareaware automated quantization with mixed precision. In CVPR, 2019.
 [42] Ziwei Wang, Jiwen Lu, Chenxin Tao, Jie Zhou, and Qi Tian. Learning channelwise interactions for binary convolutional neural networks. In CVPR, 2019.
 [43] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardwareaware efficient convnet design via differentiable neural architecture search. In CVPR, 2019.
 [44] Bichen Wu, Yanghan Wang, Peizhao Zhang, Yuandong Tian, Peter Vajda, and Kurt Keutzer. Mixed precision quantization of convnets via differentiable neural architecture search. ArXiv, abs/1812.00090, 2018.
 [45] Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He. Exploring randomly wired neural networks for image recognition. arXiv preprint arXiv:1904.01569, 2019.
 [46] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. SNAS: stochastic neural architecture search. In ICLR, 2019.
 [47] Chris Zhang, Mengye Ren, and Raquel Urtasun. Graph hypernetworks for neural architecture search. In ICLR, 2019.
 [48] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, 2018.
 [49] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
 [50] Yanqi Zhou, Siavash Ebrahimi, Sercan Ö Arık, Haonan Yu, Hairong Liu, and Greg Diamos. Resourceefficient neural architect. arXiv preprint arXiv:1806.07912, 2018.
 [51] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.
 [52] Shilin Zhu, Xin Dong, and Hao Su. Binary ensemble neural network: More bits per network or more networks per bit? In CVPR, 2019.
 [53] Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid. Towards effective lowbitwidth convolutional neural networks. In CVPR, 2018.
 [54] Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid. Structured binary neural networks for accurate image classification and semantic segmentation. In CVPR, 2019.
 [55] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In ICLR, 2017.
 [56] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. In CVPR, 2018.
Appendix
We show more qualitative comparisons, additional ablations, computational savings of our method, detailed reviews on related works, and brief remarks on quantized networks.
Appendix A More Qualitative Comparisons of Our Searched Cell
We present more qualitative comparisons of our searched cell with the binarized DARTS [27] cell in addition to Section 5.2 where we compare it with the handcrafted XNORNet cell. In Figure 7, we compare the normal cell of BNAS with the normal cell of DARTS[27]. Our cell has intercell skip connections which facilitate gradient propagation amongst multiple cells, making it less prone to gradient vanishing issues, whereas the binarized DARTS cell does not train at all (achieving only 10.01% test accuracy on CIFAR10 in Table 11). Other than the excessive number of separable convolutions in the DARTS searched cell (Table 11), the lack of intercell skip connections in their cell template may also contribute to the failure of its architecture in the binary domain.
We also qualitatively compare the BNAS reduction cell to the binarized DARTS reduction cell in Figure 8. Note that the BNAS reduction cell has a lot of Zeroise layers which help reduce quantization error.
Appendix B More Analyses on the Ablated Models
b.1 ‘No Skip’ Setting
Our BNASA,B, and C variants show vanishing gradient problems without the proposed intercell skip connections (refer to No Skip setting in Table 8). Besides the final classification accuracy, we also show the train and test accuracy curves for the No Skip ablation models in Figure 9 for more detailed analysis. All three variants collapse to a very low training and test accuracy after a reasonable number of epochs (600).
We also observe that the ablated models without the intercell skip connections have extremely small gradients for most of the early layers relative to that of later layers even though the classification loss is very high, indicating a gradient vanishing problem (Figure 10).
b.2 ‘No Zeroise’ Setting
Using the Zeroise layers in the final architecture has additional benefits apart from reducing the quantization error. In theory, converting a floating point layer to a binary layer saves up to about on memory and inference time [34, 28], whereas converting a floating point layer to a Zeroise layer saves memory and speeds up the inference by as it does not require any learnable parameters or computation. We summarize the memory savings, FLOPs and inference speedup of our BNASA model in Table 9 by comparing our BNASA model with and without Zeroise layers. With the Zeroise layers, not only does the accuracy increase, but we also observe significant increase in both memory savings and inference speedup.
BNASA  w/o Zeroise  w/ Zeroise 
# Cells/# Chn.  /  / 
Memory Savings  
FLOPS ()  
Inference Speedup  
Test. Acc. (%) 
b.3 Additional Ablation Study  No Dilated Convolution Layers
To factor out any performance improvements we might gain by using more advanced layers, e.g. the dilated convolution layer type, we further ablate our model by removing it from our set of layer types. This ensures that we use the same convolution layer types that were used in the ResNet18 architecture. We qualitatively compare the searched cells with and without the dilated convolution layer types in Figure 11. We then quantitatively compare XNORNet and BNASA* in Table 10. Our search algorithm still discovers a better architecture given the same binarization scheme and the same set of convolution layer types used, indicating that the performance gain we observe is mostly from the search method and the Zeroise layers.
FLOPs ()  Model  Dil. Conv.  Binarization  Test Acc. (%) 
XNORNet (ResNet18)  ✗  Sign + Scale  
BNASA*  ✗  Sign + Scale  
BNASA  ✓  Sign + Scale 
Appendix C Additional Discussions on Separable Convolution in the Set of Layer Types
In Section 3 of the main paper, we claim two issues for the failure of binarized DARTS, SNAS and GDAS. First, accumulation of quantization error due to separable convolutions and second, the lack of intercell skip connections that makes propagating the gradients across multiple cells difficult. Particularly, for the first issue (i.e., using separable convolutions accumulates quantization error repetitively), we proposed to exclude the separable convolutions from the set of layer types entirely. Nonetheless, we investigate the effect of keeping the separable convolutions in the set of layer types. We summarize the results in Table 11 with the proportion of separable convolution layers selected in the searched architecture and the respective classification accuracy.
Since DARTS, SNAS, and GDAS search on the floating point domain, their search methods do not take quantization error into account and thus result in cells that have a relatively high percentage of separable convolutions. In contrast, we search directly on the binary domain which enables our search method to identify that separable convolutions have high quantization error and hence obtain a cell that contains only one separable convolution (note that proportion of separable convolutions is 12.5% for BNAS while for others, it is more than 36%). Note that explicitly excluding the separable convolutions from the set of layer type does result in better performing binary architectures, motivating us to remove them entirely from the set of layer types for our method. The discussion about the failure of separable convolutions can be found in Section 4.1.2.
Method  Test Accuracy (%)  Proportion of Sep. Conv. (%) 
DARTS + Binarized  
SNAS + Binarized  
GDAS + Binarized  
BNASA  
BNASA 
Appendix D Memory Saving and Inference Speedup
We additionally discuss about the memory savings and inference speedup of our searched binary networks. In contrast to other binary networks using ResNet18 as their base architecture for classification on ImageNet, our searched architecture does not have a base architecture by definition. While other binary networks compare memory savings and inference speedup with respect to their base architectures, we conceive that calculating the memory savings and inference speedup for our searched binary networks could be done by comparing it to the floating point version of the our searched binary networks. We follow [28] in computing the memory savings and inference speedup with respect to the floating point counterparts on the ImageNet models. The memory savings are and and the inference speedups are and for BNASD and BNASE, respectively.
Appendix E More Details About [33] and [24] in Literature Review
e.1 Detailed Discussion About [33]: MobiNet
MobiNet [33] modifies the separable convolution by adding skip connections across the depthwise and pointwise convolution, and an extra midblock module to binarize the MobileNetV1 architecture. They claim that their modification alleviates gradient vanishing issues (Section 3.1 in [33]). To investigate whether the failure of separable convolutions is due to the gradient vanishing problem, we train a threelayer CNN with the separable convolution and the modified separable convolution for binarized network by [33] (denoted as ‘MobiNet Sep. Conv.’), and summarize the results in Table 12. The chances of gradient vanishing are very low in this shallow network.
Layer Type  Sep. Conv  MobiNet Sep. Conv  
Kernel Size  
FP Acc. (%)  N/A  N/A  
Bin. Acc. (%) 
Both the binarized separable convolutions and the modified separable convolution by the MobiNet still do not train at all (note that random guessing is in 10way classification using CIFAR10) even in a threelayer CNN. It implies that the failure in training may not be attributed to the gradient vanishing problem. We argue that the failure comes from the accumulated quantization error in both separable convolution layer types (Section 4.1.2 in the main paper). As MobiNet’s separable convolutional layer does not train at all, we do not include it in our set of layer types for search.
e.2 Detailed Discussion About [24]: Cbcn
The idea of CBCN is to use circulant filters, which are rotated versions of the original convolution kernels, to make binary networks rotational invariant and improve the performance [24]. The authors reported ImageNet top1 and top5 accuracy with using four circulant filters inplace of one original convolution kernel. The main drawback of this idea is that using more than one circulant filter incurs an increase in FLOPs as the number of convolution operations required increase linearly with the number of circulant filters used. However, [24] report neither the exact architecture of their binary network, nor the FLOPs of their resulting architecture with the circulant filters. This makes comparison with other binary networks in terms of FLOPs unfair as CBCN with four circulant filters may have more FLOPs than other binary networks based on the same ResNet18 architecture. Thus, we skip the comparison with it in our main paper to maintain fairness in terms of FLOPs as we do not know the exact FLOPs budget of CBCN.
However, by using our best guess, we estimate the minimum and maximum FLOPs^{5}^{5}5The maximum estimate was acquired from checking the authors’ code for [24]. The minimum estimate was acquired from Figure 6 and Table 5 in [24]. of CBCN on the architectures that was used for their ImageNet experiments and scale our models to the respective FLOPs budgets (BNASF and BNASG) and report the classification accuracy in Table 13.
FLOPs ()  Model  Searched  Pretraining  Top1 (%)  Top5 (%) 
BNASF  ✓  ✗  
()  CBCN  ✗  ✓  
BNASG  ✓  ✗ 
For the maximum FLOPs estimate, we multiply the FLOPs of BiReal Net[28] (roughly same FLOPs as CBCN when only one circulant filter is used) by four since most of the FLOPs in a CNN are from convolution layers and the FLOPs of the convolution layers increase linearly with the number of circulant filters. We calculate the minimum estimate by not using circulant filters for most of the more expensive convolution layers in terms of FLOPs such as the first convolution layer or the downsampling convolutions in the ResNet18 architecture since it may be the case that CBCN does not use circulant filters for all the convolution layers.
Even without pretraining, our BNASF performs close to CBCN and BNASG performs better than CBCN by a fair margin of for top1 accuracy and for the top5 accuracy. If we assume CBCN’s FLOPs to be in between that of BNASF and BNASG, we can argue that our method performs at least on par with CBCN given the accuracy of BNASF and BNASG and the lack of pretraining.
Appendix F Remarks on Quantized but ‘Non 1bit’ (not fully binary) CNNs
Quantized networks that incorporate search are a type of efficient networks that are not comparable to 1bit CNNs. The reason they are not comparable is that they cannot utilize XNOR and bit counting operations in the inference which significantly brings down their memory savings and inference speed up gains. However, it is interesting to note that there are a line of work for efficient networks with more resource consumption and review them, especially the recent ones. Notably, [5, 41, 44, 30] all search for multibit quantization policies and only [5] search for network architectures as well. [4] also search for network architectures for binary weight (not 1bit) CNNs. Moreover, [41, 44, 30] all search for quantization policies, not network architectures, further differentiating it from our method.
Comments
There are no comments yet.