Learning Architectures for Binary Networks

02/17/2020 ∙ by Kunal Pratap Singh, et al. ∙ Gwangju Institute of Science and Technology 0

Backbone architectures of most binary networks are well-known floating point architectures, such as the ResNet family. Questioning that the architectures designed for floating-point networks would not be the best for binary networks, we propose to search architectures for binary networks (BNAS). Specifically, based on the cell based search method, we define a new set of layer types, design a new cell template, and rediscover the utility of and propose to use the Zeroise layer to learn well-performing binary networks. In addition, we propose to diversify early search to learn better performing binary architectures. We show that our searched binary networks outperform state-of-the-art binary networks on CIFAR10 and ImageNet datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Increasing demand for deploying high performance visual recognition systems encourages research on efficient neural networks. The research fields include pruning

[12], efficient architecture design[48, 15, 14], low-rank decomposition[16], network quantization[6, 34, 20] and knowledge distillation[13, 39]

. Particularly, network quantization, especially binary or 1-bit CNNs, are known to provide extreme computational savings with a relatively small accuracy drop. In addition, the computationally expensive floating point convolutions are replaced with computationally efficient XNOR and bit-count operations, which significantly speeds up inference

[34]. Current binary neural networks, however, are based on architectures designed for floating-point weights and activations[34, 28, 36]. We hypothesize that the backbone architectures used in current binary networks may not be optimal for binary parameters as they were designed for floating point ones. Instead, we may learn better binary network architectures by exploring the space of binary network topology. As shown in Figure 1

, given the same binarization scheme, our searched architectures clearly outperform all the handcrafted architectures indicating the prowess of our search method in discovering better binary networks.


Figure 1: Test accuracy (%) vs. FLOPs on CIFAR10 for various backbone architectures binarized using the XNOR-Net binarization scheme [34]. Our searched architectures outperform binarized floating point architectures. Note that our BNAS-Mini, which has much less FLOPs, outperforms all other binary networks except the one based on WideResNet40 (WRN40).

To discover better performing binary networks, we first apply one of the most well-known binarization schemes [34] to the searched architectures from floating point NAS which use cell based search and gradient based search algorithms [27, 46, 9] and train the resulting binary networks on CIFAR10. Disappointingly, the binarized searched architectures do not perform well, as shown in Figure 2. We hypothesize two reasons for the failure of binarized searched floating point architectures. First, the set of layer types used in the floating point NAS is not necessarily the best set for binary networks. For example, separable convolutions will have large quantization error when binarized, since nested convolutions increase quantization error (Section 4.1.2). Additionally, we discover that the Zeroise layer type, which harms the accuracy in the floating point networks, improves the accuracy of binary networks. (Section 4.1.3). Second, the convolutional cell template used for floating point cell based NAS methods is not well suited for the binary domain because of severe vanishing gradients in the binary domain (Section 4.2).

Based on the above hypothesis and empirical observations, we formulate a cell based search space explicitly defined for binary networks. In addition, we propose to use a novel diversity regularizer to encourage exploration of diverse layer types in the early stages of search. We show that the diversity regularizer helps in searching better performing binary architectures.

Contributions.

Contributions are four-fold as follows:

  • [leftmargin=0cm,itemindent=.5cm,labelwidth=labelsep=0cm,align=left,itemsep=-0.3em]

  • We propose the first architecture search method for binary neural networks. The searched architectures are adjustable to various computational budgets (in FLOPs) and outperform state-of-the-art binary networks with comparable computational costs on both CIFAR10 and ImageNet datasets.

  • We propose to add inter-cell skip-connections to the cell template to alleviate the gradient vanishing problem in binary architectures.

  • We propose to use the Zeroise layer to reduce the quantization error in binary networks and show that it results in better performing binary networks.

  • We propose to diversify early stages of search and show that it also contributes to discovering better performing binary networks.

2 Related Work

2.1 Binary Neural Networks

There have been numerous attempts to improve the accuracy of low-bit precision CNNs. We categorize them into binarization schemes, architectural modifications and training methods.

Binarization schemes.

[6] used the sign function to binarize the weights and showed good performance on CIFAR10 but could not scale to larger datasets such as ImageNet. [34] proposed binary weight network (BWN) that uses the sign function with a scaling factor to binarize the weights and also showed improved performance on larger datasets. Several attempts have also been made to employ multi-bit weights to enhance the representational capacity of weight-quantized networks[51, 20]. Various approaches quantize both weights and activations, offering higher memory savings and inference speed-up than their weight quantization only counterparts. [7]

introduced binarized neural networks and proposed to use the sign function to binarize the weights and the activations and the straight through estimator (STE) to estimate the gradient. They showed good classification accuracy on relatively small datasets such as CIFAR10, but

[34] showed that it is not very successful on a larger dataset such as ImageNet. [34] proposed XNOR-Net which uses the sign function with a scaling factor to binarize the weights and the activations. They showed improved performance on large scale datasets, and showed that the computationally expensive floating point convolution operations can be replaced by highly efficient xnor and bit counting operations for the binary domain. We use the same binarization scheme for our method following recent binary networks [28, 24]. [49] tried combinations of 1-bit weights and multi-bit low precision activations and gradients to obtain a better trade-off between accuracy and quantization error. [22] approximated both weights and activations as a weighted sum of multiple binary filters to improve performance. [40] employed ternary activations with binary weights for better representation capacity. [42]

proposed that the approximation of the binary weights and activations has sign inconsistencies which leads to increased quantization error. They use reinforcement learning to mine channel-wise interactions to provide prior knowledge which alleviates the inconsistency in signs and preserves information, leading to reduction of quantization error. Recently some new binarization schemes have also been proposed

[10, 2]. [10] uses projection convolutional layers while [2] improves upon the analytically calculated scaling factor in XNOR-Net.

These different binarization schemes do not modify the backbone architecture. In contrast, we search for better backbone architectures given a binarization scheme.

Architectural modifications.

Appropriate modifications to the backbone architecture can result in great improvements in accuracy [34, 24, 28]. [34]

proposed XNOR-Net which shows that changing the order of batch normalization (BN) and sign function is crucial for the performance of binary networks. XNOR-Net performs well on ImageNet with orders of magnitude faster inference time as the normal convolutions are replaced by XNOR operations.

[2] also adopt the changes proposed in XNOR-Net. [18] proposed a local binary convolutional layer to approximate the response of a floating point CNN layer using a non-learnable binary sparse filter and a set of learnable linear filters. [28] connected the input floating point activations of consecutive blocks through identity connections before the sign function. Their motivation was to improve the representational capacity for binary networks by adding the floating point activation of the current block to the consequent block. They also introduced a better approximation of the gradient of the sign function for back-propagation. [52] ensembled multiple smaller binary networks to achieve comparable performance to floating point networks. [24] used circulant binary convolutions to enhance the representational capabilities of binary networks. Note that using circulant binary convolutions increases the computational cost but they do not report the FLOPs of their architecture, making fair comparison of algorithms with respect to the FLOPs difficult. [33] proposed a modified version of separable convolutions to binarize the MobileNetv1 architecture. However we found their modified separable convolution modules to not generalize to architectures other than MobileNet. Please refer to the supplement for more details about [24] and [33]. [54] decompose the floating point networks into groups and approximate each group using a set of binary bases. They also introduced a method to learn this decomposition dynamically. Most recently, [36]

use evolutionary algorithms to change the number of channels for each convolution layer of a binarized ResNet backbone. However, their method trades more computation cost for better performance, reducing their inference speed up (

) to be far smaller than other binary networks ().

Training Methods.

Training methods designed for lower bit or 1-bit CNNs have been shown to be effective. [53] proposed methods to improve the training of lower bit networks. They showed that quantized networks, when trained progressively from higher to lower bit-width, do not get trapped in a local minimum. Recently, [11] proposed a training method for binary networks using two new losses; Bayesian kernel loss and Bayesian feature loss. They claim that their new training method can be applied to all neural network architectures and show performance improvements using the ResNet18 backbone.

All these methods share one common fact that the modifications done are without changing the convolution layer type used or varying how these layers are combined i.e, the global network topology. We wish to discover an entirely new backbone architecture which is explicitly suited for binary networks. The possibility that different network topology can improve the performance of binary networks motivates us to search binary networks.

2.2 Efficient Neural Architecture Search

We search architectures for binary network by adopting ideas from neural architecture search (NAS) methods for floating point networks [56, 27, 46, 55, 32]. To reduce the severe computation cost of NAS methods, there are numerous proposals focused on accelerating the NAS algorithms [27, 9, 46, 3, 1, 31, 23, 43, 50, 25, 8, 26, 47, 45, 21, 32]. We categorize these attempts on accelerating neural architecture search (NAS) into cell based search and gradient based search algorithms.

Cell based search.

Starting from [56], many NAS methods [46, 9, 1, 31, 23, 27, 43, 50, 25, 8, 26, 47, 45, 21] have used the cell based search, where the objective of the NAS algorithm is to search for a cell, which will then be stacked to form the final network. The cell based search reduces the search space drastically from the entire network to a cell, significantly reducing the computational cost. The cell based search space is formulated by the set of possible layer types and the convolutional cell template. Additionally, the searched cell can be stacked any number of times given the computational budget.

Gradient based search algorithms.

In order to accelerate the search, methods including [27, 46, 43, 9] relax the discrete sampling of child architectures to be differentiable so that the gradient descent algorithm can be used. The relaxation involves taking a weighted sum of several layer types during the search to approximate a single layer type in the final architecture. [27] uses softmax of learnable parameters as the weights, while other methods [46, 43, 9] use the Gumbel-softmax [17] instead, both of which allow seamless back-propagation by gradient descent. Coupled with the use of the cell based search, certain works have been able to drastically reduce the search time [9].

We make use of both the cell based search and gradient based search algorithms with modifications to the cell based search space and introduce a new regularizer for the optimization process of the gradient based search algorithms.

3 Binarizing Searched Architectures by NAS

It is well known that the searched architectures outperform hand-crafted architectures. To obtain better binary networks, we binarize the searched architectures by cell based gradient search methods. Specifically, we apply the binarization scheme of XNOR-Net along with their architectural modifications [34] to architectures searched by DARTS, SNAS and GDAS. We show the learning curves of the binarized searched floating point architectures on CIFAR10 dataset in Figure 2.

Figure 2: Learning curves of binarized searched architectures on CIFAR10. The XNOR-Net’s binarization scheme and architectural modifications are applied to all the cases. Contrasting to our BNAS, the binarized searched architectures fail to train well.

Disappointingly, GDAS and SNAS reach around test accuracy and quickly plummet while DARTS did not train at all. This implies that floating point NAS methods cannot be trivially extended to search binary networks. We investigate the failure modes in training and find two issues; 1) using separable convolutions accumulates the quantization error repetitively, 2) the cell template does not propagate the gradients properly. To search binary networks, the set of layer types and the cell template should be redesigned to address frequent gradient vanishing and quantization error.

4 Approach

Based on a cell based gradient search method [27], we propose a binary network architecture search (BNAS) method to search binary networks. First, we define a set of layer types for binary networks that are resilient to quantization error. Second, we design a cell template to maintain gradients over multiple cells. Last, we propose a novel diversity regularizer to diversify the search and show that it helps to search better performing binary architectures. The searched cell is stacked to form the final architectures with various computational budgets measured in FLOPs.

4.1 Set of Layer Types for Binary Networks

Unlike the layer types used in floating point NAS, the layer types used in BNAS should be tolerable to quantization error. Starting from the layer types popularly used in floating point NAS [27, 46, 9, 56], we investigate the tolerance of various convolutional layers to quantization error. The layer types we investigate include the convolution, the dilated convolution and the separable convolution.

4.1.1 Convolutions and Dilated Convolutions

To investigate the resilience to quantization error, we review the binarization scheme we use [34]. Let be the weights of a floating point convolution layer with dimension and be an input activation with dimensions . The floating point convolution can be approximated by binary parameters denoted by and the binary input activation denoted by as:

(1)

where is the convolution operation, is the Hadamard product (element wise multiplication), , , with , , and . The quantization error occurs because the average of the magnitudes of (

) and the average of the magnitudes of the sub-tensor of

centered at () is multiplied to the output of convolving only the signs since

(2)

for some vectors

and . Dilated convolutions are identical to convolutions in terms of quantization error. Since both convolutions and dilated convolutions show tolerable quantization error in binary networks [34], we include the standard convolutions and dilated convolutions in our set of layer types.

4.1.2 Separable Convolutions

Separable convolutions [37] has been widely used to construct efficient network architectures for floating point networks [14]. Unlike floating point networks, we argue that the separable convolution is not suitable for binary networks. For computational efficiency, it uses nested convolutions to approximate a single convolution. The nested convolution can be written as:

(3)

where denotes the separable convolution, and are the binary weights for the first and second convolution operation in the separable convolution layer, , and are the scaling factors for their respective binary weights and activations. Since every scaling factor induces quantization error, the nested convolutions in separable convolutions will result in more quantization error.

To empirically investigate how the quantization error affects training for different layer types, we construct small networks formed by repeating each kind of convolutional layers three times, followed by three fully connected layers. We train these networks on CIFAR10 in floating point and binary domain and summarize the results in Table 1.

Layer Type Conv Dil. Conv Sep. Conv
Kernel Size
FP Acc. (%)
Bin. Acc. (%)
Table 1: Test Accuracy (%) of a small CNN composed of each layer type only in floating point (FP Acc.) and binary domain (Bin. Acc) on CIFAR10. Conv, Dil. Conv and Sep. Conv refer to the convolutions, dilated convolutions and separable convolutions, respectively. Notice the performance of all layer types is worse in the binary domain.

When binarized, both convolution and dilated convolution layers show only a reasonable drop in accuracy, while the separable convolution layers show performance equivalent to random guessing ( for CIFAR10). The observations in Table 1 confirm that the accumulated quantization error by the nested convolutions fails binary networks in training. This also partly explains why the binarized architecture of DARTS in Figure 2 does not train as it has a large number of separable convolutions.

4.1.3 Zeroise

The Zeroise layer outputs all zeros irrespective of the input[27]. In the authors’ implementation of [27]111https://github.com/quark0/darts, they remove the Zeroise layers when obtaining the final architecture. As the exclusion is not discussed in [27], we compare the accuracy with and without the Zeroise layer for DARTS in the DARTS column of Table 2 and empirically verify that the Zeroise layer is not particularly useful for floating point networks. We believe the reason to be that it only reduces the network capacity.

Search Method DARTS BNAS
Zeroise Layer
Train Acc. (%) (-) (+)
Test Acc. (%) (-) (+)
Table 2: DARTS and BNAS with and without the Zeroise layers in the final architecture on CIFAR10. Zeroise Layer indicates whether the Zeroise layers were kept () or not (). The test accuracy of DARTS drops by when you include the Zeroise layers and the train accuracy drops to and the training stagnates. But the Zeroise layer improves BNAS in both train and test accuracy.

However, we argue that the Zeroise layer can play an important role in binary networks and should be included. Particularly, we claim that the Zeroise layer can reduce quantization error in binary networks as shown in Figure 3. Empirically, we also observe meaningful gain in performance when we include the Zeroise layers in the final architecture (see BNAS column in Table 2).

Figure 3: An example when the Zeroise layer is beneficial in the binary domain. Since the floating point convolution is close to zero but the binarized convolution is far greater than 0, if the search selects the Zeroise layer instead of the convolution layer, the quantization error reduces significantly.

Including the Zeroise layer in the final architecture is particularly beneficial when the situation similar to Figure 3

happens frequently as the quantization error reduction is significant. This degree of benefit may differ from dataset to dataset. As the dataset used for search may differ from the dataset used to train and evaluate the searched architecture, we propose to tune the probability of including the

Zeroise layer. Specifically, we propose a generalized layer selection criterion to adjust the probability of including the Zeroise

layer by a transferability hyperparameter

as:

(4)

where is the architecture parameter corresponding to the Zeroise layer and are the architecture parameters corresponding to the layer other than Zeroise. Larger encourages to pick the Zeroise layer only when it is substantially better than the other layers.

With the separable convolutions and the Zeroise layer type considered, we summarize the defined set of layer types for BNAS in Table 3.

Layer Type Bin Conv. Bin Dil. Conv. MaxPool AvgPool Zeroise
Kernel Size N/A
Table 3: The final set of layer types for BNAS. Bin Conv, Bin Dil. Conv, MaxPool and AvgPool

refer to the binary convolution, binary dilated convolution, max pooling and average pooling layers, respectively.

4.2 Inter-Cell Skip Connection in Cell Template

With the defined set of layer types, we learn a network architecture with the convolutional cell template proposed in [56]

. However, the learned architecture suffers from severe vanishing gradient problems in the binary domain as shown in Figure 

4.

Figure 4: Gradient vanishing problem using intra-cell skip-connections only. Both the train and test accuracies drop to

after 150 epochs, implying high classification loss. The magnitudes of the gradients are proportional to the loss but they are very small especially in the early layers, indicating a gradient vanishing problem.

Investigating the reasons for the vanishing gradient, we observe the following. Skip-connections in the cell template proposed in [56] are confined to be inside a single convolutional cell i.e., intra-cell skip-connections. The intra-cell skip-connections do not propagate the gradients outside the cell. To help convey information over multiple cells, we propose to add skip-connections between multiple cells as illustrated in Figure 5

. These inter-cell skip-connections help propagate gradients throughout the network. Note that since the hand-crafted binary networks use the ResNet family as their backbone, they already have similar residual connections as shown in Figure 

6. We empirically validate the usefulness of the inter-cell skip connections in Section 5.5.


(a) DARTS Cell Template                (b) BNAS Cell Template

Figure 5: Proposed BNAS cell template with the inter-cell skip-connections. ConvCell indicates the convolutional cell. c_(k) indicates the output of the kth cell. Red lines in (b) indicate inter-cell skip connections.

4.3 Diversifying Search for Binary Networks

After the binary cell based search space is defined, we form a over-parameterized DAG containing all layer types similar to [27]. Unfortunately, in BNAS, the binary layers with learnable parameters (e.g., convolutional layers) are not selected as often in the early stages of search as the layers requiring no learning are more favorable to the search algorithm than the under-trained layers. This is more prominent in the binary domain because binary layers train slower than floating point ones [7]. To alleviate this, we propose to use an exponentially annealed entropy based regularizer during the search and call it the diversity regularizer. Specifically, we subtract the entropy of the architecture parameter distribution from the search objective as:

(5)

where is the search objective of [27] (cross-entropy), are the DAG’s parameters, are the architecture parameters, is the entropy, is the scaling hyperparameter, is the epoch, and is the annealing hyperparameter. This will encourage the architecture parameter distribution to be more uniform in the early stages, allowing the search to explore more diverse layer types.

Using the diversity regularizer, we observed a relative increase in the average number of learnable layer types picked in the first epochs of the search. We also empirically validate the benefit of diversity in search using CIFAR10 dataset in Table 4. While the accuracy improvement from the diversity regularizer in the floating point NAS methods such as DARTS [27] is marginal (), the improvement in our binary network is more meaningful ().

Search Method DARTS BNAS
Diversity
Test Acc. (%) (+) (+)
Table 4: Effect of diversity on CIFAR10. Diversity refers to whether diversity regularization was applied () or not () during the search. DARTS only gains test accuracy while BNAS gains test accuracy.

5 Experiments222We will publicly release the code and learned models very soon.

5.1 Experimental Setup

Datasets.

We use CIFAR10[19] and ImageNet (ILSVRC 2012)[35] datasets to evaluate the image classification accuracy. For searching binary networks, we use the CIFAR10 dataset. For training the final architectures from scratch, we use both CIFAR10 and ImageNet. During the search, we hold out half of the training data of CIFAR10 as the validation set to evaluate the quality of search. For final evaluation of the searched architecture, we train it from the scratch using the full training set and report Top-1 (and Top-5 for ImageNet) accuracy.

Search details.

We train a small network with cells and initial number of channels using SGD with the diversity regularizer (Section 4.3) for epochs with batch size of . We use momentum with initial learning rate of using cosine annealing [29] and a weight decay of . We use same architecture hyperparameters as [27] except for the additional diversity regularizer where we use and . Our cell search takes approximately 10 hours on a single NVIDIA RTX 2080TI GPU.

Learning details.

For CIFAR10, we train the final networks for epochs with batch size . We use SGD with momentum and weight decay of . We use the one cycle learning rate scheduler[38] with the learning rate ranging from 5 10-2 to 4 10-4. For ImageNet, we train the models for epochs with batch size . We use SGD with momentum , with an initial learning rate of and a weight decay of . We use the cosine restart scheduler [29] with the minimum learning rate of and the length of one cycle being epochs.

Final architecture configurations.

We vary the size of our BNAS variants to compare with the other binary networks with different FLOPs and name them as BNAS-{Mini, A, B, C, D, E} as shown in Table 5.

BNAS- Mini A B C D E
# Cells
# Chn.
Dataset CIFAR10 ImageNet
Table 5: Configuration details of BNAS variants. # Cells: the number of stacked cells. # Chn.: the number of output channels in the first convolution layer of the model. is a hyper-parameter for transferability in Eq. 4.
Comparison with other binary networks.

In experiments on CIFAR10, we show the test accuracy for four different FLOPs budgets for binary networks. For XNOR-Net with different backbone architectures, we use the floating point architectures from a public source444https://github.com/kuangliu/pytorch-cifar and apply the binarization scheme of XNOR-Net. In experiments on ImageNet, we excerpt the test accuracy of various binary networks from their respective papers and show test accuracy for two different FLOPs budgets for binary networks. Similar to how ABC-Net with a single base is used for comparison in previous work[28, 10], we compare PCNN with a single projection kernel for both CIFAR10 and ImageNet.

5.2 Comparing the Searched and Handcrafted Cell

We qualitatively compare our searched cell with the XNOR-Net cell that is based on the ResNet18 architecture in Figure 6. As shown in the figure, our searched cell has a contrasting structure to the handcrafted ResNet18 architecture. Both cells contain only two 33 binary convolution layer types, but the extra Zeroise layer types selected by our search algorithm help in reducing the quantization error. The topology in which the Zeroise layer types and convolution layer types are connected also contributes to improving the classification performance of our searched cell. In the following subsections, we show that our searched topology yields better binary networks that outperform current handcrafted binary networks. More qualitative comparisons of our searched cell can be found in the supplement.


(a) ResNet cell from XNOR-Net          (b) BNAS cell

Figure 6: Comparing BNAS cell to XNOR-Net (ResNet) cell. c_(k) denotes the output of the kth cell. The dotted lines represents the connections from the second previous cell (c_(k-2)). The red lines correspond to skip connections.

5.3 Results on CIFAR10

We quantitatively compare our BNAS models to the existing binary networks with various FLOPs on CIFAR10 dataset and summarize the results in Table 6. Note that the order of performance in floating point does not always translate for binary networks, as binarized DenseNet and SENet perform worse than ResNet18. Our architectures searched by the proposed BNAS method outperform all the handcrafted architectures with the XNOR-Net binarization scheme with similar or lesser FLOPs. Our BNAS variants also outperform the state-of-the-art binary networks that have the same FLOPs except PCNN with in the FLOPs budget. The ‘Projection’ binarization scheme used in PCNN has advantage over the ‘Sign + Scale’ binarization scheme in the FLOPs budget for CIFAR10.

It is interesting to note that BNAS-Mini outperforms all the other networks in the FLOPs budget with far less FLOPs. Note that both PCNN () and BNAS-C approach the floating point ResNet18’s test accuracy, with less than difference.

FLOPs () Model (Backbone) Binarization Test Acc (%)
PCNN () (WRN22) Projection
BNAS-Mini Sign + Scale
XNOR-Net (ResNet18) Sign + Scale
XNOR-Net (DenseNet) Sign + Scale
XNOR-Net (NiN) Sign + Scale
XNOR-Net (SENet) Sign + Scale
BinaryNet (ResNet18) Sign
BNAS-A Sign + Scale
XNOR-Net (ResNet34) Sign + Scale
XNOR-Net (WRN40) Sign + Scale
PCNN () (WRN22) Projection
BNAS-B Sign + Scale
XNOR-Net (ResNext29-64) Sign + Scale
BNAS-C Sign + Scale
ResNet18 (FP) N/A
Table 6: Classification accuracy on CIFAR10 with various FLOPs budgets. The parentheses after various models indicate the backbone architecture. The binarization schemes are: ‘Sign’: binarizing the floating point weights and activations by a sign function. ‘Sign + Scale’: using fixed scaling factor and the sign function to binarize. ‘Projection’: projection convolutions[10].

5.4 Results on ImageNet

FLOPs () Model Binarization Searched Pretraining Top- Acc. (%) Top- Acc. (%)
BinaryNet [7] Sign
ABC-Net [22] Clip + Scale
XNOR-Net [34] Sign + Scale
BNAS-D Sign + Scale
Bi-Real [28] Sign + Scale
XNOR-Net++ [2] Sign + Scale*
PCNN [10] Projection
BNAS-E Sign + Scale
ResNet18 (FP) N/A
Table 7: Classification accuracy (Top-1 and Top-5) on ImageNet with various FLOPs budgets. The backbone architecture for other binary networks is ResNet18. The binarization schemes are: ‘Sign’: binarizing the floating point weights and activations by a sign function. ‘Clip + Scale’: binarizing using clip function with shift parameter, ‘Sign + Scale’: using fixed scaling factor and the sign function to binarize [22]. ‘Sign + Scale*’ : using learned scaling factor and the sign function to binarize [2], ‘Projection’: projection convolutions[10]. Note: our models outperform other binary networks in both FLOPs budgets even without any pretraining or recent binarization schemes.

We now quantitatively compare our searched binary networks to state of the art binary networks on the ImageNet dataset in Table 7. As XNOR-Net have proposed to not binarize the first convolution and the last fully-connected layer [34] which subsequent binary networks [28, 22, 10, 2] have followed, we also follow this convention for the ease of comparison. Note that Bi-Real, PCNN, and XNOR-Net++ further unbinarize the downsampling convolutions in the ResNet18 architecture which increases the FLOPs. We have confirmed with the authors of XNOR-Net that the accuracy of for XNOR-Net was without unbinarizing the downsampling convolutions, which makes XNOR-Net different in FLOPs to Bi-Real, PCNN, and XNOR-Net++. We also do not unbinarize the downsampling convolutions in our binary networks.

To compare with XNOR-Net, ABC-Net, and BinaryNet, we train BNAS-D which has around FLOPs. As shown in Table 7, with the same FLOPs, BNAS-D outperforms XNOR-Net by a large margin of for the top-1 accuracy and for the top-5 accuracy, and outperforms BinaryNet and ABC-Net with even larger margins. Note that BNAS-D shares the same binarization scheme as XNOR-Net but has noticeably higher test accuracy, highlighting the effectiveness of our searched global topology of the architecture.

To compare with Bi-Real, XNOR-Net++, and PCNN, we train BNAS-E which has around FLOPs. Note that Bi-Real and PCNN use floating point or binary weight pretraining schemes while XNOR-Net++ and BNAS-E do not use any pretraining. Even without pretraining, BNAS-E outperforms Bi-Real, XNOR-Net++, and PCNN by a margin of at least for top-1 accuracy and at least for top-5 accuracy. With the same binarization scheme (comparing Bi-Real with BNAS-E), our BNAS-E outperforms the hand-crafted ResNet18 backbone by a margin of for top-1 accuracy and for top-5 accuracy. Interestingly, even without applying the recent binarization schemes of XNOR-Net++ and PCNN, our BNAS-E still outperforms both of them. We expect further improvements by applying the binarization schemes proposed in XNOR-Net++ and PCNN.

5.5 Ablation Studies

We perform ablation studies on various components of our full method. We use the CIFAR10 dataset for the experiments with various FLOPs budgets and summarize the results in Table 8.

Comparing No Div with Full, the searched cell with the diversity regularizer clearly outperforms the searched cell without the diversity regularizer for all the model variants. Removal of inter-cell skip-connections (No Skip) severely degrades the performance of our binary networks. It is interesting to note that for all three BNAS variants (BNAS-A, B, and C), the models eventually collapsed to very low training and test accuracy and exhibited gradient vanishing issues for the No Skip setting. Comparing No Zeroise with Full, our searched architectures with the Zeroise layers achieve better test accuracy on CIFAR10. Interestingly, the largest model (BNAS-C) without Zeroise layers performs worse than BNAS-A and BNAS-B, due to excess complexity. Please refer to the supplement for more detailed analysis on our ablated models.

Model Full No Skip No Zeroise No Div
BNAS-A
BNAS-B
BNAS-C
Table 8: Classification accuracy (%) of ablated models on CIFAR10. Full refers to our proposed method. No Skip refers to our method without the inter-cell skip connections. No Zeroise refers to our method with explicitly discarding the Zeroise layers. No Div refers to our method without the diversity regularizer. We report the test accuracy of BNAS-A, BNAS-B, and BNAS-C.

6 Conclusion

To design better performing binary networks, we propose a method to obtain binary networks by searching, called BNAS. BNAS searches for a cell that can be stacked to generate networks for various computational budgets. To configure the BNAS search space, we define a new set of binary layer types and modify the existing cell template. In addition, we propose to use the Zeroise layer type in the final network, whereas other existing floating point NAS methods have only used it during the search, and show that it helps in improving the performance of binary networks. Moreover, we propose to diversify the selection of layers in early search, and show that it helps in obtaining better binary architectures. The learned BNAS networks outperform other hand-crafted binary networks with similar or less computational costs on both CIFAR-10 and ImageNet.

Acknowledgement

The authors would like to thank Dr. Mohammad Rastegari at XNOR.AI for valuable comments and training details of XNOR-Net and Dr. Chunlei Liu and other authors of [24] for sharing the code of [24].

References

Appendix

We show more qualitative comparisons, additional ablations, computational savings of our method, detailed reviews on related works, and brief remarks on quantized networks.

Appendix A More Qualitative Comparisons of Our Searched Cell

We present more qualitative comparisons of our searched cell with the binarized DARTS [27] cell in addition to Section 5.2 where we compare it with the hand-crafted XNOR-Net cell. In Figure 7, we compare the normal cell of BNAS with the normal cell of DARTS[27]. Our cell has inter-cell skip connections which facilitate gradient propagation amongst multiple cells, making it less prone to gradient vanishing issues, whereas the binarized DARTS cell does not train at all (achieving only 10.01% test accuracy on CIFAR10 in Table 11). Other than the excessive number of separable convolutions in the DARTS searched cell (Table 11), the lack of inter-cell skip connections in their cell template may also contribute to the failure of its architecture in the binary domain.

Figure 7: Comparing the normal cell of BNAS (a) and the normal cell of binarized DARTS (b). c_(k) indicates the output of the cell. The dotted lines represent the connections from the second previous cell (c_(k-2)). Red lines in (a) indicate the inter-cell skip connections. Note that the searched cell of binarized DARTS in (b) has only intra-cell skip connections (denoted by pink boxes) which are not as effective for gradient propagation among multiple cells as compared to inter-cell skip connections in (a) (See discussions in Section 4.2.

We also qualitatively compare the BNAS reduction cell to the binarized DARTS reduction cell in Figure 8. Note that the BNAS reduction cell has a lot of Zeroise layers which help reduce quantization error.

Figure 8: Comparing the reduction cell of BNAS (a) and the reduction cell of binarized DARTS (b). c_(k) indicates the output of the cell. The dotted lines represent the connections from the second previous cell (c_(k-2)). Red lines indicate the inter-cell skip connections. The intra-cell skip connections are denoted by the pink boxes. Interestingly, the BNAS reduction cell only uses the output from the second previous cell (c_(k-2)) as inputs to the max pool layers.

Appendix B More Analyses on the Ablated Models

b.1 ‘No Skip’ Setting

Our BNAS-A,B, and C variants show vanishing gradient problems without the proposed inter-cell skip connections (refer to No Skip setting in Table 8). Besides the final classification accuracy, we also show the train and test accuracy curves for the No Skip ablation models in Figure 9 for more detailed analysis. All three variants collapse to a very low training and test accuracy after a reasonable number of epochs (600).

Figure 9: Learning curve for the ‘No Skip’ ablation. The train (a) and test (b) accuracy of all three models collapse when trained for 600 epochs. Additionally, the test accuracy curves fluctuates heavily when compared to the train accuracy curve.

We also observe that the ablated models without the inter-cell skip connections have extremely small gradients for most of the early layers relative to that of later layers even though the classification loss is very high, indicating a gradient vanishing problem (Figure 10).


Figure 10: Gradient vanishing problem in the ‘No Skip’ ablation. We show the sum of gradient magnitudes for convolution layers in all three models for the No Skip setting. As it can be seen in Figure 9, the models exhibit high loss but the gradients in the early layers are very small relative to that of the later layers. As is the case with Figure 4 in the main paper, the three model variants show gradient vanishing problems.

b.2 ‘No Zeroise’ Setting

Using the Zeroise layers in the final architecture has additional benefits apart from reducing the quantization error. In theory, converting a floating point layer to a binary layer saves up to about on memory and inference time [34, 28], whereas converting a floating point layer to a Zeroise layer saves memory and speeds up the inference by as it does not require any learnable parameters or computation. We summarize the memory savings, FLOPs and inference speed-up of our BNAS-A model in Table 9 by comparing our BNAS-A model with and without Zeroise layers. With the Zeroise layers, not only does the accuracy increase, but we also observe significant increase in both memory savings and inference speed-up.

BNAS-A w/o Zeroise w/ Zeroise
# Cells/# Chn. / /
Memory Savings
FLOPS ()
Inference Speed-up
Test. Acc. (%)
Table 9: Comparing our searched binary networks with and without the Zeroise Layer. Test. Acc. (%) indicates the test accuracy on CIFAR10. The models shown here have the exact same configuration except for the Zeroise layers and all the other differences are consequences of including or excluding the Zeroise layers. Note that the memory savings and inference speed-up were calculated with respect to the floating point version of BNAS-A (see Section D) without Zeroise since Zeroise layers are not used in floating point domain.

b.3 Additional Ablation Study - No Dilated Convolution Layers

To factor out any performance improvements we might gain by using more advanced layers, e.g. the dilated convolution layer type, we further ablate our model by removing it from our set of layer types. This ensures that we use the same convolution layer types that were used in the ResNet18 architecture. We qualitatively compare the searched cells with and without the dilated convolution layer types in Figure 11. We then quantitatively compare XNOR-Net and BNAS-A* in Table 10. Our search algorithm still discovers a better architecture given the same binarization scheme and the same set of convolution layer types used, indicating that the performance gain we observe is mostly from the search method and the Zeroise layers.

Figure 11: Comparing the cell searched with and without binary dilated convolutions in the set of layer types. c_(k) indicates the output of the cell. The dotted lines represents the connections from the second previous cell (c_(k-2)). Red lines indicate the inter-cell skip connections.
FLOPs () Model Dil. Conv. Binarization Test Acc. (%)
XNOR-Net (ResNet18) Sign + Scale
BNAS-A* Sign + Scale
BNAS-A Sign + Scale
Table 10: No dilated convolutions in the set of layer types. Test Acc. (%) indicates the test accuracy on CIFAR10. The parentheses indicate the backbone architecture. The binarization schemes are: ‘Sign + Scale’: using fixed scaling factor and the sign function to binarize. BNAS-A* indicates the cell searched without dilated convolutions. Our model that was searched without dilated convolutions still outperforms XNOR-Net (ResNet18) by a fair margin.

Appendix C Additional Discussions on Separable Convolution in the Set of Layer Types

In Section 3 of the main paper, we claim two issues for the failure of binarized DARTS, SNAS and GDAS. First, accumulation of quantization error due to separable convolutions and second, the lack of inter-cell skip connections that makes propagating the gradients across multiple cells difficult. Particularly, for the first issue (i.e., using separable convolutions accumulates quantization error repetitively), we proposed to exclude the separable convolutions from the set of layer types entirely. Nonetheless, we investigate the effect of keeping the separable convolutions in the set of layer types. We summarize the results in Table 11 with the proportion of separable convolution layers selected in the searched architecture and the respective classification accuracy.

Since DARTS, SNAS, and GDAS search on the floating point domain, their search methods do not take quantization error into account and thus result in cells that have a relatively high percentage of separable convolutions. In contrast, we search directly on the binary domain which enables our search method to identify that separable convolutions have high quantization error and hence obtain a cell that contains only one separable convolution (note that proportion of separable convolutions is 12.5% for BNAS while for others, it is more than 36%). Note that explicitly excluding the separable convolutions from the set of layer type does result in better performing binary architectures, motivating us to remove them entirely from the set of layer types for our method. The discussion about the failure of separable convolutions can be found in Section 4.1.2.

Method Test Accuracy (%) Proportion of Sep. Conv. (%)
DARTS + Binarized
SNAS + Binarized
GDAS + Binarized
BNAS-A
BNAS-A
Table 11: Effect of separable convolutions in the set of layer types. Test Accuracy (%) refers to the test accuracy on CIFAR10. Proportion of Sep. Conv. refers to the percentage of separable convolutions in the searched convolutional cells. BNAS-A refers to our searched network with separable convolutions included in the set of layer types. BNAS-A refers to our searched network without separable convolutions in the set of layer types. Note that the same binarization scheme (XNOR-Net) was applied in all cases. The models with higher proportion of separable convolutions show lower test accuracy. For the learning curves of DARTS + Binarized, SNAS + Binarized, GDAS + Binarized, and BNAS-A, please refer to Figure 2.

Appendix D Memory Saving and Inference Speed-up

We additionally discuss about the memory savings and inference speed-up of our searched binary networks. In contrast to other binary networks using ResNet18 as their base architecture for classification on ImageNet, our searched architecture does not have a base architecture by definition. While other binary networks compare memory savings and inference speed-up with respect to their base architectures, we conceive that calculating the memory savings and inference speed-up for our searched binary networks could be done by comparing it to the floating point version of the our searched binary networks. We follow [28] in computing the memory savings and inference speed-up with respect to the floating point counterparts on the ImageNet models. The memory savings are and and the inference speed-ups are and for BNAS-D and BNAS-E, respectively.

Appendix E More Details About [33] and [24] in Literature Review

e.1 Detailed Discussion About [33]: MobiNet

MobiNet [33] modifies the separable convolution by adding skip connections across the depth-wise and point-wise convolution, and an extra mid-block module to binarize the MobileNet-V1 architecture. They claim that their modification alleviates gradient vanishing issues (Section 3.1 in [33]). To investigate whether the failure of separable convolutions is due to the gradient vanishing problem, we train a three-layer CNN with the separable convolution and the modified separable convolution for binarized network by [33] (denoted as ‘MobiNet Sep. Conv.’), and summarize the results in Table 12. The chances of gradient vanishing are very low in this shallow network.

Layer Type Sep. Conv MobiNet Sep. Conv
Kernel Size
FP Acc. (%) N/A N/A
Bin. Acc. (%)
Table 12: Test Accuracy (%) of a small CNN composed of each layer type only in floating point (FP Acc.) and binary domain (Bin. Acc) on CIFAR10. Sep. Conv and MobiNet Sep. Conv refer to the separable convolutions and MobiNet separable convolution layers, respectively. Both modified separable convolutions by MobiNet and binarized separable convolutions do not train at all, i.e., stay at test accuracy after a reasonable number of epochs (50). Note that MobiNet’s proposed layers was purely for the binary domain, hence the omission of its floating point accuracy.

Both the binarized separable convolutions and the modified separable convolution by the MobiNet still do not train at all (note that random guessing is in 10-way classification using CIFAR10) even in a three-layer CNN. It implies that the failure in training may not be attributed to the gradient vanishing problem. We argue that the failure comes from the accumulated quantization error in both separable convolution layer types (Section 4.1.2 in the main paper). As MobiNet’s separable convolutional layer does not train at all, we do not include it in our set of layer types for search.

e.2 Detailed Discussion About [24]: Cbcn

The idea of CBCN is to use circulant filters, which are rotated versions of the original convolution kernels, to make binary networks rotational invariant and improve the performance [24]. The authors reported ImageNet top-1 and top-5 accuracy with using four circulant filters in-place of one original convolution kernel. The main drawback of this idea is that using more than one circulant filter incurs an increase in FLOPs as the number of convolution operations required increase linearly with the number of circulant filters used. However, [24] report neither the exact architecture of their binary network, nor the FLOPs of their resulting architecture with the circulant filters. This makes comparison with other binary networks in terms of FLOPs unfair as CBCN with four circulant filters may have more FLOPs than other binary networks based on the same ResNet18 architecture. Thus, we skip the comparison with it in our main paper to maintain fairness in terms of FLOPs as we do not know the exact FLOPs budget of CBCN.

However, by using our best guess, we estimate the minimum and maximum FLOPs555The maximum estimate was acquired from checking the authors’ code for [24]. The minimum estimate was acquired from Figure 6 and Table 5 in [24]. of CBCN on the architectures that was used for their ImageNet experiments and scale our models to the respective FLOPs budgets (BNAS-F and BNAS-G) and report the classification accuracy in Table 13.

FLOPs () Model Searched Pretraining Top-1 (%) Top-5 (%)
BNAS-F
() CBCN
BNAS-G
Table 13: Estimated FLOPs range for CBCN. We scale our model to the minimum and maximum limits of the possible FLOPs range for CBCN and show the top-1 and top-5 accuracy on ImageNet. The configuration for BNAS-F are # Cells: 12, # Chn.: 84 and . The configuration of BNAS-G are # Cells: 16, # Chn.: 128 and . Please refer to Table 5 in the main paper for more details. Note that () indicates the estimated FLOPs range of CBCN, not a specific FLOPs budget.

For the maximum FLOPs estimate, we multiply the FLOPs of Bi-Real Net[28] (roughly same FLOPs as CBCN when only one circulant filter is used) by four since most of the FLOPs in a CNN are from convolution layers and the FLOPs of the convolution layers increase linearly with the number of circulant filters. We calculate the minimum estimate by not using circulant filters for most of the more expensive convolution layers in terms of FLOPs such as the first convolution layer or the downsampling convolutions in the ResNet18 architecture since it may be the case that CBCN does not use circulant filters for all the convolution layers.

Even without pretraining, our BNAS-F performs close to CBCN and BNAS-G performs better than CBCN by a fair margin of for top-1 accuracy and for the top-5 accuracy. If we assume CBCN’s FLOPs to be in between that of BNAS-F and BNAS-G, we can argue that our method performs at least on par with CBCN given the accuracy of BNAS-F and BNAS-G and the lack of pretraining.

Appendix F Remarks on Quantized but ‘Non 1-bit’ (not fully binary) CNNs

Quantized networks that incorporate search are a type of efficient networks that are not comparable to 1-bit CNNs. The reason they are not comparable is that they cannot utilize XNOR and bit counting operations in the inference which significantly brings down their memory savings and inference speed up gains. However, it is interesting to note that there are a line of work for efficient networks with more resource consumption and review them, especially the recent ones. Notably, [5, 41, 44, 30] all search for multi-bit quantization policies and only [5] search for network architectures as well. [4] also search for network architectures for binary weight (not 1-bit) CNNs. Moreover, [41, 44, 30] all search for quantization policies, not network architectures, further differentiating it from our method.