MeliusNet: Can Binary Neural Networks Achieve MobileNet-level Accuracy?

01/16/2020 ∙ by Joseph Bethge, et al. ∙ Hasso Plattner Institute 0

Binary Neural Networks (BNNs) are neural networks which use binary weights and activations instead of the typical 32-bit floating point values. They have reduced model sizes and allow for efficient inference on mobile or embedded devices with limited power and computational resources. However, the binarization of weights and activations leads to feature maps of lower quality and lower capacity and thus a drop in accuracy compared to traditional networks. Previous work has increased the number of channels or used multiple binary bases to alleviate these problems. In this paper, we instead present MeliusNet consisting of alternating two block designs, which consecutively increase the number of features and then improve the quality of these features. In addition, we propose a redesign of those layers that use 32-bit values in previous approaches to reduce the required number of operations. Experiments on the ImageNet dataset demonstrate the superior performance of our MeliusNet over a variety of popular binary architectures with regards to both computation savings and accuracy. Furthermore, with our method we trained BNN models, which for the first time can match the accuracy of the popular compact network MobileNet in terms of model size and accuracy. Our code is published online: https://github.com/hpi-xnor/BMXNet-v2

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The success of deep convolutional neural networks in a variety of machine learning tasks, such as image classification

[14, 22], object detection [28, 29], text recognition [20], and image generation [1, 8], has led to the design of deeper, larger, and more sophisticated neural networks. However, the large size and high number of operations of these accurate models severely limit the applicability on resource-constrained platforms, such as those associated with mobile or embedded devices. There are many existing works aiming to solve this problem by reducing memory requirements and accelerating inference. These approaches can be roughly divided into a few research directions: network pruning techniques [12, 13], compact networks designs [15, 16, 19, 30, 34], and low-bit quantization [5, 27, 35], wherein the full-precision 32-bit floating point weights (and in some cases also the activations) are replaced with lower-bit representations, e.g. 8 bits or 4 bits. The extreme case, Binary Neural Networks (BNNs), was introduced by [18, 27] and uses only 1 bit for weights and activations.

It was shown in previous work that the BNN approach is especially promising, since a binary convolution can be sped up by a factor higher than 50 while using only less than 1% of the energy compared to a 32-bit convolution on FPGAs and ASICs [26]. This speed-up can be achieved by replacing the multiplications (and additions) in matrix multiplications with bit-wise xnor and bitcount operations [26, 27], processing up to 64 values in one operation. However, BNNs still suffer from accuracy degradation compared to their full-precision counterparts [10, 27]. To alleviate this issue, there has been work to approximate full-precision accuracy by using multiple weight bases [23, 36] or increasing the channel number in feature maps [26, 31]. However these approaches come with an increase in both computational cost and model-size. We briefly review the related work in more detail in Section 2.

Prior work has been using full-precision architectures, e.g., AlexNet [22] and ResNet [14], without specific adaptations for BNNs. To the best of our knowledge, only two works are exceptions: Liu et al. added additional residual shortcuts to the ResNet architecture [25] and Bethge et al. adapted a DenseNet architecture with dense shortcuts for BNNs [4]. Both approaches seem to be beneficial for BNNs, but we presume different reasons: the former improves the quality of the features, while the latter increases the capacity. We combined these aspects and developed MeliusNet which increases both quality and capacity of features throughout the network (see Section 3).

Previous work also showed a large gap between the compact network structure MobileNet [16] and BNNs. Even approaches with multiple binary bases [23, 36] so far have not been able to reach similar accuracy based on the same computational budget. We identify, that this is mainly due to a few layers in previous BNNs which use 32-bit instead of 1-bit. To solve this issue, we propose a change to these layers, using multiple grouped convolutions to save operations and improve the accuracy at the same time (see Section 3.2).

We evaluated MeliusNet on the ImageNet [6] dataset and compare it with the state-of-the-art (see Section 4). To confirm the effectiveness of our methods, we also provided extensive ablation experiments. During this study, we found that our training process with Adam [21] achieves much better results than reported in previous work. To allow for a fair comparison, we also trained the original (unchanged) networks and clearly separated the accuracy gains between the different factors (also within Section 4). Finally, we conclude our work in Section 5.

Summarized our main contributions in this work are:

  • A novel BNN architecture which counters the lower quality and lower capacity of binary feature maps efficiently.

  • A novel initial set of grouped convolution layers for all binary networks.

  • The first BNN that matches the accuracy of MobileNet 0.5, 0.75, and 1.0.

2 Related Work

Alternatives to binarization, such as compact network structures [15, 16, 19, 30, 34] and quantized approaches [5, 27, 35] have been introduced. In this section, we take a more detailed look at approaches that use BNNs with 1-bit weights and 1-bit activations. These networks were originally introduced by Courbariaux et al. [18] with Binarized Neural Networks and improved by Rastegari et al. who used channel-wise scaling factors to reduce the quantization error in their XNOR-Net [27]. The following works tried to further improve the network accuracy, which was much lower than the accuracy of common 32-bit networks with different techniques:

WRPN [26] and Shen et al. [31] increased the number of channels for a better performance. Their work only increases the number of channels in the convolutions and the feature maps, but does not change the architecture.

Another way to increase the accuracy of BNNs was presented by ABC-Net [23] and GroupNet [36]. Instead of using a single binary convolution, they use a set of binary convolutions to approximate a 32-bit convolution (this number is sometimes called the number of binary bases). This achieves higher accuracy but increases the required memory and number of operations of each convolution by the factor . These approaches optimize the network within each building block.

The two approaches most similar to our work are Bi-RealNet [25] and BinaryDenseNet [4]. They use only a single binary convolution, but adapt the network architecture compared to full-precision networks to improve the accuracy of a BNN. However, they did not test whether their proposed architecture changes are specific for BNNs or whether they would improve a 32-bit network as well.

3 MeliusNet

The motivation for MeliusNet comes from the two main disadvantages of using binary values instead of 32-bit values for weights and inputs.

On the one hand, the number of possible weight values is reduced from up to to only . This leads to a certain quantization error, which is the difference between the result of a regular 32-bit convolution and a 1-bit convolution. This error reduces the quality of the features computed by binary convolutions compared to 32-bit convolutions.

On the other hand, the value range of the inputs (for the following layer) is reduced by the same factor. This leads to a huge reduction in the available capacity of features as well, since fine-granular differences between values, as in 32-bit floating point values, can no longer exist.

In the following section, we describe how MeliusNet increases the quality and capacity of features efficiently. Afterwards, we describe how the number of operations in the remaining 32-bit layers of a binary network can be reduced. Finally, we show the implementation details of our BNN layers.

Figure 1: Building block of MeliusNet (c denotes the number of channels in the feature map). We first increase the feature capacity by concatenating 64 newly computed channels to the feature map (yellow area) in the Dense Block. Then, we improve the quality of those newly added channels

with a residual connection (green area) in the Improvement Block. The result is a balanced increase of

capacity and quality.

3.1 Improving Quality and Capacity

The core building block of MeliusNet consists of a Dense Block followed by an Improvement Block (see Figure 1). The DenseBlock increases feature capacity, whereas the Improvement Block increases feature quality.

The Dense Block is the only building block of a BinaryDenseNet[4], which is a binary variant of the DenseNet architecture [17]. It consists of a binary convolution which derives 64 channels of new features based on the input feature map, with, for example, 256 channels. These features are concatenated to the feature map itself, resulting in 320 channels afterwards, thus increasing feature capacity.

The Improvement Block increases the quality of these newly concatenated channels. It uses a binary convolution to compute 64 channels again based on the input feature map of 320 channels. These 64 output channels are added to the previously computed 64 channels through a residual connection, without changing the first 256 channels of the feature map (see Figure 1). Thus, this addition improves the last 64 channels, leading to the name of our network (melius is latin for improvement). With this approach each section of the feature map is improved exactly once.

Note that we could also use a residual connection to improve the whole feature map instead of using the proposed Improvement Block

. However, with this naive approach, the number of times each section of the feature map is improved would be highly skewed towards the initially computed features. It would further incur a much higher number of operations, since the number of output channels needs to match the number of channels in the feature map. With the proposed

Improvement Block, we can instead save computations and get a feature map with balanced quality improvements (the supplementary material contains some experiment data to compare the naive approach and MeliusNet).

As stated earlier, alternating between a Dense Block and an Improvement Block forms the core part of the network. Depending on how often the combination of both blocks is repeated, we can create models of different size and with a different number of operations. Our network progresses through similar stages, as a Bi-RealNet and a BinaryDenseNet, with transition layers in between, which halve the height and width of the feature map with a MaxPool layer. Furthermore, the number of channels is also roughly halved in the downsampling convolution during the transition (see Table 1 on page 1 for the exact factors). We show an example in Figure 2, where we repeat the blocks 4, 5, 4, and 4 times between transition layers and achieve a model which is similar to Bi-RealNet18 in terms of model size.

Figure 2: A depiction of our MeliusNet 22 with a configuration of 4-5-4-4 blocks between transitions. Details of the Dense Block and Improvement Block can be seen in Figure 1.

3.2 Layers with 32-bit Convolutions

We follow previous work and do not binarize the first convolution, the final fully connected layer, and the (“downsampling”) convolutions in the network to preserve accuracy [4, 25, 36]. However, since these layers contribute a large share of operations, we propose a redesign of the first layers (we use the accuracy and number of operations of the respective architectures for the ImageNet classification task [6]).

We compared previous BNNs to the compact network architecture MobileNet 0.5 [16], which only needs operations in total and can achieve accuracy on ImageNet. We found, that the closest BNN result (regarding model size and operations) is Bi-RealNet34, which achieved lower accuracy () with a similar model size but it also needs more operations (). We presume, that because of this difference, compact model architectures are more popular for practical applications than BNNs, especially with more recent (and improved) compact networks appearing [15, 30]. To find a way to close this gap, we analyze the required number of operations in the following.

As described in Section 3.3, previous work [4, 25, 36] did not binarize the first convolutional layer, the final fully-connected layer, and the downsampling convolutions to prevent a large accuracy drop. Even though we agree with their decision, this choice also leads to a high number of operations and memory needed for these layers.

For example, the first convolution layer in a Bi-RealNet18 alone needs () of the total operations of the whole network (which factors in the theoretical speedup of binary layers for a total of operations). The three downsampling convolutions account for another () of operations. Since in total about of all operations are needed for these 32-bit convolutions, we focused on them to reduce the number of operations.

(a) The convolution with operations.
(b) Our proposed grouped stem with operations.
Figure 3:

A depiction of the two different versions of initial layers of a network (s is the stride, g the number of groups, we use 1 group and stride 1 otherwise). Our

grouped stem in (fig:grouped-stem) can be applied to all common BNN architectures, e.g., Bi-RealNet [25] and BinaryDenseNet [4], as well as our proposed MeliusNet to save operations by replacing the expensive convolution in the original layer configuration (fig:7x7) without an increase in model size.

In previous work the 32-bit convolution uses 64 channels. We propose to replace the convolution with three convolutions, similar to the stem network used by Szegedy et al. [32]. In contrast to Szegedy, we use grouped convolutions [22] for a reduction in operations instead of regular convolutions (resulting in the name grouped stem). The first convolution has 32 output channels (with a stride of 2), the second convolution uses 4 groups and 32 output channels, and the third convolution has 8 groups and 64 output channels (see Figure 3). We use this combination to achieve the same number of parameters (and thus model size), so it can be compared to previous architectures. Our proposed grouped stem structure only needs instead of the original operations, which is a reduction of more than .

Even though there are certainly other ways to change the initial layer to reach an even lower number of operations, e.g. using quantization, a different set of layers, etc., we wanted to adapt them mostly to see whether a BNN can reach a similar accuracy as a MobileNet based on the same number of operations (see Section 4.3 for the results).

Similarly to adapting the first layer, the downsampling convolution can also be adapted in a similar way and use a certain number of groups, e.g., 2 or 4. However, since the features in the feature map are created consecutively with Dense Blocks we add a channel shuffle operation before the downsampling convolution [34] (only if we use groups in our downsampling convolution). This allows the downsampling convolution to combine features from earlier layers and later layers together.

3.3 Implementation Details

We follow the general principles to train binary networks as presented in previous work [4, 25, 27]. Weights and activations are binarized by using the sign function:

(1)

The non-differentiability of the sign function is solved with a Straight-Through Estimator (STE)

[2]

coupled with gradient clipping as introduced by Hubara

et al. [18]. Therefore the forward and backward passes can be described as:

Forward: (2)
Backward: (3)

In this case is the loss, a real number input, and a binary output. We use a clipping threshold of as used by [4]. Furthermore, the computational cost of binary neural networks at runtime can be highly reduced by using the and CPU instructions, as presented by Rastegari et al. [27].

Previous work [25] has suggested a different backward function to approximate the sign function more closely, however we found no performance gain during our experiments, similar to the results of [3]. Channel-wise scaling factors have been proposed to reduce the difference between a regular and a binary convolution [27]. However, it was also argued, that they are mostly needed to scale the gradients [25], that a single scaling factor is sufficient [35], or that neither of them is actually needed [3]. Recent work suggests, that the effect of scaling factors might be neutralized by BatchNorm layers [4]. For this reason, and since we have not observed a performance gain by using scaling factors, we did not apply them in our convolutions. We use the typical layer order (BatchNorm sign BinaryConv) of previous BNNs [4, 25] Finally, we replaced the bottleneck structures, consisting of a and a convolution, which is often used in full-precision networks, as it was done in previous work [3, 36] and used a single convolution instead.

4 Results and Discussion

We selected the challenging task of image classification on the ImageNet dataset [6] to test our new model architecture and perform ablation studies with our proposed changes. Our implementation is based on BMXNet111https://github.com/hpi-xnor/BMXNet-v2 [33] and the model implementations of Bethge et al. [4]. Note, that experiment logs, accuracy curves, and plots of model structures for all trainings are in the supplementary material.

4.1 Grouped Stem Ablation Study and Training Details

When training models with our proposed grouped stem structure based on previous architectures, we discovered a large performance gain compared to previous networks. To verify the source of these gains we did an ablation study on ResNetE18 [3], Bi-RealNet34 [25], BinaryDenseNet28/37[4], and our MeliusNet22/29 with and without our proposed grouped stem structure. We directly show the results of this study in the corresponding figures for the comparison to the state-of-the-art (see (a) and fig:sota-large-model, alternatively a table with these values can be found in the supplementary material).

On the one hand, the results show, that using grouped stem instead of a regular convolution increases the model accuracy for all tested model architectures. The actual increase by using the grouped stem structure is between and for each model, but at the same time we also save a constant amount () of operations (as shown by the dotted lines in Figure 5). We conclude, that not only is using our grouped stem structure highly efficient, but also generalizes well to different BNN architectures.

On the other hand, we also recognized that our training process performs significantly better than previous training strategies. Therefore, we give a brief overview about our training configuration in the following:

For data preprocessing we use channel-wise mean subtraction, normalize the data based on the standard deviation, randomly flip the image with a probability of

and finally select a random resized crop, which is the same augmentation scheme that was used in XNOR-Net [27]. We initialize the weights with the method of [7]

and train our models from scratch (without pre-training a 32-bit model) for 120 epochs with a base learning rate of

. We use the RAdam optimizer proposed by Liu et al. [24] and the default (“cosine”) learning rate scheduling of the GluonCV toolkit [11]. This learning rate scheduling steadily decreases the learning rate based on the following formula ( is the current step in training, with ): . However, we achieved similar (only slightly worse) results with the same learning rate scheduling and the Adam [21] optimizer, if we use a warm-up phase of 5 epochs in which the learning rate is linearly increased to the base learning rate. Using SGD led to the worst results overall and even though we did some initial investigation into the differences between optimizers (included in supplementary material) we could not find a clear reason for the performance difference.

4.2 Ablation Study on 32-bit Networks

Figure 4: A comparison between the 32-bit versions of MeliusNet and DenseNet (best viewed in color). We tuned the number of building blocks to achieve models of similar complexity: MeliusNet uses 4-4-4-3 (4.5 billion FLOPS, 20.87 MB) and DenseNet uses 6-6-6-5 (4.0 billion FLOPS, 19.58 MB). We used the off-the-shelf Gluon-CV training script for ImageNet [11]

with identical hyperparameters to train both models. The accuracy curves are almost indistinguishable for the whole training process and our 32-bit MeliusNet is not able to improve the result compared to a 32-bit DenseNet, even though it uses slightly more FLOPS and memory.

We performed another ablation study to find out whether our proposed MeliusNet is indeed specifically better for a BNN or whether it would also increase the performance of a 32-bit network. Since our proposed MeliusNet without the Improvement Blocks is very similar to a DenseNet, we compared these two architectures and trained two 32-bit models based on a DenseNet and a MeliusNet. We used the off-the-shelf Gluon-CV training script for ImageNet and their DenseNet implementation as a basis for our experiment [11]. To achieve a fair comparison, we constructed two models of similar size and operations. We used 4-4-4-3 blocks (Dense Block and Improvement Block) between the transition stages for MeliusNet and 6-6-6-5 blocks (Dense Blocks only) for a DenseNet. The models need 4.5 billion FLOPS with 20.87 MB model size and 4.0 billion FLOPS with 19.58 MB model size, respectively. Therefore, we expect MeliusNet to definitely achieve a slightly better result, since it uses slightly more FLOPs and has a higher model size, unless our designed architecture is only specifically useful for BNNs. Both models were trained with SGD with momentum () and equal hyperparameters for 90 epochs (with a warm-up phase of 5 epochs and “cosine” learning rate scheduling). Note that additional augmentation techniques (HSV jitter and PCA-based lightning noise) were used (in this study only), since we did not change the original Gluon-CV training script for the 32-bit models.

The result shows basically identical training curves between both models for the whole training (see Figure 4). At the end of training, the training accuracy is even between both architectures at . Even though the validation accuracy does not match for the whole training, this is probably caused by randomized augmentation and shuffling of the dataset. Therefore we conclude, that using our MeliusNet architecture for 32-bit models does not lead to an improvement, and our architecture is indeed only an improvement for BNNs.

4.3 Comparison to State-of-the-art

Name (block numbers) Channel reduction factor in transitions Size (MB) FLOPs Top-1 (Top-5) accuracy
MeliusNet22 (4,5,4,4) 3.9 2.08 63.6% (84.7%)
MeliusNet29 (4,6,8,6) 5.1 2.14 65.8% (86.2%)
MeliusNet42 (5,8,14,10) 10.1 3.25 69.2% (88.3%)
MeliusNet59 (6,12,24,12) 17.4 5.25 70.7% (89.3%)
MeliusNet25/4 (4,5,5,6) 4.0 1.62 63.4% (84.2%)
MeliusNet29/2 (4,6,8,6) 5.0 1.96 65.7% (85.9%)
Table 1: Details of our different MeliusNet configurations and their accuracy on the ImageNet classification task [6] (based on the usage of our grouped stem structure). The channel reduction factors are chosen at such specific fractions to keep the number of channels as multiples of 32. The suffixes /2 and /4 denote, that the downsampling convolution uses 2 and 4 groups, respectively.
(a) Architectures with MB model size
(b) Architectures with MB model size
Figure 5: A comparison between our work and previous approaches with network architectures [4, 9, 10, 16, 23, 25, 35] of two size categories on the ImageNet dataset. All colored results are trained with our training strategy, the original authors’ result is shown in black. The dotted lines represent the difference between training with and without our proposed grouped stem structure. (fig:sota-small-model) Our MeliusNet 22 achieves higher accuracy than a BinaryDenseNet 28 without additional operations and by applying our optimizations we can also reach a state-of-the-art result based on the ResNetE architecture. (fig:sota-large-model) We are the first to achieve a result similar to MobileNet with a binary neural network by applying our optimizations to a Bi-RealNet 34. Our MeliusNet 29 can reach a higher accuracy than even ABC-Net (5/5) while using less than one third of its operations.

To compare to other state-of-the-art networks we created different configurations of MeliusNet with different model sizes and number of operations (see Table 1). Our main goal was to reach fair comparisons to previous architectures, by using a similar model size and number of operations. Therefore, we chose the configurations of MeliusNet22 and MeliusNet29 to be similar to BinaryDenseNet28 and BinaryDenseNet38 respectively. We calculated the number of operations in the same way as in previous work, factoring in a speed-up factor for binary convolutions [4, 25]. To be able to compare to Bi-RealNet we further needed to reduce the amount of operations, so we used 4 and 2 groups in the downsampling convolutions for MeliusNet25/4 and MeliusNet29/2 respectively and added a channel shuffle operation beforehand as described in Section 3.2. Finally, we created the larger networks MeliusNet42 and MeliusNet59 to be able to compare to MobileNet 0.75 and MobileNet 1.0. This also shows, that the basic network structure of MeliusNet can be adapted easily to create networks with different sizes and number of operations by tuning the number of blocks and using groups in the downsampling convolution. Note, that after initially choosing these model configurations for comparison and getting our training results, we did not adapt them for further tuning.

Comparison to other binary networks (one base):

We compared our MeliusNet22/29 with the following binary network architectures: ResNetE18 [3] (which is similar to Bi-RealNet18 [25]

, except for the addition of a single ReLu layer and a single BatchNorm), Bi-RealNet34

[25], and BinaryDenseNet28/37 [4]. For reference we also include ABC-Net results, which uses multiple binary bases for weights and activations, even though they are not directly comparable, since they use a larger model size [23]. Since we trained the other binary network architectures for our grouped stem ablation study with our training strategy, we report them together with the accuracy reported by the original authors. This allows for a fair comparison between the architectures since all models are trained with our training strategy.

We divide the results into two groups: the models with a size of about 4.0 MB and those with a size of about 5.1 MB (see Figure 5).

First, we recognize, that comparing our MeliusNet22 (including all optimizations) to the original result of a BinaryDenseNet28 shows a accuracy increase together with a reduction of FLOPs. However, through our ablation study we can also see, how the different factors contribute to this increase in accuracy: comes from the architecture change itself, (and the FLOPs reduction) from using grouped stem and from our training strategy.

Secondly, we can see, that if we apply our grouped stem and our training strategy to a ResNetE18, the result can even surpass sophisticated training methods, such as BONN or PCNN by Gu et al. [9, 10]. If we compare our MeliusNet25/4 (which has a reduced number of operations) to BONN (which is based on the Bi-RealNet18 architecture), we can recognize that we achieve higher accuracy based on the same number of operations. We note that since we do not use additional losses (e.g. those introduced in BONN), our architectural optimizations could be combined with such advanced training methods in future work, likely achieving even more accurate BNNs. Overall, our MeliusNet achieves the best result for a binary network with one binary base and a model size of 4 MB by far (see (a)).

For the analysis of binary models of 5.1 MB size, we also included the result of MobileNet 0.5 [16] for reference, even though it is not a binary approach (see (b)). MeliusNet29 (including all optimizations) shows a accuracy increase over the original result of a BinaryDenseNet37 with the same reduction of FLOPs. Again we can analyze how the different factors contribute to this increase in accuracy: comes from the difference in architectures, (and the FLOPs reduction) from using grouped stem and from our training strategy.

Additionally we recognize, that by applying our grouped stem and our training strategy to a Bi-RealNet34, we can achieve the same accuracy as MobileNet 0.5 based on a similar amount of operations and model size, which has not been achieved by any BNN before. Furthermore, our MeliusNet29 even surpasses ABC-Net with 5 binary bases for weights and activations by with a much lower number of operations and model size. Finally, we also compare our MeliusNet29/2 to the Bi-RealNet34 result achieved with our training, where we achieve higher accuracy based on the same number of operations.

Comparison to other binary networks with multiple binary bases and compact networks:

For another challenging and more direct comparison, we compared our results based on Bi-RealNet34, MeliusNet29, MeliusNet42, and MeliusNet59 to the compact network architecture MobileNet [16] and the GroupNet approach [36], which uses 5 binary bases (which means they use 5 binary convolutions to approximate each 32-bit convolution) in Table 2. First of all, in the comparison between MeliusNet29 and Group-Net18 and MeliusNet42 and Group-Net34 our MeliusNet reaches and higher accuracy at a lower number of operations and lower model size, respectively. However, since both approaches are optimizing at a different architecture level, they could even be combined in future work.

Furthermore, by applying our optimizations to a Bi-RealNet34, we can reach the same accuracy as MobileNet 0.5 with almost identical model size and operations. Our MeliusNet29 and MeliusNet29/2 achieve improvements of and , respectively, over the result of a MobileNet 0.5, although it is not directly comparable since it uses a slightly higher amount of operations. However, the results are still very promising, since they are based on the same model size and show a significant increase in accuracy.

A similar comparison between MeliusNet42 and MobileNet 0.75 and MeliusNet59 and MobileNet 1.0 are better comparable, since we tuned both models to exactly match the respective MobileNet in operations and model size. In these comparisons, MeliusNet42 and MeliusNet59 can reach and higher accuracy than the respective MobileNet models (note that due to its size, we trained MeliusNet59 for 150 epochs instead of 120).

Model size Architecture FLOPs Top-1 acc.
5.1MB MobileNet 0.5 [16] 63.7% (base)
Bi-Real-Net34 [25] 62.2%
Bi-Real-Net34 [25] 63.7%
MeliusNet29/2 65.7%
MeliusNet29 65.8%
8.7MB Group-Net18 (5) [36] 64.8% -
10MB MobileNet 0.75 [16] 68.4% (base)
MeliusNet42 69.2%
15MB Group-Net34 (5) [36] 68.5% -
17MB MobileNet 1.0 [16] 70.6% (base)
MeliusNet59 70.7%

This result is based on our optimizations.

Table 2: Comparison of MobileNet v1 [16], the GroupNet approach [36], which uses multiple binary bases, and our results, based on Bi-RealNet34 [25] and our binary MeliusNet on the ImageNet dataset [6]. We can achieve a similar or better accuracy than MobileNet 0.5, 0.75 and 1.0 with different networks.

We conclude that our architectural approach is a valid alternative to the structural decomposition described in GroupNet and also shows very promising results to be comparable to a 32-bit MobileNet, since it matches or even surpasses their accuracy.

5 Conclusion

Previous work has shown different techniques to increase the accuracy of BNNs by increasing the channel numbers or replacing the binary convolutions with convolutions with multiple binary bases. The Bi-RealNet and the BinaryDenseNet approaches were the first to change the architecture of a BNN compared to a 32-bit network. In our work, we showed a novel architecture MeliusNet, which is specifically designed to amend the disadvantages of using binary convolutions. In this architecture, we repeatedly add new features and improve them to compensate for the lower quality and lower capacity of binary feature maps. Our experiments with different model sizes on the challenging ImageNet dataset show that MeliusNet is superior to previous BNN approaches, which adapted the architecture.

Further, we presented grouped stem, an optimized set of layers that can replace the first convolution. This has considerably reduced the gap between BNN results and compact networks, and with our optimization, both previous architectures and our proposed MeliusNet can reach an accuracy similar to MobileNet 0.5 and MobileNet 0.75 based on the same model size and a similar amount of operations. This provides a strong basis for BNNs to reach the same accuracy as MobileNet 0.25 and 1.0 in future work. The higher energy saving potential of BNNs (based on customized hardware) could then make them the favorable choice in many applications.

We also found, that our architecture can reach competitive accuracy when compared to approaches with multiple binary bases. Therefore, we think, that future work with BNNs could achieve further improvements by combining architectural optimizations with block-internal optimizations, such as using multiple binary bases.

References

  • [1] Martin Arjovsky and Léon Bottou. Towards Principled Methods for Training Generative Adversarial Networks. International Conference on Learning Representations (ICLR), 2017.
  • [2] Yoshua Bengio, Nicholas Léonard, and Aaron C Courville.

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation.

    CoRR, abs/1308.3, 2013.
  • [3] Joseph Bethge, Marvin Bornstein, Adrian Loy, Haojin Yang, and Christoph Meinel. Training competitive binary neural networks from scratch. arXiv preprint arXiv:1812.01965, 2018.
  • [4] Joseph Bethge, Haojin Yang, Marvin Bornstein, and Christoph Meinel. BinaryDenseNet: Developing an Architecture for Binary Neural Networks. In

    The IEEE International Conference on Computer Vision (ICCV) Workshops

    , 2019.
  • [5] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.
  • [6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , pages 248–255. Ieee, 2009.
  • [7] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    , pages 249–256, 2010.
  • [8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [9] Jiaxin Gu, Ce Li, Baochang Zhang, Jungong Han, Xianbin Cao, Jianzhuang Liu, and David Doermann. Projection Convolutional Neural Networks for 1-bit CNNs via Discrete Back Propagation. Proceedings of the AAAI Conference on Artificial Intelligence, 33:8344–8351, 2019.
  • [10] Jiaxin Gu, Junhe Zhao, Xiaolong Jiang, Baochang Zhang, Jianzhuang Liu, Guodong Guo, and Rongrong Ji. Bayesian Optimized 1-Bit CNNs. In The IEEE International Conference on Computer Vision (ICCV), 2019.
  • [11] Jian Guo, He He, Tong He, Leonard Lausen, Mu Li, Haibin Lin, Xingjian Shi, Chenguang Wang, Junyuan Xie, Sheng Zha, Aston Zhang, Hang Zhang, Zhi Zhang, Zhongyue Zhang, and Shuai Zheng. GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing. arXiv preprint arXiv:1907.04433, 2019.
  • [12] Song Han, Huizi Mao, and William J Dally. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In International Conference on Learning Representations (ICLR), 2016.
  • [13] Song Han, Jeff Pool, John Tran, and William Dally. Learning both Weights and Connections for Efficient Neural Networks. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
  • [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [15] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V Le, and Hartwig Adam. Searching for MobileNetV3. In The IEEE International Conference on Computer Vision (ICCV), 2019.
  • [16] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. 2017.
  • [17] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pages 2261–2269, 2017.
  • [18] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115, 2016.
  • [19] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
  • [20] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Deep Features for Text Spotting. In Computer Vision – ECCV 2014, pages 512–528, Cham, 2014. Springer International Publishing.
  • [21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [23] Xiaofan Lin, Cong Zhao, and Wei Pan. Towards Accurate Binary Convolutional Neural Network. In Advances in Neural Information Processing Systems, number 3, pages 344–352, 2017.
  • [24] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the Variance of the Adaptive Learning Rate and Beyond. arXiv preprint arXiv:1908.03265, 2019.
  • [25] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. Bi-Real Net: Enhancing the Performance of 1-bit CNNs with Improved Representational Capability and Advanced Training Algorithm. In The European Conference on Computer Vision (ECCV), sep 2018.
  • [26] Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook, and Debbie Marr. WRPN: Wide Reduced-Precision Networks. International Conference on Learning Representations (ICLR), 2018.
  • [27] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
  • [28] Joseph Redmon, Santosh Kumar Divvala, Ross B Girshick, and Ali Farhadi. You Only Look Once: Unified, Real-Time Object Detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, 2016.
  • [29] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28, pages 91–99, 2015.
  • [30] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [31] Mingzhu Shen, Kai Han, Chunjing Xu, and Yunhe Wang. Searching for Accurate Binary Neural Architectures. The IEEE International Conference on Computer Vision (ICCV) Workshops, 2019.
  • [32] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In AAAI, volume 4, page 12, 2017.
  • [33] Haojin Yang, Martin Fritzsche, Christian Bartz, and Christoph Meinel.

    BMXNet: An Open-Source Binary Neural Network Implementation Based on MXNet.

    In Proceedings of the 2017 ACM on Multimedia Conference, pages 1209–1212. ACM, 2017.
  • [34] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [35] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. arXiv preprint arXiv:1606.06160, 2016.
  • [36] Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid. Structured Binary Neural Networks for Accurate Image Classification and Semantic Segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

Supplementary material

Our supplementary material contains the following information:

  • Appendix A briefly explains the structure of the experiment data, which can be found here: https://owncloud.hpi.de/s/h5zWIepW1OS0Rs6

  • Appendix B shows a comparison between MeliusNet and the naive approach of simply alternating Residual Blocks and Dense Blocks

  • Appendix C contains data that shows some of the observed differences between the different optimizers (SGD, Adam, RAdam)

  • Table 3 (on page 3) contains the detailed result numbers of all our trainings and previous work (these values are the basis for Figure 5 (a) and (b) in the paper)

Appendix A Detailed Experiment Data

We include the experiment logs (experiment.log), accuracy curves (accuracy.png) and detailed plots (network.pdf) of our model architectures in one folder per experiment result. The accuracy curves also include the model size and number of operations of the corresponding model.

Appendix B Comparing the Naive Approach and MeliusNet

(a) Naive approach
(b) MeliusNet
Figure 6: The basic building blocks of MeliusNet and the naive approach of repeating alternating Dense Blocks and Residual Blocks ( denotes the number of channels in the feature map). (fig:naive-structure-supp) With the naive approach, the number of operations in the Residual Block increases by the factor of instead of the constant number of output channels (64) compared to the Dense Block. This means the Residual Block needs between 2 and 10 times the number of operations of the Dense Block, depending on the number of layers and depth of the layer in the network. Furthermore the number of weights and operations increases quadratically, depending on , making anything except very shallow networks unfeasible. (fig:meliusnet-structure-supp) Our MeliusNet for comparison. The number of operations between both blocks is similar (only the number of input channels changes slightly between the blocks).
Figure 7: A comparison between MeliusNet and the naive approach of simply alternating a Residual Block and a Dense Block (see (a)). Both models need about 258 million operations (factoring in the speedup of binary operations). The 3% accuracy drop of the naive approach is too large and the high number of operations needed for larger models make the naive architecture unfeasible for BNNs.

The direct approach to combining residual and dense shortcut connections could lead to a result as shown in (a). In this case the combination of a Dense Block and a Residual Block are repeated throughout the network. However, the residual shortcut connection requires that feature map sizes between the input and output of the convolution match. This means the number of channel contributes to the number of of operations quadratically. This makes achieving a reasonable number of operations difficult with this approach, since increasing the channel number (as is done in every Dense Block) leads to a quadratic increase of operations. Therefore, increasing the capacity of feature maps with this approach is not practical, especially for larger binary networks.

(b) shows the MeliusNet for comparison. The design of our Improvement Block keeps the number of operations lower, since increasing the channel number with Dense Blocks only linearly increases the number of operations required for later blocks.

We also empirically evaluated both models. These experiments were trained for only 40 epochs and a different learning rate schedule (base learning rate is , decaying by at epochs 35 and 37). However, since both models were trained with the same hyperparameters this should not affect the comparison between both. Since we struggled to construct a model which could match in both model size and number of operations, we only made the number of operations equal. In the comparison we can see that the naive approach is much worse, with a 3% different in Top 1 accuracy on ImageNet (see Figure 7). Even with the slightly smaller model (3.3 MB instead of 4 MB) this drop in accuracy is too much compared to other binary models, e.g. Bi-RealNet or BinaryDenseNet. Therefore we concluded that this approach is not useful for BNNs and have not pursued it further. The details of these experiments are in the “experiment_data” folder under “naive_vs_MeliusNet”.

Appendix C Optimizer Comparison

As written in the paper, we found, that both Adam and RAdam optimize better than SGD. We tried different learning rates and learning rate schedules, however the accuracy on ImageNet when training with SGD still was about 1% lower than Adam (with warmup). Therefore, we counted the number of sign “flips” for each individual weight between batches (accumulated per epoch) for each optimizer during the training of ResNetE18 on ImageNet (see Figure 8). If a weight was updated from to when updating the weights after processing one batch its weight flip count would increase by one. This can happen several times per epoch and intuitively reflects the “stability” of the training process regarding the binary weights.

First of all, the data showed, that surprisingly, after about 90 epochs, 95% of all binary weights are stable during a single given epoch. Note, that this does not mean that 95% of weights are stable for the whole time after the 90th epoch, since the 95% of stable weights are not necessarily identical between the different epochs.

During the training with Adam and RAdam, the average stability increases during the training, while for SGD the stability decreases after about 50 epochs. However, this is only true for the earlier layers in the network (see (a)), but does not apply to later layers (see (b)). Although this is an indication for a more unstable training process with SGD it does not yet conclusively explain the performance difference to RAdam and Adam.

(a) Data from the first binary convolution of the first network stage
(b) Data from the last binary convolution of the last network stage
Figure 8: We show the n-th percentile of the number of weight “flips” for each optimizer for the binary weights of two different convolution layers over the whole training process of 120 epochs for a ResNetE. The first 5 epochs are warm-up epochs for Adam and SGD, where the learning rate is increased linearly to the base learning rate. We can see, for example, that after the 100th epoch during a single given epoch 95% of weights are stable in these layers. Furthermore, for Adam and RAdam the stability increases during the training. This is not the case for SGD in the earlier layers of the network (e.g. in (fig:percentiles-11)), where the number of flips increases starting around epoch 60.
Model size Network Architecture Training procedure Grouped stem FLOPS Top-1 accuracy of method
4.0MB MeliusNet25/4 Ours 63.4% -
Bi-Real-Net18[25] Original 56.4% -
PCNN[9] 57.3% -
BONN[10] 59.3% -
ResNetE18[4] Original 58.1% (base)
Ours 60.0% (base)
Ours 60.6%
BinaryDenseNet28[4] Original 60.7%
Ours 61.7% (base)
Ours 62.6%
MeliusNet22 Ours 62.8% (base)
Ours 63.6%
5.1MB MobileNet 0.5 [16] - - 63.7% - -
MeliusNet29/2 Ours 65.7% -
Bi-Real-Net34[25] Original 62.2% (base)
Ours 63.3% (base)
Ours 63.7%
BinaryDenseNet37[4] Original 62.5%
Ours 63.3% (base)
Ours 64.2%
MeliusNet29 Ours 64.9% (base)
Ours 65.8%
8.7MB ABC-Net18 (5/3)[23] - - 62.5% - -
ABC-Net18 (5/5)[23] - - 65.0% - -
7.4MB Group-Net18 (4)[36] - - 64.2% - -
8.7MB Group-Net18 (5)[36] - - 64.8% - -
9.2MB Group-Net18** (5)[36] - - 67.0% - -
10MB MobileNet 0.75 [16] - - 68.4% - -
MeliusNet42 - - 69.2% - -
15MB Group-Net34 (5)[36] - - 68.5% - -
15.3MB Group-Net34** (5)[36] - - 70.5% - -
45MB ResNet18 (32-bit)[14] - - 69.3% - -
84MB ResNet34 (32-bit)[14] - - 73.3% - -
Table 3: Data for all our trainings and a variety of comparisons on ImageNet. This data is the basis for the Figures 5 (a) and (b) in our paper. We highlighted results based on our architecture in bold. For convenience we calculated the difference to a baseline model in the second last column. The last column shows the difference between our baseline result and the original authors result as well as the difference between applying grouped stem or using a regular convolution.