1 Introduction
The success of deep convolutional neural networks in a variety of machine learning tasks, such as image classification
[14, 22], object detection [28, 29], text recognition [20], and image generation [1, 8], has led to the design of deeper, larger, and more sophisticated neural networks. However, the large size and high number of operations of these accurate models severely limit the applicability on resourceconstrained platforms, such as those associated with mobile or embedded devices. There are many existing works aiming to solve this problem by reducing memory requirements and accelerating inference. These approaches can be roughly divided into a few research directions: network pruning techniques [12, 13], compact networks designs [15, 16, 19, 30, 34], and lowbit quantization [5, 27, 35], wherein the fullprecision 32bit floating point weights (and in some cases also the activations) are replaced with lowerbit representations, e.g. 8 bits or 4 bits. The extreme case, Binary Neural Networks (BNNs), was introduced by [18, 27] and uses only 1 bit for weights and activations.It was shown in previous work that the BNN approach is especially promising, since a binary convolution can be sped up by a factor higher than 50 while using only less than 1% of the energy compared to a 32bit convolution on FPGAs and ASICs [26]. This speedup can be achieved by replacing the multiplications (and additions) in matrix multiplications with bitwise xnor and bitcount operations [26, 27], processing up to 64 values in one operation. However, BNNs still suffer from accuracy degradation compared to their fullprecision counterparts [10, 27]. To alleviate this issue, there has been work to approximate fullprecision accuracy by using multiple weight bases [23, 36] or increasing the channel number in feature maps [26, 31]. However these approaches come with an increase in both computational cost and modelsize. We briefly review the related work in more detail in Section 2.
Prior work has been using fullprecision architectures, e.g., AlexNet [22] and ResNet [14], without specific adaptations for BNNs. To the best of our knowledge, only two works are exceptions: Liu et al. added additional residual shortcuts to the ResNet architecture [25] and Bethge et al. adapted a DenseNet architecture with dense shortcuts for BNNs [4]. Both approaches seem to be beneficial for BNNs, but we presume different reasons: the former improves the quality of the features, while the latter increases the capacity. We combined these aspects and developed MeliusNet which increases both quality and capacity of features throughout the network (see Section 3).
Previous work also showed a large gap between the compact network structure MobileNet [16] and BNNs. Even approaches with multiple binary bases [23, 36] so far have not been able to reach similar accuracy based on the same computational budget. We identify, that this is mainly due to a few layers in previous BNNs which use 32bit instead of 1bit. To solve this issue, we propose a change to these layers, using multiple grouped convolutions to save operations and improve the accuracy at the same time (see Section 3.2).
We evaluated MeliusNet on the ImageNet [6] dataset and compare it with the stateoftheart (see Section 4). To confirm the effectiveness of our methods, we also provided extensive ablation experiments. During this study, we found that our training process with Adam [21] achieves much better results than reported in previous work. To allow for a fair comparison, we also trained the original (unchanged) networks and clearly separated the accuracy gains between the different factors (also within Section 4). Finally, we conclude our work in Section 5.
Summarized our main contributions in this work are:

A novel BNN architecture which counters the lower quality and lower capacity of binary feature maps efficiently.

A novel initial set of grouped convolution layers for all binary networks.

The first BNN that matches the accuracy of MobileNet 0.5, 0.75, and 1.0.
2 Related Work
Alternatives to binarization, such as compact network structures [15, 16, 19, 30, 34] and quantized approaches [5, 27, 35] have been introduced. In this section, we take a more detailed look at approaches that use BNNs with 1bit weights and 1bit activations. These networks were originally introduced by Courbariaux et al. [18] with Binarized Neural Networks and improved by Rastegari et al. who used channelwise scaling factors to reduce the quantization error in their XNORNet [27]. The following works tried to further improve the network accuracy, which was much lower than the accuracy of common 32bit networks with different techniques:
WRPN [26] and Shen et al. [31] increased the number of channels for a better performance. Their work only increases the number of channels in the convolutions and the feature maps, but does not change the architecture.
Another way to increase the accuracy of BNNs was presented by ABCNet [23] and GroupNet [36]. Instead of using a single binary convolution, they use a set of binary convolutions to approximate a 32bit convolution (this number is sometimes called the number of binary bases). This achieves higher accuracy but increases the required memory and number of operations of each convolution by the factor . These approaches optimize the network within each building block.
The two approaches most similar to our work are BiRealNet [25] and BinaryDenseNet [4]. They use only a single binary convolution, but adapt the network architecture compared to fullprecision networks to improve the accuracy of a BNN. However, they did not test whether their proposed architecture changes are specific for BNNs or whether they would improve a 32bit network as well.
3 MeliusNet
The motivation for MeliusNet comes from the two main disadvantages of using binary values instead of 32bit values for weights and inputs.
On the one hand, the number of possible weight values is reduced from up to to only . This leads to a certain quantization error, which is the difference between the result of a regular 32bit convolution and a 1bit convolution. This error reduces the quality of the features computed by binary convolutions compared to 32bit convolutions.
On the other hand, the value range of the inputs (for the following layer) is reduced by the same factor. This leads to a huge reduction in the available capacity of features as well, since finegranular differences between values, as in 32bit floating point values, can no longer exist.
In the following section, we describe how MeliusNet increases the quality and capacity of features efficiently. Afterwards, we describe how the number of operations in the remaining 32bit layers of a binary network can be reduced. Finally, we show the implementation details of our BNN layers.
3.1 Improving Quality and Capacity
The core building block of MeliusNet consists of a Dense Block followed by an Improvement Block (see Figure 1). The DenseBlock increases feature capacity, whereas the Improvement Block increases feature quality.
The Dense Block is the only building block of a BinaryDenseNet[4], which is a binary variant of the DenseNet architecture [17]. It consists of a binary convolution which derives 64 channels of new features based on the input feature map, with, for example, 256 channels. These features are concatenated to the feature map itself, resulting in 320 channels afterwards, thus increasing feature capacity.
The Improvement Block increases the quality of these newly concatenated channels. It uses a binary convolution to compute 64 channels again based on the input feature map of 320 channels. These 64 output channels are added to the previously computed 64 channels through a residual connection, without changing the first 256 channels of the feature map (see Figure 1). Thus, this addition improves the last 64 channels, leading to the name of our network (melius is latin for improvement). With this approach each section of the feature map is improved exactly once.
Note that we could also use a residual connection to improve the whole feature map instead of using the proposed Improvement Block
. However, with this naive approach, the number of times each section of the feature map is improved would be highly skewed towards the initially computed features. It would further incur a much higher number of operations, since the number of output channels needs to match the number of channels in the feature map. With the proposed
Improvement Block, we can instead save computations and get a feature map with balanced quality improvements (the supplementary material contains some experiment data to compare the naive approach and MeliusNet).As stated earlier, alternating between a Dense Block and an Improvement Block forms the core part of the network. Depending on how often the combination of both blocks is repeated, we can create models of different size and with a different number of operations. Our network progresses through similar stages, as a BiRealNet and a BinaryDenseNet, with transition layers in between, which halve the height and width of the feature map with a MaxPool layer. Furthermore, the number of channels is also roughly halved in the downsampling convolution during the transition (see Table 1 on page 1 for the exact factors). We show an example in Figure 2, where we repeat the blocks 4, 5, 4, and 4 times between transition layers and achieve a model which is similar to BiRealNet18 in terms of model size.
3.2 Layers with 32bit Convolutions
We follow previous work and do not binarize the first convolution, the final fully connected layer, and the (“downsampling”) convolutions in the network to preserve accuracy [4, 25, 36]. However, since these layers contribute a large share of operations, we propose a redesign of the first layers (we use the accuracy and number of operations of the respective architectures for the ImageNet classification task [6]).
We compared previous BNNs to the compact network architecture MobileNet 0.5 [16], which only needs operations in total and can achieve accuracy on ImageNet. We found, that the closest BNN result (regarding model size and operations) is BiRealNet34, which achieved lower accuracy () with a similar model size but it also needs more operations (). We presume, that because of this difference, compact model architectures are more popular for practical applications than BNNs, especially with more recent (and improved) compact networks appearing [15, 30]. To find a way to close this gap, we analyze the required number of operations in the following.
As described in Section 3.3, previous work [4, 25, 36] did not binarize the first convolutional layer, the final fullyconnected layer, and the downsampling convolutions to prevent a large accuracy drop. Even though we agree with their decision, this choice also leads to a high number of operations and memory needed for these layers.
For example, the first convolution layer in a BiRealNet18 alone needs () of the total operations of the whole network (which factors in the theoretical speedup of binary layers for a total of operations). The three downsampling convolutions account for another () of operations. Since in total about of all operations are needed for these 32bit convolutions, we focused on them to reduce the number of operations.
A depiction of the two different versions of initial layers of a network (s is the stride, g the number of groups, we use 1 group and stride 1 otherwise). Our
grouped stem in (fig:groupedstem) can be applied to all common BNN architectures, e.g., BiRealNet [25] and BinaryDenseNet [4], as well as our proposed MeliusNet to save operations by replacing the expensive convolution in the original layer configuration (fig:7x7) without an increase in model size.In previous work the 32bit convolution uses 64 channels. We propose to replace the convolution with three convolutions, similar to the stem network used by Szegedy et al. [32]. In contrast to Szegedy, we use grouped convolutions [22] for a reduction in operations instead of regular convolutions (resulting in the name grouped stem). The first convolution has 32 output channels (with a stride of 2), the second convolution uses 4 groups and 32 output channels, and the third convolution has 8 groups and 64 output channels (see Figure 3). We use this combination to achieve the same number of parameters (and thus model size), so it can be compared to previous architectures. Our proposed grouped stem structure only needs instead of the original operations, which is a reduction of more than .
Even though there are certainly other ways to change the initial layer to reach an even lower number of operations, e.g. using quantization, a different set of layers, etc., we wanted to adapt them mostly to see whether a BNN can reach a similar accuracy as a MobileNet based on the same number of operations (see Section 4.3 for the results).
Similarly to adapting the first layer, the downsampling convolution can also be adapted in a similar way and use a certain number of groups, e.g., 2 or 4. However, since the features in the feature map are created consecutively with Dense Blocks we add a channel shuffle operation before the downsampling convolution [34] (only if we use groups in our downsampling convolution). This allows the downsampling convolution to combine features from earlier layers and later layers together.
3.3 Implementation Details
We follow the general principles to train binary networks as presented in previous work [4, 25, 27]. Weights and activations are binarized by using the sign function:
(1) 
The nondifferentiability of the sign function is solved with a StraightThrough Estimator (STE)
[2]coupled with gradient clipping as introduced by Hubara
et al. [18]. Therefore the forward and backward passes can be described as:Forward:  (2)  
Backward:  (3) 
In this case is the loss, a real number input, and a binary output. We use a clipping threshold of as used by [4]. Furthermore, the computational cost of binary neural networks at runtime can be highly reduced by using the and CPU instructions, as presented by Rastegari et al. [27].
Previous work [25] has suggested a different backward function to approximate the sign function more closely, however we found no performance gain during our experiments, similar to the results of [3]. Channelwise scaling factors have been proposed to reduce the difference between a regular and a binary convolution [27]. However, it was also argued, that they are mostly needed to scale the gradients [25], that a single scaling factor is sufficient [35], or that neither of them is actually needed [3]. Recent work suggests, that the effect of scaling factors might be neutralized by BatchNorm layers [4]. For this reason, and since we have not observed a performance gain by using scaling factors, we did not apply them in our convolutions. We use the typical layer order (BatchNorm sign BinaryConv) of previous BNNs [4, 25] Finally, we replaced the bottleneck structures, consisting of a and a convolution, which is often used in fullprecision networks, as it was done in previous work [3, 36] and used a single convolution instead.
4 Results and Discussion
We selected the challenging task of image classification on the ImageNet dataset [6] to test our new model architecture and perform ablation studies with our proposed changes. Our implementation is based on BMXNet^{1}^{1}1https://github.com/hpixnor/BMXNetv2 [33] and the model implementations of Bethge et al. [4]. Note, that experiment logs, accuracy curves, and plots of model structures for all trainings are in the supplementary material.
4.1 Grouped Stem Ablation Study and Training Details
When training models with our proposed grouped stem structure based on previous architectures, we discovered a large performance gain compared to previous networks. To verify the source of these gains we did an ablation study on ResNetE18 [3], BiRealNet34 [25], BinaryDenseNet28/37[4], and our MeliusNet22/29 with and without our proposed grouped stem structure. We directly show the results of this study in the corresponding figures for the comparison to the stateoftheart (see (a) and fig:sotalargemodel, alternatively a table with these values can be found in the supplementary material).
On the one hand, the results show, that using grouped stem instead of a regular convolution increases the model accuracy for all tested model architectures. The actual increase by using the grouped stem structure is between and for each model, but at the same time we also save a constant amount () of operations (as shown by the dotted lines in Figure 5). We conclude, that not only is using our grouped stem structure highly efficient, but also generalizes well to different BNN architectures.
On the other hand, we also recognized that our training process performs significantly better than previous training strategies. Therefore, we give a brief overview about our training configuration in the following:
For data preprocessing we use channelwise mean subtraction, normalize the data based on the standard deviation, randomly flip the image with a probability of
and finally select a random resized crop, which is the same augmentation scheme that was used in XNORNet [27]. We initialize the weights with the method of [7]and train our models from scratch (without pretraining a 32bit model) for 120 epochs with a base learning rate of
. We use the RAdam optimizer proposed by Liu et al. [24] and the default (“cosine”) learning rate scheduling of the GluonCV toolkit [11]. This learning rate scheduling steadily decreases the learning rate based on the following formula ( is the current step in training, with ): . However, we achieved similar (only slightly worse) results with the same learning rate scheduling and the Adam [21] optimizer, if we use a warmup phase of 5 epochs in which the learning rate is linearly increased to the base learning rate. Using SGD led to the worst results overall and even though we did some initial investigation into the differences between optimizers (included in supplementary material) we could not find a clear reason for the performance difference.4.2 Ablation Study on 32bit Networks
We performed another ablation study to find out whether our proposed MeliusNet is indeed specifically better for a BNN or whether it would also increase the performance of a 32bit network. Since our proposed MeliusNet without the Improvement Blocks is very similar to a DenseNet, we compared these two architectures and trained two 32bit models based on a DenseNet and a MeliusNet. We used the offtheshelf GluonCV training script for ImageNet and their DenseNet implementation as a basis for our experiment [11]. To achieve a fair comparison, we constructed two models of similar size and operations. We used 4443 blocks (Dense Block and Improvement Block) between the transition stages for MeliusNet and 6665 blocks (Dense Blocks only) for a DenseNet. The models need 4.5 billion FLOPS with 20.87 MB model size and 4.0 billion FLOPS with 19.58 MB model size, respectively. Therefore, we expect MeliusNet to definitely achieve a slightly better result, since it uses slightly more FLOPs and has a higher model size, unless our designed architecture is only specifically useful for BNNs. Both models were trained with SGD with momentum () and equal hyperparameters for 90 epochs (with a warmup phase of 5 epochs and “cosine” learning rate scheduling). Note that additional augmentation techniques (HSV jitter and PCAbased lightning noise) were used (in this study only), since we did not change the original GluonCV training script for the 32bit models.
The result shows basically identical training curves between both models for the whole training (see Figure 4). At the end of training, the training accuracy is even between both architectures at . Even though the validation accuracy does not match for the whole training, this is probably caused by randomized augmentation and shuffling of the dataset. Therefore we conclude, that using our MeliusNet architecture for 32bit models does not lead to an improvement, and our architecture is indeed only an improvement for BNNs.
4.3 Comparison to Stateoftheart
Name (block numbers)  Channel reduction factor in transitions  Size (MB)  FLOPs  Top1 (Top5) accuracy 

MeliusNet22 (4,5,4,4)  3.9  2.08  63.6% (84.7%)  
MeliusNet29 (4,6,8,6)  5.1  2.14  65.8% (86.2%)  
MeliusNet42 (5,8,14,10)  10.1  3.25  69.2% (88.3%)  
MeliusNet59 (6,12,24,12)  17.4  5.25  70.7% (89.3%)  
MeliusNet25/4 (4,5,5,6)  4.0  1.62  63.4% (84.2%)  
MeliusNet29/2 (4,6,8,6)  5.0  1.96  65.7% (85.9%) 
To compare to other stateoftheart networks we created different configurations of MeliusNet with different model sizes and number of operations (see Table 1). Our main goal was to reach fair comparisons to previous architectures, by using a similar model size and number of operations. Therefore, we chose the configurations of MeliusNet22 and MeliusNet29 to be similar to BinaryDenseNet28 and BinaryDenseNet38 respectively. We calculated the number of operations in the same way as in previous work, factoring in a speedup factor for binary convolutions [4, 25]. To be able to compare to BiRealNet we further needed to reduce the amount of operations, so we used 4 and 2 groups in the downsampling convolutions for MeliusNet25/4 and MeliusNet29/2 respectively and added a channel shuffle operation beforehand as described in Section 3.2. Finally, we created the larger networks MeliusNet42 and MeliusNet59 to be able to compare to MobileNet 0.75 and MobileNet 1.0. This also shows, that the basic network structure of MeliusNet can be adapted easily to create networks with different sizes and number of operations by tuning the number of blocks and using groups in the downsampling convolution. Note, that after initially choosing these model configurations for comparison and getting our training results, we did not adapt them for further tuning.
Comparison to other binary networks (one base):
We compared our MeliusNet22/29 with the following binary network architectures: ResNetE18 [3] (which is similar to BiRealNet18 [25]
, except for the addition of a single ReLu layer and a single BatchNorm), BiRealNet34
[25], and BinaryDenseNet28/37 [4]. For reference we also include ABCNet results, which uses multiple binary bases for weights and activations, even though they are not directly comparable, since they use a larger model size [23]. Since we trained the other binary network architectures for our grouped stem ablation study with our training strategy, we report them together with the accuracy reported by the original authors. This allows for a fair comparison between the architectures since all models are trained with our training strategy.We divide the results into two groups: the models with a size of about 4.0 MB and those with a size of about 5.1 MB (see Figure 5).
First, we recognize, that comparing our MeliusNet22 (including all optimizations) to the original result of a BinaryDenseNet28 shows a accuracy increase together with a reduction of FLOPs. However, through our ablation study we can also see, how the different factors contribute to this increase in accuracy: comes from the architecture change itself, (and the FLOPs reduction) from using grouped stem and from our training strategy.
Secondly, we can see, that if we apply our grouped stem and our training strategy to a ResNetE18, the result can even surpass sophisticated training methods, such as BONN or PCNN by Gu et al. [9, 10]. If we compare our MeliusNet25/4 (which has a reduced number of operations) to BONN (which is based on the BiRealNet18 architecture), we can recognize that we achieve higher accuracy based on the same number of operations. We note that since we do not use additional losses (e.g. those introduced in BONN), our architectural optimizations could be combined with such advanced training methods in future work, likely achieving even more accurate BNNs. Overall, our MeliusNet achieves the best result for a binary network with one binary base and a model size of 4 MB by far (see (a)).
For the analysis of binary models of 5.1 MB size, we also included the result of MobileNet 0.5 [16] for reference, even though it is not a binary approach (see (b)). MeliusNet29 (including all optimizations) shows a accuracy increase over the original result of a BinaryDenseNet37 with the same reduction of FLOPs. Again we can analyze how the different factors contribute to this increase in accuracy: comes from the difference in architectures, (and the FLOPs reduction) from using grouped stem and from our training strategy.
Additionally we recognize, that by applying our grouped stem and our training strategy to a BiRealNet34, we can achieve the same accuracy as MobileNet 0.5 based on a similar amount of operations and model size, which has not been achieved by any BNN before. Furthermore, our MeliusNet29 even surpasses ABCNet with 5 binary bases for weights and activations by with a much lower number of operations and model size. Finally, we also compare our MeliusNet29/2 to the BiRealNet34 result achieved with our training, where we achieve higher accuracy based on the same number of operations.
Comparison to other binary networks with multiple binary bases and compact networks:
For another challenging and more direct comparison, we compared our results based on BiRealNet34, MeliusNet29, MeliusNet42, and MeliusNet59 to the compact network architecture MobileNet [16] and the GroupNet approach [36], which uses 5 binary bases (which means they use 5 binary convolutions to approximate each 32bit convolution) in Table 2. First of all, in the comparison between MeliusNet29 and GroupNet18 and MeliusNet42 and GroupNet34 our MeliusNet reaches and higher accuracy at a lower number of operations and lower model size, respectively. However, since both approaches are optimizing at a different architecture level, they could even be combined in future work.
Furthermore, by applying our optimizations to a BiRealNet34, we can reach the same accuracy as MobileNet 0.5 with almost identical model size and operations. Our MeliusNet29 and MeliusNet29/2 achieve improvements of and , respectively, over the result of a MobileNet 0.5, although it is not directly comparable since it uses a slightly higher amount of operations. However, the results are still very promising, since they are based on the same model size and show a significant increase in accuracy.
A similar comparison between MeliusNet42 and MobileNet 0.75 and MeliusNet59 and MobileNet 1.0 are better comparable, since we tuned both models to exactly match the respective MobileNet in operations and model size. In these comparisons, MeliusNet42 and MeliusNet59 can reach and higher accuracy than the respective MobileNet models (note that due to its size, we trained MeliusNet59 for 150 epochs instead of 120).
Model size  Architecture  FLOPs  Top1 acc.  

5.1MB  MobileNet 0.5 [16]  63.7%  (base)  
BiRealNet34 [25]  62.2%  
BiRealNet34 [25]  63.7%  
MeliusNet29/2  65.7%  
MeliusNet29  65.8%  
8.7MB  GroupNet18 (5) [36]  64.8%    
10MB  MobileNet 0.75 [16]  68.4%  (base)  
MeliusNet42  69.2%  
15MB  GroupNet34 (5) [36]  68.5%    
17MB  MobileNet 1.0 [16]  70.6%  (base)  
MeliusNet59  70.7% 
This result is based on our optimizations.
We conclude that our architectural approach is a valid alternative to the structural decomposition described in GroupNet and also shows very promising results to be comparable to a 32bit MobileNet, since it matches or even surpasses their accuracy.
5 Conclusion
Previous work has shown different techniques to increase the accuracy of BNNs by increasing the channel numbers or replacing the binary convolutions with convolutions with multiple binary bases. The BiRealNet and the BinaryDenseNet approaches were the first to change the architecture of a BNN compared to a 32bit network. In our work, we showed a novel architecture MeliusNet, which is specifically designed to amend the disadvantages of using binary convolutions. In this architecture, we repeatedly add new features and improve them to compensate for the lower quality and lower capacity of binary feature maps. Our experiments with different model sizes on the challenging ImageNet dataset show that MeliusNet is superior to previous BNN approaches, which adapted the architecture.
Further, we presented grouped stem, an optimized set of layers that can replace the first convolution. This has considerably reduced the gap between BNN results and compact networks, and with our optimization, both previous architectures and our proposed MeliusNet can reach an accuracy similar to MobileNet 0.5 and MobileNet 0.75 based on the same model size and a similar amount of operations. This provides a strong basis for BNNs to reach the same accuracy as MobileNet 0.25 and 1.0 in future work. The higher energy saving potential of BNNs (based on customized hardware) could then make them the favorable choice in many applications.
We also found, that our architecture can reach competitive accuracy when compared to approaches with multiple binary bases. Therefore, we think, that future work with BNNs could achieve further improvements by combining architectural optimizations with blockinternal optimizations, such as using multiple binary bases.
References
 [1] Martin Arjovsky and Léon Bottou. Towards Principled Methods for Training Generative Adversarial Networks. International Conference on Learning Representations (ICLR), 2017.

[2]
Yoshua Bengio, Nicholas Léonard, and Aaron C Courville.
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation.
CoRR, abs/1308.3, 2013.  [3] Joseph Bethge, Marvin Bornstein, Adrian Loy, Haojin Yang, and Christoph Meinel. Training competitive binary neural networks from scratch. arXiv preprint arXiv:1812.01965, 2018.

[4]
Joseph Bethge, Haojin Yang, Marvin Bornstein, and Christoph Meinel.
BinaryDenseNet: Developing an Architecture for Binary Neural
Networks.
In
The IEEE International Conference on Computer Vision (ICCV) Workshops
, 2019.  [5] Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.

[6]
Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei.
Imagenet: A largescale hierarchical image database.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pages 248–255. Ieee, 2009. 
[7]
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pages 249–256, 2010.  [8] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [9] Jiaxin Gu, Ce Li, Baochang Zhang, Jungong Han, Xianbin Cao, Jianzhuang Liu, and David Doermann. Projection Convolutional Neural Networks for 1bit CNNs via Discrete Back Propagation. Proceedings of the AAAI Conference on Artificial Intelligence, 33:8344–8351, 2019.
 [10] Jiaxin Gu, Junhe Zhao, Xiaolong Jiang, Baochang Zhang, Jianzhuang Liu, Guodong Guo, and Rongrong Ji. Bayesian Optimized 1Bit CNNs. In The IEEE International Conference on Computer Vision (ICCV), 2019.
 [11] Jian Guo, He He, Tong He, Leonard Lausen, Mu Li, Haibin Lin, Xingjian Shi, Chenguang Wang, Junyuan Xie, Sheng Zha, Aston Zhang, Hang Zhang, Zhi Zhang, Zhongyue Zhang, and Shuai Zheng. GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing. arXiv preprint arXiv:1907.04433, 2019.
 [12] Song Han, Huizi Mao, and William J Dally. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In International Conference on Learning Representations (ICLR), 2016.
 [13] Song Han, Jeff Pool, John Tran, and William Dally. Learning both Weights and Connections for Efficient Neural Networks. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
 [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [15] Andrew Howard, Mark Sandler, Grace Chu, LiangChieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V Le, and Hartwig Adam. Searching for MobileNetV3. In The IEEE International Conference on Computer Vision (ICCV), 2019.
 [16] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. 2017.
 [17] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In Proceedings  30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pages 2261–2269, 2017.
 [18] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115, 2016.
 [19] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
 [20] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Deep Features for Text Spotting. In Computer Vision – ECCV 2014, pages 512–528, Cham, 2014. Springer International Publishing.
 [21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [23] Xiaofan Lin, Cong Zhao, and Wei Pan. Towards Accurate Binary Convolutional Neural Network. In Advances in Neural Information Processing Systems, number 3, pages 344–352, 2017.
 [24] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the Variance of the Adaptive Learning Rate and Beyond. arXiv preprint arXiv:1908.03265, 2019.
 [25] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and KwangTing Cheng. BiReal Net: Enhancing the Performance of 1bit CNNs with Improved Representational Capability and Advanced Training Algorithm. In The European Conference on Computer Vision (ECCV), sep 2018.
 [26] Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook, and Debbie Marr. WRPN: Wide ReducedPrecision Networks. International Conference on Learning Representations (ICLR), 2018.
 [27] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNORNet: ImageNet Classification Using Binary Convolutional Neural Networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
 [28] Joseph Redmon, Santosh Kumar Divvala, Ross B Girshick, and Ali Farhadi. You Only Look Once: Unified, RealTime Object Detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, 2016.
 [29] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28, pages 91–99, 2015.
 [30] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [31] Mingzhu Shen, Kai Han, Chunjing Xu, and Yunhe Wang. Searching for Accurate Binary Neural Architectures. The IEEE International Conference on Computer Vision (ICCV) Workshops, 2019.
 [32] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inceptionv4, InceptionResNet and the Impact of Residual Connections on Learning. In AAAI, volume 4, page 12, 2017.

[33]
Haojin Yang, Martin Fritzsche, Christian Bartz, and Christoph Meinel.
BMXNet: An OpenSource Binary Neural Network Implementation Based on MXNet.
In Proceedings of the 2017 ACM on Multimedia Conference, pages 1209–1212. ACM, 2017.  [34] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [35] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. DoReFaNet: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. arXiv preprint arXiv:1606.06160, 2016.
 [36] Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid. Structured Binary Neural Networks for Accurate Image Classification and Semantic Segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Supplementary material
Our supplementary material contains the following information:

Appendix A briefly explains the structure of the experiment data, which can be found here: https://owncloud.hpi.de/s/h5zWIepW1OS0Rs6

Appendix B shows a comparison between MeliusNet and the naive approach of simply alternating Residual Blocks and Dense Blocks

Appendix C contains data that shows some of the observed differences between the different optimizers (SGD, Adam, RAdam)
Appendix A Detailed Experiment Data
We include the experiment logs (experiment.log), accuracy curves (accuracy.png) and detailed plots (network.pdf) of our model architectures in one folder per experiment result. The accuracy curves also include the model size and number of operations of the corresponding model.
Appendix B Comparing the Naive Approach and MeliusNet
The direct approach to combining residual and dense shortcut connections could lead to a result as shown in (a). In this case the combination of a Dense Block and a Residual Block are repeated throughout the network. However, the residual shortcut connection requires that feature map sizes between the input and output of the convolution match. This means the number of channel contributes to the number of of operations quadratically. This makes achieving a reasonable number of operations difficult with this approach, since increasing the channel number (as is done in every Dense Block) leads to a quadratic increase of operations. Therefore, increasing the capacity of feature maps with this approach is not practical, especially for larger binary networks.
(b) shows the MeliusNet for comparison. The design of our Improvement Block keeps the number of operations lower, since increasing the channel number with Dense Blocks only linearly increases the number of operations required for later blocks.
We also empirically evaluated both models. These experiments were trained for only 40 epochs and a different learning rate schedule (base learning rate is , decaying by at epochs 35 and 37). However, since both models were trained with the same hyperparameters this should not affect the comparison between both. Since we struggled to construct a model which could match in both model size and number of operations, we only made the number of operations equal. In the comparison we can see that the naive approach is much worse, with a 3% different in Top 1 accuracy on ImageNet (see Figure 7). Even with the slightly smaller model (3.3 MB instead of 4 MB) this drop in accuracy is too much compared to other binary models, e.g. BiRealNet or BinaryDenseNet. Therefore we concluded that this approach is not useful for BNNs and have not pursued it further. The details of these experiments are in the “experiment_data” folder under “naive_vs_MeliusNet”.
Appendix C Optimizer Comparison
As written in the paper, we found, that both Adam and RAdam optimize better than SGD. We tried different learning rates and learning rate schedules, however the accuracy on ImageNet when training with SGD still was about 1% lower than Adam (with warmup). Therefore, we counted the number of sign “flips” for each individual weight between batches (accumulated per epoch) for each optimizer during the training of ResNetE18 on ImageNet (see Figure 8). If a weight was updated from to when updating the weights after processing one batch its weight flip count would increase by one. This can happen several times per epoch and intuitively reflects the “stability” of the training process regarding the binary weights.
First of all, the data showed, that surprisingly, after about 90 epochs, 95% of all binary weights are stable during a single given epoch. Note, that this does not mean that 95% of weights are stable for the whole time after the 90th epoch, since the 95% of stable weights are not necessarily identical between the different epochs.
During the training with Adam and RAdam, the average stability increases during the training, while for SGD the stability decreases after about 50 epochs. However, this is only true for the earlier layers in the network (see (a)), but does not apply to later layers (see (b)). Although this is an indication for a more unstable training process with SGD it does not yet conclusively explain the performance difference to RAdam and Adam.
Model size  Network Architecture  Training procedure  Grouped stem  FLOPS  Top1 accuracy  of method  

4.0MB  MeliusNet25/4  Ours  ✓  63.4%    
BiRealNet18[25]  Original  ✗  56.4%    
PCNN[9]  ✗  57.3%    
BONN[10]  ✗  59.3%    
ResNetE18[4]  Original  ✗  58.1%  (base)  
Ours  ✗  60.0%  (base)  
Ours  ✓  60.6%  
BinaryDenseNet28[4]  Original  ✗  60.7%  
Ours  ✗  61.7%  (base)  
Ours  ✓  62.6%  
MeliusNet22  Ours  ✗  62.8%  (base)  
Ours  ✓  63.6%  
5.1MB  MobileNet 0.5 [16]      63.7%      
MeliusNet29/2  Ours  ✓  65.7%    
BiRealNet34[25]  Original  ✗  62.2%  (base)  
Ours  ✗  63.3%  (base)  
Ours  ✓  63.7%  
BinaryDenseNet37[4]  Original  ✗  62.5%  
Ours  ✗  63.3%  (base)  
Ours  ✓  64.2%  
MeliusNet29  Ours  ✗  64.9%  (base)  
Ours  ✓  65.8%  
8.7MB  ABCNet18 (5/3)[23]      62.5%      
ABCNet18 (5/5)[23]      65.0%      
7.4MB  GroupNet18 (4)[36]      64.2%      
8.7MB  GroupNet18 (5)[36]      64.8%      
9.2MB  GroupNet18** (5)[36]      67.0%      
10MB  MobileNet 0.75 [16]      68.4%      
MeliusNet42      69.2%      
15MB  GroupNet34 (5)[36]      68.5%      
15.3MB  GroupNet34** (5)[36]      70.5%      
45MB  ResNet18 (32bit)[14]      69.3%      
84MB  ResNet34 (32bit)[14]      73.3%     
Comments
There are no comments yet.