1 Introduction
Deep Neural Networks (DNNs) have achieved great impact to broad disciplines in academia and industry [57, 38]. Recently, the deployment of DNNs are transferring from highend cloud to lowend devices such as mobile phones and embedded chips, serving general public with many realtime applications, such as drones, miniature robots, and augmented reality. Unfortunately, these devices typically have limited computing power and memory space, thus cannot afford DNNs to achieve important tasks like object recognition involving significant matrix computation and memory usage.
Binary Neural Network (BNN) is among the most promising techniques to meet the desired computation and memory requirement. BNNs [31]
are deep neural networks whose weights and activations have only two possible values (e.g., 1 and +1) and can be represented by a single bit. Beyond the obvious advantage of saving storage and memory space, the binarized architecture admits only bitwise operations, which can be computed extremely fast using digital logic units
[20] such as arithmeticlogic unit (ALU) with much less power consumption than floatingpoint unit (FPU).Despite the significant gain in speed and storage, however, current BNNs suffer from notable accuracy degradation when applied to challenging tasks such as ImageNet classification. To mitigate the gap, previous researches in BNNs have been focusing on designing more effective optimization algorithms to find better local minima of the quantized weights. However, the task is highly nontrivial, since gradientbased optimization that used to be effective to train DNNs now becomes tricky to implement.
Instead of tweaking network optimizers, we investigate BNNs systematically in terms of representation power, speed, bias, variance, stability, and their robustness. We find that BNNs suffer from severe intrinsic instability and nonrobustness regardless of network parameter values. What implied by this observation is that the performance degradation of BNNs are not likely to be resolved by solely improving the optimization techniques; instead, it is mandatory to cure the BNN function, particularly to reduce the prediction variance and improve its robustness to noises.
Inspired by the analysis, in this work, we propose Binary Ensemble Neural Network (BENN). Though the basic idea is as straightforward as to simply aggregate multiple BNNs by boosting or bagging, we show that the statistical properties of the ensembled classifiers become much nicer: not only the bias and variance are reduced, more importantly, BENN’s robustness to noises at test time is significantly improved. All the experiments suggest that BNNs and ensemble methods are a perfectly natural fit. Using architectures of the same connectivity (a compact Network in Network [42]), we find that boosting only BNNs would be able to even surpass the baseline DNN with real weights in the best case. In addition, our initial exploration by applying BENN on ImageNet recognition using AlexNet [38] and ResNet [27] also shows a large gain. This is by far the fastest, most accurate, and most robust results achieved by binarized networks (Fig. 1).
To the best of our knowledge, this is the first work to bridge BNNs with ensemble methods. Unlike traditional BNN improvements that have computational complexity of by using bit per weights [65] or bases in total [43], the complexity of BENN is reduced to . Compared with [65, 43], BENN also enjoys better bitwise operation parallelizability. With trivial parallelization, the complexity can be reduced to . We believe that BENN can shed light on more research along this idea to achieve extremely fast yet robust computation by networks.
2 Related Work
Quantized and binary neural networks: People have found that there is no need to use fullprecision parameters and activations and can still preserve the accuracy of a neural network using kbit fixed point numbers, as stated by [19, 23, 61, 8, 40, 41, 48, 56, 49]. The first approach is to use lowbitwidth numbers to approximate real ones, which is called quantized neural networks (QNNs) [32]. [66, 64] also proposed ternary neural networks. Although recent advances such as [65] can achieve competitive performance compared with fullprecision models, they cannot fully speed it up because we still cannot perform parallelized bitwise operation with bitwidth larger than one. [31] is the very recent work that binarizes all the weights and activations, which was the birth of BNN. They have demonstrated the power of BNNs in terms of speed, memory use and power consumption. But recent works such as [58, 11, 21, 10] also reveal the strong accuracy degradation and mismatch issue during the training when BNNs are applied in complicated tasks such as ImageNet ([12]) recognition, especially when the activation is binarized. Although some work like [43, 50, 13]
have offered reasonable solutions to approximate fullprecision neural network, much more computation and tricks on hyperparameters are still needed to implement compared with BENN. Since they either use
bitwidth quantization or binary bases, the computational complexity cannot get rid of if is required for 1bit single BNN, while BENN can achieve and even if multiple threads are naturally paralleled. Also, many of current literatures tried to minimize the distance between binary and realvalue parameters. But empirical assumptions such as Gaussian parameter distribution are usually required in order to get a priori for each BNN or just keep the sign same as suggested by [43], otherwise the nonconvex optimization is hard to deal with. By contrast, BENN can be a general framework to achieve the goal and has strong potential to work even better than fullprecision networks, without involving more hyperparameters than a single BNN.Ensemble techniques: To avoid simply relying on a single powerful classifier, the ensemble strategy can improve the accuracy of given learning algorithm combining multiple weak classifiers as summarized by [6, 9, 47]. The two most common strategies are bagging by [5] and boosting by [51, 17, 53, 26], which were proposed many years ago and have strong statistical foundation. They have roots in a theoretical framework PAC model by [59]
which was the first to pose the question of whether weak learners can be ensembled into a strong learner. Bagging predictors are proved to reduce variance while boosting can reduce both bias and variance, and their effectiveness have been proved by many theoretical analysis. Traditionally ensemble was used with decision trees, decision stumps, random forests and achieved great success thanks to its desirable statistical properties. Recently people use ensemble to increase the generalization ability of deep CNNs
[24], advocate boosting on CNNs and do architecture selection [45], and propose boost over features [30]. But people did not pay enough attention to ensemble techniques because neural network is not a weak classifier anymore thus ensemble can unnecessarily increase the model complexity. However, when applied to weak binary neural networks, we found it generates new insights and hopes, and BENN is a natural outcome of such perfect combination. In this work, we build our BENN on the top of variant bagging, AdaBoost by [15, 52], LogitBoost by [17] and can be extended to many more variants of traditional ensemble algorithms. We hope this work can revive these intelligent approaches and bring their life back into modern neural networks.3 Why Making BNNs Work Well is Challenging?
Despite the speed and space advantage of BNN, its performances is still far inferior to the real valued counterparts. There are at least two possible reasons: First, functions representable by BNNs may have some inherent flaws; Second, current optimization algorithms may still not be able to find a good minima. While most researchers have been working on developing better optimization methods, we suspect that BNNs have some fundamental flaws. The following investigation reveals the fundamental limitations of BNNrepresentable functions experimentally.
Because all weights and activations are binary, an obvious fact is that BNNs can only represent a subset of discrete functions, being strictly weaker than real networks that are universal continuous function approximators [29]. What are not so obvious are two serious limitations of BNNs: the robustness issue w.r.t. input perturbations, and the stability issue w.r.t. network parameters. Classical learning theory tells us that both robustness and stability are closely related to the generalization error of a model [62, 4]. A more detailed theoretical analysis on BNN’s problems is attached in supplementary material.
Robustness Issue: In practice, we observe more severe overfitting effects of BNNs than real networks. Robustness is defined as the property that if a testing population is “similar” to a training population, then the testing error is close to the training error [62]. To verify this point, we experiment in a random network setting and a trained network setting.
Random Network Setting. We compute the following quantity to compare 32bit realvalued DNN, BNN, QNN, and our BENN model (Sec. 4) on the NetworkInNetwork (NIN) architecture:
(1) 
where is the network and represents network weights.
We randomly sample realvalued weights as suggested in literature to get a DNN with weights and binarize it to get a BNN with binary weights . We also independently sample and binarize to generate multiple BNNs with the same architecture to simulate the BENN and get . QNN is obtained by quantizing the DNN to bit weights (W) and activations (A). We normalize each input image in CIFAR10 to the range .
Then we inject the input perturbation on each example by a Gaussian noise with different variances (), run a forward pass on each network, and measure the expected norm of the change on the output distribution. The above norm of DNN, BNN, QNN, and BENN averaged by 1000 sampling rounds is shown in Fig. 2(left) with perturbation variance 0.01.
Results show that BNNs always have larger output variation, suggesting that they are more susceptible to input perturbation, and BNN does worse than QNN that has more bits. We also observe that having more bits on activations actually improves BNN’s robustness significantly, while having more bits on weights just has quite marginal improvement (Fig. 2(left)). Therefore, the activation binarization seems to be the bottleneck.
Trained Network Setting. To further consolidate the discovery, we also train a realvalued DNN and a BNN
using XNORNet
[50] rather than direct sampling. We also include our designed BENN in comparison. Then we perform the same Gaussian input perturbation , run a forward pass, and calculate the change of classification error on CIFAR10 as:(2) 
Results in Fig. 2(middle) indicates that BNNs are still more sensitive to noises even if it is well optimized. Although people have shown that weights in BNN still have nice statistical properties as in [1], the conclusion can change dramatically if both weights and activations are binarized while input is perturbed.
Stability Issue:
BNNs are known to be hard to optimize due to problems such as gradient mismatch and nonsmoothness of activation function. While
[40] has shown that stochastic rounding converges to within accuracy of the minimizer in expectation where denotes quantization resolution, assuming the error surface is convex, the community has not fully understood the nonconvex error surface of BNN and how it interacts with different optimizers such as SGD or ADAM [37].To compare the stability of different networks (sensitivity to network parameter during optimization), we measure the accuracy fluctuation after a large amount of training steps. Fig. 2
(right) shows the accuracy oscillation in the last 20 training steps after we train BNN and QNN with 300 epochs, and results show that we should at least have QNN with weights and activations both 4bit in order to stabilize the network.
One explanation of such instability is the nonsmoothness of the function output w.r.t. the binary network parameters. Note that, as the output of the activation function in the previous layer, the input to each layer of BNNs are binarized numbers. In other words, not only each function is nonsmooth w.r.t. the input, but also it is nonsmooth w.r.t. the learned parameters. As a comparison, empirically, BENN with 5 and 32 ensembles (denoted as BENN05/32 in Fig. 2) have already achieved amazing stability.
4 Binary Ensemble Neural Network
In this section, we illustrate our BENN using bagging and boosting strategies, respectively. In all experiments, we adopt the widely used deterministic binarization as
for network weights and activations, which is preferred to leverage hardware accelerations. However, backpropagation becomes challenging since the derivative is zero almost everywhere except for the stepping point. In this work, we borrow the common strategy called “straightthrough estimator” (STE)
[28] during backpropagation, defined as .4.1 BENNBagging
The key idea of bagging is to average weak classifiers that are trained from i.i.d. samples of the training set. To train each BNN classifier, we sample examples independently with replacement from the training set . We do this times to get BNNs, denoted as . The sampling with replacement assures that each BNN sees roughly of the entire training set.
At test time, we aggregate the opinions from these classifiers and decide among
classes. We compare two ways of aggregating the outputs. One is to choose the label that most BNNs agree with (hard decision), while the other is to choose the best label after aggregating their softmax probabilities (soft decision).
The main advantage brought by bagging is to reduce the variance of a single classifier. This is known to be extremely effective for deep decision trees which suffer from high variance, but only marginally helpful to boost the performance of neural networks, since networks are generally quite stable. Interestingly, though less helpful to realvalued networks, bagging is effective to improve BNNs since the instability issue is severe for BNNs due to gradient mismatch and strong discretization noise as stated in Sec. 3.
4.2 BENNBoosting
Boosting is another important tool to ensemble classifiers. Instead of just aggregating the predictions from multiple independently trained BNNs, boosting combines multiple weak classifiers in a sequential manner and can be viewed as a stagewise gradient descent method optimized in the function space. Boosting is able to reduce both bias and variance of individual classifiers.
There are many variants of boosting algorithms and we choose the AdaBoost [15] algorithm for its popularity. Suppose classifier has hypothesis , weight , and output distribution , we can denote the aggregated classifier as and its aggregated output distribution . Then AdaBoost minimizes the following exponential loss:
where and denotes the index of the training example.
Reweighting Principle
The key idea of boosting algorithm is to have the current classifier pay more attention to the misclassified samples by previous classifiers. Reweighting is the most common way of budgeting attention based on the historical results. There are essentially two ways to accomplish this goal:

[leftmargin=0.5cm]

Reweighting on sampling probabilities: Suppose initially each training example is assigned uniformly, so each sample gets equal chance to be picked. After each round, we reweight the sampling probability according to the classification confidence.

Reweighting on loss/gradient: We may also incorporate into the gradient, so that a BNN updates parameters with larger step size on misclassified examples and vice versa. For example, set , where is the learning rate. However, we observe that this approach is less effective experimentally for BNNs, and we conjecture that it exaggerates the gradient mismatch problem.
4.3 TestTime Complexity
A 1bit BNN with the same connectivity as the original fullprecision 32bit DNN can save x memory. In reality, BNN can achieve x speed up on the current generation of 64bit CPUs [50] and may be further improved with special hardware such as FPGA. Some existing works only binarize the weights but leave activations fullprecision, which practically only results in 2x speed up. As for BENN with ensembles, each BNN’s inference is independent, thus the total memory saving is x. As for boosting, we can further compress BNN to save more computations and memory usage. Besides, existing approaches have complexity with bit QNN [65] or use binary bases [43], because they cannot avoid the bit collection operation to generate a number, although their fixedpoint computation is much more efficient than floatpoint computation. If is the time complexity of the boolean operation, then BENN reduces the quadratic complexity to linear, i.e., with ensembles but still maintains the very satisfying accuracy and stability as stated above. We can even make the inference in for BENN if multiple threads are supported. A complete comparison is shown in Table 1.
4.4 Stability Analysis
Given a fullprecision real valued DNN with a set of parameters , a BNN with binarized parameters
, input vector
(after Batch Normalization) and perturbation
, and a BENN with ensembles, we want to compare their stability and robustness w.r.t. the network parameters and input perturbation. Here we analyze the variance of output change before and after perturbation, which echoes Eq. 1 in Sec. 3. This is because the output change has zero mean and its variance reflects the distribution of output variation. More specifically, larger variance means increased variation of output w.r.t. input perturbation.Assume
are outputs before nonlinear activation function of a single neuron in an onelayer network, we have the output variation of realvalue DNN as
, whose distribution has variance , where denotes number of input connections for this neuron and denotes inner product. Some modern nonlinear activation functionlike ReLU will not change the inequality of variances, thus we can omit them in the analysis to keep it simple.
For BNN with both weights and activations binarized, we can rewrite the above formulation as , thus having variance . And for BENNBagging, we have with ensembles, since bagging effectively reduces variance. For BENNBoosting, our model can reduce both bias and variance at the same time. However for boosting, the analysis on bias and variance becomes much more difficult and there are still some debates in literature [7, 17]. With these Gaussian assumptions and some numerical experiments (detailed analysis and theorems can be found in supplementary material), we can verify the large stability gain of BENN over BNN compared with floatingnumber DNN. As for robustness, the same analysis principle can be applied to perturbing weights as compared with used in stability analysis.
Network  Weights  Activation  Operations  Memory Saving  Computation Saving 

Standard DNN  F  F  +, ,  1  1 
[10, 33, 39, 66, 64],…  B  F  +,   32x  2x 
[65, 32, 61, 2],…  +, ,  x  x  
[43],…  +, , XNOR, bitcount  x  x  
[50] and ours  B  B  XNOR, bitcount  32x  58x 
5 Independent and WarmRestart Training for BENNs
We train our BENN with two different methods. The first one is to initialize each new classifier independently and retrain it, which is a traditional way. To accelerate the training of new weak classifier in BENN, we can also initialize the weights of the new classifier by cloning the weights from the most recently trained classifier. We name this training scheme as warmrestart training, and we conjecture that the knowledge of those unseen data for the new classifier has been transferred from the inherited weights and is helpful to increase the discriminability of the new classifier. Interestingly, we observe that for small network and dataset like NetworkInNetwork [42] on CIFAR10, warmrestart training has better accuracy. However, independent training is better when BENN is applied to large network and dataset such as AlexNet [38] and ResNet [27] on ImageNet since overfitting problem emerges. More discussion can be found in Sec. 6 and Sec. 7.
Implementation Details
We train BENN on the image classification task with CNN block structure containing a batch normalization layer, a binary activation layer, a binary convolution layer, a nonbinary activation layer (e.g., sigmoid, ReLU), and a pooling layer, as used by many recent works [50, 65]. To compute the gradient of step function , we use the same approach suggested by STE. When updating parameters, we use realvalued weights as [50] suggests otherwise the tiny update could be killed by deterministic binarization and training cannot move on. In this work, we train each BNN using standard independent and warmrestart training. Unlike the previous works which always keep the first and last layer fullprecision, we test 7 different BNN architecture configurations as shown in Table 2 and use them as ingredients for ensemble in BENN.
Weak BNN Configuration/Type (T)  Weight  Activation  Size  Params 

SB (SemiBNN)  First and last layer:32bit  First and last layer:32bit  100%  100% 
AB (AllBNN)  All layers:1bit  All layers:1bit  100%  100% 
WQB (WeightQuantizedBNN)  All layers:Qbit  All layers:1bit  100%  100% 
AQB (ActivationQuantizedBNN)  All layers:1bit  All layers:Qbit  100%  100% 
IB (ExceptInputBNN)  All layers:1bit  First layer: 32bit  100%  100% 
SB/AB/IBTiny (TinyCompressBNN)      50%  25% 
SB/AB/IBNano (NanoCompressBNN)      10%  1% 
6 Experimental Results
We evaluate BENN on CIFAR10 and ImageNet datasets with a selfdesigned compact NetworkInNetwork (NIN) [42], the standard AlexNet [38] and ResNet18 [27], respectively. We have summarized in Table 2 the configurations of all BNN variants. More detailed specifications of the networks can be found in the supplementary material. For each type of BNN, we obtain the converged single BNN (e.g., SB) when training is done. We also store BNN after each training step and obtain the best BNN along the way by picking the one with the highest test accuracy (e.g., Best SB). We use BENNTR to denote the BENN by aggregating R BNNs of configuration T (e.g., BENNSB32). We also denote Bag/BoostIndep and Bag/BoostSeq as bagging/boosting with standard independent training and warmrestart sequential training (Sec. 5). All ensembled BNNs share the same network architecture as their realvalued DNN counterpart in this paper, although studying multimodel ensemble is an interesting future work. The code of all our experiments will be made public online.
Network  Ensemble Method  Ensemble  STD 

SB    1  2.94 
Best SB    1  1.40 
BENNSB  BagSeq  5  0.31 
BENNSB  BoostSeq  5  0.24 
BENNSB  BagSeq  32  0.03 
BENNSB  BoostSeq  32  0.02 
6.1 Insights Generated from CIFAR10
In this section, we show the large performance gain using BENN on CIFAR10 and summarize some insights. Each BNN is initialized by a pretrained model from XNORNet [50] and then retrained by 100 epochs to reach convergence before ensemble. Each fullprecision DNN counterpart is trained by 300 epochs to obtain the best accuracy for reference. The learning rate is set to 0.001 and ADAM optimizer is used. Here, we use a compact NetworkInNetwork (NIN) for CIFAR10. We first present some significant independent comparisons as follows and then summarize the insights we found.
Single BNN versus BENN: We found that BENN can achieve much better accuracy and stability than a single BNN with negligible sacrifice in speed. Experiments across all BNN configurations show that BENN has the accuracy gain ranging from to over BNN on CIFAR10. If each BNN is weak (e.g., AB), the gain of BENN will increase as shown in Fig. 3 (right). This verifies that BNN is indeed a good weak classifier for ensembling. Surprisingly, BENNSB outperforms fullprecision DNN after 32 ensembles (either bagging or boosting) by up to (Fig. 3 (left)). Note that in order to have the same memory usage as a 32bit DNN, we constrain the ensemble up to 32 rounds if no network compression is involved. If more ensembles are available, we observe further performance boost but accuracy gain will eventually become flat.
We also compare BENNSB5 (i.e., 5 ensembles) with WQB (Q=5, 5bit weight and 1bit activation), which have the same amount of parameters (measured by bits). WQB can only achieve accuracy unstably while our ensemble network can reach up to and remain stable.
We also measure the accuracy variation of the classifier in the last 20 training steps for all BNN configurations. The results in Fig. 3 indicate that BENN can reduce BNN’s variance by if ensemble 5 rounds and after 32 rounds. Moreover, picking the best BNN with the highest test accuracy instead of using the BNN when training is done can also reduce the oscillation. This is because the statistical property of ensemble framework (Sec. 3 and Sec. 4.4) makes BENN become a graceful way to ensure high stability.
Bagging versus boosting: It is known that bagging can only reduce the variance of the predictor, while boosting can reduce both bias and variance. Fig. 3(right), Fig. 4, and Table 4 show that boosting outperforms bagging, especially after BNN is compressed, by up to when network size is reduced to (Tiny config) and when network size is reduced to (Nano config), and the gain increases from 5 to 32 ensembles. This verifies that boosting is a better choice if the model does not overfit much.
Standard independent training versus warmrestart training: Standard ensemble techniques use independent training, while warmrestart training enable new classifiers to learn faster. Fig. 3(left) shows that warmrestart training performs better up to for bagging and for boosting after the same number of training epochs. This means gradually adapting to more examples might be a better choice for CIFAR10. However, this does not hold for ImageNet task because of slight overfitting with warmrestart (Sec. 6.2). We believe that this is an interesting phenomenon but it needs more justification by studying the theory of convergence.
Network  Ensemble Method  Ensemble  Accuracy 

Best SB    1  84.91% 
BENNSB  BagSeq  32  89.12% 
BENNSB  BoostSeq  32  89.00% 
Best SBTiny    1  77.20% 
BENNSBTiny  BagSeq  32  84.09% 
BENNSBTiny  BoostSeq  32  84.32% 
Best SBNano    1  40.70% 
BENNSBNano  BagSeq  500  57.12% 
BENNSBNano  BoostSeq  500  63.11% 
The impact of compressing BNN: BNN’s model complexity largely affects bias and variance. If each weak BNN has enough complexity with low bias but high variance, then bagging is more favorable than boosting due to simplicity. However, if each BNN’s size is small with large bias, boosting becomes a much better choice. To verify this, we compress each BNN in Table 2 by naively reducing the amount of channels and neurons in each layer. The results in Table 4 show that BENNSB can maintain reasonable performance even after naive compression, and boosting gains more over bagging in severe compression (Nano config).
We also found that BENN is less sensitive to network size. Table 4 shows that compression reduces single BNN’s accuracy by (Tiny config) and (Nano config). After 32 ensembles, the performance loss caused by compression decreases to and respectively. Surprisingly, we observe that compression only reduces the accuracy of fullprecision DNN by (Tiny config) and (Nano config). So it is necessary to have nottooweak BNNs to build BENN that can compete with fullprecision DNN. Better pruning algorithm can be combined with BENN in the future rather than naive compression to allow smaller network to be ensembled.
The effect of bit width: Higher bitwidth results in lower variance and bias at the same time. This can be seen in Fig. 4 where we make activations 2bit in BENNAQB (Q=2). As can be seen, BENNAQB (Q=2) and BENNIB have comparable accuracy after 32 ensembles, but much better than BENNAB and worse than BENNSB. We also observe that activation binarization results in much more unstable model than weight binarization. This indicates that the gain of having more bits is mostly due to better features from the input image, since input binarization is a real pain for neural networks. Surprisingly, BENNAB can still achieve more than accuracy under such a pain.
The effect of binarizing first and last layer: Almost all the existing works in BNN assume the full precision of the first and last layer, since binarization on these two layers will cause severe accuracy degradation. But we found BENN is less affected, as shown by BENNAB, BENNSB and BENNIB in Fig. 4. The BNN’s accuracy loss due to binarizing these two special layers is . For BENN with 32 ensembles, the loss reduces to .
In summary, we generate our main insights about BNN and BENN: (1) Ensemble such as bagging and boosting greatly relieve BNN’s problems in terms of representation power, stability, and robustness. (2) Boosting gains advantage over bagging in most cases, and warmrestart training is often a better choice. (3) Weak BNN’s configuration (i.e., size, bitwidth, first and last layer) is essential to build a wellfunctioning BENN to match fullprecision DNN in practice.
6.2 Exploration on Applying BENN to ImageNet Recognition
We believe BENN is one of the best neural network structures for inference acceleration. To demonstrate the effectiveness of BENN, we compare our algorithm with stateofthearts on the ImageNet recognition task (ILSVRC2012) using AlexNet [38] and ResNet18 [27]. Specifically, we compare our BENNSB independent training (Sec. 5) with the fullprecision DNN [38, 50], DoReFaNet (kbit quantized weight and activation) [65], XNORNet (binary weight and activation) [50], BNN (binary weight and activation) [31] and BinaryConnect (binary weight) [10]. We also tried ABCNet (k binary bases for weight and activation) [43] but unfortunately the network does not converge well. Note that accuracy of BNN and BinaryConnect on AlexNet are reported by [50] instead of original authors. For DoReFaNet and ABCNet, we use the best reported accuracy by original authors with 1bit weight and 1bit activation. For XNORNet, we report the number of our own retrained model. Our BENN is retrained given a well pretrained model until convergence by XNORNet after 100 epochs to use, and we retrain each BNN with 80 epochs before ensemble. As shown in Table 5 and 6, BENNSB is the best among all the stateoftheart BNN architecture, even with only 3 ensembles paralleled on 3 threads. Meanwhile, although we do observe continuous gain with 5 and 8 ensembles (e.g., + on AlexNet), we found that BENN with more ensembles on ImageNet task can be unstable in terms of accuracy and needs further investigation on overfitting issue, otherwise the rapid gain is not always guaranteed. However, we believe our intitial exploration along this direction has shown BENN’s potentiality of catching up fullprecision DNN and even surpass it with more base BNN classifiers. In fact, how to optimize BENN on large and diverse dataset is still an interesting open problem.
Method  W  A  Top1 

FullPrecision DNN [38, 50]  32  32  56.6% 
XNORNet [50]  1  1  44.0% 
DoReFaNet [65]  1  1  43.6% 
BinaryConnect [10, 50]  1  32  35.4% 
BNN [31, 50]  1  1  27.9% 
BENNSB3, Bagging (ours)  1  1  48.8% 
BENNSB3, Boosting (ours)  1  1  50.2% 
BENNSB6, Bagging (ours)  1  1  52.0% 
BENNSB6, Boosting (ours)  1  1  54.3% 
Method  W  A  Top1 

FullPrecision DNN [27, 43]  32  32  69.3% 
XNORNet [50]  1  1  48.6% 
ABCNet [43]  1  1  42.7% 
BNN [31, 50]  1  1  42.2% 
BENNSB3, Bagging (ours)  1  1  53.4% 
BENNSB3, Boosting (ours)  1  1  53.6% 
BENNSB6, Bagging (ours)  1  1  57.9% 
BENNSB6, Boosting (ours)  1  1  61.0% 
7 Discussion
More bits per network or more networks per bit? We believe this paper brings up this important question. As for biological neural networks such as our brain, the signal between two neurons is more like a spike instead of highrange realvalue signal. This implies that it may not be necessary to use realvalued numbers, while involve a lot of redundancies and can waste significant computing power. Our work converts the direction of “how many bits per network” into “how many networks per bit”. BENN provides a hierarchical view, i.e., we build weak classifiers by groups of neurons, and build a strong classifier by ensembling the weak classifiers. We have shown that this hierarchical approach is more intuitive and natural to represent knowledge. Although the optimal ensemble structure is beyond the scope of this paper, we believe that some structure searching or metalearning techniques can be applied. Moreover, the improvement on single BNN such as studying the error surface and resolving the curse of activation/gradient binarization is still essential for the success of BENN.
BENN is hardware friendly: Using BENN with ensembles is better than using one bit classifier. Firstly, bit quantization still cannot get rid of fixedpoint multiplication, while BENN can support bitwise operation. People have found that BNN can be further accelerated on FPGAs over modern CPUs [63, 18]. Secondly, people have shown that the complexity of a multiplier is proportional to the square of bitwidth, thus BENN simplifies the hardware design. Thirdly, BENN can use spike signals in the chips instead of keeping the signal realvalued all the time, which can save a lot of energy. Finally, unlike recent literature requiring quadratic time to compute, BENN can be better paralleled on the chips due to its linear time complexity.
Current limitations: It is known to all that ensemble methods can potentially cause overfitting to the model and we also observed similar problems on CIFAR10 and ImageNet, when the number of ensembles keeps increasing. An interesting next step is to analyze the property of decision boundary of BENN on different datasets and track its evolution in highdimensional feature space. Also, training will take longer time if many ensembles are needed (especially on large dataset like ImageNet), thus reducing the speed of design iterations. Finally, BENN needs to be further optimized for large networks such as AlexNet and ResNet in order to show its full power, such as picking the best ensemble rule and base classifier.
8 Conclusion and Future Work
In this paper, we proposed BENN, a novel neural network architecture which marries BNN with ensemble methods. The experiments showed a large performance gain in terms of accuracy, robustness, and stability. Our experiments also reveal some insights about tradeoffs on bit width, network size, number of ensembles, etc. We believe that by leveraging specialized hardware such as FPGA, BENN can be a new dawn for deploying large DNNs into mobile and embedded systems. This work also indicates that a single BNN’s properties are still essential thus people need to work hard on both directions. In the future we will explore the power of BENN to reveal more insights about network bit representation and minimal network architecture (e.g., combine BENN with pruning), BENN and hardware cooptimization, and the statistics of BENN’s decision boundary.
References
 [1] A. G. Anderson and C. P. Berg. The highdimensional geometry of binary neural networks. arXiv preprint arXiv:1705.07199, 2017.

[2]
S. Anwar, K. Hwang, and W. Sung.
Fixed point optimization of deep convolutional neural networks for object recognition.
In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 1131–1135. IEEE, 2015.  [3] Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.

[4]
O. Bousquet and A. Elisseeff.
Stability and generalization.
Journal of machine learning research
, 2(Mar):499–526, 2002.  [5] L. Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
 [6] L. Breiman. Bias, variance, and arcing classifiers. 1996.
 [7] P. Bühlmann and T. Hothorn. Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, pages 477–505, 2007.
 [8] Z. Cai, X. He, J. Sun, and N. Vasconcelos. Deep learning with low precision by halfwave gaussian quantization. arXiv preprint arXiv:1702.00953, 2017.
 [9] J. G. Carney, P. Cunningham, and U. Bhagwan. Confidence and prediction intervals for neural network ensembles. In Neural Networks, 1999. IJCNN’99. International Joint Conference on, volume 2, pages 1215–1218. IEEE, 1999.
 [10] M. Courbariaux, Y. Bengio, and J.P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.
 [11] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.
 [12] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
 [13] L. Deng, P. Jiao, J. Pei, Z. Wu, and G. Li. Gated xnor networks: Deep neural networks with ternary weights and activations under a unified discretization framework. arXiv preprint arXiv:1705.09283, 2017.
 [14] P. Domingos. A unified biasvariance decomposition. In Proceedings of 17th International Conference on Machine Learning, pages 231–238, 2000.

[15]
Y. Freund and R. E. Schapire.
A desiciontheoretic generalization of online learning and an
application to boosting.
In
European conference on computational learning theory
, pages 23–37. Springer, 1995.  [16] Y. Freund, R. E. Schapire, et al. Experiments with a new boosting algorithm. In Icml, volume 96, pages 148–156. Bari, Italy, 1996.

[17]
J. Friedman, T. Hastie, R. Tibshirani, et al.
Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors).
The annals of statistics, 28(2):337–407, 2000.  [18] C. Fu, S. Zhu, H. Su, C.E. Lee, and J. Zhao. Towards fast and energyefficient binarized neural network inference on fpga. arXiv preprint arXiv:1810.02068, 2018.
 [19] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.

[20]
G. Govindu, L. Zhuo, S. Choi, and V. Prasanna.
Analysis of highperformance floatingpoint arithmetic on fpgas.
In Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, page 149. IEEE, 2004.  [21] Y. Guo, A. Yao, H. Zhao, and Y. Chen. Network sketching: Exploiting binary structure in deep cnns. arXiv preprint arXiv:1706.02021, 2017.
 [22] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning, pages 1737–1746, 2015.
 [23] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 [24] S. Han, Z. Meng, A.S. Khan, and Y. Tong. Incremental boosting convolutional neural network for facial action unit recognition. In Advances in Neural Information Processing Systems, pages 109–117, 2016.
 [25] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
 [26] T. Hastie, S. Rosset, J. Zhu, and H. Zou. Multiclass adaboost. Statistics and its Interface, 2(3):349–360, 2009.
 [27] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [28] G. Hinton. Neural networks for machine learning. In Coursera, 2012.
 [29] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
 [30] F. Huang, J. Ash, J. Langford, and R. Schapire. Learning deep resnet blocks sequentially using boosting theory. arXiv preprint arXiv:1706.04964, 2017.
 [31] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115, 2016.
 [32] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
 [33] K. Hwang and W. Sung. Fixedpoint feedforward deep neural network design using weights+ 1, 0, and 1. In Signal Processing Systems (SiPS), 2014 IEEE Workshop on, pages 1–6. IEEE, 2014.
 [34] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
 [35] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [36] M. Kim and P. Smaragdis. Bitwise neural networks. arXiv preprint arXiv:1601.06071, 2016.
 [37] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [38] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [39] F. Li, B. Zhang, and B. Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
 [40] H. Li, S. De, Z. Xu, C. Studer, H. Samet, and T. Goldstein. Training quantized nets: A deeper understanding. In Advances in Neural Information Processing Systems, pages 5813–5823, 2017.
 [41] D. Lin, S. Talathi, and S. Annapureddy. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pages 2849–2858, 2016.
 [42] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
 [43] X. Lin, C. Zhao, and W. Pan. Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pages 344–352, 2017.
 [44] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications. arXiv preprint arXiv:1510.03009, 2015.
 [45] M. Moghimi, S. J. Belongie, M. J. Saberian, J. Yang, N. Vasconcelos, and L.J. Li. Boosted convolutional neural networks. In BMVC, 2016.
 [46] J. Ott, Z. Lin, Y. Zhang, S.C. Liu, and Y. Bengio. Recurrent neural networks with limited numerical precision. arXiv preprint arXiv:1608.06902, 2016.
 [47] N. C. Oza and S. Russell. Online ensemble learning. University of California, Berkeley, 2001.
 [48] E. Park, J. Ahn, and S. Yoo. Weightedentropybased quantization for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [49] A. Polino, R. Pascanu, and D. Alistarh. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668, 2018.
 [50] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
 [51] R. E. Schapire. The boosting approach to machine learning: An overview. In Nonlinear estimation and classification, pages 149–171. Springer, 2003.
 [52] R. E. Schapire. Explaining adaboost. In Empirical inference, pages 37–52. Springer, 2013.
 [53] R. E. Schapire and Y. Singer. Improved boosting algorithms using confidencerated predictions. Machine learning, 37(3):297–336, 1999.

[54]
D. Soudry, I. Hubara, and R. Meir.
Expectation backpropagation: Parameterfree training of multilayer neural networks with continuous or discrete weights.
In Advances in Neural Information Processing Systems, pages 963–971, 2014.  [55] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [56] W. Sung, S. Shin, and K. Hwang. Resiliency of deep neural networks under quantization. arXiv preprint arXiv:1511.06488, 2015.
 [57] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al. Going deeper with convolutions. Cvpr, 2015.
 [58] W. Tang, G. Hua, and L. Wang. How to train a compact binary neural network with high accuracy? In AAAI, pages 2625–2631, 2017.
 [59] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
 [60] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus. Regularization of neural networks using dropconnect. In International Conference on Machine Learning, pages 1058–1066, 2013.
 [61] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4820–4828, 2016.
 [62] H. Xu and S. Mannor. Robustness and generalization. Machine learning, 86(3):391–423, 2012.
 [63] R. Zhao, W. Song, W. Zhang, T. Xing, J.H. Lin, M. Srivastava, R. Gupta, and Z. Zhang. Accelerating binarized convolutional neural networks with softwareprogrammable fpgas. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 15–24. ACM, 2017.
 [64] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. arXiv preprint arXiv:1702.03044, 2017.
 [65] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
 [66] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.
9 Supplementary Material: Detailed Analysis on DNN, BNN, and BENN
Given a fullprecision real valued DNN with a set of parameters , a BNN with binarized parameters , input vector (after Batch Normalization) and perturbation , and a BENN with ensembles, we want to compare their robustness w.r.t. the input perturbation. Here we analyze the variance of output change before and after perturbation, which echoes Eq.1 in Sec.3 in the main paper. This is because the output change has zero mean and its variance reflects the distribution of output variation. More specifically, larger variance means increased variation of output w.r.t. input perturbation.
Assume are outputs before nonlinear activation function of a single neuron in an onelayer network, we have the output variation of realvalue DNN:
whose distribution has variance , where denotes number of input connections for this neuron and denotes inner product. This is because summation of multiple independent distributions (due to inner product ) has variance summed as well. Some modern nonlinear activation function like ReLU will not change the inequality of variances (i.e., if , then ), thus we can omit them in the analysis to keep it simple.
9.1 Activation Binarization
Suppose is real valued but only input binarized (denote as ), the activation binarization (1 and +1) has threshold 0, then the output variation is:
whose distribution has variance . This is because so the inner product is just the summation of independent distributions, each having variance . Note that only has three possible values, namely, 0, 2 and +2. We compute each of them as follows:
and its variance can be computed by:
since . Unfortunately this integral is too complicated to be solved by analytical formula, thus we use numerical method to obtain . Therefore, the variance is:
where () and () can be found in Table 7. When , robustness of BNN is worse than DNN’s. As for BENNBagging with () ensembles, the output change has variance:
thus BENNBagging has better robustness than BNN. If , then BENNBagging can have even better robustness than DNN.
B  R  

1.5  1.25  2.25 
1.0  1.0  1.0 
0.5  0.59  0.25 
0.1  0.13  0.01 
0.01  0.013  0.0001 
0.001  0.0013  0.000001 
9.2 Weight Binarization
If we binarize to but keeping the activation realvalued, the output variation follows:
with variance . Thus whether weight binarization will hurt robustness or not depends on whether holds or not. In particular, the robustness will not decrease if . BENNBagging has variance . So if , then BENNBagging is better than DNN.
9.3 Binarization of Both Weight and Activation
If both activation and weight are binarized (denote as ), the output variation:
has variance which is just the combination of Sec. 9.1 and Sec. 9.2. BENNBagging has variance , which is more robust than DNN when .
The above analysis results in the following theorem:
Theorem 1
Given a activation binarization, weight binarization or extreme binarization onelayer network introduced above, input perturbation is , then the output variation obeys:

If only activation is binarized, BNN has worse robustness than DNN when perturbation . BENNBagging is guaranteed to be more robust than BNN. BENNBagging with ensembles is more robust than DNN when .

If only weight is binarized, BNN has worse robustness than DNN when . BENNBagging is guaranteed to be more robust than BNN. BENNBagging with ensembles is more robust than DNN when .

If both weight and activation are binarized, BNN has worse robustness than DNN when and perturbation . BENNBagging is guaranteed to be more robust than BNN. BENNBagging with ensembles is more robust than DNN when .
9.4 Multiple Layers Scenario
All the above analysis is for one layer models before and after activation function. The same conclusion can be extended to multiple layers scenario with Theorem 2.
Theorem 2
Given a activation binarization, weight binarization or extreme binarization Llayer network (without batch normalization for generalization) introduced above, input perturbation is , then the accumulated perturbation of ultimate network output obeys:

For DNN, ultimate output variation is .

For activation binarization BNN, ultimate output variation is .

For weight binarization BNN, ultimate output variation is

For extreme binarization BNN, ultimate output variation is .

Theorem 1 holds for multiple layers scenario.
People have not fully understood the effect of variance reduction in boosting algorithms and some debates still exist in literature [7, 17], given that classifiers are not independent with each other. However, our experiments show that BENNboosting can also reduce variance in our situation, which is consistent with [16, 17]. The theoretical analysis on BENNboosting is left for future work.
If we switch and , replace input perturbation with parameter perturbation in the above analysis, then the same conclusion holds for parameter perturbation (stability issue). To sum up, BNN often can be worse than DNN in terms of robustness and stability, and our method BENN can cure these problems.
10 Supplementary Material: Training Process of BENN
11 Supplementary Material: Network Architectures Used in the Paper
In this section we provide network architectures used in the experiments of our main paper.
11.0.1 SelfDesigned NetworkInNetwork (NIN)
Layer Index  Type  Parameters 

1  Conv  
2  BatchNorm  : 0.0001, Momentum: 0.1 
3  ReLU   
4  BatchNorm  : 0.0001, Momentum: 0.1 
5  Dropout  : 0.5 
6  Conv  Depth: 96, Kernel Size: 1x1, Stride: 1, Padding: 0 
7  ReLU   
8  MaxPool  Kernel: 3x3, Stride: 2, Padding: 1 
9  BatchNorm  : 0.0001, Momentum: 0.1 
10  Dropout  : 0.5 
11  Conv  Depth: 192, Kernel Size: 5x5, Stride: 1, Padding: 2 
12  ReLU   
13  BatchNorm  : 0.0001, Momentum: 0.1 
14  Dropout  : 0.5 
15  Conv  Depth: 192, Kernel Size: 1x1, Stride: 1, Padding: 0 
16  ReLU   
17  AvgPool  Kernel: 3x3, Stride: 2, Padding: 1 
18  BatchNorm  : 0.0001, Momentum: 0.1 
19  Dropout  : 0.5 
20  Conv  Depth: 192, Kernel Size: 3x3, Stride: 1, Padding: 1 
21  ReLU   
22  BatchNorm  : 0.0001, Momentum: 0.1 
23  Conv  Depth: 192, Kernel Size: 1x1, Stride: 1, Padding: 0 
24  ReLU   
25  BatchNorm  : 0.0001, Momentum: 0.1 
26  Conv  Depth: 192, Kernel Size: 1x1, Stride: 1, Padding: 0 
27  ReLU   
28  AvgPool  Kernel: 8x8, Stride: 1, Padding: 0 
29  FC  Width: 1000 
11.0.2 AlexNet
Layer Index  Type  Parameters 

1  Conv  Depth: 96, Kernel Size: 11x11, Stride: 4, Padding: 0 
2  ReLU   
3  MaxPool  Kernel: 3x3, Stride: 2 
4  BatchNorm   
5  Conv  Depth: 256, Kernel Size: 5x5, Stride: 1, Padding: 2 
6  ReLU   
7  MaxPool  Kernel: 3x3, Stride: 2 
8  BatchNorm   
9  Conv  Depth: 384, Kernel Size: 3x3, Stride: 1, Padding: 1 
10  ReLU   
11  Conv  Depth: 384, Kernel Size: 3x3, Stride: 1, Padding: 1 
12  ReLU   
13  Conv  Depth: 256, Kernel Size: 3x3, Stride: 1, Padding: 1 
14  ReLU   
15  MaxPool  Kernel: 3x3, Stride: 2 
16  Dropout  : 0.5 
17  FC  Width: 4096 
18  ReLU   
19  Dropout  : 0.5 
20  FC  Width: 4096 
21  ReLU   
22  FC  Width: 1000 
11.0.3 ResNet18
Layer Index  Type  Parameters 

1  Conv  Depth: 64, Kernel Size: 7x7, Stride: 2, Padding: 3 
2  BatchNorm  : 0.00001, Momentum: 0.1 
3  ReLU   
4  MaxPool  Kernel: 3x3, Stride: 2 
5  Conv  Depth: 64, Kernel Size: 3x3, Stride: 1, Padding: 1 
6  BatchNorm  : 0.00001, Momentum: 0.1 
7  ReLU   
8  Conv  Depth: 64, Kernel Size: 3x3, Stride: 1, Padding: 1 
9  BatchNorm  : 0.00001, Momentum: 0.1 
10  Conv  Depth: 64, Kernel Size: 3x3, Stride: 1, Padding: 1 
11  BatchNorm  : 0.00001, Momentum: 0.1 
12  ReLU   
13  Conv  Depth: 64, Kernel Size: 3x3, Stride: 1, Padding: 1 
14  BatchNorm  : 0.00001, Momentum: 0.1 
15  Conv  Depth: 128, Kernel Size: 3x3, Stride: 2, Padding: 1 
16  BatchNorm  : 0.00001, Momentum: 0.1 
17  ReLU   
18  Conv  Depth: 128, Kernel Size: 3x3, Stride: 1, Padding: 1 
19  BatchNorm  : 0.00001, Momentum: 0.1 
20  Conv  Depth: 128, Kernel Size: 1x1, Stride: 2 
21  BatchNorm  : 0.00001, Momentum: 0.1 
22  Conv  Depth: 128, Kernel Size: 3x3, Stride: 1, Padding: 1 
23  BatchNorm  : 0.00001, Momentum: 0.1 
24  ReLU   
25  Conv  Depth: 128, Kernel Size: 3x3, Stride: 1, Padding: 1 
26  BatchNorm  : 0.00001, Momentum: 0.1 
27  Conv  Depth: 256, Kernel Size: 3x3, Stride: 2, Padding: 1 
28  BatchNorm  : 0.00001, Momentum: 0.1 
29  ReLU   
30  Conv  Depth: 256, Kernel Size: 3x3, Stride: 1, Padding: 1 
31  BatchNorm  : 0.00001, Momentum: 0.1 
32  Conv  Depth: 256, Kernel Size: 1x1, Stride: 2 
33  BatchNorm  : 0.00001, Momentum: 0.1 
34  Conv  Depth: 256, Kernel Size: 3x3, Stride: 1, Padding: 1 
35  BatchNorm  : 0.00001, Momentum: 0.1 
36  ReLU   
37  Conv  Depth: 256, Kernel Size: 3x3, Stride: 1, Padding: 1 
38  BatchNorm  : 0.00001, Momentum: 0.1 
39  Conv  Depth: 512, Kernel Size: 3x3, Stride: 2, Padding: 1 
40  BatchNorm  : 0.00001, Momentum: 0.1 
41  ReLU   
42  Conv  Depth: 512, Kernel Size: 3x3, Stride: 1, Padding: 1 
43  BatchNorm  : 0.00001, Momentum: 0.1 
44  Conv  Depth: 512, Kernel Size: 1x1, Stride: 2 
45  BatchNorm  : 0.00001, Momentum: 0.1 
46  Conv  Depth: 512, Kernel Size: 3x3, Stride: 1, Padding: 1 
47  BatchNorm  : 0.00001, Momentum: 0.1 
48  ReLU   
49  Conv  Depth: 512, Kernel Size: 3x3, Stride: 1, Padding: 1 
50  BatchNorm  : 0.00001, Momentum: 0.1 
51  AvgPool   
52  FC  Width: 1000 