## 1 Introduction

Deep convolutional neural networks (DCNNs) have been successfully demonstrated on many computer vision tasks such as object detection and image classification. DCNNs deployed in practical environments, however, still face many challenges. They usually involve millions of parameters and billions of FLOPs during computation. This is critical because models of vision applications may consume very large amounts of memory and computation, making them impractical for most embedded platforms.

Binary filters instead of using full-precision filter weights have been investigated in DCNNs to compress the deep models to handle the aforementioned problems. Many works attempt to quantize the weights of a network while keeping the activations (feature maps) to 32-bit floating points [zhou2017incremental, zhu2016trained, Wang_2018_CVPR]. Although this scheme leads to less performance decrease compared to its full-precision counterpart, it still needs a substantial amount of computational resource to handle the full-precision activations. Therefore, the so-called 1-bit DCNNs, which target the problem of training the networks with both 1-bit quantized weights and 1-bit activations, become more promising and significant in the field of DCNNs compression. As presented in [rastegari2016xnor]

, by reconstructing the full-precision filters with a single scaling factor, XNOR provides an efficient implementation of convolutional operations. More recently, Bi-Real Net

[liu2018bi] explores a new variant of residual structure to preserve the real activations before the sign function. And the researchers in [hou2016loss] propose a new value approximation method that considers the effect of binarization on the loss to further obtain binarized weights. PCNN [Gu2019P] learns a set of diverse quantized kernels by exploiting multiple projections with discrete back propagation.The investigation into prior arts reveals that how to use the full-precision models is the key issue to obtain the optimized BCNNs. Most existing methods use the full-precision models as an initialization [rastegari2016xnor] [liu2018bi], or for kernel approximation [Gu2019P] [rastegari2016xnor]. Besides, knowledge distillation uses a teacher model (e.g., a full-precision model) to provide a guidance to quantize the network [Polino2018Model, zhuang2018towards, Mishra2017Apprentice]

. While these methods generally use a regularization term to minimize the difference between the student’s and teacher’s posterior probabilities or intermediate feature representations, they fail to consider the full-precision feature maps (activations) in a comprehensive way. This might be the reason why the knowledge distillation methods have not been employed to obtain the extreme 1-bit CNNs yet. To narrow down the performance gap between a BCNN and its full-precision model, we propose that the full-precision kernels and feature maps should be considered in a more comprehensive way, in order to fully exploit the multi-cue information.

In this paper, we introduce a rectified binary convolutional network (RBCN) to calculate an optimized BCNN in which a novel learning architecture is introduced to combine the full-precision feature maps and the kernels approximation in an end-to-end manner. Based on the powerful probability fitting ability of generative adversarial network (GAN), we discover that training a BCNN network with GAN, a better performance can be obtained by fitting the distribution of feature maps between full-precision and 1-bit binary networks. By doing so, GAN is introduced to distill RBCN from full-precision network by exploiting their full-precision feature maps. To the best of our knowledge, we are the first to use a GAN to do binary approximation of the full-precision model. The whole process is illustrated in Fig. 1, where the full-precision model and the 1-bit binary model (generator) respectively provide “real” and “fake” feature maps to the discriminators. The discriminators try to distinguish the “real” from the “fake”, and the generator tries to make the discriminators unable to work well. By repeating this process, the multi-cue information (full-precision kernels and feature maps) is sufficiently employed in the training process to enhance the representational ability of the 1-bit binary model. Besides, kernel (filter) approximation (RBConv in Fig. 1

) is integrated in the framework. Also, multiple discriminators are used to further improve the performance of RBCN. This process involving the GAN and the kernel approximation is a rectified process, which can lead to a unique architecture with more precise estimation of the full-precision model. The contributions of this paper are summarized as follows.

(1) A novel BCNN learning architecture, referred to as rectified binary convolutional network (RBCN), is proposed, which employs the full-precision kernels and feature maps to rectify the binarization process in a comprehensive framework.

(2) To the best of our knowledge, we are the first to use a GAN to calculate a BCNN. Besides, we discover that using multiple discriminators in the GAN can significantly improve the performance of the 1-bit binary model.

(3) Extensive experiments demonstrate the superior performance of the proposed RBCNs over state-of-the-art BCNNs on the object classification and tracking tasks.

loss function | binarized filters | feature maps from RBCN to |
---|---|---|

learned filters | learnable matrixs | feature maps from the full-precision model |

discriminators | gradient of | |

filter index | learning rate | feature maps before and after convolution in RBCN |

iteration | number of layers | filters of the discriminators |

layer index | gradient of |

## 2 Rectified Binary Convolutional Networks (RBCNs)

We design RBCNs via kernel approximation and training with GANs to rectify BCNNs in a unified framework.
During this process, the multi-cue information of the full-precision feature maps and kernels^{2}^{2}2In this paper, the terms “filter” and “kernel” are exchangeable. is exploited to improve the performance degraded by binarization.
The rectified convolutional layers are generic and flexible, which can be easily incorporated into existing CNNs, such as WideResNets and ResNets.
First of all, Table 1 gives the main notation used in this paper.

### 2.1 Loss Function of RBCNs

The rectified process combines the full-precision kernels and feature maps to rectify the binarization process. It includes kernel approximation and adversarial learning. This learnable kernel approximation can lead to an unique architecture with more precise estimation of the original convolutional filters through minimizing a kernel loss. The discriminators with filters are introduced to distinguish the feature maps of the full-precision model from those of RBCN. The generator (RBCN) with filters and learnable matrixs is learned together with by using the knowledge from the supervised feature maps . Therefore, , and are learned by solving the following optimization problem:

(1) |

where is the adversarial loss:

(2) |

where consists of four basic blocks, each of which has a linear layer and a LeakyRelu layer.

In addition, is the kernel loss between the learned full-precision filters and the binarized filters , which is expressed by MSE:

(3) |

Finally, is a traditional problem-dependent loss such as the softmax loss.

For simplicity, the update of the discriminators is omitted in the following description until Algorithm 1. Besides, we find that the in Equ. 2 has little effect during training and so it is omitted too. Then, based on the Lagrangian method, the optimization problem in Equ. 1 is rewritten as:

(4) |

In Equ. 4, the target is to obtain , and with fixed, which is why the term in Equ. 2 can be ignored. The update of can be found in Algorithm 1. The advantage of our formulation in Equ. 4 lies in that the loss function is trainable, meaning that it can be easily incorporated into existing learning frameworks.

### 2.2 Forward Propagation in RBCNs

In RBCNs, a binary filter is calculated as:

(5) |

where is the corresponding full-precision filter, and the values of are or . Both and are jointly obtained in the end-to-end learning.

In RBCNs, the convolution is implemented based on and to calculate the feature maps :

(6) |

where denotes the convolution operation implemented as a new module, and are the feature maps before and after the convolution, respectively, and is the element-by-element product. Note that is binary after the sign operation (see Fig. 1), and is actually , which will be elaborated at the end of section 3.3.

### 2.3 Backward Propagation in RBCNs

In RBCNs, what need to be learned and updated are the full-precision filters and the learnable matrixs . These two sets of parameters are jointly learned. In each convolutional layer, an RBCN updates first and then .

First we updates the full-precision filters . Let be the gradient of the full-precision filter

. During backpropagation, the gradients pass to

first and then to . Thus:(7) |

where

(8) |

which is an approximation of the dirac-delta function [liu2018bi]. Furthermore,

(9) |

and

(10) |

where is a learning rate. Then:

(11) |

(12) |

We further update the learnable matrix with fixed. Let be the gradient of . Then we have:

(13) |

and

(14) |

where is another learning rate. Further,

(15) |

(16) |

The above derivations show that the rectified process is trainable in an end-to-end manner. The complete training process is summarized in Algorithm 1

, including the update of the discriminators. Besides, in the implementation, the batch normalization (BN) layers are updated with

andfixed after each epoch.

We note that in our implementation, the value of will be replaced by its average during the forward process, resulting into a new matrix denoted by ^{3}^{3}3its elements are equal. By doing so, only a scalar instead of a matrix involve into the convolution which thus speed up the calculation.

## 3 Experiments

Our RBCNs are evaluated first on object classification using MNIST [L1998Gradient], CIFAR10/100 [Krizhevsky2009Learning]

and ILSVRC12 ImageNet datasets

[Russakovsky2015ImageNet], and then on object tracking. For object classification, WideResNet (WRN) [Zagoruyko2016Wide] and ResNet [He2016Deep]are employed as the backbone networks to build our RBCNs. Also, binarizing the neuron activations is carried out in all of our experiments.

### 3.1 Datasets and Implementation Details

#### Minist.

The MINIST [L1998Gradient] dataset is composed of a training set of 60,000 and a testing set of 10,000 grayscale images of hand-written digits from 0 to 9.

#### Cifar.

CIFAR10 [Krizhevsky2009Learning] is a natural image classification dataset containing a training set of and a testing set of color images across the following 10 classes: airplanes, automobiles, birds, cats, deers, dogs, frogs, horses, ships, and trucks, while CIFAR100 consists of 100 classes.

#### ImageNet.

ImageNet object classification dataset [Russakovsky2015ImageNet] is more challenging due to its large scale and greater diversity. There are 1000 classes and 1.2 million training images and 50k validation images in it. We compare our method with the state-of-the-art on the ImageNet dataset, and we adopt ResNet18 to validate the superiority and effectiveness of RBCNs.

#### WRN Backbone.

WRN is a network structure similar to ResNet with a depth factor to control the feature map depth dimension expansion through 3 stages, within which the dimensions remain unchanged. For simplicity we fix the depth factor to 1. Each WRN has a parameter which indicates the channel dimension of the first stage, and we set it to 16, leading to a network structures ---. The training details are the same as in [Zagoruyko2016Wide]. and are set as 0.01 with a degradation of 10% for every 60 epochs before reaching the maximum epoch of 200 for CIFAR10/100. For example, WRN22 is a network with 22 convolutional layers and similarly for WRN40.

#### ResNet18 Backbone.

SGD is used as the optimization algorithm with a momentum of and a weight decay 1e-4. and are set as 0.1 with a degradation of 10% for every 20 epochs before reaching the maximum epoch of 70 on ImageNet, while on CIFAR10/100, and are set as 0.01 with a degradation of 10% for every 60 epochs before reaching the maximum epoch of 200.

### 3.2 Ablation Study

In this section, we study the performance contributions of the components in RBCNs, which include kernel approximation, GAN, and the update of the BN layers. CIFAR100 and ResNet18 with different kernel stages are used in this experiment. The details are given below.

Kernel Stage | Bi | R | R+G | R+G+B | |
---|---|---|---|---|---|

RBCN | 32-32-64-128 | 54.92 | 56.54 | 59.13 | 61.64 |

RBCN | 32-64-128-256 | 63.11 | 63.49 | 64.93 | 65.38 |

RBCN | 64-64-128-256 | 63.81 | 64.13 | 65.02 | 66.27 |

1) We only replace the convolution in Bi-Real Net with our kernel approximation () and compare the results. As shown in the R column in Table 2, RBCN achieves 1.62% accuracy improvement over Bi-Real Net (56.54% vs. 54.92%) using the same network structure as in ResNet18 with 32-32-64-128. This significant improvement verifies the effectiveness of the learnable matrixs.

2) In RBCNs, if we use the GAN to help binarization, we can find a more significant improvement from 56.54% to 59.13% with the kernel stage of 32-32-64-128, which shows that the GAN can really enhance the binarized networks.

3) We find that a training trick can also improve RBCNs, which is to update the BN layers with and fixed after each epoch (line 17 in Algorithm 1). This trick makes RBCN boost 2.51% (61.64% vs. 59.13%) in CIFAR100 with 32-32-64-128.

### 3.3 Accuracy Comparison with State-of-the-Art

Model | Kernel Stage | Dataset | |

CIFAR | CIFAR | ||

-10 | -100 | ||

ResNet18 | 32-32-64-128 | 92.67 | 67.07 |

ResNet18 | 32-64-128-256 | 93.88 | 72.51 |

ResNet18 | 64-64-128-256 | 94.57 | 72.89 |

RBCN (ResNet18) | 32-32-64-128 | 89.03 | 61.09 |

RBCN (ResNet18) | 32-64-128-256 | 90.67 | 65.38 |

RBCN (ResNet18) | 64-64-128-256 | 90.40 | 66.27 |

WRN22 | 64-64-128-256 | 95.19 | 76.38 |

WRN40 | 64-64-128-256 | 94.92 | 74.91 |

RBCN (WRN22) | 64-64-128-256 | 93.28 | 72.06 |

RBCN (WRN40) | 64-64-128-256 | 93.69 | 73.08 |

XNOR (ResNet18) | 32-32-64-128 | 71.01 | 43.58 |

XNOR (WRN22) | 64-64-128-256 | 86.90 | 58.05 |

Bi-Real (ResNet18) | 32-32-64-128 | 85.34 | 54.92 |

Bi-Real (WRN22) | 64-64-128-256 | 90.65 | 68.51 |

PCNN (ResNet18) | 32-32-64-128 | 85.50 | 55.66 |

PCNN (WRN22) | 64-64-128-256 | 91.62 | 70.32 |

Scheme-A (ResNet18) | 32-64-128-256 | 75.45 | 46.32 |

Scheme-A (WRN22) | 64-64-128-256 | 87.83 | 59.54 |

#### Cifar10/100.

The same parameter settings are used in RBCNs on both CIFAR10 and CIFAR100. We first compare our RBCNs with the original ResNet18 with different stage kernels, followed by a comparison with the original WRNs with the initial channel dimension in Table 3. Thanks to the rectified process, our results on both the datasets are close to the full-precision networks ResNe18 and WRN40. Then, we compare our results with other state-of-the-arts such as Bi-Real Net [liu2018bi], PCNN [Gu2019P], Scheme-A [Mishra2017Apprentice] and XNOR [rastegari2016xnor]. All these BCNNs have both binary filters and binary activations. It is observed that at most 6.17% ( 61.09%54.92%) accuracy improvement is gained with our RBCN, and in other cases, larger margins are achieved.

#### ImageNet.

Five state-of-the-art methods on ImageNet are chosen for comparison: Bi-Real Net [liu2018bi], BinaryNet [Courbariaux2016Binarized], XNOR [rastegari2016xnor], PCNN [Gu2019P] and ABC-Net [lin2017towards]. Again, these networks are representative methods of binarizing both network weights and activations and achieve state-of-the-art results. All the methods in Table 4 perform the binarization of ResNet18. The results in Table 4 are quoted directly from their papers, except that the result of BinaryNet is from [lin2017towards]. The comparison clearly indicates that the proposed RBCN outperforms the five binary networks by a considerable margin in terms of both the top-1 and top-5 accuracies. Specifically, for top-1 accuracy, RBCN outperforms BinaryNet and ABC-Net with a gap over 16%, achieves 7.9% improvement over XNOR, 3.1% over the very recent Bi-Real Net, and 2.2% over the latest PCNN. In Fig. 2, we plot the training and testing loss curves of XNOR and RBCN. It clearly shows that using our rectified process, RBCN converges faster than XNOR.

Full-Precision | XNOR | ABC-Net | BinaryNet | Bi-Real | PCNN | RBCN | ||
---|---|---|---|---|---|---|---|---|

ResNet18 | Top-1 | 69.3 | 51.2 | 42.7 | 42.2 | 56.4 | 57.3 | 59.5 |

Top-5 | 89.2 | 73.2 | 67.6 | 67.1 | 79.5 | 79.8 | 81.6 |

Dataset | Index | SiamFC | XNOR | RB-SF |

GOT-10K | AO | 0.348 | 0.251 | 0.327 |

SR | 0.383 | 0.230 | 0.343 | |

OTB50 | Precision | 0.761 | 0.457 | 0.706 |

SR | 0.556 | 0.323 | 0.496 | |

OTB100 | Precision | 0.808 | 0.541 | 0.786 |

SR | 0.602 | 0.394 | 0.572 | |

UAV123 | Precision | 0.745 | 0.547 | 0.688 |

SR | 0.528 | 0.374 | 0.497 |

### 3.4 Experiments on Object Tracking

The key message conveyed in the proposed method is that although the conventional binary training method has a limited model capability, the proposed rectified process can lead to a powerful model. In this section, we show that this framework can also be used in object tracking. In particular, we consider the problem of tracking an arbitrary object in videos, where the object is identified solely by a rectangle in the first frame. For object tracking, it is necessary to update the weights of the network online, severely compromising the speed of the system. To directly apply the proposed framework to this application, we can construct a binary convolution with the same structure to reduce the convolution time. In this way, RBCN can be used to binarize the network further to guarantee the tracking performance.

In this paper, we use SiamFC Network as the backbone for object tracking. We binarize SiamFC as Rectified Binary Convolutional SiamFC Network (RB-SF). We evaluate RB-SF on four datasets, GOT-10K [huang2018got], OTB50 [wu2013online], OTB100 [wu2015object], and UAV123 [mueller2016benchmark], using accuracy occupy (AO) and success rate (SR). The results are shown in Table 5. Intriguingly, our framework achieves about 7% AO improvement over XNOR, both using the same network architecture as in SiamFC Network on GOT-10k. Further, our framework brings so much benefit that Bi-SF performs almost as well as the full-precision SiamFC Network.

### 3.5 Efficiency Analysis

The memory usage is computed as the summation of 32 bits times the number of real-valued parameters and 1 bit times the number of binary parameters in the network. Further, we use FLOPs to measure the speed. The results are given in Table 6. The FLOPs are calculated as the amount of real-valued floating point multiplications plus 1/64 of the amount of 1-bit multiplications [liu2018bi]. As shown in Table 6, the proposed RBCN, along with XNOR, reduces the memory usage of the full-precision ResNet18 by 11.10 times. For efficiency, both RBCN and XNOR gain speedup over ResNet18. Note the computational and storage costs brought by learnable scalar can be negligible.

RBCN | XNOR-Net | ResNet18 | |
---|---|---|---|

Memory usage | Mbits | Mbits | Mbits |

Memory saving | - | ||

FLOPs | |||

Speedup | - |

## 4 Conclusion

In this paper, we introduce rectified binary convolutional networks (RBCNs), towards optimized BCNNs, by exploiting the full-precision kernels and feature maps in an end-to-end manner. In particular, we use a GAN to train the 1-bit binary network with the guidance of its corresponding full-precision model, which significantly improves the performance of the BCNN. Furthermore, as a general model, RBCNs can be used not only in object classification but also in other tasks such as object tracking. The experiments on both object classification and object tracking demonstrate the superior performance of the proposed RBCNs over state-of-the-art binary models.

## Acknowledgments

The work was supported by the National Key Research and Development Program of China (Grant No. 2016YFB0502600) and the Natural Science Foundation of China under Contract 61672079. Also, it is in part supported by the Fundamental Research Funds for the Central Universities. Baochang Zhang is the corresponding author.