Deep convolutional neural networks (DCNNs) have been successfully demonstrated on many computer vision tasks such as object detection and image classification. DCNNs deployed in practical environments, however, still face many challenges. They usually involve millions of parameters and billions of FLOPs during computation. This is critical, because deep models may consume very large amounts of memory and computation, making them impractical for most embedded platforms for vision applications.
Binary filters (kernels) instead of using full-precision filter weights have been investigated in DCNNs to compress deep models to handle the aforementioned problems. Many works attempt to quantize the weights of a network while keeping the activations (feature maps) to 32-bit floating points [25, 20]. Although this scheme leads to less performance decrease compared to its full-precision counterpart, it still needs a substantial amount of computational resource to handle the full-precision activations. Therefore, the so-called 1-bit DCNNs, also called binary CNNs (BCNNs), which target the problem of training the networks with both 1-bit quantized weights and 1-bit activations, become more promising and significant in the field of DCNNs compression. As presented in 
, by reconstructing the full-precision filters with a single scaling factor, XNOR provides an efficient implementation of convolutional operations. More recently, Bi-Real Net explores a new variant of residual structure to preserve the real activations before the sign function.  propose a value approximation method that considers the effect of binarization on the loss to further obtain binarized weights. To make a trade-off between accuracy and complexity, 
propose to recursively perform residual quantization and yield a series of binary tensors with decreasing magnitude scales. And PCNN learns a set of diverse quantized kernels by exploiting multiple projections with discrete back propagation. The investigation into prior arts reveals that obtaining binary kernels that can accurately approximate their full-precision ones is the key to obtain the optimized BCNNs. However, the intrinsic reasons which disturb the BCNNs performance lie in 1) the vanishing gradient issue that baffles the convergence of BCNNs, and 2) the limited representational ability due to the extreme quantization (binarization) of both kernels and activations. Therefore, more potential of the back-propagation needs to be further explored.
In this paper, we introduce a genetic binary convolutional network (GBCN) to train a 1-bit DCNN, in which a novel learning architecture with a balanced genetic algorithm (BGA) is presented to help escape poor local minima caused by the binarization process. Genetic algorithms (GAs) are a randomized search and optimization technique guided by the principles of evolution and natural genetics, and can perform heuristic search in complex, large and multimodal landscapes. Inspired by GAs’ powerful search ability, we use the two main operations (crossover and mutation) of GAs in our GBCNs to obtain better diverse representations. In addition, the phenomenon, which most kernels remain unchanged after binarization, becomes more severe for an unbalanced distribution of feature maps/weights. To address the issue, a simple but effective normalization method is introduced into our BGA module to help better behave in 1-bit DCNNs. By doing so, the optimizer would be less likely to get stuck in local minima. The framework is illustrated in Fig. 1, where Genetic Binary Convolution (GBConv) integrated with BGA is introduced to calculate the diverse kernels end-to-end, which can largely improve the diversity of kernels. Through the convolutional layer (GBConv) and the activation layer (BGA), the enriched diverse information (kernels and feature maps) is employed more sufficiently in the training process to enhance the representational ability of the 1-bit DCNN model. The contributions of this paper are summarized as follows.
(1) A novel 1-bit DCNN learning architecture, referred to as genetic binary convolutional network (GBCN), is proposed to increase the representational ability of 1-bit DCNNs.
(2) A new module BGA is presented to obtain more diverse representation by fully balancing the distributions of data, which can help minimize the performance gap between full-precision and 1-bit DCNNs.
(3) Extensive experiments demonstrate the superior performance of the proposed GBCNs over state-of-the-art BCNNs on object classification, face recognition and person re-identification tasks.
Genetic Binary Convolutional Networks (GBCNs)
We design GBCNs via kernel approximation and discrete optimization to optimize BCNNs in a unified framework. During this process, the representational ability in 1-bit DCNNs can be exploited more fully to improve the performance degraded by binarization. The proposed convolutional layers and balanced GA layers are generic and flexible, which can be easily incorporated into existing DCNNs, such as WideResNets and ResNets. First of all, Table 1 gives the main notation used in this paper.
|loss function||binarized filters||gradient of||feature maps|
|learned filters||learnable matrices||gradient of||input of BGA|
|filter index||learning rate||
|learnable regularization parameters||mutation probability|
Balanced GA (BGA)
In general, a DCNN with its full-precision kernels is updated by , where is the learning rate and represents the layer. As we know, the fact that the values of are usually much larger than their corresponding gradients results in most kernels unchanged after binarization. More specifically, a small variation of a value usually does not change its sign. This phenomenon limits the representational ability of BCNNs, which further causes substantial accuracy loss. Therefore, we design a new activation layer, called balanced GA, in this section to alleviate this problem:
where denotes the activation operation implemented as a new module, represents the input ( or ), the pair of learnable parameters and are used to make sure that the transformation inserted in the network can represent the identity transform, and . Considering a balanced distribution of inputs would have positive consequences for BCNN, which helps make gradient propagation better behaved, we introduce a balanced distribution to normalize as:
where , and is a small value to avoid zero denominator. Fig. 2 shows the main implementation process of GA in . The GA process includes two operations, crossover and mutation. We randomly select pairs of kernels and for crossover, where is the number of kernels in the
layer, and each kernel is represented as a vector. In crossover, two corresponding fragments of the two vectors are exchanged and the position to determine the fragments is randomly chosen. In mutation, each binary component of a vector is flipped with probability. We empirically set and to 0.1 and 0.3, respectively.
The normalization in Eq. 2 is actually a balancing processing of the input ( or ), which can make the binarized kernels or feature maps easier to change their signs during learning. It should also be emphasized that in this paper, though we borrow the two main operations, crossover and mutation, from traditional GAs, our GA (or BGA) is quite different; for example, it does not use the concept of fitness to choose kernels or feature maps for crossover. Besides, in Fig. 2, BGA is carried out on kernels to demonstrate the BGA process; in fact, it is also performed on feature maps as shown in Fig. 1 (the BGA module to generate ). Through the BGA process, the representational ability of GBCNs can be greatly enhanced.
In GBCNs, the convolution is implemented based on and to calculate the output feature maps :
where denotes the convolution operation implemented as a new module, and are the feature maps before and after the convolution, respectively, and is the element-by-element product. Note that is binary after the operation (see Fig. 1). In GBCNs, what need to be learned and updated are the full-precision filters and the learnable matrices . These two sets of parameters are jointly learned. In each convolutional layer, a GBCN updates first and then .
Updating : Let be the gradient of the full-precision filter
. During backpropagation, the gradients pass tofirst and then to . Thus:
where donates the loss function which can be represented as:
where is a weighting coefficient, and is the cross-entropy loss. Then,
The gradient of can be obtained by computing the derivative of :
where donates the location of a kernel. Further,
Then, we have the updated :
where is a learning rate.
Updating : We further update the learnable matrix with fixed. Let be the gradient of . Then, we have:
and the updated :
where is another learning rate. The above derivations show that the BGA and GBCNs are trainable in an end-to-end manner. The complete training procedure is summarized in Algorithm 1. Note that in our implementation, all the values of are replaced by their average during the forward path. In this case, only a scalar instead of a matrix is involved in the inference, thus speeding up the computation.
Our GBCNs are evaluated first on object classification using CIFAR10/100  and ILSVRC12 ImageNet datasets , and then on face recognition and person re-identification. WideResNet (WRN)  and ResNet  are employed as the backbone networks to build our GBCNs.
Datasets and Implementation Details
CIFAR10  is a natural image classification dataset containing a training set of 50,000 and a testing set of 10,000 color images across the following 10 classes: airplanes, automobiles, birds, cats, deers, dogs, frogs, horses, ships, and trucks, while CIFAR100 consists of 100 classes.
ImageNet object classification dataset  is more challenging due to its large scale and greater diversity. There are 1000 classes, 1.2 million training images, and 50k validation images in it. We compare our method with the state-of-the-art on the ImageNet dataset, and we adopt ResNet18 to validate the superiority and effectiveness of GBCNs.
WRN Backbone: WRN is a network structure similar to ResNet with a depth factor to control the feature map depth dimension expansion through 3 stages, within which the dimensions remain unchanged. For simplicity we fix the depth factor to 1. Each WRN has a parameter which indicates the channel dimension of the first stage, and we set it to 16, leading to a network structures 16-16-32-64. The training details are the same as in . is set to 0.001. and are both set to 0.01 with a degradation of 10% for every 60 epochs before reaching the maximum epoch of 200 for CIFAR10/100. For example, WRN22 is a network with 22 convolutional layers.
ResNet18 Backbone: SGD is used as the optimization algorithm with a momentum of 0.9 and a weight decay 1e-4. is set to 0.001. On ImageNet, and are both set to 0.1 with a degradation of 10% for epoch 20 and epoch 35 before reaching the maximum epoch of 70. On CIFAR10/100, and are both set to 0.01 with a degradation of 10% for every 60 epochs before reaching the maximum epoch of 200.
In this section, we first calculate the proportion of the changed kernel weights in the XNOR  binarization process as shown in Fig. 3 (a). We can see clearly that after 200 epochs of training, no more than 10% are changed in every layer based on ResNet18 with the kernel stage of 16-16-32-64. However, in Fig. 3 (b), our corresponding GBCN has more than 40% of weights changed. Also, we evaluate the influence of BGA which acts as a binarization operator. When we use BGA to do binarization, the numbers of and are 1052347 and 1044805, respectively, with the feature maps of size as input. When only using to do binarization, the numbers of and are 721194 and 1375958, respectively. It is obvious that BGA can make the data distribution more balanced. Then in Table 2, we study the performance contributions of the components in GBCNs, and use CIFAR10/100 and ResNet18 with different kernel stages in this experiment. The details are given below.
1) We only replace the convolution in Bi-Real Net  with our but without BGA inside (simply using as the binarization operator) and compare the results. As shown in the column in Table 2, GBCN achieves 6.09% accuracy improvement over Bi-Real Net (78.85% vs. 72.79%) using the same network structure as in ResNet18 with 16-16-32-64 on CIFAR10. This significant improvement verifies the effectiveness of the learnable matrices .
2) In , if we use with BGA inside to help binarization, we can find a more significant improvement from 45.32% to 47.96% in ResNet18 with the kernel stage of 16-16-32-64 on CIFAR100, which shows that BGA can really enhance the binarized networks.
Accuracy Comparison with State-of-the-Art
CIFAR10/100: The same parameter settings are used in GBCNs on both CIFAR10 and CIFAR100. We first compare our GBCNs with the original ResNet18 with different stage kernels, followed by a comparison with the original WRNs with the initial channel dimension 64 in Table 3. Thanks to the the whole BGA process, our results on both the datasets are close to the full-precision networks ResNe18 and WRN22. Then, we compare our results with other state-of-the-arts such as Bi-Real Net , PCNN , and XNOR . All these BCNNs have both binary filters and binary activations. It is observed that at most 6.55% ( 47.96% 41.41%) accuracy improvement is gained with our GBCN when compared with PCNN on CIFAR100.
ImageNet: Five state-of-the-art methods on ImageNet are chosen for comparison: Bi-Real Net , BinaryNet , XNOR , PCNN  and ABC-Net . Again, these networks are representative methods of binarizing both network weights and activations and achieve state-of-the-art results. All the methods in Table 4 perform the binarization of ResNet18. The results in Table 4 are quoted directly from their papers, except that the result of BinaryNet is from 
. In ImageNet, we apply more batch normalization layers and center loss to fine tune our models. The comparison clearly indicates that the proposed GBCN outperforms the other binary networks by a considerable margin in terms of both the top-1 and top-5 accuracies. Specifically, for top-1 accuracy, GBCN outperforms BinaryNet and ABC-Net with a gap over 15%, achieves 6.6% improvement over XNOR, 1.4% over the very recent Bi-Real Net, and 0.5% over the latest PCNN. In Fig. 4, we plot the training and testing loss curves of XNOR and GBCN. It clearly shows that GBCN converges faster than XNOR.
Experiments on Face Recognition
In this section, we examine the effectiveness of GBCNs for face recognition (FR), which can help us understand how binary networks work in the fine-grained object recognition task. Despite the high accuracy in FR benchmarks, FR models still hardly meet the requirements in resource-limited applications because of the heavy memory and computation cost. Therefore, in this paper, we use our proposed GBCNs to compress FR models. To the best of our knowledge, we are the first to use binary networks in FR. In the following, we will briefly introduce the datasets and backbones used in the experiments.
Training Dataset: We use publicly available web-collected training dataset CASIA-WebFace  to train our GBCN models. CASIA-WebFace has 494,414 face images belonging to 10,575 different individuals. These face images are horizontally flipped for data augmentation. Notice that the scale of our training data (0.49M) is relatively small, especially compared to other private datasets used in DeepFace  (4M), VGGFace  (2M) and FaceNet  (200M).
Testing Dataset: The LFW dataset  consists of 13,323 web photos of 5,749 celebrities which are divided into 6,000 face pairs in 10 splits.
Celebrities in Frontal-Profile (CFP)  contains 7000 images of 500 subjects. The dataset is used for evaluating how face verification approaches handle pose variation. Hence, it consists of 5000 images in frontal view and 2000 images in extreme profile. The data are organized into10 splits, each containing the same number of frontal-frontal and frontal-profile comparisons.
AgeDB  contains 16,488 images of various famous people, such as actors/actresses, writers, scientists, politicians, etc. Every image is annotated with respect to the identity, age and gender attribute. There are 568 distinct subjects.
We compare our GBCN with the popular and state-of-the-art methods XNOR, PCNN and CBCN . As shown in Table 5, our framework achieves 16.55% precision improvement over XNOR, both using the same network architecture as in ResNet18 on AgeDB. Also, GBCN outperforms XNOR with a large gap of 17.65% in ResNet50 on AgeDB, which indicates GBCN can also be effective to compress deeper networks. Further, our framework brings so much benefit that GBCN performs almost as well as the full-precision ResNet18. Among the methods using a single model and public training data, our model achieves a cutting-edge performance.
Experiments on Person Re-identification
The task of person re-identification is to judge whether two person images belong to the same subject or not. In practical applications, the two images are usually captured by two cameras with disjoint views. The performance of person re-identification is closely related to many other applications, such as cross camera tracking, behaviour analysis, object retrieval and so on. Despite the good performance in many benchmarks, It is hard to apply in the real-world because of the heavy memory and computation cost. Therefore, in this paper, we use our proposed GBCNs to compress the reID models. To the best of our knowledge, we are the first to use binary network in the Re-id. In the following, we will briefly introduce datasets used in this experiments.
Market-1501  is currently the largest image-based reID benchmark dataset. It contains 32,668 labeled bounding boxes of 1,501 identities captured from 6 different view points. The bounding boxes are detected using Deformable Part Model (DPM) . The dataset is split into two parts: 12,936 images with 751 identities for training and 19,732 images with 750 identities for testing. In testing, 3,368 hand-drawn images with 750 identities are used as probe set to identify the correct identities on the testing set.
DukeMTMC-reID  is a subset of the DukeMTMC for image-based re-identification, in the format of the Market-1501 dataset. The original dataset contains eight 85-minute high resolution videos from eight different cameras. Hand drawn pedestrian bounding boxes are available.
The iLIDS dataset  was constructed from video images captured in a busy airport arrival hall. It features 119 pedestrians, with 479 images normalized to 128 64 pixels. The images come from non-overlapping cameras, and were subject to quite large illumination changes and occlusions. On average, there are four images of each individual pedestrian.
As shown in Table LABEL:reid, our framework achieves at most 10% precision improvement over XNOR, both using the same network architecture as in ResNet50 Network on iLIDS. Also, GBCN outperforms PCNN with a gap 6.1% in ResNet50 Network on Market-1501, which confirm its potential in the aspect of recognition tasks.
The memory usage is computed as the summation of 32 bits times the number of real-valued parameters and 1 bit times the number of binary parameters in the network. Further, we use FLOPs to measure the speed. The results are given in Table 7. The FLOPs are calculated as the amount of real-valued floating point multiplications plus 1/64 of the amount of 1-bit multiplications . As shown in Table 7, the proposed GBCN, along with XNOR, reduces the memory usage of the full-precision ResNet18 by 11.10 times. For efficiency, both GBCN and XNOR gain speedup over ResNet18. Note that the computational and storage costs brought by the learnable matrices can be negligible.
|Memory usage (Mbits)|
In this paper, we have proposed genetic binary convolutional networks (GBCNs), towards optimal BCNNs, by exploiting more diverse kernels and feature maps for better representational ability in an end-to-end manner. In particular, we use crossover and mutation of GAs to make BCNNs learning more effectively, which significantly improves the performance of BCNNs. Furthermore, as a general model, GBCNs can be used not only in object classification but also in other fine-grained tasks such as face recognition and person re-identification. The experiments on all object classification, face recognition and person re-identification demonstrate the superior performance of the proposed GBCNs over state-of-the-art binary models.
The work was supported in part by National Natural Science Foundation of China under Grants 61672079, 61473086, 61773117, 614730867. This work is supported by Shenzhen Science and Technology Program KQTD2016112515134654. Baochang Zhang is also with Shenzhen Academy of Aerospace Technology, Shenzhen 100083, China.
-  (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: Accuracy Comparison with State-of-the-Art.
-  (2010) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: Experiments on Person Re-identification.
-  (2019) Projection convolutional neural networks. In AAAI, Cited by: Introduction, Accuracy Comparison with State-of-the-Art, Accuracy Comparison with State-of-the-Art.
-  (2017) Network sketching: exploiting binary structure in deep cnns. pp. 5955–5963. Cited by: Introduction.
Deep residual learning for image recognition.
IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: Experiments.
-  (2016) Loss-aware binarization of deep networks. arXiv preprint arXiv:1611.01600. Cited by: Introduction.
-  (2007-10) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report Technical Report 07-49, University of Massachusetts, Amherst. Cited by: Experiments on Face Recognition.
-  (2009) The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html. Cited by: Datasets and Implementation Details, Experiments.
-  (2012) ImageNet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems, pp. 1097–1105. Cited by: Datasets and Implementation Details, Experiments.
-  (2017) Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pp. 345–353. Cited by: Accuracy Comparison with State-of-the-Art.
-  (2019) Circulant binary convolutional networks: enhancing the performance of 1-bit dcnns with circulant back propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2691–2699. Cited by: Experiments on Face Recognition.
-  (2018) Bi-real net: enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In Proceedings of the European Conference on Computer Vision, pp. 722–737. Cited by: Introduction, Ablation Study, Accuracy Comparison with State-of-the-Art, Accuracy Comparison with State-of-the-Art, Efficiency Analysis.
-  (2017) Agedb: the first manually collected, in-the-wild age database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 51–59. Cited by: Experiments on Face Recognition.
-  (2015) Deep face recognition. In British Machine Vision Conference, Cited by: Experiments on Face Recognition.
-  (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: Introduction, Ablation Study, Accuracy Comparison with State-of-the-Art, Accuracy Comparison with State-of-the-Art.
-  (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: Experiments on Face Recognition.
-  (2016) Frontal to profile face verification in the wild. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9. Cited by: Experiments on Face Recognition.
-  (2014) Deepface: closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708. Cited by: Experiments on Face Recognition.
-  (2014) Person re-identification by video ranking. In European Conference on Computer Vision, pp. 688–703. Cited by: Experiments on Person Re-identification.
-  (2018-06) Modulated convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Introduction.
-  (2014) Learning face representation from scratch. arXiv preprint arXiv:1411.7923. Cited by: Experiments on Face Recognition.
-  (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: Datasets and Implementation Details, Experiments.
-  (2015) Scalable person re-identification: a benchmark. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1116–1124. Cited by: Experiments on Person Re-identification.
-  Unlabeled samples generated by GAN improve the person re-identification baseline in vitro. External Links: Cited by: Experiments on Person Re-identification.
-  (2017) Incremental network quantization: towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044. Cited by: Introduction.