Deep convolutional neural networks (DCNNs) have been successfully demonstrated on many computer vision tasks such as object detection and image classification. DCNNs deployed in practical environments, however, still face many challenges. It is particularly true when the portability and real time performance are required. This is critical because models of vision applications can require very large memory, making them impractical for most embedded platforms. Besides, floating-point inputs and network weights along with forward or backward data flow can result in a significant computational burden.
Binary filters instead of using full-precision weights have been investigated in DCNNs to compress the deep models to handle the aforementioned problems. They are also called 1-bit DCNNs, as each weight parameter and activation can be represented by a single bit. As presented in [Rastegari2016XNOR], XNOR has both the convolution weights and inputs attached to the convolution be approximated with binary values, providing an efficient implementation of convolutional operations, particularly by reconstructing the unbinarized filters with a single scaling factor. More recently, Bi-Real Net [Liu2018Bi] explores a new variant of residual structure to preserve the real activations before the sign function and TBN [Wan2018TBN]
replaces the sign function with a threshold-based ternary function to obtain ternary input tensor. Both provide an optimal tradeoff among memory, efficiency and performance. A warm-restart learning-rate schedule in[Mark2018Training] is adopted to accelerate the training for 1-bit-per-weight networks. Furthermore, a method called WAGE [Wu2018Training] is proposed to discretize both training and inference, where not only weights and activations but also gradients and errors are quantized. In these previous methods, however, the binarization of the filters often degrades the representational ability of the models for the rotation variations in objects.
Inspired by Oriented Response Networks [Zhou2017ORN] which already show their powerful ability in within-class rotation-invariance, we also employ similar way to enhance the representative ability which destoryed by the binarization process. To the best of our knowledge, we are the first to use the idea of orientation in binary network with a asynchronous way, opening up a promising direction for pursuing efficient networks based on virtual shared filter banks. In this paper, we introduce circulant filters (CiFs) and the circulant binary convolution (CBConv) to actively calculate diverse feature maps, which can improve the representational ability of the resulting binarized DCNNs. The key insight of producing CiFs to help back propagation is shown in Fig. 1. Compared to previous binarized DCNN filters, CiFs are defined based on a circulant operation on each learned filter. A new circulant back propagation (CBP) algorithm is also introduced to develop an end-to-end trainable DCNN. During the convolution, CiFs are used to produce diverse feature maps which provide the binarized DCNNs with the ability to capture variations previously unseen. Instead of introducing extra functional modules or new network topologies, our method implements CBConv onto the most basic element of DCNNs, the convolution operator. Thus, it can be naturally fused with modern DCNN architectures, upgrading them to more expressive and compact Circulant Binary Convolutional Networks (CBCNs) for resource limited applications. We design a simple and unique variation process, which is deployed at each layer and can be solved within the same pipeline of the new CBP algorithm. In addition, we consider the center loss to further enhance the performance of CBCNs as shown in Fig. 2. Thanks to the low model complexity, such an architecture is less prone to over-fitting and suitable for resource-constrained environments. Our CBCNs reduce the required storage space of full-precision models by a factor of 32, while achieving better performance than existing binarized filters based DCNNs. The contributions of this paper are summarized as follows:
(1) CiFs are used to obtain more robust feature representation, e.g., orientation variations in objects, which minimize the performance gap between full-precision DCNNs and binarized DCNNs.
(2) We develop a CBP algorithm to reduce the loss during back propagation and make convolutional networks more compact and efficient. Experimental results show that CBP is not only effective, but also converged quickly.
(3) The presented circulant convolution is generic, and can be easily used on existing DCNNs, such as ResNets and conventional DCNNs. Our highly compressed models outperform state-of-the-art binarized models by a large margin on MNIST, CIFAR and ImageNet databases.
|circulant transfer matrix||circulant filter||learned filter||feature map|
|the gradient of the learned filter||inverse circulant transfer matrix||binarized filter|
|number of orientations for each filter||filter index||orientation index||layer index|
|input feature map index||output feature map index||learning rate|
2 Related Work
DCNNs with a large number of parameters consume considerable computational resources. From our practical study, about half of the power consumption of a DCNN is related to the model size. Many attempts have been made to accelerate and simplify DCNNs while avoiding the increase of the errors. Network binarization is one of the most popular approaches, which is briefly reviewed below.
The research in [Lai2017Deep] demonstrates that the storage of real-valued deep neural networks such as AlexNet [Krizhevsky2012ImageNet], GoogLeNet [Szegedy2014Going] and VGG-16 [Simonyan2014Very]
can be reduced significantly when their 32-bit parameters are quantized to 1-bit. Expectation BackPropagation (EBP) in[Soudry2014Expectation] uses a variational Bayesian approach to achive high performance with a network with binary weights and activations. BinaryConnect (BC) [Courbariaux2015BinaryConnect] extends the idea of EBP by employing 1-bit precision weights (1 and -1). Later, Courbariaux et al. [Courbariaux2016Binarized] propose BinaryNet (BN) that is an extension of BC and further constrains activations to +1 and -1, binaring the input (except the first layer) and the output of each layer. BC and BN both achieve sufficiently high accuracy on small datasets such as MNIST, CIFAR10 and SVHN. According to Rastegari et al. [Rigamonti2013Learning], BC and BN are not very successful on large-scale data sets. XNOR [Rastegari2016XNOR] has a different binarization method and a network architecture. Both the weights and inputs attached to the convolution are approximated with binary values, which results in an efficient implementation of the convolutional operations by reconstructing the unbinarized filters with a single scaling factor. Recent studies such as MCN [Wang2018Modulated] and Bi-Real Net [Liu2018Bi] have been conducted to explore new network structures and training techniques for binarizing both weights and activations while minimizing accuracy degradation using a concept similar to XNOR. MCN introduces modulated filters to recover the unbinarized filters and leads to a new architecture to calculate the network model. Bi-Real Net connects the real activations to the activations of consecutive blocks through an identity shortcut.
The results of these studies are encouraging, but due to the weight binarization process, the representational ability of the networks can be degraded. This inspires us to seek a way to increase the filter variations in order to increase the network representation ability. In particular, for the first time, we use the circulant matrix to build CiFs for our binarized CNNs. We also develop a CBP algorithm to make the DCNNs more compact and effective in an end-to-end framework.
We design a specific architecture in CBCNs based on CiFs, and train it with a new BP algorithm. Attempting to increase the representational ability reduced by the binarization process, CiFs are designed to enrich the binarized filters for the enhancement of the network performance. As shown in the experiments, the performance drop is marginal even when the learned network parameters are highly compressed. First of all, Table 1 gives the main notation used in this paper.
3.1 Circulant Transfer Matrix
A circulant matrix
is defined by a single vector in the first column, with cyclic permutations of the vector with offset equal to the column index in the remaining columns. An important property of the circulant matrix is that it can produce different representations using simple vectors or matrices. With this unique characteristic, we define the circulant transfer matrix ofcolumns as , :
where and vector rotations are used to form . The first column corresponds to the numbers in Fig. 3, and the other columns are obtained by a counter-clockwise rotation of the numbers. Each column of represents one rotation angle , , , , , , , }. We set to correspond to the learned filter and to the derived rotated versions of .
3.2 Circulant Filters (CiFs)
We now design the specific convolutional filters CiFs used in our CBCNs. A CiF is a tensor of size , generated based on a learned filter and . These CiFs are deployed across all convolutional layers. As shown in Fig. 4(a), the original learned filter is modified to by replicating it three times and concatenating them to obtain the CiF. For
, every channel of the network input is replicated as a group of four elements. By doing so, we can easily implement our CBCNs using PyTorch. One example of the CBConv is illustrated in Fig.4(b).
To facilitate the math description below, we represent a learned filter as a vector of size so that its corresponding CiF can be represented using a matrix of size (see Fig. 4(a)). Let be such a matrix representing the CiF. Then
where is a vector containing the learned filter’s weights (including the unchanged central one during rotation), and denotes the rotation operation of with (see Fig. 3). corresponds to the learned filter and the other columns of are introduced to increase the representational ability.
3.3 Forward Propagation of CBCNs based on the CBConv Module
In CBCNs, a binary CiF denoted by is calculated as:
where is the corresponding full-precision CiF, and the values of are 1 or -1. Both and are jointly obtained in the end-to-end learning framework. Let the set of all the binarized filters in the layer be . Then the output feature maps are obtained by:
where is the convolution operation implemented as a new module including the CiF generation process (the blue part in Fig. 2). Fig. 4(b) shows a simple example of a forward convolution where there is one input feature map with one generated output feature map. In the CBConv, the channels of an output feature map are generated as follows:
where denotes the convolution operation. is the channel of the feature map, and denotes the feature map of the input in the convolutional layer. In Fig. 4(b), and , where after the CBConv with one binary CiF, the number of the channels of the output feature map is the same as that of the input feature map.
3.4 Circulant Back Propagation (CBP)
During the back-propagation, what needs to be learned and updated are the learned filters only. And the inverse transformation of the circulant transfer matrix is involved in the process of BP to further enhance the representational ability of our CBCNs. To facilitate the math description below, we define the inverse circulant matrix of columns as , , where and vector inverse rotations are used to form . Let be the gradient of a learned filter . Note that we need to sum up the gradients of the sub-filters in the corresponding CiF, . Thus:
where is the network loss function, and is the learning rate. Note that since the central weights of CiFs are not rotated, their gradients are obtained as in the common BP procedure and are not presented in Eq. 7. As is shown, the circular operation involves in our BP process, which makes CBP be adaptive to orientation variations in objects.
For the gradient of the sign function, some special process is necessary due to its discontinuity property. In [Courbariaux2016Binarized] and [Liu2018Bi], the sign function is approximated by the clip function and the piecewise polynomial function, respectively, as shown in Fig. 5 and Fig. 5 where their corresponding derivatives are also given. Since the derivative of the sign function (an impulse) can be represented as , in this work, we use this Gaussian function (Fig. 5) as the approximation of the gradient:
are the amplitude gain and variance of the Gaussian function, respectively, which are determined empirically. In our experiments, we find that our approximation in Fig.5 is better than those in Fig. 5 and Fig. 5. From the equations above, we can see that the BP process can be easily implemented. Thus only updating the learned convolution filters with the help of CiFs, our CBCNs are significantly compact and efficient, reducing the memory storage by 32. Finally, the learning algorithm to train CBCNs is given in Algorithm 1.
|Dataset||Backbone||kernel stage||original (%)||rot (%)|
Our CBCNs are evaluated on object classification using MNIST [L1998Gradient], CIFAR10/100 [Krizhevsky2009Learning] and ILSVRC12 ImageNet datasets [Russakovsky2015ImageNet]. LeNet [L1998Gradient], WideResNet (WRN) [Zagoruyko2016Wide] and ResNet18 [He2016Deep]
are employed as the backbone networks to build our CBCNs simply by replacing the full-precision convolution with CBConv. Also, binarizing the neuron activations is carried out in all of our experiments.
4.1 Datasets and Implementation Details
Datasets: The MINIST [L1998Gradient] dataset is composed of a training set of 60,000 and a testing set of 10,000 grayscale images of hand-written digits from 0 to 9. Each sample is randomly rotated in yielding MNIST-rot.
CIFAR10 [Krizhevsky2009Learning] is a natural image classification dataset containing a training set of and a testing set of color images across the following 10 classes: airplanes, automobiles, birds, cats, deers, dogs, frogs, horses, ships, and trucks, while CIFAR100 consists of 100 classes. And we randomly rotate each sample in the CIFAR10 dataset between to yield CIFAR10-rot.
ImageNet object classification dataset [Russakovsky2015ImageNet] is more challenging due to its large scale and greater diversity. There are 1000 classes and 1.2 million training images and 50k validation images in it. We compare our method with the state-of-the-art on the ImageNet dataset and we adopt ResNet18 to validate the superiority and effectiveness of CBCNs.
In the implementation, LeNet, WRN, and ResNet18 backbone networks are used to build CBCNs. We simply replace the full-precision convolution with CBConv, and keep other components unchanged. The parameters and for the Gaussian function in the Eq. 9 are set to 1 and , respectively. More details are elaborated below.
LetNet contains four simple convolutional layers. We adopt Max-pooling and ReLU after each convolution layer, and a dropout layer after the fully connected layer to avoid over-fitting. The initial learning rate is 0.01 with no degradation before reaching the maximum epoch of 50 for MNIST and MNIST-rot.
WRN Backbone: WRN is a network structure similar to ResNet with a depth factor to control the feature map depth dimension expansion through 3 stages, within which the dimensions remain unchanged. For simplicity we fix the depth factor to 1. Each WRN has a parameter which indicates the channel dimension of the first stage and we set it to 16 leading to a network structures ---. The training details are the same as in [Zagoruyko2016Wide]. The initial learning rate is 0.01 with a degradation of 10% for every 60 epochs before it reaches the maximum epoch of 200 for CIFAR10/100 and CIFAR10-rot. For example, WRN22 is a network with 22 convolutional layers and similarly for WRN40.
ResNet18 Backbone: Fig. 6 respectively illustrates the architectures of ResNet18, XNOR and CBCNs. SGD is used as the optimization algorithm with a momentum of and a weight decay 1e-4. The initial learning rate is 0.01 with a degradation of 10% for every 20 epochs before reaching the maximum epoch of 70 on ImageNet, while on CIFAR10/100, the initial rate is 0.01 with a degradation of 10% for every 60 epochs before reaching the maximum epoch of 200.
4.2 Rotation Invariance
With LeNet and ResNet18 backbones, we build XNOR and CBCNs and compare them on MNIST, MNIST-rot, CIFAR10, and CIFAR10-rot. is set to in CBCNs.
Table LABEL:rotate gives the results in terms of error rates, and ‘fp’ represents the full-precision results. The state-of-the-art XNOR has a dramatical performance drop on the more challenging rotated datasets. On MNIST-rot, with the kernel stage ---, CBCN shows impressive performance improvement 11.5% over XNOR, while 1.85% improvement is achieved on MNIST. On CIFAR10-rot, with the kernel stage ---, CBCN has about 20% improvement over XNOR. From Table LABEL:rotate, we can also see that on CIFAR10-rot, the performance gap between CBCN and XNOR decreases from about 20% to 17% to 6% with the increase of the kernel stage (parameters), meaning that the improvement of CBCN over XNOR is more significant when they have fewer parameters. The results in Table LABEL:rotate confirm that with the improved representation ability from the proposed CiFs, CBCNs are more robust than conventional binarization methods for rotation variations of input images.
4.3 Ablation Study
In this section, we study the performance contributions of the components in CBCNs, which include CBConv, center loss, additional shortcuts (Fig. 6), and the Gaussian gradient function (Eq. 9). CIFAR10 and ResNet18 with kernel stage --- are used in this experiment. The details are given below.
1) We only replace the convolution BConv in XNOR with our CBConv convolution and compare the results. As shown in the ConvComp column in Table LABEL:ablation, CBCN (=4) achieves about 8% accuracy improvement over XNOR (84.79% vs. 76.3%) using the same network structure and shortcuts as in ResNet18. This significant improvement verifies the effectiveness of our CBConv.
2) In CBCNs, if we double the shortcuts (Fig. 6), we can also find a decent improvement from 84.79% to 89.10% (see the column under S in Table LABEL:ablation), which shows that the increase of shortcuts can also enhance binarized deep networks.
3) Fine-tuning CBCN with the center loss can also improve the performance of CBCN by 0.5% as shown in the column under S+C in Table LABEL:ablation).
4) Replacing the piecewise polynomial function in [Liu2018Bi] with the Gaussian function for back propagation, CBCN obtains 1.12% improvement (90.22% vs. 89.10%), which shows that the gradient function we use is a better choice.
5) From the column under S in Table LABEL:ablation, we can see that CBCN performs better using more orientations in CiFs. More orientations can better deal with the problem of degraded representation caused by network binarization.
4.4 Accuracy Comparison with State-of-the-Art
CIFAR10/100: The same parameter settings are used in CBCNs on both CIFAR10 and CIFAR100. We first compare our CBCNs with original ResNet18 with stage kernels as --- and ---, followed by a comparison with the original WRNs with the initial channel dimension in Table 4. Then, we compare our results with other state-of-the-arts such as BNN [Courbariaux2015BinaryConnect], BWN [Rastegari2016XNOR], and XNOR [Rastegari2016XNOR]. It is observed that at least 1.84% (= 93.42%-91.58%) accuracy improvement is gained with our CBCN, and in other cases, larger margins are achieved. Also, we plot the training and testing loss curves of XNOR and CBCN, respectively, in Fig. 7, which clearly show that CBCN (CBP) converges faster than XNOR (BP).
ImageNet: Four state-of-the-art methods on ImageNet are chosen for comparison: Bi-Real Net [Liu2018Bi], BinaryNet [Courbariaux2016Binarized], XNOR [Rastegari2016XNOR] and ABC-Net [Lin2017Towards]. These four networks are representative methods of binarizing both network weights and activations and achieve state-of-the-art results. All the methods in Table 5 perform the binarization of ResNet18. For a fair comparison, our CBCN contains the same amount of learned filters as ResNet18. The comparative results in Table 5 are quoted directly from the references, except that the result of BinaryNet is from [Lin2017Towards]. The comparison clearly indicates that the proposed CBCN outperforms the four binary networks by a considerable margin in terms of both the top-1 and top-5 accuracies. Specifically, for top-1 accuracy CBCN outperforms BinaryNet and ABC-Net with a gap over 18%, achieves about 10% improvement over XNOR, and about 5% over the latest Bi-Real Net. In Fig. 8, we plot the training and testing loss curves of XNOR and CBCN, respectively. It clearly shows that using our CBP algorithm, CBCN converges faster than XNOR.
In this paper, we have proposed new circulant binary convolutional networks (CBCNs) that are implemented by a set of binary circulant filters (CiFs). The proposed CiFs and circulant binary convolution (CBConv) are used to enhance the representation ability of binary networks. CBCNs can be trained end-to-end with the developed circulant BP (CBP) algorithm. Our extensive experiments demonstrate that CBCNs have superiority over state-of-the-art binary networks, and obtain results that are more close to the full-precision backbone networks ResNets and WRNs, with a storage reduction of about 32 times. As a generic convolutional layer, CBConv can also be used on various tasks, which is our future work.
The work was supported by the National Key Research and Development Program of China (Grant No. 2016YFB0502602) and the National Key R&D Plan (2017YFC0821102). Baochang Zhang is the corresponding author.