1 Introduction
Deep convolutional neural networks (DCNNs) have been successfully demonstrated on many computer vision tasks such as object detection and image classification. DCNNs deployed in practical environments, however, still face many challenges. It is particularly true when the portability and real time performance are required. This is critical because models of vision applications can require very large memory, making them impractical for most embedded platforms. Besides, floatingpoint inputs and network weights along with forward or backward data flow can result in a significant computational burden.
Binary filters instead of using fullprecision weights have been investigated in DCNNs to compress the deep models to handle the aforementioned problems. They are also called 1bit DCNNs, as each weight parameter and activation can be represented by a single bit. As presented in [Rastegari2016XNOR], XNOR has both the convolution weights and inputs attached to the convolution be approximated with binary values, providing an efficient implementation of convolutional operations, particularly by reconstructing the unbinarized filters with a single scaling factor. More recently, BiReal Net [Liu2018Bi] explores a new variant of residual structure to preserve the real activations before the sign function and TBN [Wan2018TBN]
replaces the sign function with a thresholdbased ternary function to obtain ternary input tensor. Both provide an optimal tradeoff among memory, efficiency and performance. A warmrestart learningrate schedule in
[Mark2018Training] is adopted to accelerate the training for 1bitperweight networks. Furthermore, a method called WAGE [Wu2018Training] is proposed to discretize both training and inference, where not only weights and activations but also gradients and errors are quantized. In these previous methods, however, the binarization of the filters often degrades the representational ability of the models for the rotation variations in objects.Inspired by Oriented Response Networks [Zhou2017ORN] which already show their powerful ability in withinclass rotationinvariance, we also employ similar way to enhance the representative ability which destoryed by the binarization process. To the best of our knowledge, we are the first to use the idea of orientation in binary network with a asynchronous way, opening up a promising direction for pursuing efficient networks based on virtual shared filter banks. In this paper, we introduce circulant filters (CiFs) and the circulant binary convolution (CBConv) to actively calculate diverse feature maps, which can improve the representational ability of the resulting binarized DCNNs. The key insight of producing CiFs to help back propagation is shown in Fig. 1. Compared to previous binarized DCNN filters, CiFs are defined based on a circulant operation on each learned filter. A new circulant back propagation (CBP) algorithm is also introduced to develop an endtoend trainable DCNN. During the convolution, CiFs are used to produce diverse feature maps which provide the binarized DCNNs with the ability to capture variations previously unseen. Instead of introducing extra functional modules or new network topologies, our method implements CBConv onto the most basic element of DCNNs, the convolution operator. Thus, it can be naturally fused with modern DCNN architectures, upgrading them to more expressive and compact Circulant Binary Convolutional Networks (CBCNs) for resource limited applications. We design a simple and unique variation process, which is deployed at each layer and can be solved within the same pipeline of the new CBP algorithm. In addition, we consider the center loss to further enhance the performance of CBCNs as shown in Fig. 2. Thanks to the low model complexity, such an architecture is less prone to overfitting and suitable for resourceconstrained environments. Our CBCNs reduce the required storage space of fullprecision models by a factor of 32, while achieving better performance than existing binarized filters based DCNNs. The contributions of this paper are summarized as follows:
(1) CiFs are used to obtain more robust feature representation, e.g., orientation variations in objects, which minimize the performance gap between fullprecision DCNNs and binarized DCNNs.
(2) We develop a CBP algorithm to reduce the loss during back propagation and make convolutional networks more compact and efficient. Experimental results show that CBP is not only effective, but also converged quickly.
(3) The presented circulant convolution is generic, and can be easily used on existing DCNNs, such as ResNets and conventional DCNNs. Our highly compressed models outperform stateoftheart binarized models by a large margin on MNIST, CIFAR and ImageNet databases.
circulant transfer matrix  circulant filter  learned filter  feature map 

the gradient of the learned filter  inverse circulant transfer matrix  binarized filter  
number of orientations for each filter  filter index  orientation index  layer index 
input feature map index  output feature map index  learning rate 
2 Related Work
DCNNs with a large number of parameters consume considerable computational resources. From our practical study, about half of the power consumption of a DCNN is related to the model size. Many attempts have been made to accelerate and simplify DCNNs while avoiding the increase of the errors. Network binarization is one of the most popular approaches, which is briefly reviewed below.
The research in [Lai2017Deep] demonstrates that the storage of realvalued deep neural networks such as AlexNet [Krizhevsky2012ImageNet], GoogLeNet [Szegedy2014Going] and VGG16 [Simonyan2014Very]
can be reduced significantly when their 32bit parameters are quantized to 1bit. Expectation BackPropagation (EBP) in
[Soudry2014Expectation] uses a variational Bayesian approach to achive high performance with a network with binary weights and activations. BinaryConnect (BC) [Courbariaux2015BinaryConnect] extends the idea of EBP by employing 1bit precision weights (1 and 1). Later, Courbariaux et al. [Courbariaux2016Binarized] propose BinaryNet (BN) that is an extension of BC and further constrains activations to +1 and 1, binaring the input (except the first layer) and the output of each layer. BC and BN both achieve sufficiently high accuracy on small datasets such as MNIST, CIFAR10 and SVHN. According to Rastegari et al. [Rigamonti2013Learning], BC and BN are not very successful on largescale data sets. XNOR [Rastegari2016XNOR] has a different binarization method and a network architecture. Both the weights and inputs attached to the convolution are approximated with binary values, which results in an efficient implementation of the convolutional operations by reconstructing the unbinarized filters with a single scaling factor. Recent studies such as MCN [Wang2018Modulated] and BiReal Net [Liu2018Bi] have been conducted to explore new network structures and training techniques for binarizing both weights and activations while minimizing accuracy degradation using a concept similar to XNOR. MCN introduces modulated filters to recover the unbinarized filters and leads to a new architecture to calculate the network model. BiReal Net connects the real activations to the activations of consecutive blocks through an identity shortcut.The results of these studies are encouraging, but due to the weight binarization process, the representational ability of the networks can be degraded. This inspires us to seek a way to increase the filter variations in order to increase the network representation ability. In particular, for the first time, we use the circulant matrix to build CiFs for our binarized CNNs. We also develop a CBP algorithm to make the DCNNs more compact and effective in an endtoend framework.
3 Methodology
We design a specific architecture in CBCNs based on CiFs, and train it with a new BP algorithm. Attempting to increase the representational ability reduced by the binarization process, CiFs are designed to enrich the binarized filters for the enhancement of the network performance. As shown in the experiments, the performance drop is marginal even when the learned network parameters are highly compressed. First of all, Table 1 gives the main notation used in this paper.
3.1 Circulant Transfer Matrix
A circulant matrix
is defined by a single vector in the first column, with cyclic permutations of the vector with offset equal to the column index in the remaining columns. An important property of the circulant matrix is that it can produce different representations using simple vectors or matrices. With this unique characteristic, we define the circulant transfer matrix of
columns as , :(1) 
where and vector rotations are used to form . The first column corresponds to the numbers in Fig. 3, and the other columns are obtained by a counterclockwise rotation of the numbers. Each column of represents one rotation angle , , , , , , , }. We set to correspond to the learned filter and to the derived rotated versions of .
3.2 Circulant Filters (CiFs)
We now design the specific convolutional filters CiFs used in our CBCNs. A CiF is a tensor of size , generated based on a learned filter and . These CiFs are deployed across all convolutional layers. As shown in Fig. 4(a), the original learned filter is modified to by replicating it three times and concatenating them to obtain the CiF. For
, every channel of the network input is replicated as a group of four elements. By doing so, we can easily implement our CBCNs using PyTorch. One example of the CBConv is illustrated in Fig.
4(b).To facilitate the math description below, we represent a learned filter as a vector of size so that its corresponding CiF can be represented using a matrix of size (see Fig. 4(a)). Let be such a matrix representing the CiF. Then
(2) 
where is a vector containing the learned filter’s weights (including the unchanged central one during rotation), and denotes the rotation operation of with (see Fig. 3). corresponds to the learned filter and the other columns of are introduced to increase the representational ability.
3.3 Forward Propagation of CBCNs based on the CBConv Module
In CBCNs, a binary CiF denoted by is calculated as:
(3) 
where is the corresponding fullprecision CiF, and the values of are 1 or 1. Both and are jointly obtained in the endtoend learning framework. Let the set of all the binarized filters in the layer be . Then the output feature maps are obtained by:
(4) 
where is the convolution operation implemented as a new module including the CiF generation process (the blue part in Fig. 2). Fig. 4(b) shows a simple example of a forward convolution where there is one input feature map with one generated output feature map. In the CBConv, the channels of an output feature map are generated as follows:
(5) 
(6) 
where denotes the convolution operation. is the channel of the feature map, and denotes the feature map of the input in the convolutional layer. In Fig. 4(b), and , where after the CBConv with one binary CiF, the number of the channels of the output feature map is the same as that of the input feature map.
3.4 Circulant Back Propagation (CBP)
During the backpropagation, what needs to be learned and updated are the learned filters only. And the inverse transformation of the circulant transfer matrix is involved in the process of BP to further enhance the representational ability of our CBCNs. To facilitate the math description below, we define the inverse circulant matrix of columns as , , where and vector inverse rotations are used to form . Let be the gradient of a learned filter . Note that we need to sum up the gradients of the subfilters in the corresponding CiF, . Thus:
(7) 
(8) 
where is the network loss function, and is the learning rate. Note that since the central weights of CiFs are not rotated, their gradients are obtained as in the common BP procedure and are not presented in Eq. 7. As is shown, the circular operation involves in our BP process, which makes CBP be adaptive to orientation variations in objects.
For the gradient of the sign function, some special process is necessary due to its discontinuity property. In [Courbariaux2016Binarized] and [Liu2018Bi], the sign function is approximated by the clip function and the piecewise polynomial function, respectively, as shown in Fig. 5 and Fig. 5 where their corresponding derivatives are also given. Since the derivative of the sign function (an impulse) can be represented as , in this work, we use this Gaussian function (Fig. 5) as the approximation of the gradient:
(9) 
where and
are the amplitude gain and variance of the Gaussian function, respectively, which are determined empirically. In our experiments, we find that our approximation in Fig.
5 is better than those in Fig. 5 and Fig. 5. From the equations above, we can see that the BP process can be easily implemented. Thus only updating the learned convolution filters with the help of CiFs, our CBCNs are significantly compact and efficient, reducing the memory storage by 32. Finally, the learning algorithm to train CBCNs is given in Algorithm 1.Dataset  Backbone  kernel stage  original (%)  rot (%)  
fp  XNOR  CBCN  fp  XNOR  CBCN  
MNIST  LeNet  5102040  0.91  3.76  1.91  2.77  17.26  5.76 
10204080  0.69  1.50  1.24  1.89  7.77  4.95  
CIFAR10  ResNet18  16163264  8.94  22.88  10.9  19.07  40.75  19.68 
323264128  6.63  15.55  8.13  12.96  33.69  16.2  
3264128256  5.27  13.43  8.09  10.47  21.93  15.11 
4 Experiments
Our CBCNs are evaluated on object classification using MNIST [L1998Gradient], CIFAR10/100 [Krizhevsky2009Learning] and ILSVRC12 ImageNet datasets [Russakovsky2015ImageNet]. LeNet [L1998Gradient], WideResNet (WRN) [Zagoruyko2016Wide] and ResNet18 [He2016Deep]
are employed as the backbone networks to build our CBCNs simply by replacing the fullprecision convolution with CBConv. Also, binarizing the neuron activations is carried out in all of our experiments.
4.1 Datasets and Implementation Details
Datasets: The MINIST [L1998Gradient] dataset is composed of a training set of 60,000 and a testing set of 10,000 grayscale images of handwritten digits from 0 to 9. Each sample is randomly rotated in yielding MNISTrot.
CIFAR10 [Krizhevsky2009Learning] is a natural image classification dataset containing a training set of and a testing set of color images across the following 10 classes: airplanes, automobiles, birds, cats, deers, dogs, frogs, horses, ships, and trucks, while CIFAR100 consists of 100 classes. And we randomly rotate each sample in the CIFAR10 dataset between to yield CIFAR10rot.
ImageNet object classification dataset [Russakovsky2015ImageNet] is more challenging due to its large scale and greater diversity. There are 1000 classes and 1.2 million training images and 50k validation images in it. We compare our method with the stateoftheart on the ImageNet dataset and we adopt ResNet18 to validate the superiority and effectiveness of CBCNs.
In the implementation, LeNet, WRN, and ResNet18 backbone networks are used to build CBCNs. We simply replace the fullprecision convolution with CBConv, and keep other components unchanged. The parameters and for the Gaussian function in the Eq. 9 are set to 1 and , respectively. More details are elaborated below.
LeNet Backbone:
LetNet contains four simple convolutional layers. We adopt Maxpooling and ReLU after each convolution layer, and a dropout layer after the fully connected layer to avoid overfitting. The initial learning rate is 0.01 with no degradation before reaching the maximum epoch of 50 for MNIST and MNISTrot.
WRN Backbone: WRN is a network structure similar to ResNet with a depth factor to control the feature map depth dimension expansion through 3 stages, within which the dimensions remain unchanged. For simplicity we fix the depth factor to 1. Each WRN has a parameter which indicates the channel dimension of the first stage and we set it to 16 leading to a network structures . The training details are the same as in [Zagoruyko2016Wide]. The initial learning rate is 0.01 with a degradation of 10% for every 60 epochs before it reaches the maximum epoch of 200 for CIFAR10/100 and CIFAR10rot. For example, WRN22 is a network with 22 convolutional layers and similarly for WRN40.

S  S+C  S+G 



XNOR  76.3  80.53  80.97  81.65  82.32  
CBCN (=2)  81.84  85.79  86.23  86.67  87.56  
CBCN (=4)  84.79  89.10  89.6  90.22  90.83  
CBCN (=8)  86.79  90.80  91.27  91.53  92.02 
ResNet18 Backbone: Fig. 6 respectively illustrates the architectures of ResNet18, XNOR and CBCNs. SGD is used as the optimization algorithm with a momentum of and a weight decay 1e4. The initial learning rate is 0.01 with a degradation of 10% for every 20 epochs before reaching the maximum epoch of 70 on ImageNet, while on CIFAR10/100, the initial rate is 0.01 with a degradation of 10% for every 60 epochs before reaching the maximum epoch of 200.
4.2 Rotation Invariance
With LeNet and ResNet18 backbones, we build XNOR and CBCNs and compare them on MNIST, MNISTrot, CIFAR10, and CIFAR10rot. is set to in CBCNs.
Table LABEL:rotate gives the results in terms of error rates, and ‘fp’ represents the fullprecision results. The stateoftheart XNOR has a dramatical performance drop on the more challenging rotated datasets. On MNISTrot, with the kernel stage , CBCN shows impressive performance improvement 11.5% over XNOR, while 1.85% improvement is achieved on MNIST. On CIFAR10rot, with the kernel stage , CBCN has about 20% improvement over XNOR. From Table LABEL:rotate, we can also see that on CIFAR10rot, the performance gap between CBCN and XNOR decreases from about 20% to 17% to 6% with the increase of the kernel stage (parameters), meaning that the improvement of CBCN over XNOR is more significant when they have fewer parameters. The results in Table LABEL:rotate confirm that with the improved representation ability from the proposed CiFs, CBCNs are more robust than conventional binarization methods for rotation variations of input images.
4.3 Ablation Study
In this section, we study the performance contributions of the components in CBCNs, which include CBConv, center loss, additional shortcuts (Fig. 6), and the Gaussian gradient function (Eq. 9). CIFAR10 and ResNet18 with kernel stage  are used in this experiment. The details are given below.
Model  Kernel Stage  Dataset  
CIFAR  CIFAR  
10  100  
BNN    89.85   
BWN    90.12   
XNOR (ResNet18)  6464128256  87.1  66.08 
XNOR (WRN40)  6464128256  91.58  73.18 
ResNet18  16163264  94.84  75.37 
CBCN  16163264  90.22  69.97 
CBCN  3264128256  91.60  70.07 
WRN40  16163264  95.8  79.41 
WRN22  16163264  90.32  67.19 
CBCN  16163264  93.42  74.80 
FullPrecision  XNOR  ABCNet  BinaryNet  BiReal  CBCN  

ResNet18  Top1  69.3  51.2  42.7  42.2  56.4  61.4 
Top5  89.2  73.2  67.6  67.1  79.5  82.80 
1) We only replace the convolution BConv in XNOR with our CBConv convolution and compare the results. As shown in the ConvComp column in Table LABEL:ablation, CBCN (=4) achieves about 8% accuracy improvement over XNOR (84.79% vs. 76.3%) using the same network structure and shortcuts as in ResNet18. This significant improvement verifies the effectiveness of our CBConv.
2) In CBCNs, if we double the shortcuts (Fig. 6), we can also find a decent improvement from 84.79% to 89.10% (see the column under S in Table LABEL:ablation), which shows that the increase of shortcuts can also enhance binarized deep networks.
3) Finetuning CBCN with the center loss can also improve the performance of CBCN by 0.5% as shown in the column under S+C in Table LABEL:ablation).
4) Replacing the piecewise polynomial function in [Liu2018Bi] with the Gaussian function for back propagation, CBCN obtains 1.12% improvement (90.22% vs. 89.10%), which shows that the gradient function we use is a better choice.
5) From the column under S in Table LABEL:ablation, we can see that CBCN performs better using more orientations in CiFs. More orientations can better deal with the problem of degraded representation caused by network binarization.
4.4 Accuracy Comparison with StateoftheArt
CIFAR10/100: The same parameter settings are used in CBCNs on both CIFAR10 and CIFAR100. We first compare our CBCNs with original ResNet18 with stage kernels as  and , followed by a comparison with the original WRNs with the initial channel dimension in Table 4. Then, we compare our results with other stateofthearts such as BNN [Courbariaux2015BinaryConnect], BWN [Rastegari2016XNOR], and XNOR [Rastegari2016XNOR]. It is observed that at least 1.84% (= 93.42%91.58%) accuracy improvement is gained with our CBCN, and in other cases, larger margins are achieved. Also, we plot the training and testing loss curves of XNOR and CBCN, respectively, in Fig. 7, which clearly show that CBCN (CBP) converges faster than XNOR (BP).
ImageNet: Four stateoftheart methods on ImageNet are chosen for comparison: BiReal Net [Liu2018Bi], BinaryNet [Courbariaux2016Binarized], XNOR [Rastegari2016XNOR] and ABCNet [Lin2017Towards]. These four networks are representative methods of binarizing both network weights and activations and achieve stateoftheart results. All the methods in Table 5 perform the binarization of ResNet18. For a fair comparison, our CBCN contains the same amount of learned filters as ResNet18. The comparative results in Table 5 are quoted directly from the references, except that the result of BinaryNet is from [Lin2017Towards]. The comparison clearly indicates that the proposed CBCN outperforms the four binary networks by a considerable margin in terms of both the top1 and top5 accuracies. Specifically, for top1 accuracy CBCN outperforms BinaryNet and ABCNet with a gap over 18%, achieves about 10% improvement over XNOR, and about 5% over the latest BiReal Net. In Fig. 8, we plot the training and testing loss curves of XNOR and CBCN, respectively. It clearly shows that using our CBP algorithm, CBCN converges faster than XNOR.
5 Conclusion
In this paper, we have proposed new circulant binary convolutional networks (CBCNs) that are implemented by a set of binary circulant filters (CiFs). The proposed CiFs and circulant binary convolution (CBConv) are used to enhance the representation ability of binary networks. CBCNs can be trained endtoend with the developed circulant BP (CBP) algorithm. Our extensive experiments demonstrate that CBCNs have superiority over stateoftheart binary networks, and obtain results that are more close to the fullprecision backbone networks ResNets and WRNs, with a storage reduction of about 32 times. As a generic convolutional layer, CBConv can also be used on various tasks, which is our future work.
6 Acknowledgment
The work was supported by the National Key Research and Development Program of China (Grant No. 2016YFB0502602) and the National Key R&D Plan (2017YFC0821102). Baochang Zhang is the corresponding author.