1 Introduction
With the rapid development of the deep learning, Convolutional Neural Networks (CNNs)
LeCun et al. (1989) have become ubiquitous ranging from image classificationLeCun et al. (1998); Krizhevsky et al. (2012b) to semantic segmentation Long et al. (2015); Chen et al. (2016) and object detectionGirshick (2015); Ren et al. (2015); Liu et al. (2016), since them effectively extracts valuable and abstract features. Although deep models Krizhevsky et al. (2012b); Simonyan and Zisserman (2014a); Szegedy et al. (2016); He et al. (2016); Huang et al. (2016) are very powerful, the large number of learnable parameters leads to a mass of calculations and memory of devices consumption. For example, the parameter numbers of ResNet101He et al. (2016) and DenseNet100Huang et al. (2016) are 39M and 27M, respectively. It leads to many of lowpower devices are hard to deploy CNNs. Evidently, deep neural networks would be used more widely if their computational cost and storage requirement could be significantly reduced.Recently, an increasing line of effort has been undertaken to tackled compressing the size of models for CNNs. Most of existing approaches are considering to design delicate structures or prune redundant connections of networks. Kernel factorization and weight pruning are two branches of representative approaches to reduce the model size of deep neural networks. Kernel factorization like InceptionV3Szegedy et al. (2016) uses asymmetric kernels like and , and perpendicularly intersects them to replace one normal square kernel. Since asymmetric kernels have line segment shapes, this kind of twolayer solution is cheaper than normal square kernel for the same number of output channels. Weight pruningHan et al. (2015); Venkatesh et al. (2016)
is a kind of post processing methods. It deletes the connections which weights are smaller than setting thresholds. The thoughts of these two kinds of methods are both to suppose that the generic architecture of CNNs have much parameter redundancy, and exist even smaller kernels than original square kernels are competent for feature extraction. However, existing methods have some limitations. Such methods of asymmetric convolution kernel factorization only have two kinds of angles, i.e., vertical and horizontal, which somehow restricts the model capability of convolution. While the approaches of common weight pruning can not be completed in one step as they need to wait for the whole kernels having been pretrained and then prune and finetune them. Moreover, common weight pruning models neither reduce computing cost nor speed up original models due to their pruned models needing to be saved by complicated hashing techniques.
In this work, we aim to significantly prune the number of parameters for CNN models without additional computing cost caused by complicated hashing techniques, as well as maintain acceptable performance. To achieve these targets, there are several challenges we have to face. First, the redundant parameters in CNN models usually take on stochastic distribution. Removing such redundant parameters will result in sparse matrices of parameters, which leads to additional cost for hashing in phases of saving and loading models. Second, most of accelerated libraries for deep neural networks, e.g, CuDNNChetlur et al. (2014) and OpenBlasXianyi et al. (2014), are developed for coping with dense matrix. It a challenging task to leverage accelerated libraries to speed up the pruned networks which weights matrices are sparse. Finally, as Figure 1
shown, the key of CNN models that distinguish them from other models in computer vision is the convolutional operation, like 3
3 convolution, which can effectively extract and assembly edges, angles and shapes from low to high features by learning the assembly patterns of convolutional kernel layer by layer. Intuitively, we can abstract the “principal components” from convolutional kernels to achieve an efficient assembly. As shown in Figure 2(a), an irregular shape can be modeled by assembling two “principal components” of squares. However, like Figure 2(b), feature maps are usually various. Thus, as shown in Figure 2(c) the trained convolutional kernels should have various “principal components”, which causes lots of redundant parameters. It is very challenging to maintain powerful abilities of CNN kernels to extract and assembly features, as well as, find a general pattern to efficiently express the “principal components”.To address these problems, we propose a simple, efficient, yet effective method, known as Rotated Convolution(RotateConv), to makes up the deficiency of traditional kernel factorization and weight pruning. Traditional convolutional kernels, which shapes are usually square or rectangle, can learn the patterns in feature maps by overlapped scanning. As shown in Figure 2(c), AlexNet has learned a variety of frequency and orientationselective kernels. These kernels have clear skeletons which are composed by “black grids”, and most of them present as line segments. Inspired by such evidence, Our RotateConv is shown as the last one in Figure 3. Aiming at assembling “principal components”, the basic shape of RotateConv kernel is a line segment, which means that it only has 2 or 3 weights for convolution. Besides weights, a RotateConv kernel has an additional learnable variable, namely, angle , which makes RotateConv is much more flexible considering that directions from angles to instead of and than asymmetric kernels of kernel factorization. Note that making the kernel rotatable is achieved by making the variable continuous and learnable. Due to the existence of , these line segment kernels can have a larger receptive field like , so its modeling capacity still can be guaranteed. In addition, for avoiding the computing cost caused by complicated hashing techniques in the phase of loading pruned model, we leverage interpolationbased algorithms to realize efficiently storing and restoring pruned model instead of hashing techniques. Considering different emphases on the model size and performance, we propose two variants of rotatable kernels. One has parameters consisting of weights and an angle value for each convolutional kernel. And the other one only has parameters for each convolutional kernel and one additional parameter for each convolutional filter. Accordingly the compress ratios are and approximately equal to for convolution, respectively.
Compared with the existing work, our contributions in this work can be summarised as follows.

Compared with asymmetric kernels of kernel factorization, proposed RotateConv kernels are much more flexible considering that they can have any continuous angles from to instead of and . Compared with previous methods of weight pruning, our RotateConv kernels do not need to pretrain, as they not only have much less weights than the square one, but also are born with line segment shape that can be trained end to end.

Compared with previous methods which store and restore pruned models by hashing techniques, we propose learnable and interpolationbased methods to efficiently store and restore pruned model. Taking this advantage, our pruned CNN models have enormous potential to run faster on lowpower devices, such as ARM CPU and FPGA.

We conduct a series of experiments to validate the proposed methods. The results show that our final pruned models achieve competitive performance, e.g., for SSDLiu et al. (2016), we can reduce the number of parameters 80% and save more than 60% FLOPs without accuracy slumping.
2 Related Work
Since RotateConv devotes to achieve a more compressed convolutional kernel, related work can be divided into two groups, i.e., convolution kernel design and model compression.
2.1 Convolution kernel design
The most representative work of kernel design in deep CNN can be inferred from the series work of InceptionVSzegedy et al. (2015); Ioffe and Szegedy (2015b); Szegedy et al. (2016). InceptionV1 uses multiscale kernels to extract scaleinvariant features. InceptionV2 uses more stacked small kernels to replace one bigger kernel so as to increase the depth of the network and reduce the number of parameters. InceptionV3 further makes the kernel smaller as it uses asymmetric kernels like and . As we mentioned above, though this asymmetric kernel reduces parameters a lot, its fixed angle as vertical and horizontal puts limitations on the capacity of modeling more orientationflexible patterns. Dilated convolutionYu and Koltun (2015) is another widely used method for convolution kernel design, which aims at solving the resolution reduction problem of feature maps in forward propagation. It is a dilated variant of traditional compact kernels, which helps the kernel have a larger receptive field without increasing parameters. However, dilated kernels usually result in memory cache missing problem and suffer from unexpected speed bottleneck.
Recently, there emerge some novel deformable kernel design works. Deformable Convolutional Networks (DCN)Dai et al. (2017) is a recently proposed excellent work which learns irregular kernels. DCN has a similar thought with Region Proposal NetworkRen et al. (2015) as it applies a usual convolution on the input feature, and then outputs the new kernel shape for the following deformable convolution layer. Irregular Convolutional Neural Networks (ICNN)Ma et al. (2017) is another work learning irregular kernels. Different from DCN, ICNN directly models the kernel’s shape attributes as learnable variables and learns the shapes in the same way as kernel weights. Although, these two methods can expand the capacity of convolutional kernels, they utilize extra parameters and made the calculation more complicated. Unlike existing methods, our methods change the shape of the convolutional kernels for devoting to maintain the capacity of CNN models with less parameters.
2.2 Model compression
There has been growing interest in model compression due to the demands of devicelimited applications. From the viewpoint of having potential to reduce the requirement of storage and accelerate the deep models in the phase of inference, the related work of model compression can be ranged into lowrank decomposition, weight sparsifying, structured pruning and quantization representation.
Lowrank decomposition methods approximate weight matrix of neural networks with lowrank techniques like Singular Value Decomposition
Tai et al. (2015) or CPdecompositionLebedev et al. (2015). These methods usually can obtain well performance on “big layers”, such as fullyconnected layers and such convolution layers using big kernels like 55 and 77. Thus, they can yield significant model compression and acceleration on AlexnetKrizhevsky et al. (2012a) with a little compromise on accuracy. However, these methods can not achieve notable effect of compression on modern CNN architectures, since their convolutional layers tend to use small kernels like 33 and 11 kernels.Weight sparsifying methods attempt to prune the unimportant connections in pretrained neural networks. The resulting network’s weights are mostly zeros thus the storage space can be reduced by storing the model with sparse formats. Song Han et al.Han et al. (2015) alternately prune the unimportant connections with given thresholds and finetune the pruned networks to reduce the number of parameters by 9 and 13 for AlexNet and VGG16 model, respectively. However, this kind of methods only can achieve speedup with dedicated sparse matrix operation libraries and/or hardware, since the sparse weights need to auxiliary operation to retrieval in the phase of inference. Srinivas et al.Srinivas et al. (2017)
overcome the limitation of generating sparse weights by setting thresholds, they explicitly impose sparse constraints over each weight tensor, and achieve high compression rates with a more efficient training. However, this method still suffers from the same drawback with Song Han’s work
Han et al. (2015), i.e., they are easy to obtain small models, but hard to really speed up networks with general computing devices like CPU, since the work of processing sparse matrix is not friendly for CPU.Structured pruning works are proposed for pruning redundant structures, such as channels and layers, in trained deep models. Channel pruning methodsLi et al. (2016); Liu et al. (2017) are most representative work in this branch. Li et al.Li et al. (2016) prune input channels for each layer indicating by the weights of convolutional kernels, while Liu et al.Liu et al. (2017)
learn a scaling factor which indicates the importance of channel for each output channel, then heuristically prunes the output channels with learned scaling factors. Changpinyo et al.
Changpinyo et al. (2017) introduce sparsity both on input and output channels by random deactivating inputoutput channelwise connections. To achieve a fullscale solution, Wen et al.Wen et al. (2016) utilize a Structured Sparsity Learning (SSL) method to sparsify different level of structures (e.g. filters, channels or layers) in CNNs. He et al.He et al. (2017)effectively prunes each layer by channel selection based on LASSO regression
Hans (2009) and least square reconstruction of output. All of these methods utilize sparsity regualarization during training to obtain structured sparsity. Compared with weight sparsifying, these channel pruning works overcome the limitation of requiring to restore compact matrices or dedicated speeding up libraries.Quantization representation methods quantize network weights from float(usually 32 bits) to be a few of bits. BinarynetHubara et al. (2016)
and XNORnet
Rastegari et al. (2016) can achieve 32 compression rates and 58 speed up on deep models by using just one bit to store one weight and bitwise operations, but often notably sacrifice accuracy. HashNetChen et al. (2015) quantizes the network weights by a hash strategy, while Song Han et al.Han et al. (2015) quantizes the network weights by a clustering strategy. These methods assign the network weights to different groups and within each group weight the value is shared. In this way, only the shared weights and indices of weights need to be stored, thus a large amount of storage space could be saved. However, these techniques can neither save runtime memory nor inference time unless they are aided by special devices, since the shared weights need to be restored to their original positions before calculation.3 Mathematical Derivation of Rotated Convolution
Early workHan et al. (2015); Chen et al. (2015); He et al. (2017) have proved that only reserving the “principal components” of CNN models, i.e., a few of importance weights, the CNN models still can maintain their ability of feature extraction. In this section, we aim to prune and compress standard convolution networks to such “principal components” by elaborating Rotated Convolution Network (RCN). Accordingly, we first formulate our rotated convolution kernel for RCN, and present in detail how we efficiently achieve RCN which has both high compression rate and accuracy from standard CNNs. Then we will explain RotateConv’s mathematical derivations.
3.1 Rotated convolution kernel
Distinguish from a traditional CNN model which convolutional kernels have fixed shapes and can be expressed as matrices, e.g., a standard 33 kernel is expressed as a 33 matrix, our RCN model is formed by rotated convolution kernels which consists of 3 weights placed on a straight line as shown in Figure 3. Accordingly, instead of using a matrix, we formulate the rotated convolution kernel by building coordinate systems on its corresponding standard kernel .
(1) 
where denotes the number of input channels and is the number of output channels. is the set of kernel weights and each kernel has 3 weights. is the set of kernel angles . is defined as the included angle between the horizontal line and the kernel line as shown in Figure 3, which is range in to . For a RotateConv kernel shown in Figure 4(a), the output can be calculated as:
(2) 
where is the weighted summation in the th output channel. Inputs , and correspond to the weights , and , respectively. Note that for RotateConv, , are sampled by .
3.2 Interpolation based on angle
The rotated convolution kernel can be viewed as the “principal components” of . Hence the calculation of convolution only needs to process the pixels corresponding to “principal components”. However, as in is defined to be continuous, if does not equal to an integer multiple of , then it does not exist corresponding pixels, i.e., and , on feature maps.
To solve this problem, we first build a coordinate system on its related kernel , which is shown as Figure 4(b). Then, we have two choices to match input pixels with weights. Specifically, we can combine the input pixels corresponding to and or split the weights and , to the adjacent positions which are the integer multiple of as shown in Figure 4(c). Most of deep learning software frameworks utilize the algorithm of to transform the calculation of convolution to be dense matrix calculation. If we adopt combining the input pixels, the will invalidate because is a variable. Therefore, in this work, we adopt the second method to avoid extra time consuming related with the practical convolution implementation. And at last, a RotateConv kernel associated with a 33 square kernel is equivalent to keeping 5 weights like Figure 4(b). The split process can be calculated as:
(3) 
where and mean the weights split by , same as and split by . The spliting sizes are determined by , which is simply defined as the ratio between the include angle and .
To this end, the convolution is a weighted summation after inverseinterpolation based on angle:
(4) 
3.3 Back propagation
There are three kinds of learnable variables, i.e., inputs , weights and angles , and one kind of intermediate variables , as called , , and . The gradients for inputs can be calculated by the same as the ones of traditional convolution, while the gradients for and also can be calculated with the traditional way, since , and zeros are combined to be a usual kernel as shown in Figure 4(b). The key point is to compute the gradients for weights , , and angles with the intermediate variables .
(5) 
The update mechanism of is the same as before for standard weights , but not for angles . For a certain angle , since backward gradients, are supplied from , the updated new should not excess the boundaries defined by and too much.
(6) 
where is a small positive value to allow to get out of the last adjacent boundaries and but not too much. Note that weights and angles can both be initialized from random distribution or pretrained and kernels, while the initialization for angles should be in the range from zero to .
4 Arithmetic interpolation
Although we deduce RotateConv kernels in above section with punning 33 convolutional kernels, the punning method can be used to arbitrary scale convolutional kernels. In the meantime, we have noticed that presentday CNN models are constructed by a mass of 33 convolutional layers, but RotateConv kernels with interpolation based angle only can achieve compression ratio on 33 convolutional kernels. Considering different emphases on the model size and performance, in this section, we further propose a more efficient interpolation approach aiming at higher compression ratio, i.e., RotateConv based Arithmetic Interpolation(AIRotateConv).
4.1 Formulation of RotateConv based Arithmetic Interpolation
Different from RotateConv based angle interpolation which has an additional parameter, i.e., angle, for each convolutional kernel, RotateConv based arithmetic interpolation treats the weights of “principal components” of a convolution layer as an arithmetic progression, which only requires a tolerance for each convolution layer to do the interpolation calculation. Here, we redefine RotateConv based arithmetic interpolation as
(7)  
where denotes the parameter set of a pruned convolution layer. is the minimum weight in reserved weights and
is the estimated tolerance of arithmetic progression. The reserved weights of pruned convolution layer are treated as an arithmetic progression.
is the point set indicating the “principal components” of , i.e., the positions of top biggest weights of . is an ordered point set which are ordered by their associated weights, and their associated weights are ranged into an arithmetic progression. is the th convolution kernel belongs th channel, and is the number of weights of a convolution kernel.4.2 Interpolation based on arithmetic interpolation
Aiming to extracting the “principal components” from regular convolutional kernels, we can initialize the shape of pruned convolutional kernel with the “principal components” of regular convolutional kernel . Specifically, we sort the weights of according to their absolute values, then select top positions that their absolute weights are bigger than the rest to be the .
Similar to RotateConv based angle interpolation, we utilize an algorithm of interpolation to allocate the weights indicated by . Here, we first gather all top weights from each , and sort the them to generate ordered points set . Then, we recalculate weights of points in through arithmetic interpolation. Specifically, due to the attribute of arithmetic progression, the weights of each can be interpolated as
(8) 
where indicates the th point in pruned kernel , and denotes an index function which returns the order of point in ordered set . is the estimated tolerance of arithmetic progression which is made up of reserved weights indicating by . Therefore, the estimated tolerance can be calculated by
(9) 
where is the number of points in . denotes the weight of point , and .
4.3 Learning
Given a pruned convolution layer , there are three kinds of variables need to be learn, i.e., the reserved point set for each regular kernel , the minimum weight in the reserved points and the tolerance . Aiming to learning from regular convolution kernel , we jointly train the network weights of with sparsity regularization imposed on the latter. Then, we iteratively select kernels with thresholds and top points of these reserved kernels which absolute weights are bigger than the rest to be the , and recalculate weights of points in through arithmetic interpolation. Specifically, the training objective of our approach is given by
(10) 
where denotes the trained input and target, denotes the trainable weights of regular convolution, the first sumterm corresponds to the normal training loss of a CNN. is the set of layers which will be pruned, is a sparsityinduced penalty on the weights of a pruned convolution layer, and balances the two terms. In this work, we choose , which is known as
norm and widely used to feature selection.
Specifically, as shown in Algorithm 1, we first normally train the source network in each iteration, and then update based on the last with the methods mentioned in subsection 4.2
. Notice that for realizing training our pruned network on existing training tools(e.g., caffe, pytorch), at the end of each iteration, we first set all weights of each
to be , then update reserved weights with the weights of .5 Experiment
For evaluating the effectiveness of proposed methods, we study the performance of pruned models generated by proposed methods on the tasks of classification and objective detection. In this section, we first introduce datasets, baselined models and experimental settings respectively. Then, we evaluate performance of RotateConv and AIRotateConv models which do interpolation based on angle and arithmetic progression, respectively.
5.1 Datasets
CIFARKrizhevsky et al. (2009) consists of colored natural images with 32
32 pixels. CIFAR10 consists of images drawn from 10 classes and CIFAR100 from 100 classes. The training and testing sets contain 50,000 and 10,000 images respectively, and hold out 5,000 training images as a validation set. In our experiments, the input data is normalized using channel means and standard deviations without any data augmentation. For the final run we report the final test error on the test set at the end of training.
SVHNNetzer et al. (2011)
is a realworld digit image dataset for developing machine learning and object recognition algorithms. It is obtained from house numbers in Google Street View images. The task is to classify the digit centered in image. It has 10 classes, 73,257 digits for training, 26,032 digits for testing. We use the CIFARlike version for experiments, each image has a
spatial size and is centered around a single digit which means that many examples do contain some distractions at the sides.PASCAL VOCEveringham et al. (2010) is a realworld digit image dataset for developing object detection and localization algorithms. Our training dataset consists of a set of images from the training datasets of VOC2007 and VOC2012, and the testing dataset(4952 images) is the testing set of VOC2007. Each image has an annotation file giving a bounding box and object class label for each object in one of the twenty classes present in the image. We adopt the same configurations with SSDLiu et al. (2016) in training and testing, and report mAP performance of SSD300(the size of input images is 300300 pixels).
5.2 Deep models
For studying the performance of our methods both on light and heavy models, we select three kinds of network structures, i.e., ResNetHe et al. (2016), DenseNetHuang et al. (2016) and VGGNetSimonyan and Zisserman (2014b) to do experiments.
ResNet is a kind of popular network structure in modern CNNs and makes great contribution in deep learning. Using shortcut connections and deeper networks, it massively improves the performance in various learning tasks while maintaining the efficiency in the model size.
DenseNets are improved from ResNet. They connect each layer to every other layer in a feedforward fashion. In this way, DenseNets alleviate the vanishinggradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. In our experiments, we reproduce a 40layer DenseNet as illustrated in
He et al. (2016) with growth rate 60.VGGNet is a neural network that performed very well in the Image Net Large Scale Visual Recognition Challenge (ILSVRC) in 2014. It scored first place on the image localization task and second place on the image classification task. Only 33 convolution and 22 pooling are used throughout the whole network. VGG also shows that the depth of the network plays an important role and deeper networks give better results. In our experiments, we also adopt the SSD300 architectureLiu et al. (2016) to do experiments for evaluating performance of our methods on the task of object detection. SSD300 utilizes VGG16Simonyan and Zisserman (2014b) as its base architecture, which only removes all the dropout layers and the fc8 layer. For obtaining predictions of detections at multiple scales, SSD300 adds convolutional feature layers which decrease in size progressively to the end of the truncated base network.
5.3 Experimental setting
All the network structures are trained using stochastic gradient descent (SGD). On CIFAR, we train the baselines, i.e., ResNet and DenseNet, using batch size 64 for 300 epochs without data augmentation. The initial learning rate is set to 0.01, is divided by 10 at 50% and 75% of the total number of training epochs. We use a weight decay of
and a Nesterov momentum
Sutskever et al. (2013) of 0.9 without dampening. The weight initialization introduced by He et al. (2015) is adopted. On PASCAL VOC, we report mAP of the SSD300 model provided by the authors. The backbone of SSD300 is VGG16 which is pretrained on the ILSVRC CLSLOC datasetRussakovsky et al. (2015a) with initial learning rate 0.001, 0.9 momentum, 0.0005 weight decay, and batch size 32.We train pruned networks which 3
3 convolution kernels are set to be pruned. For each pruned convolution layer, a Batch Normalization
Ioffe and Szegedy (2015a) layer is added after that. On CIFAR, we train prunned ResNet and DenseNet with the initial learning rate , and keeping other settings the same as their baselines. On PASCAL VOC, we train the pruned SSD300 following the same settings in baseline training.5.4 Results and analysis
In this section, we first analyze the performance of RotateConv models which are generated by the method mentioned in section 3, as well as the improved models which are produced by the approach of AIRotateConv mentioned in section 4. Then, we observe the effect of applying AIRotateConv on different layers and the number of pruned layers to explore a appropriate pruning rule.
5.4.1 Effectiveness of approach
The numbers of parameters of deep neural networks can be reduce a lot by our proposed approaches as well as their performance still can be acceptable. As shown in Table 1, Version means the basic network which most of convolution layers are convolution, Version and Version means 4 parameters and 3 parameters models pruned by RotateConv, while Version means 3 parameters model pruned by AIRotateConv with threshold , respectively. Taking ResNet20 for an example, the performance of the various versions is similar, while the basic model has parameters, the pruned models, i.e., Version, Version and Version models only have , and parameters, respectively. Moreover, according to the experimental results shown in Table 1, the performance of Version is even better than Version in some cases. It indicates that the line segment kernels still have a powerful capability on feature extraction, and sometimes even have better generalization than traditional square kernels, because they have deformable kernels.
Model  ResNet20  VGG  DenseNet40  

Version  9He et al. (2016)  4  3  9He et al. (2016)  4  3  9Huang et al. (2016)  4  3  
#Params  16.62M  9.23M  5.54M  4.71M  19.09M  10.61M  6.36M  5.63M  4.03M  2.24M  1.35M  1.22M 
CIFAR10  91.70  90.78  84.83  90.28  92.52  91.83  83.35  91.06  92.63  92.68  81.24  90.89 
CIFAR100  52.84  53.45  50.39  52.75  56.63  56.93  51.32  56.12  72.45  69.99  50.64  68.64 
SVHN  95.82  96.01  95.17  95.69  96.02  96.40  96.19  95.68  97.15  96.84  90.69  96.79 
However, the experimental results in Table 1 also indicate that the performance of Version and Version exist a degree of gap. This phenomenon shows that weights diversity is helpful for feature extraction and has a significant impact on performance. For example, Version can directly model a triangle with a line segment kernel, but Version can not. Thus, the performance of Version usually decline more than Version. In addition, the performance of Version can be compared to Version while its parameters are even less than Version. This result is because AIRotateConv not only can learn the key positions in convolution kernels by norm and thresholds, but also utilize arithmetic interpolation to the full extent of maintaining weights diversity.
Figure 5 shows the accuracy curves for various versions of ResNet20 on the dataset CIFAR100. As we can see in the Figure 5(a), accuracies of Baseline, RotateConv and AIRotateConv seesaw by iterations, and finally get almost equal performance. The ascent tendency of AIRotateConv is a little slower than others when it was trained from scratch. This phenomenon may demonstrate that the rate of convergence of AIRotateConv is more relied on parameter initialization, because it needs to select key positions of convolutional kernels. Therefore, we usually initialize the pruned models of AIRotateConv by trained basic models, and then finetune the pruned models with relatively small learning rate, e.g., 0.001. In this way, we can quickly obtain an acceptable pruned model by AIRotateConv shown in Figure 5(b).
Model  err(%)  pruned params  pruned flops 

Baseline  7.37  –  – 
NS(40% Pruned)  6.11  35.7%  28.4% 
NS(70% Pruned)  5.19  65.2%  55.0% 
version4  7.32  44.35%  38.15% 
version  9.11  69.72%  59.67% 
One purpose of this work is to reduce the amount of computing resources needed. From Table 2, we can observe that, on DenseNet, typically when 44%69% convolution parameters are pruned, our aproaches, i.e.,version4 and version can achieve acceptable performance comparing with the original models and other pruning methods. For example, when 69% convolution parameters and 59.67% flops are pruned, version still can achieve a test error of 9.11% on CIFAR10. Although our methods are inferior on the viewpoint of test errors, they can achieve higher compression ratio and less computation than the slimming modelLiu et al. (2017).
5.4.2 Analysis of working mechanism
For detecting the working mechanism of line segment kernels, we choose the model without loss of generality, and train it on CIFAR100 with 4 parameters RotateConv kernels. Then, we observe the angle distributions of last convolutional layer , which has 64 output channels and 64 input channels. Here, we analyze two distributions along input and output channel respectively.
On the one hand, we give the angle distribution for one output channel. For each output channel, it has 64 spatial kernels corresponding to 64 RotateConv angles applied along input channels. Figure 6(a) shows the angle distribution of the first output channel, which is denoted as before. We can find that different input features are applied with different RotateConv angles, which is coincident with universal intuition that different features represent different patterns.
On the other hand, we give the angle distribution for one input channel. For each input channel, it has been repeatedly used by 64 different output channels which has 64 different RotateConv angles too. Figure 6(b) shows the angle distribution of the first input channel, which is denoted as before. We can find that one single input feature is repeatedly applied with different RotateConv angles, which can be explained that one feature map always contains various patterns and the later operations need respectively select these patterns for further processing.
5.4.3 Effect of pruning different layers
To study the effect of pruning different layers, we observe the accuracy of image classification varying pruning number of layers by AIRotateConv. Without loss of generality, we prune DenseNet from bottomtoup and uptobottom, respectively. As shown in Figure 7(a), the accuracies usually decline with the number of pruned layers increasing, but the direction of bottomtoup declines more quickly than the direction of uptobottom at initial stage. For example, at the stage of pruning to layers, the performance degradation of bottomtoup is significantly faster than uptobottom. It indicates that the lowlevel features seemingly require relatively powerful convolutional kernels to do feature extraction, because the lowlevel features usually are detailed information. Therefore, in practice, we usually maintain the first few layers without pruning for balancing the accuracy and the resources reductions. Besides, we can observe from Figure 7(b) that the high layers seem to be more sensitive to be pruned than the low layers. For example, when the low(conv2 to conv10), middle(conv16 to conv25), and high(conv29 to conv38) layers are pruned respectively, the lower pruning cases seem to outperform higher cases. This is due to the high layers usually process abstract information which is crucial to classification. Therefore, according to requirements of balancing the accuracy and the resources reductions, we usually maintain the first and latest convolution layers without pruning.
Model  mAP  Threshold  Param pruned  FLOP pruned 

Baseline  74.0  –  –  – 
Model1  67.1  0.001  74.7%  79.3% 
Model2  73.8  0.001  74.7%  79.1% 
Model3  73.2  0.0015  77.2%  83.1% 
Model4  70.8  0.0020  79.5%  83.7% 
Model5  66.7  0.0035  86.3%  86.2% 
Keeping the observation of Figure 7 in mind, we further study the performance of AIRotateConv approach applying on the object detection network, i.e., SSD300. As shown in Table 3, applying AIRotateConv approach on SSD300 can reduce amount of parameters while the performance of pruned models are acceptable. Moreover, due to maintaining the first convolution layer without punning, the mAP of Model2 outperforms Model1. This result reconfirms that the lowlevel features seemingly require relatively powerful convolutional kernels, and we usually keep the first convolution layer without pruning in practice for balancing the accuracy and the resources reductions. For obtaining more condensed models, we further remove the kernels which absolute are smaller than given thresholds, i.e., ignore the associated input channels for each layer when their weights of convolutional kernels are close to 0. The experimental results are shown by Model3, Model4 and Model5. It indicates that as the threshold increasing, the parameter reduction ratio is increasing and the performance indicated by mAP still can be acceptable.
6 Conclusion
The aim of this work is to reduce computing resource requirements of CNNs as well as maintain their performance. Thus, we propose a kind of convolutional kernel which has extremely simple shape as line segments, and equip them with the rotatable ability to model diverse features. The rotatable ability is achieved by using inverseinterpolation which makes angles continuous, differentiable and learnable. In this paper, we use RotateConv and AIRotateConv to significantly reduce the number of model parameters, as well as maintain the accuracies of models. The difference between these variants is the method of interpolation, i.e., interpolation of RotateConv is based on angles while AIRotateConv does inverseinterpolation with arithmetic interpolation. In experiments, three kinds of network structures, i.e., ResNet20, VGG and DenseNet40 are pruned for efficiency analysis and pruning strategies exploration.
In the future, we will devote to the following problems. Firstly, proposed approaches should be validated on more large scale datasets such as ImageNet
Russakovsky et al. (2015b) and COCOLin et al. (2014). Secondly, RotateConv is achieved by inverseinterpolation with angle parameter, but it brings RotateConv kernels back to original shape for computation, and does not reduce computation. Therefore, we hope more efforts could be devoted to make further progress in acceleration of RotateConv. Last, although AIRotateConv has the potential to reduce the number of FLOPs and make an acceleration, it still needs to modify existing software framework like caffe, meanwhile, FPGAs also are main platforms for model inference, we plan to involve AIRotateConv into highperformance frameworks, e.g., ncnn, for lowpower CPUs and FPGAs.References
 [1] (2017) The power of sparsity in convolutional neural networks. arXiv preprint arXiv:1702.06257. Cited by: §2.2.
 [2] (2016) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915. Cited by: §1.
 [3] (2015) Compressing neural networks with the hashing trick. In International Conference on International Conference on Machine Learning, pp. 2285–2294. Cited by: §2.2, §3.
 [4] (2014) Cudnn: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759. Cited by: §1.
 [5] (2017) Deformable convolutional networks. arXiv preprint arXiv:1703.06211. Cited by: §2.1.
 [6] (201006) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §5.1.
 [7] (2015) Fast rcnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448. Cited by: §1.
 [8] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR. Cited by: §1, §2.2, §2.2, §3.
 [9] (2009) Bayesian lasso regression. Biometrika 96 (4), pp. 835–845. Cited by: §2.2.
 [10] (2015) Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §5.3.
 [11] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §1, §5.2, §5.2, Table 1.
 [12] (2017) Channel pruning for accelerating very deep neural networks. arXiv preprint arXiv:1707.06168. Cited by: §2.2, §3.
 [13] (2016) Densely connected convolutional networks. arXiv preprint arXiv:1608.06993. Cited by: §1, §5.2, Table 1.
 [14] (2016) Binarized neural networks. In Advances in neural information processing systems, pp. 4107–4115. Cited by: §2.2.
 [15] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456. Cited by: §5.3.
 [16] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.1.
 [17] (2009) Cifar10 and cifar100 datasets. URl: https://www. cs. toronto. edu/kriz/cifar. html (vi sited on Mar. 1, 2016). Cited by: §5.1.
 [18] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §2.2.
 [19] (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §1.
 [20] (2015) Speedingup convolutional neural networks using finetuned cpdecomposition. ICLR. Cited by: §2.2.
 [21] (1989) Backpropagation applied to handwritten zip code recognition. Neural computation 1 (4), pp. 541–551. Cited by: §1.
 [22] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1.
 [23] (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §2.2.
 [24] (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, pp. 740–755. Cited by: §6.
 [25] (2016) SSD: single shot multibox detector. In European Conference on Computer Vision, pp. 21–37. Cited by: item 3, §1, §5.1, §5.2.
 [26] (2017) Learning efficient convolutional networks through network slimming. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2755–2763. Cited by: §2.2, §5.4.1, Table 2.
 [27] (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §1.
 [28] (2017) Irregular convolutional neural networks. arXiv preprint arXiv:1706.07966. Cited by: §2.1.
 [29] (2011) Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, Vol. 2011, pp. 5. Cited by: §5.1.
 [30] (2016) Xnornet: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §2.2.
 [31] (2015) Faster rcnn: towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.1.
 [32] (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §5.3.
 [33] (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §6.
 [34] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
 [35] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §5.2, §5.2.
 [36] (2017) Training sparse neural networks. In Computer Vision and Pattern Recognition Workshops, pp. 455–462. Cited by: §2.2.
 [37] (2013) On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. Cited by: §5.3.
 [38] (2015) Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §2.1.
 [39] (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826. Cited by: §1, §1, §2.1.
 [40] (2015) Convolutional neural networks with lowrank regularization. ICLR. Cited by: §2.2.
 [41] (2016) Accelerating deep convolutional networks using lowprecision and sparsity. arXiv preprint arXiv:1610.00324. Cited by: §1.
 [42] (2016) Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2074–2082. Cited by: §2.2.
 [43] (2014) OpenBLAS. URL: http://xianyi. github. io/OpenBLAS. Cited by: §1.
 [44] (2015) Multiscale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §2.1.
Comments
There are no comments yet.