Deep convolutional neural networks (CNN) have made significant improvement on solving visual recognition problems since the famous work by Krizhevsky et al. in 2012 . Thanks to their deep structure, vision oriented layer designs, and efficient training schemes, recent CNN models from Google  and MSRA 
obtain better than human level accuracy on ImageNet ILSVRC dataset.
The computational complexity for the state-of-the-art models for both training and inference are extremely high, requiring several GPUs or cluster of CPUs. The most time-consuming building block of the CNN, the convolutional layer, is performed by convolving the 3D input data with a series of 3D kernels. The computational complexity is quadratic in both the kernel size and the number of channels. To achieve state-of-the-art performance, the number of channels needs to be a few hundred, especially for the layers with smaller spatial input dimension, and the kernel size is generally no less than .
Several attempts have been made to reduce the amount of computation and parameters in both convolutional layers and fully connected layers. Low rank decomposition has been extensively explored in various fashions  to obtain moderate efficiency improvement. Sparse decomposition based methods  achieve higher theoretical reduction of complexity, while the actual speedup is bounded by the efficiency of sparse multiplication implementations. Most of these decomposition-based methods start from a pre-trained model, and perform decomposition and fine-tuning based on it, while trying to maintain similar accuracy. This essentially precludes the option of improving efficiency by designing and training new CNN models from scratch.
On the other hand, in recent state-of-the-art deep CNN models, several heuristics are adopted to alleviate the burden of heavy computation. In, the number of channels are reduced by a linear projection before the actual convolutional layer; In , the authors utilize a bottleneck structure, in which both the input and the output channels are reduced by linear projection; In , and asymmetric convolutions are adopted to achieve larger kernel sizes. While these strategies to some extent help to design moderately efficient and deep models in practice, they are not able to provide a comprehensive analysis of optimizing the efficiency of the convolutional layer.
In this work, we propose several schemes to improve the efficiency of convolutional layers. In standard convolutional layers, the 3D convolution can be considered as performing intra-channel spatial convolution and linear channel projection simultaneously, leading to highly redundant computation. These two operations are first unraveled to a set of 2D convolutions in each channel and subsequent linear channel projection. Then, we make the further modification of performing the 2D convolutions sequentially rather than in parallel. In this way, we obtain a single intra-channel convolutional (SIC) layer that involves only one filter for each input channel and linear channel projection, thus achieving significantly reduced complexity. By stacking multiple SIC layers, we can train models that are several times more efficient with similar or higher accuracy than models based on standard convolutional layer.
In a SIC layer, linear channel projection consumes the majority of the computation. To reduce its complexity, we propose a topological subdivisioning framework between the input channels and output channels as follows: The input channels and the output channels are first rearranged into a
-dimensional tensor, then each output channel is only connected to the input channels that are within its local neighborhood. Such a framework leads to a regular sparsity pattern of the convolutional kernels, which is shown to possess a better performance/cost ratio than standard convolutional layer in our experiments.
Furthermore, we design a spatial “bottleneck” structure to take advantage of the local correlation of adjacent pixels in the input. The spatial dimensions are first reduced by intra-channel convolution with stride, then recovered by deconvolution with the same stride after linear channel projection. Such a design reduces the complexity of linear channel projection without sacrificing the spatial resolution.
The above three schemes (SIC layer, topological subdivisioning and spatial “bottleneck” structure) attempt to improve the efficiency of traditional CNN models from different perspectives, and can be easily combined together to achieve lower complexity as demonstrated thoroughly in the remainder of this paper. Each of these schemes will be explained in detail in Section 2, evaluated against traditional CNN models, and analyzed in Section 3.
In this section, we first review the standard convolutional layer, then introduce the proposed schemes. For the purpose of easy understanding, the first two schemes are explained with mathematical equations and pseudo-code, as well as illustrated with graphical visualization in Figure 5.
We make the assumption that the number of output channels is equal to the number of input channels, and the input is padded so that the spatial dimensions of output is the same as input. We also assume that the residual learning technique is applied to each convolutional layer, namely the input is directly added to the output since they have the same dimension.
2.1 Standard Convolutional Layer
Consider the input data in , where , and are the height, width and the number of channels of the input feature maps, and the convolutional kernel in , where is size of the convolutional kernel and is the number of output channels. The operation of a standard convolutional layer is given by Algorithm 1. The complexity of a convolutional layer measured by the number of multiplications is
Since the complexity is quadratic with the kernel size, in most recent CNN models, the kernel size is limited to to control the overall running time.
2.2 Single Intra-Channel Convolutional Layer
In standard convolutional layers, the output features are produced by convolving a group of 3D kernels with the input features along the spatial dimensions. Such a 3D convolution operation can be considered as a combination of 2D spatial convolution inside each channel and linear projection across channels. For each output channel, a spatial convolution is performed on each input channel. The spatial convolution is able to capture local structural information, while the linear projection transforms the feature space for learning the necessary non-linearity in the neuron layers. When the number of input and output channels is large, typically hundreds, such a 3D convolutional layer requires an exorbitant amount of computation.
A natural idea is, the 2D spatial convolution and linear channel projection can be unraveled and performed separately. Each input channel is first convolved with 2D filters, generating intermediate features that have times channels of the input. Then the output is generated by linear channel projection. Unravelling these two operations provides us more freedom of model design by tuning both and . The complexity of such a layer is
Typically, is much smaller than . The complexity is approximately linear with . When , this is equivalent to a linear decomposition of the standard convolutional layers . When , the complexity is lower than the standard convolutional layer in a low-rank fashion.
Our key observation is that instead of convolving 2D filters with each input channel simultaneously, we can perform the convolutions sequentially. The above convolutional layer with filters can be transformed to a framework that has layers. In each layer, each input channel is first convolved with single 2D filter, then a linear projection is applied to all the input channels to generate the output channels. In this way, the number of channels are maintained the same throughout all layers. Algorithm. 2 formally describes this framework.
When we consider each of the layers, only one
kernel is convolved with each input channel. This seems to be a risky choice. Convolving with only one filter will not be able to preserve all the information from the input data, and there is very little freedom to learn all the useful local structures. Actually, this will probably lead to a low pass filter, which is somewhat equivalent to the first principal component of the image. However, the existence of residual learning module helps to overcome this disadvantage. With residual learning, the input is added to the output. The subsequent layers thus receive information from both the initial input and the output of preceding layers. Figure.5 presents a visual comparison between the proposed method and standard convolutional layer.
2.3 Topologica Subdivisioning
Given that the standard convolutional layer boils down to single intra-channel convolution and linear projection in the SIC layer, we make further attempt to reduce the complexity of linear projection. In , the authors proved that extremely high sparsity could be accomplished without sacrificing accuracy. While the sparsity was obtained by fine-tuning and did not possess any structure, we study to build the sparsity with more regularity. Inspired by the topological ICA framework in , we propose a -dimensional topological subdivisioning between the input and output channels in the convolutional layers. Assuming the number of input channels and output channels are both , we first arrange the input and output channels as an -dimensional tensor .
Each output channel is only connected to its local neighbors in the tensor space rather than all input channels. The size of the local neighborhood is defined by another -dimensional tensor, , and the total number of neighbors for each output channel is
The complexity of the proposed topologically subdivisioned convolutional layers compared to the standard convolutional layers can be simply measured by . Figure. 2 illustrate the 2D and 3D topological subdivisioning between the input channels and the output channels. A formal description of this layer is presented in Algorithm 3.
When , the algorithm is suitable for the linear projection layer, and can be directly embedded into Algorithm 2 to further reduce the complexity of the SIC layer.
2.4 Spatial “Bottleneck” Structure
In the design of traditional CNN models, there has always been a trade-off between the spatial dimensions and the number of channels. While high spatial resolution is necessary to preserve detailed local information, large number of channels produce high dimensional feature spaces and learn more complex representations.The complexity of one convolutional layer is determined by the product of these two factors. To maintain an acceptable complexity, the spatial dimensions are reduced by max pooling or stride convolution while the number of channels are increased.
On the other hand, the adjacent pixels in the input of each convolutional layers are correlated, in a similar fashion to the image domain, especially when the spatial resolution is high. While reducing the resolution by simple sub-sampling will obviously lead to a loss of information, such correlation presents considerable redundancy that can be taken advantage of.
In this section, we introduce a spatial “bottleneck” structure that reduces the amount of computation without decreasing either the spatial resolution or the number of channels by exploiting the spatial redundancy of the input.
Consider the 3D input data in , we first apply a single intra-channel convolution to each input channel as was introduced in Section 2.2. A kernel is convolved with each input channel with stride , so that the output dimension is reduced to . Then a linear projection layer is applied. Finally, We perform a intra-channel deconvolution with stride to recover the spatial resolution.
|2||max pooling , stride 3|
|max pooling , stride 2|
|max pooling , stride 3|
|average pooling, stride 6|
|fully connected, 2048|
|fully connected, 1000|
Figure. 3 illustrates the proposed spatial “bottleneck” framework. The spatial resolution of the data is first reduced, then expanded, forming a bottleneck structure. In this 3-phase structure, the linear projection phase , which consumes most of the computation, is times more efficient than plain linear projection on the original input. The intra-channel convolution and deconvolution phases learn to capture the local correlation of adjacent pixels, while maintaining the spatial resolution of the output.
We evaluate the performance of our method on the ImageNet LSVRC 2012 dataset, which contains 1000 categories, with 1.2M training images, 50K validation images, and 100K test images. We use Torch to train the CNN models in our framework. Our method is implemented with CUDA and Lua based on the Torch platform. The images are first resized to, then randomly cropped into
and flipped horizontally while training. Batch normalization is placed after each convolutional layer and before the ReLU layer. We also adopt the dropout 
strategy with a ratio of 0.2 during training. Standard stochastic gradient descent with mini-batch containing 256 images is used to train the model. We start the learning rate from 0.1 and divide it by a factor of 10 every 30 epochs. Each model is trained for 100 epochs. For batch normalization, we use exponential moving average to calculate the batch statistics as is implemented in CuDNN. The code is run on a server with 4 Pascal Titan X GPU. For all the models evaluated below, the top-1 and top-5 error of validation set with central cropping is reported.
We evaluate the performance and efficiency of a series of models designed using the proposed efficient convolutional layer. To make cross reference easier and help the readers keep track of all the models, each model is indexed with a capital letter.
|Model||kernel size||# layers||Top-1 err.||Top-5 err.||Complexity|
We compare our method with a baseline CNN model that is built from standard convolutional layers. The details of the baseline models are given in Table 1. The convolutional layers are divided into stages according to their spatial dimensions. Inside each stage, the convolutional kernels are performed with paddings so that the output has the same spatial dimensions as the input. Across the stages, the spatial dimensions are reduced by max pooling and the number of channels are doubled by
convolutional layer. One fully connected layer with dropout is added before the logistic regression layer for final classification. Residual learning is added after every convolutional layer with same number of input and output channels.
We evaluate the performance of our method by substituting the standard convolutional layers in the baseline models with the proposed Single Intra-Channel Convolutional (SIC) layers. We leave the convolutional layer in the first stage and the convolutional layers across stages the same, and only substitute the convolutional layers. In the following sections, the relative complexities are also measured with regards to these layers.
3.1 Single Intra-Channel Convolutional Layer
We first substitute the standard convolutional layer with the unraveled convolution configuration in model B. Each input channel is convolved with 4 filters, so that the complexity of B is approximately of the baseline model A. In model C , we use two SIC layers to replace one standard convolutional layer. Even though our model C has more layers than the baseline model A, its complexity is only of model A. In model E, we increase the number of SIC layers from 4 in model C to 6 in model E. The complexity of model E is only of the baseline. Due to the extremely low complexity of the SIC layer, we can easily increase the model depth without too much increase of the computation.
Table. 2 lists the distribution of computation between the intra-channel convolution and linear channel projection of each SIC layer in model C. The intra-channel convolution generally consumes less than 10% of the total layer computation. Thanks to this advantage, we can utilize a larger kernel size with only a small sacrifice of efficiency. Model D is obtained by setting the kernel size of model C to 5.
Table 3 lists the top-1 and top-5 errors and the complexity of models from A to E. Comparing model B and A, with same number of layers, model B can match the accuracy of model A with less than half computation. When comparing the SIC based model C with model B, model C reduces the top-1 error by 1% with half complexity. This verifies the superior efficiency of the proposed SIC layer. With kernels, model E obtains 0.5% accuracy gain with as low as 5% increase of complexity on average. This demonstrates that increasing kernel size in SIC layer provides us another choice of improving the accuracy/complexity ratio.
3.2 Topological Subdivisioning
We first compare the performance of two different topological configurations against the baseline model. Model F adopts 2D topology and for both dimensions, which leads to a reduction of complexity by a factor of 4. In Model G, we use 3D topology and set and , so that the complexity is reduced by a factor of 4.27. The details of the network configuration are listed in Table 4. The number of topological layers is twice the number of standard convolutional layers in the baseline model, so the overall complexity per stage is reduced by a factor of 2.
|Stage||#Channels||2D topology||3D topology|
As a comparison, we also train a model H using the straightforward grouping strategy introduced in . Both the input and output channels are divided into 4 groups. The output channels in each group are only dependent on the input channels in the corresponding group. The complexity is also reduced 4 times in this manner. Table 5 lists the top-1 & top-5 error rate and complexities of model F to H. Both the 2D and the 3D topology models outperform the grouping method with lower error rate while maintaining the same complexity. When compared with the baseline model, both of the two topology models achieve similar top-1 and top-5 error rate with half the computation.
Finally, we apply the topological subdivisioning to the SIC layer in model I. We choose 2D topology based on the results in Table 5. In model I, there are 8 convolutional layers for each stage, due to the layer doubling caused by both the SIC layer and the topological subdivisioning. The complexity of each layer is, however, approximately as low as of a standard convolutional layer. Compared to the baseline model, 2D topology together with SIC layer achieves similar error rate while being 9 times faster.
3.3 Spatial “Bottleneck” Structure
In our evaluation of layers with spatial “bottleneck” structure, both the kernel size and the stride of the in-channel convolution and deconvolution is set to 2. The complexity of such a configuration is a quarter of a SIC layer. Both model J and model K are modified from model C by replacing SIC layers with spatial “bottleneck” layers. One SIC layer is substituted with two Spatial “Bottleneck” layers, the first one with no padding and the second one with one pixel padding, leading to a 50% complexity reduction. In model J, every other SIC layer is substituted; In model K, all SIC layers are substituted. Table 6 compares their performance with the baseline model and SIC based model. Compared to the SIC model C, model J reduces the complexity by 25% with no loss of accuracy; model K reduces the complexity by 50% with a slight drop of accuracy. Compared to the baseline model A, model K achieves 9 times speedup with similar accuracy.
|Model||#layers||Top-1 err.||Top-5 err.||Complexity|
3.4 Comparison with standard CNN models
In this section, we increase the depth of our models to compare with recent state-of-the-art CNN models. To go deeper but without increasing too much complexity, we adopt the channel-wise bottleneck structure similar to the one introduced in . In each channel-wise bottleneck structure, the number of channels are first reduced by half by the first layer, then recovered by the second layer. Such a two-layer bottleneck structure has almost the same complexity to single layer with the same input and output channels, thus increase the overall depth of the network.
We gradually increase the number of SIC layers with channel-wise bottleneck structure in each stage from 8 to 40, and compare their complexity to recent CNN models with similar accuracies. Model , , and correspond to the number of layers of 8, 12, 24, and 40, respectively. Due to training memory limitation, only the SIC layer is used in models in this section. While model and have the same spatial dimensions and stage structures as in Table 1, model and adopt the same structure as in . They have different pooling strides and one more stages right after the first convolutional layer. The detailed model configurations are put in the supplemental materials.
|Model||Top-1 err.||Top-5 err.||# of Multiplications|
|Our Model L||28.29%||9.9%||381M|
|Our Model M||27.07%||8.93%||490M|
|Our Model N||24.76%||7.58%||845M|
|Our Model O||23.99%||7.12%||1172M|
Top-1 and Top-5 error rate of single-crop testing with single model, number of multiplication of our model and several previous work. The numbers in this table are generated with single model and center-crop. For AlexNet and GoogLeNet, the top-1 error is missing in original paper and we use the number of Caffe’s implementation. For ResNet-34, we use the number with Facebook’s implementation.
Figure 4 compares the accuracy and complexity of our model from to with several previous works. Table 7 lists the detailed results. Figure 4 provides a visual comparison in the form of scattered plot. The red marks in the figure represent our models. All of our models demonstrate remarkably lower complexity while being as accurate. Compared to VGG, Resnet-34, Resnet-50 and Resnet-101 models, our models are , , , more efficient respectively with similar or lower top-1 or top-5 error.
3.5 Visualization of filters
Given the exceptionally good performance of the proposed methods, one might wonder what type of kernels are actually learned and how they compare with the ones in traditional convolutional layers. We randomly chose some kernels in the single intra-channel convolutional layers and the traditional convolutional layers, and visualize them side by side in Figure 5 to make an intuitive comparison. Both kernels and kernels are shown in the figure. The kernels learned by the proposed method demonstrate much higher level of regularized structure, while the kernels in standard convolutional layers exhibit more randomness. We attribute this to the stronger regularization caused by the reduction of number of filters.
3.6 Discussion on implementation details
In both SIC layer and spatial “bottleneck” structure , most of the computation is consumed by the linear channel projection, which is basically a matrix multiplication. The 2D spatial convolution in each channel has similar complexity to a max pooling layer. Memory access takes most running time due to low amount of computation. The efficiency of our CUDA based implementation is similar to the open source libraries like Caffe and Torch. We believe higher efficiency can be easily achieved with an expert-level GPU implementation like in CuDNN. The topological subdivisioning layer resembles the structure of 2D and 3D convolution.Unlike the sparsity based methods, the regular connection pattern from topological subdivisioning makes efficient implementation possible. Currently, our implementation simply discards all the non-connected weights in a convolutional layer.
This work introduces a novel design of efficient convolutional layer in deep CNN that involves three specific improvements: (i) a single intra-channel convolutional (SIC) layer ; (ii) a topological subdivision scheme; and (iii) a spatial “bottleneck” structure. As we demonstrated, they are all powerful schemes in different ways to yield a new design of the convolutional layer that has higher efficiency, while achieving equal or better accuracy compared to classical designs. While the numbers of input and output channels remain the same as in the classical models, both the convolutions and the number of connections can be optimized against accuracy in our model - (i) reduces complexity by unraveling convolution, (ii) uses topology to make connections in the convolutional layer sparse, while maintaining local regularity and (iii) uses a conv-deconv bottleneck to reduce convolution while maintaining resolution. Although the CNN have been exceptionally successful regarding the recognition accuracy, it is still not clear what architecture is optimal and learns the visual information most effectively. The methods presented herein attempt to answer this question by focusing on improving the efficiency of the convolutional layer. We believe this work will inspire more comprehensive studies in the direction of optimizing convolutional layers in deep CNN.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
-  Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, 2014.
-  Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In Proc. BMVC, 2014.
-  Xiangyu Zhang, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun. Efficient and accurate approximations of nonlinear convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1984–1992, 2015.
-  Yani Ioannou, Duncan Robertson, Jamie Shotton, Roberto Cipolla, and Antonio Criminisi. Training cnns with low-rank filters for efficient image classification. arXiv preprint arXiv:1511.06744, 2015.
-  Cheng Tai, Tong Xiao, Xiaogang Wang, et al. Convolutional neural networks with low-rank regularization. arXiv preprint arXiv:1511.06067, 2015.
-  Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 806–814, 2015.
-  Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2, 2015.
Aapo Hyvärinen, Patrik Hoyer, and Mika Inki.
Topographic independent component analysis.Neural computation, 13(7):1527–1558, 2001.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Dropout: A simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
-  Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
-  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675–678. ACM, 2014.
-  Sam Gross and Michael Wilber. Resnet training in torch. https://github.com/charlespwd/project-title, 2016.