1 Introduction and related work
Several efficient implementations of convolutions have been proposed. For instance, 2D convolution can be efficiently implemented as matrix multiplication by converting the convolution kernel to a Toeplitz matrix. However, this procedure requires replicating the kernel values multiple times across different matrix columns in the Toeplitz matrix and hence the memory requirements are increased. Implementing convolutions via the im2col approach is also memory intensive due to the space required for building the column matrix. The space requirements of these approaches may be far too large to fit in the memory of a mobile or embedded devices hindering the deployment of CNNs in resource-limited platforms.
In general, most existing attempts to efficient convolutions are isolated and there currently is no unified framework to study them. In particular we are interested in two different branches, which we review here after. First, approaches that leverage tensor methods for efficient convolutions, either to compress or reformulate them for speed. Second, approaches that directly formulate efficient neural architecture, e.g. using separable convolutions.
Tensor methods for deep learning
The properties of tensor methods [Kolda and Bader, 2009] made them a prime choice for deep neural network. Beside the theoretical study of the properties of deep neural networks [Cohen et al., 2016], they have been especially studied in the context of reparametrizing existing layer. One goal of such reparametrization is parameter space savings. Novikov et al.  for instance proposed to reshape the weight matrix of fully-connected layers into high-order tensors and apply tensor decomposition (specifically Tensor-Train (TT) [Oseledets, 2011] ) on the resulting tensor. In a followup work [Garipov et al., 2016], the same strategy is applied to both fully connected and convolutional layers. Fully connected layers and flattening layers can be removed altogether and replaced with tensor regression layers [Kossaifi et al., 2018]. These express outputs through a low-rank multi-linear mapping from a high-order activation tensor to an output tensor of arbitrary order. Parameter space saving can also be obtained, while maintaining multi-linear structure, by applying tensor contraction [Kossaifi et al., 2017].
Another advantage of tensor reparametrization is computational speed-up. In particular, tensor decomposition is an efficient way to obtain separable filters from convolutional kernels. These separable convolutions were proposed in computer vision byRigamonti et al.  in the context of filter banks. Jaderberg et al. 
first applied this concept to deep learning and proposed leveraging redundancies across channels using separable convolutions. This is optimized layer by layer via alternating least-squares by minimizing the reconstruction error between the pretrained weights and the corresponding approximation.Lebedev et al.  proposed to apply CP decomposition directly to the ( dimensional) kernels of pretrained D convolutional layers, making them separable. The resulting formulation allows to speed up the convolution by rewriting it as a series of smaller convolutions. The incurred loss in performance is compensated by fine-tuning. A similar efficient rewriting of convolutions was proposed by Kim et al.  using Tucker decomposition instead of CP to decompose the convolutional layers of a pre-trained network. This allows to rewrite the convolution as a convolutions, followed by regular convolution with a smaller kernel and, finally, another convolution. Note that in this case, the spatial dimensions of the convolutional kernel are left untouched and only the input and output channels are compressed. Again, the loss in performance is compensated by fine-tuning the whole network. CP decomposition is also used by Astrid and Lee  to reparametrize the layers of deep convolutional neural networks, but the tensor power method is used instead of an alternating least-squares algorithm. The network is then iteratively fine-tuned to restore performance. Similarly, Tai et al.  proposes to use tensor decomposition to remove redundancy in convolutional layers and express these as the composition of two convolutional layers with less parameters. Each 2D filter is approximated by a sum of rank– matrices. Thanks to this restricted setting, a closed-form solution can be readily obtained with SVD. This is done for each convolutional layer with a kernel of size larger than 1.
Here, we propose to unify the above works and propose a generalisation of separable convolutions to higher orders.
Efficient neural networks
While concepts such as separable convolutions have been studied since the early successes of deep learning using tensor decompositions, they only relatively recently been “rediscovered” and proposed as standalone, end-to-end trainable efficient neural network architectures. First attempts in the direction of neural network architecture optimization were proposed early one in the ground-breaking VGG network [Simonyan and Zisserman, 2014] where the large convolutional kernels used in AlexNet [Krizhevsky and Hinton, 2009] were replaced with a series of smaller one that have an equivalent receptive field size: i.e a convolution with a kernel can be replaced by two consecutive convolutions of size . In parallel, the idea of decomposing larger kernels into a series of smaller ones is explored in the multiple iterations of the Inception block [Szegedy et al., 2015, 2016, 2017] where a convolutional layer with a kernel is approximated with two and kernels. He et al.  introduced the so-called bottleneck module that reduces the number of channels on which the convolutional layer with higher kernel size () operate on by projection back and forth the features using two convolutional layers with filters. Xie et al.  expands upon this by replacing the convolution with a grouped convolutional layer that can further reduce the complexity of the model while increasing the representational power at the same time. Recently, Howard et al.  introduced the MobileNet architecture where they proposed to replace the convolutions with a depth-wise separable module: a depth-wise convolution (the number of groups is equal with the number of channels) followed by a convolutional layer that aggregates the information. This type of structures were shown to offer a good balance between the performance offered and the computational cost they incur. Sandler et al.  go one step further and incorporate the idea of using separable convolutions in an inverted bottleneck module. The proposed module uses the layers to expand and then contract the channels (hence inverted bottleneck) while using separable convolutions for the convolutional layer.
In this work
we show that many of the aforementioned architectural improvements, such as MobileNet or ResNet’s Bottleneck blocks, are in fact drown from the same larger family of the tensor decomposition methods and propose a new, efficient higher order convolution based on tensor factorization. Specifically, we make the following contributions:
We review convolutions under the lens of tensor decomposition and establish the link between various tensor decomposition and efficient convolutional blocks (Section 2).
We propose a general framework unifying tensor decomposition and efficient architectures, showing how these efficient architectures can be derived from regular convolution to which tensor decomposition has been applied (Section 2.5).
Based on this framework, we propose a higher-order CP convolutional layer for convolutions of any order, that are both memory and computationally efficient (Section 3).
We demonstrate the performance of our approach on both 2D and 3D data, and show that it offers better performance while being more computation and memory efficient (Section 4).
2 Tensor factorization for efficient convolutions
Here, we explore the relationship between tensor methods and deep neural networks’ convolutional layers. Without loss of generality, we omit the batch size in all the following formulas.
2.1 Mathematical background and notation
–order tensors (vectors) as, –order tensor (matrices) as and tensors of order as . We write the regular convolution of with as . In the case of a –D convolution, we write the convolution of a tensor with a vector along the –mode as . Note that in practice, as done in current deep learning frameworks [Paszke et al., 2017], we use cross-correlation, which differs from convolution by the flipping of th kernel. This does not impact the results since the weights are learned end-to-end. In other words, .
2.2 convolutions and tensor contraction
Next, we show that convolutions are equivalent to a tensor contraction with the kernel of the convolution along the channels dimension. Let’s consider a convolution , defined by kernel and applied to an activation tensor . We denote the squeezed version of along the first mode as .
The tensor contraction of a tensor with matrix , along the –mode (), known as n–mode product, is defined as , with:
By plugging (1) into the expression of , we can observe that the convolution is equivalent with an n-mode product between X and the matrix W. This can readily be seen by writing:
2.3 Separable convolutions
Here we show how separable convolutions can be obtained by applying CP decomposition to the kernel of a regular convolution Lebedev et al. . We consider a convolution defined by its kernel weight tensor, applied on an input of size . Let be an arbitrary activation tensor. If we define the resulting feature map as , we have:
Assuming a low-rank Kruskal structure on the kernel (which can be readily obtained by applying CP decomposition), we can write:
2.4 Bottleneck layer
As previously, we consider the convolution . However, instead of a Kruskal structure, we now assume a low-rank Tucker structure on the kernel (which can be readily obtained by applying Tucker decomposition) and yields an efficient formulation Kim et al. . We can write:
Plugging back into a convolution, we get:
We can further absorb the factors along the spacial dimensions into the core by writing:
In that case, the expression above simplifies to:
In other words, this is equivalence to first transforming the number of channels, then applying a (small) convolution before returning from the rank to the target number of channels. This can be seen by rearranging the terms from equation 10:
In short, this simplifies to the following expression, also illustrated in figure 2:
2.5 Efficient convolutional blocks as a tensorized convolution
While tensor decomposition have been explored in the context of deep learning for years, and in the mathematical context for decades, they are regularly rediscovered and re-introduced in different forms. Here, we revisit popular deep neural network architectures under the lens of tensor factorization. Specifically, we show how these blocks can be obtained from a regular convolution by applying tensor decomposition to its kernel. In practice, batch-normalisation layers and non-linearities are inserted in between the intermediary convolution to facilitate learning from scratch.
ResNet Bottleneck block: He et al.  introduced a block, coined Bottleneck block in their seminal work on deep residual networks. It consists in a series of a convolution, to reduce the number of channels, a smaller regular () convolution, and another convolution to restore the rank to the desired number of output channels. Based on the equivalence derived in section 2.4, it is straightforward to see this as applying Tucker decomposition to the kernel of a regular convolution.
ResNext and Xception: ResNext [Xie et al., 2017] builds on this bottleneck architecture, which, as we have shown, is equivalent to applying Tucker decomposition to the convolutional kernel. In order to reduce the rank further, the output is expressed as a sum of such bottlenecks, with a lower-rank. This can be reformulated efficiently using grouped-convolution [Xie et al., 2017]. In parallel, a similar approach was proposed by Chollet , but without convolution following the grouped depthwise convolution.
MobileNet v1: MobileNet v1 [Howard et al., 2017] uses building blocks made of a depthwise separable convolutions (spatial part of the convolution) followed by a convolution to adjust the number of output channels. This can be readily obtained from a CP decomposition (section 2.3) as follows: first we write the convolutional weight tensor as detailed in equation 4, with a rank equal to the number of input channels, i.e. . The first depthwise-separable convolution can be obtained by combining the two spatial D convolutions and . This results into a single spatial factor , such that . The convolution is then given by the matrix-product of the remaining factors, . This is illustrated in Figure 3.
MobileNet v2: MobileNet v2 [Sandler et al., 2018] employs a similar approach by grouping the spatial factors into one spatial factor , as explained previously for the case of MobileNet. However, the other factors are left untouched. The rank of the decomposition, in this case, corresponds, for each layer, to the expansion factor the number of input channels. This results in two convolutions and a depthwise separable convolution. Finally, the kernel weight tensor (displayed graphically in figure 4) is expressed as:
In practice, MobileNet-v2 also includes batch-normalisation layers and non-linearities in between the convolutions, as well as a skip connection to facilitate learning.
3 Efficient N-D convolutions via higher-order factorization
We propose to generalize the framework introduced above to convolutions of any arbitrary order. Specifically, we express, in the general case, separable ND-convolutions as a series of 1D convolutions, and show how this is derived from a CP convolution on the N–dimensional kernel.
In particular, here, we consider an –order input activation tensor with channels. We define a general, high order separable convolution defined by a kernel , and expressed as a Kruskal tensor, i.e. . We can then write:
By rearranging the terms, this expression can be rewritten as:
Tensor decomposition (and, in particular, decomposed convolutions) are notoriously hard to train end-to-end [Jaderberg et al., 2014, Lebedev et al., 2015, Tai et al., 2015]. As a result, most existing approach rely on first training an uncompressed network, then decomposing the convolutional kernels before replacing the convolution with their efficient rewriting and fine-tuning to recover lost performance. However, this approach is not suitable for higher-order convolutions where it might not be practical to train the full ND convolution.
Instead, we propose to facilitate training by adding non-linearities
(e.g. bath normalisation combined with RELU), leading to the following expression:
Skip connection can also be added by introducing an additional factor and using . This results in an efficient higher-order CP convolution, detailed in algorithm 1.
This formulation is significantly more efficient than that of a regular convolution. Let’s consider an N-dimensional convolution, with input channels and output channels, i.e. a weight of size . Then a regular 3D convolution has parameters. By contrast, our HO-CP convolution with rank has only parameters. For instance, for a 3D convolution with a cubic kernel (of size , a regular 3D convolution would have parameters, versus only for our proposed HO-CP convolution.
This reduction in the number of parameters translates into much more efficient operations in terms of floating point operations (FLOPs). We show, in figure 5, a visualisation of the number of Giga FLOPs (GFLOPs, with 1GFLOP = FLOPs), for both a regular 3D convolution and our proposed approach, for an input of size , varying the number of input and output channels, with a kernel size of .
4 Experimental setting
In this section, we introduce the experimental setting and the databases used. We then detail the implementation details and results obtained with our method.
We empirically assess the performance of our model on the popular task of image classification for both the D and D case, on two popular dataset, namely CIFAR-10 and 20BN-Jester.
CIFAR-10 [Krizhevsky and Hinton, 2009] is a dataset for image classification composed of classes with image which, divided into images per class for training and images per class for testing, on which we report the results.
20BN-Jester Dataset v1 is a dataset111Dataset available at https://www.twentybn.com/datasets/jester/v1. of videos, each representing one of hand gestures (e.g. swiping left, thumb up, etc). Each video contains a person performing one of gestures in front of a web camera. Out of the videos are used for training, for validation on which we report the results.
4.2 Implementation details
For the CIFAR-10 experiments, we used a MobileNet-v2 as our baseline. For our approach, we simply replaced the full MobileNet-v2 blocks with ours (which, in the 2D case, differs from MobileNet-v2 by the use of two separable convolutions along the spatial dimensions instead of a single 2D kernel).
For the 20BN-Jester dataset, we used a convolutional column composed of convolutional blocks with kernel size , with respective input and output of channels: and , followed by two fully-connected layers to reduce the dimensionality to first, and finally to the number of classes. Between each convolution we added a batch-normalisation layer, non-linearity (ELU) and max-pooling. The full architecture is detailed in the supplementary materials. For our approach, we used the same setting but replaced the 3D convolutions with our proposed block and used, for each layer, for the rank of the HO-CP convolution. The dataset was processed by batches of sequences of RGB images, with a temporal resolution of frames and a size of . We validated the learning rate in the range with decay on plateau by a factor of 10. In all cases we report average performance on the over runs.
|Network||# parameters||Accuracy (%)|
Here, we present the performance of our approach for the 2D and 3D case. While our method is general and applicable to data of any order, these are the most popular cases and the ones for which the existing work, both in terms of software support and algorithm development, is done. We therefore focus our experiments on these two cases.
Results for 2D convolutional networks We compared our method with a MobileNet-v2 with a comparable number of parameters, Table 1. Unsurprisingly, both approach yield similar results since, in the 2D case, the two networks architectures are similar. It is worth noting that our method has marginally less parameters than MobileNet-v2, for the same number of channels, even though that network is already optimized for efficiency.
|HO-CP ConvNet (Ours)||M|
|HO-CP ConvNet-S (Ours)||M|
Results for 3D convolutional networks For the 3D case, we test our Higher-Order CP convolution with regular 3D convolution in a simple neural network architecture, in the same setting, in order to be able to compare them. Our approach is more computationally efficient and gets better performance, Table 2. In particular, the basic version without skip connection and with RELU (emphHO-CP ConvNet) has million less parameters in the convolutional layers compared to the regular 3D network, and yet, converges to better Top-1 and Top-5 accuracy. The version with skip-connection and PReLU (HO-CP ConvNet-S) beats all approaches.
In this paper, we established the link between tensor factorization and efficient convolutions, in a unified framework. We showed how efficient architectures can be directly derived from this framework. Building up on these findings, we proposed a new efficient convolution, defined for any arbitrary number of dimensions. We theoretically derive an efficient algorithm for computing this D convolutions for any by leveraging higher-order tensor factorization. Specifically, we express the convolution as a superposition of rank-1 tensors, allowing us to reformulate it efficiently. We empirically demonstrate that our approach is efficient on both 2D and 3D tasks using existing frameworks, in terms of performance (accuracy), memory requirements and computational burden.
- Astrid and Lee  Marcella Astrid and Seung-Ik Lee. Cp-decomposition with tensor power method for convolutional neural networks compression. CoRR, abs/1701.07148, 2017.
Xception: Deep learning with depthwise separable convolutions.
Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 1251–1258, 2017.
- Cohen et al.  Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory, pages 698–728, 2016.
- Garipov et al.  Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate tensorization: compressing convolutional and fc layers alike. NIPS workshop: Learning with Tensors: Why Now and How?, 2016.
- He et al.  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016. doi: 10.1109/CVPR.2016.90.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Howard et al.  Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
- Jaderberg et al.  M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In British Machine Vision Conference, 2014.
- Kim et al.  Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. ICLR, 05 2016.
- Kolda and Bader  Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM REVIEW, 51(3):455–500, 2009.
- Kossaifi et al.  Jean Kossaifi, Aran Khanna, Zachary Lipton, Tommaso Furlanello, and Anima Anandkumar. Tensor contraction layers for parsimonious deep nets. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1940–1946. IEEE, 2017.
- Kossaifi et al.  Jean Kossaifi, Zachary C. Lipton, Aran Khanna, Tommaso Furlanello, and Anima Anandkumar. Tensor regression networks. CoRR, abs/1707.08308, 2018.
Kossaifi et al. 
Jean Kossaifi, Yannis Panagakis, Anima Anandkumar, and Maja Pantic.
Tensorly: Tensor learning in python.
Journal of Machine Learning Research, 20(26):1–6, 2019. URL http://jmlr.org/papers/v20/18-277.html.
- Krizhevsky and Hinton  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Krizhevsky et al.  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Lebedev et al.  Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan V. Oseledets, and Victor S. Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. In ICLR, 2015.
- Lecun et al.  Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998. ISSN 0018-9219. doi: 10.1109/5.726791.
- Novikov et al.  Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov. Tensorizing neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pages 442–450, 2015.
- Oseledets  I. V. Oseledets. Tensor-train decomposition. SIAM J. Sci. Comput., 33(5):2295–2317, September 2011.
- Paszke et al.  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
- Poggio and Liao  T Poggio and Q Liao. Theory i: Deep networks and the curse of dimensionality. Bulletin of the Polish Academy of Sciences: Technical Sciences, pages 761–773, 01 2018. doi: 10.24425/bpas.2018.125924.
- Rigamonti et al.  R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua. Learning separable filters. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, June 2013. doi: 10.1109/CVPR.2013.355.
- Sandler et al.  Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
- Simonyan and Zisserman  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Szegedy et al.  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
- Szegedy et al.  Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
Szegedy et al. 
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In
Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- Tai et al.  Cheng Tai, Tong Xiao, Xiaogang Wang, and Weinan E. Convolutional neural networks with low-rank regularization. CoRR, abs/1511.06067, 2015.
- Xie et al.  Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.