1 Introduction and related work
Multidimensional convolution arises in several mathematical models across different fields and is the cornerstone of Convolutional Neural Networks (CNNs) [Krizhevsky et al., 2012, Lecun et al., 1998]
. Indeed, the convolution operator is the key layer of CNNs, enabling them to effectively learn from highdimensional data by mitigating the curse of dimensionality
[Poggio and Liao, 2018]. However, CNNs are computational demanding models, with the computational cost associated to convolutions being dominant during both training and inference. Therefore, there is an increasing interest in improving the efficiency of multidimensional convolutions.Several efficient implementations of convolutions have been proposed. For instance, 2D convolution can be efficiently implemented as matrix multiplication by converting the convolution kernel to a Toeplitz matrix. However, this procedure requires replicating the kernel values multiple times across different matrix columns in the Toeplitz matrix and hence the memory requirements are increased. Implementing convolutions via the im2col approach is also memory intensive due to the space required for building the column matrix. The space requirements of these approaches may be far too large to fit in the memory of a mobile or embedded devices hindering the deployment of CNNs in resourcelimited platforms.
In general, most existing attempts to efficient convolutions are isolated and there currently is no unified framework to study them. In particular we are interested in two different branches, which we review here after. First, approaches that leverage tensor methods for efficient convolutions, either to compress or reformulate them for speed. Second, approaches that directly formulate efficient neural architecture, e.g. using separable convolutions.
Tensor methods for deep learning
The properties of tensor methods [Kolda and Bader, 2009] made them a prime choice for deep neural network. Beside the theoretical study of the properties of deep neural networks [Cohen et al., 2016], they have been especially studied in the context of reparametrizing existing layer. One goal of such reparametrization is parameter space savings. Novikov et al. [2015] for instance proposed to reshape the weight matrix of fullyconnected layers into highorder tensors and apply tensor decomposition (specifically TensorTrain (TT) [Oseledets, 2011] ) on the resulting tensor. In a followup work [Garipov et al., 2016], the same strategy is applied to both fully connected and convolutional layers. Fully connected layers and flattening layers can be removed altogether and replaced with tensor regression layers [Kossaifi et al., 2018]. These express outputs through a lowrank multilinear mapping from a highorder activation tensor to an output tensor of arbitrary order. Parameter space saving can also be obtained, while maintaining multilinear structure, by applying tensor contraction [Kossaifi et al., 2017].
Another advantage of tensor reparametrization is computational speedup. In particular, tensor decomposition is an efficient way to obtain separable filters from convolutional kernels. These separable convolutions were proposed in computer vision by
Rigamonti et al. [2013] in the context of filter banks. Jaderberg et al. [2014]first applied this concept to deep learning and proposed leveraging redundancies across channels using separable convolutions. This is optimized layer by layer via alternating leastsquares by minimizing the reconstruction error between the pretrained weights and the corresponding approximation.
Lebedev et al. [2015] proposed to apply CP decomposition directly to the ( dimensional) kernels of pretrained D convolutional layers, making them separable. The resulting formulation allows to speed up the convolution by rewriting it as a series of smaller convolutions. The incurred loss in performance is compensated by finetuning. A similar efficient rewriting of convolutions was proposed by Kim et al. [2016] using Tucker decomposition instead of CP to decompose the convolutional layers of a pretrained network. This allows to rewrite the convolution as a convolutions, followed by regular convolution with a smaller kernel and, finally, another convolution. Note that in this case, the spatial dimensions of the convolutional kernel are left untouched and only the input and output channels are compressed. Again, the loss in performance is compensated by finetuning the whole network. CP decomposition is also used by Astrid and Lee [2017] to reparametrize the layers of deep convolutional neural networks, but the tensor power method is used instead of an alternating leastsquares algorithm. The network is then iteratively finetuned to restore performance. Similarly, Tai et al. [2015] proposes to use tensor decomposition to remove redundancy in convolutional layers and express these as the composition of two convolutional layers with less parameters. Each 2D filter is approximated by a sum of rank– matrices. Thanks to this restricted setting, a closedform solution can be readily obtained with SVD. This is done for each convolutional layer with a kernel of size larger than 1.Here, we propose to unify the above works and propose a generalisation of separable convolutions to higher orders.
Efficient neural networks
While concepts such as separable convolutions have been studied since the early successes of deep learning using tensor decompositions, they only relatively recently been “rediscovered” and proposed as standalone, endtoend trainable efficient neural network architectures. First attempts in the direction of neural network architecture optimization were proposed early one in the groundbreaking VGG network [Simonyan and Zisserman, 2014] where the large convolutional kernels used in AlexNet [Krizhevsky and Hinton, 2009] were replaced with a series of smaller one that have an equivalent receptive field size: i.e a convolution with a kernel can be replaced by two consecutive convolutions of size . In parallel, the idea of decomposing larger kernels into a series of smaller ones is explored in the multiple iterations of the Inception block [Szegedy et al., 2015, 2016, 2017] where a convolutional layer with a kernel is approximated with two and kernels. He et al. [2016] introduced the socalled bottleneck module that reduces the number of channels on which the convolutional layer with higher kernel size () operate on by projection back and forth the features using two convolutional layers with filters. Xie et al. [2017] expands upon this by replacing the convolution with a grouped convolutional layer that can further reduce the complexity of the model while increasing the representational power at the same time. Recently, Howard et al. [2017] introduced the MobileNet architecture where they proposed to replace the convolutions with a depthwise separable module: a depthwise convolution (the number of groups is equal with the number of channels) followed by a convolutional layer that aggregates the information. This type of structures were shown to offer a good balance between the performance offered and the computational cost they incur. Sandler et al. [2018] go one step further and incorporate the idea of using separable convolutions in an inverted bottleneck module. The proposed module uses the layers to expand and then contract the channels (hence inverted bottleneck) while using separable convolutions for the convolutional layer.
In this work
we show that many of the aforementioned architectural improvements, such as MobileNet or ResNet’s Bottleneck blocks, are in fact drown from the same larger family of the tensor decomposition methods and propose a new, efficient higher order convolution based on tensor factorization. Specifically, we make the following contributions:

We review convolutions under the lens of tensor decomposition and establish the link between various tensor decomposition and efficient convolutional blocks (Section 2).

We propose a general framework unifying tensor decomposition and efficient architectures, showing how these efficient architectures can be derived from regular convolution to which tensor decomposition has been applied (Section 2.5).

Based on this framework, we propose a higherorder CP convolutional layer for convolutions of any order, that are both memory and computationally efficient (Section 3).

We demonstrate the performance of our approach on both 2D and 3D data, and show that it offers better performance while being more computation and memory efficient (Section 4).
2 Tensor factorization for efficient convolutions
Here, we explore the relationship between tensor methods and deep neural networks’ convolutional layers. Without loss of generality, we omit the batch size in all the following formulas.
2.1 Mathematical background and notation
We denote
–order tensors (vectors) as
, –order tensor (matrices) as and tensors of order as . We write the regular convolution of with as . In the case of a –D convolution, we write the convolution of a tensor with a vector along the –mode as . Note that in practice, as done in current deep learning frameworks [Paszke et al., 2017], we use crosscorrelation, which differs from convolution by the flipping of th kernel. This does not impact the results since the weights are learned endtoend. In other words, .2.2 convolutions and tensor contraction
Next, we show that convolutions are equivalent to a tensor contraction with the kernel of the convolution along the channels dimension. Let’s consider a convolution , defined by kernel and applied to an activation tensor . We denote the squeezed version of along the first mode as .
The tensor contraction of a tensor with matrix , along the –mode (), known as n–mode product, is defined as , with:
(1) 
By plugging (1) into the expression of , we can observe that the convolution is equivalent with an nmode product between X and the matrix W. This can readily be seen by writing:
(2) 
2.3 Separable convolutions
Here we show how separable convolutions can be obtained by applying CP decomposition to the kernel of a regular convolution Lebedev et al. [2015]. We consider a convolution defined by its kernel weight tensor, applied on an input of size . Let be an arbitrary activation tensor. If we define the resulting feature map as , we have:
(3) 
Assuming a lowrank Kruskal structure on the kernel (which can be readily obtained by applying CP decomposition), we can write:
(4) 
2.4 Bottleneck layer
As previously, we consider the convolution . However, instead of a Kruskal structure, we now assume a lowrank Tucker structure on the kernel (which can be readily obtained by applying Tucker decomposition) and yields an efficient formulation Kim et al. [2016]. We can write:
(7) 
Plugging back into a convolution, we get:
(8) 
We can further absorb the factors along the spacial dimensions into the core by writing:
(9) 
In that case, the expression above simplifies to:
(10) 
In other words, this is equivalence to first transforming the number of channels, then applying a (small) convolution before returning from the rank to the target number of channels. This can be seen by rearranging the terms from equation 10:
(11) 
In short, this simplifies to the following expression, also illustrated in figure 2:
(12) 
2.5 Efficient convolutional blocks as a tensorized convolution
While tensor decomposition have been explored in the context of deep learning for years, and in the mathematical context for decades, they are regularly rediscovered and reintroduced in different forms. Here, we revisit popular deep neural network architectures under the lens of tensor factorization. Specifically, we show how these blocks can be obtained from a regular convolution by applying tensor decomposition to its kernel. In practice, batchnormalisation layers and nonlinearities are inserted in between the intermediary convolution to facilitate learning from scratch.
ResNet Bottleneck block: He et al. [2016] introduced a block, coined Bottleneck block in their seminal work on deep residual networks. It consists in a series of a convolution, to reduce the number of channels, a smaller regular () convolution, and another convolution to restore the rank to the desired number of output channels. Based on the equivalence derived in section 2.4, it is straightforward to see this as applying Tucker decomposition to the kernel of a regular convolution.
ResNext and Xception: ResNext [Xie et al., 2017] builds on this bottleneck architecture, which, as we have shown, is equivalent to applying Tucker decomposition to the convolutional kernel. In order to reduce the rank further, the output is expressed as a sum of such bottlenecks, with a lowerrank. This can be reformulated efficiently using groupedconvolution [Xie et al., 2017]. In parallel, a similar approach was proposed by Chollet [2017], but without convolution following the grouped depthwise convolution.
MobileNet v1: MobileNet v1 [Howard et al., 2017] uses building blocks made of a depthwise separable convolutions (spatial part of the convolution) followed by a convolution to adjust the number of output channels. This can be readily obtained from a CP decomposition (section 2.3) as follows: first we write the convolutional weight tensor as detailed in equation 4, with a rank equal to the number of input channels, i.e. . The first depthwiseseparable convolution can be obtained by combining the two spatial D convolutions and . This results into a single spatial factor , such that . The convolution is then given by the matrixproduct of the remaining factors, . This is illustrated in Figure 3.
MobileNet v2: MobileNet v2 [Sandler et al., 2018] employs a similar approach by grouping the spatial factors into one spatial factor , as explained previously for the case of MobileNet. However, the other factors are left untouched. The rank of the decomposition, in this case, corresponds, for each layer, to the expansion factor the number of input channels. This results in two convolutions and a depthwise separable convolution. Finally, the kernel weight tensor (displayed graphically in figure 4) is expressed as:
(13) 
In practice, MobileNetv2 also includes batchnormalisation layers and nonlinearities in between the convolutions, as well as a skip connection to facilitate learning.
3 Efficient ND convolutions via higherorder factorization
We propose to generalize the framework introduced above to convolutions of any arbitrary order. Specifically, we express, in the general case, separable NDconvolutions as a series of 1D convolutions, and show how this is derived from a CP convolution on the N–dimensional kernel.
In particular, here, we consider an –order input activation tensor with channels. We define a general, high order separable convolution defined by a kernel , and expressed as a Kruskal tensor, i.e. . We can then write:
(14) 
By rearranging the terms, this expression can be rewritten as:
(15) 
Tensor decomposition (and, in particular, decomposed convolutions) are notoriously hard to train endtoend [Jaderberg et al., 2014, Lebedev et al., 2015, Tai et al., 2015]. As a result, most existing approach rely on first training an uncompressed network, then decomposing the convolutional kernels before replacing the convolution with their efficient rewriting and finetuning to recover lost performance. However, this approach is not suitable for higherorder convolutions where it might not be practical to train the full ND convolution.
Instead, we propose to facilitate training by adding nonlinearities
(e.g. bath normalisation combined with RELU), leading to the following expression:
(16) 
Skip connection can also be added by introducing an additional factor and using . This results in an efficient higherorder CP convolution, detailed in algorithm 1.
This formulation is significantly more efficient than that of a regular convolution. Let’s consider an Ndimensional convolution, with input channels and output channels, i.e. a weight of size . Then a regular 3D convolution has parameters. By contrast, our HOCP convolution with rank has only parameters. For instance, for a 3D convolution with a cubic kernel (of size , a regular 3D convolution would have parameters, versus only for our proposed HOCP convolution.
This reduction in the number of parameters translates into much more efficient operations in terms of floating point operations (FLOPs). We show, in figure 5, a visualisation of the number of Giga FLOPs (GFLOPs, with 1GFLOP = FLOPs), for both a regular 3D convolution and our proposed approach, for an input of size , varying the number of input and output channels, with a kernel size of .
4 Experimental setting
In this section, we introduce the experimental setting and the databases used. We then detail the implementation details and results obtained with our method.
4.1 Datasets
We empirically assess the performance of our model on the popular task of image classification for both the D and D case, on two popular dataset, namely CIFAR10 and 20BNJester.
CIFAR10 [Krizhevsky and Hinton, 2009] is a dataset for image classification composed of classes with image which, divided into images per class for training and images per class for testing, on which we report the results.
20BNJester Dataset v1 is a dataset^{1}^{1}1Dataset available at https://www.twentybn.com/datasets/jester/v1. of videos, each representing one of hand gestures (e.g. swiping left, thumb up, etc). Each video contains a person performing one of gestures in front of a web camera. Out of the videos are used for training, for validation on which we report the results.
4.2 Implementation details
For the CIFAR10 experiments, we used a MobileNetv2 as our baseline. For our approach, we simply replaced the full MobileNetv2 blocks with ours (which, in the 2D case, differs from MobileNetv2 by the use of two separable convolutions along the spatial dimensions instead of a single 2D kernel).
For the 20BNJester dataset, we used a convolutional column composed of convolutional blocks with kernel size , with respective input and output of channels: and , followed by two fullyconnected layers to reduce the dimensionality to first, and finally to the number of classes. Between each convolution we added a batchnormalisation layer, nonlinearity (ELU) and maxpooling. The full architecture is detailed in the supplementary materials. For our approach, we used the same setting but replaced the 3D convolutions with our proposed block and used, for each layer, for the rank of the HOCP convolution. The dataset was processed by batches of sequences of RGB images, with a temporal resolution of frames and a size of . We validated the learning rate in the range with decay on plateau by a factor of 10. In all cases we report average performance on the over runs.
4.3 Results
Network  # parameters  Accuracy (%) 

MobileNetv2  
HOSConv (Ours) 
Here, we present the performance of our approach for the 2D and 3D case. While our method is general and applicable to data of any order, these are the most popular cases and the ones for which the existing work, both in terms of software support and algorithm development, is done. We therefore focus our experiments on these two cases.
Results for 2D convolutional networks We compared our method with a MobileNetv2 with a comparable number of parameters, Table 1. Unsurprisingly, both approach yield similar results since, in the 2D case, the two networks architectures are similar. It is worth noting that our method has marginally less parameters than MobileNetv2, for the same number of channels, even though that network is already optimized for efficiency.
Accuracy (%)  

Network  #conv parameters  Top1  Top5 
3DConvNet  M  
HOCP ConvNet (Ours)  M  
HOCP ConvNetS (Ours)  M 
Results for 3D convolutional networks For the 3D case, we test our HigherOrder CP convolution with regular 3D convolution in a simple neural network architecture, in the same setting, in order to be able to compare them. Our approach is more computationally efficient and gets better performance, Table 2. In particular, the basic version without skip connection and with RELU (emphHOCP ConvNet) has million less parameters in the convolutional layers compared to the regular 3D network, and yet, converges to better Top1 and Top5 accuracy. The version with skipconnection and PReLU (HOCP ConvNetS) beats all approaches.
5 Conclusions
In this paper, we established the link between tensor factorization and efficient convolutions, in a unified framework. We showed how efficient architectures can be directly derived from this framework. Building up on these findings, we proposed a new efficient convolution, defined for any arbitrary number of dimensions. We theoretically derive an efficient algorithm for computing this D convolutions for any by leveraging higherorder tensor factorization. Specifically, we express the convolution as a superposition of rank1 tensors, allowing us to reformulate it efficiently. We empirically demonstrate that our approach is efficient on both 2D and 3D tasks using existing frameworks, in terms of performance (accuracy), memory requirements and computational burden.
References
 Astrid and Lee [2017] Marcella Astrid and SeungIk Lee. Cpdecomposition with tensor power method for convolutional neural networks compression. CoRR, abs/1701.07148, 2017.

Chollet [2017]
François Chollet.
Xception: Deep learning with depthwise separable convolutions.
In
Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
, pages 1251–1258, 2017.  Cohen et al. [2016] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory, pages 698–728, 2016.
 Garipov et al. [2016] Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate tensorization: compressing convolutional and fc layers alike. NIPS workshop: Learning with Tensors: Why Now and How?, 2016.
 He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016. doi: 10.1109/CVPR.2016.90.
 He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 Howard et al. [2017] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
 Jaderberg et al. [2014] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In British Machine Vision Conference, 2014.
 Kim et al. [2016] YongDeok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. ICLR, 05 2016.
 Kolda and Bader [2009] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM REVIEW, 51(3):455–500, 2009.
 Kossaifi et al. [2017] Jean Kossaifi, Aran Khanna, Zachary Lipton, Tommaso Furlanello, and Anima Anandkumar. Tensor contraction layers for parsimonious deep nets. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1940–1946. IEEE, 2017.
 Kossaifi et al. [2018] Jean Kossaifi, Zachary C. Lipton, Aran Khanna, Tommaso Furlanello, and Anima Anandkumar. Tensor regression networks. CoRR, abs/1707.08308, 2018.

Kossaifi et al. [2019]
Jean Kossaifi, Yannis Panagakis, Anima Anandkumar, and Maja Pantic.
Tensorly: Tensor learning in python.
Journal of Machine Learning Research
, 20(26):1–6, 2019. URL http://jmlr.org/papers/v20/18277.html.  Krizhevsky and Hinton [2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 Lebedev et al. [2015] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan V. Oseledets, and Victor S. Lempitsky. Speedingup convolutional neural networks using finetuned cpdecomposition. In ICLR, 2015.
 Lecun et al. [1998] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998. ISSN 00189219. doi: 10.1109/5.726791.
 Novikov et al. [2015] Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov. Tensorizing neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pages 442–450, 2015.
 Oseledets [2011] I. V. Oseledets. Tensortrain decomposition. SIAM J. Sci. Comput., 33(5):2295–2317, September 2011.
 Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPSW, 2017.
 Poggio and Liao [2018] T Poggio and Q Liao. Theory i: Deep networks and the curse of dimensionality. Bulletin of the Polish Academy of Sciences: Technical Sciences, pages 761–773, 01 2018. doi: 10.24425/bpas.2018.125924.
 Rigamonti et al. [2013] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua. Learning separable filters. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, June 2013. doi: 10.1109/CVPR.2013.355.
 Sandler et al. [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
 Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.

Szegedy et al. [2017]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
InThirtyFirst AAAI Conference on Artificial Intelligence
, 2017.  Tai et al. [2015] Cheng Tai, Tong Xiao, Xiaogang Wang, and Weinan E. Convolutional neural networks with lowrank regularization. CoRR, abs/1511.06067, 2015.
 Xie et al. [2017] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.