Introduction
Convolutional Neural Networks (CNNs) have achieved stateoftheart performance on a wide range of computer vision tasks such as image classification
(He et al., 2016), video recognition (Feichtenhofer et al., 2019) and object detection (Ren et al., 2015). Despite achieving remarkably low generalization errors, modern CNN architectures are typically overparameterized and consist of millions of parameters. As the size of stateoftheart CNN architectures continues to grow, it becomes more challenging to deploy these models on resource constrained edge devices that are limited in both memory and energy. Motivated by studies demonstrating that there is significant redundancy in CNN parameters (Denil et al., 2013), model compression techniques such as pruning, quantization, tensor decomposition, and knowledge distillation have emerged to address this problem.Decomposition methods have gained more attention in recent years as they can achieve higher compression rates in comparison to other approaches. Namely, Tucker (Kim et al., 2016), CP (Lebedev et al., 2015), TensorTrain (Garipov et al., 2016) and TensorRing (Wang et al., 2018) decompositions have been widely studied for DNNs. However, these methods still suffer significant accuracy loss for computer vision tasks.
Kronecker Product Decomposition (KPD) is another decomposition method that has recently shown to be very effective when applied to RNNs (Thakker et al., 2019). KPD leads to model compression via replacing a large matrix with two smaller Kronecker factor matrices that best approximate the original matrix. In this work, we generalize KPD to tensors, yielding the Generalized Kronecker Product Decomposition (GKPD), and use it to decompose convolution tensors. GKPD involves finding the summation of Kronecker products between factor tensors that best approximates the original tensor. We provide a solution to this problem called the Multidimensional Nearest Kronecker Product Problem. By formulating the convolution operation directly in terms of the Kronecker factors, we show that we can avoid reconstruction at runtime and thus obtain a significant reduction in memory footprints and floatingpoint operations (FLOPs). Once all convolution tensors in a pretrained CNN have been replaced by their compressed counterparts, we retrain the network. If a pretrained network is not available, we show that we are still able to train our compressed network from a random initialization. Furthermore, we show that these randomly initialized networks retain universal approximation capability by building on Hornik (1991) and Zhou (2020).
Applying GKPD to an arbitrary tensor leads to multiple possible decompositions, one for each configuration of Kronecker factors. As shown in Figure 1, we find that for any given compression factor, choosing a decomposition that consists of a larger summation of smaller Kronecker factors (as opposed to a smaller summation of larger Kronecker factors) leads to a lower reconstruction error as well as improved model accuracy.
To summarize, the main contributions of this paper are:

Generalizing the Kronecker Product Decomposition to multidimensional tensors

Introducing the Multidimenesional Nearest Kronecker Product Problem and providing a solution
Related Work on DNN Model Compression
Quantization methods focus on reducing the precision of parameters and/or activations into lowerbit representations. For example, the work in (Han et al., 2015a) quantizes the parameter precision from 32 bits to 8 bits or lower. Model weights have been quantized even further into binary (Courbariaux et al., 2015; Rastegari et al., 2016; Courbariaux et al., 2016; Hubara et al., 2017), and ternary (Li et al., 2016; Zhu et al., 2016) representations. In these methods, choosing between a uniform (Jacob et al., 2018) or nonuniform (Han et al., 2015a; Tang et al., 2017; Zhang et al., 2018b) quantization interval affects the compression rate and the acceleration.
Pruning methods began by exploring unstructured network weights and deactivating small weights through applying sparsity regularization to the weight parameters (Liu et al., 2015; Han et al., 2015a, b) or considering statistics information from layers to guide the parameter selections in ThiNet (Luo et al., 2017). Unstructured pruning results in irregularities in the weight parameters which impact the expected acceleration rate of the pruned network. Hence, several works aim at zeroing out structured groups of the convolutional filters through group sparsity regularization (Zhou et al., 2016; Wen et al., 2016; Alvarez and Salzmann, 2016). Sparsity regularization has been combined with other forms of regularizers such as lowrank (Alvarez and Salzmann, 2017), ordered weighted (Zhang et al., 2018a), and outinchannel sparsity (Li et al., 2019) regularizers to further improve the pruning performance.
Decomposition
methods factorize the original weight matrix or tensor into lightweight representations. This results in much fewer parameters and consequently fewer computations. Applying truncated singular value decomposition (SVD) to compress the weight matrix of fullyconnected layers is one of the earliest works in this category
(Denton et al., 2014). In the same vein, canonical polyadic (CP) decomposition of the kernel tensors was proposed in (Lebedev et al., 2015). This work uses nonlinear least squares to decompose the original convolution kernel into a set of rank1 tensors (vectors). An alternative tensor decomposition approach to convolution kernel compression is Tucker decomposition
(Tucker, 1963). Tucker decomposition has shown to provide more flexible interaction between the factor matrices through a core tensor. The idea of reshaping weights of fullyconnected layers into highdimensional tensors and representing them in TensorTrain format (Oseledets, 2011) was extended to CNNs in (Garipov et al., 2016). TensorRing decomposition has also become another popular option to compress CNNs (Wang et al., 2018). For multidimensional data completion with a same intermediate rank, TR can be far more expressive than TT (Wang et al., 2017). Kronecker factorization was also used to replace the weight matrices and weight tensors within fullyconnected and convolution layers (Zhou et al., 2015). This work however limited the representation to a single Kronecker product and trained the model with random initialization. As shown in Fig.1 and in the next sections of this paper, summation can significantly improve the representation power of the network and thus leads to a performance increase.Other model compression forms can also be achieved through sharing convolutional weight matrices in a more structured manner as ShaResNet (Boulch, 2018) which reuses convolutional mappings within the same scale level or FSNet (Yang et al., 2020) which shares filter weights across spatial locations. NNs can also be compressed using Knowledge Distillation (KD) where a large (teacher) pretrained network is used to train a smaller (student) network (Mirzadeh et al., 2020; Heo et al., 2019). Designing lightweight CNNs such as MobileNet (Sandler et al., 2018) and SqueezeNet (Iandola et al., 2016) is another form of model compression.
Preliminaries
Given matrices and , their Kronecker product is the matrix
(1) 
As shown in Loan (2000), any matrix can be decomposed into a sum of Kronecker products as
(2) 
where
(3) 
is the rank of a reshaped version of matrix . We call this the Kronecker rank of . Note that the Kronecker rank is not unique, and is dependent on the dimensions of factors and .
To compress a given , we can represent it using a small number of Kronecker products that best approximate the original tensor. The factors are found by solving the Nearest Kronecker Product problem
(4) 
As this approximation replaces a large matrix with a sequence of two smaller ones, memory consumption is reduced by a factor of
(5) 
Furthermore, if a matrix is decomposed into its Kronecker factors then the projection can be performed without explicit reconstruction of . Instead, the factors can be used directly to perform the computation as a result of the following equivalency relationship:
(6) 
where , and vectorizes matrices and by stacking their columns.
Method
In this section, we extend KPD to tensors yielding GKPD. First, we define the multidimensional Kronecker product, then we introduce the Multidimensional Nearest Kronecker Product problem and its solution. Finally, we describe our KroneckerConvolution module that uses GKPD to compress convolution tensors and avoids reconstruction at runtime.
Generalized Kronecker Product Decomposition
We now turn to generalizing the Kronecker product to operate on tensors. Let and be two given tensors. Intuitively, tensor is constructed by moving around tensor in a nonoverlapping fashion, and at each position scaling it by a corresponding element of as shown in Figure 2. Formally, the Multidimensional Kronecker product is defined as follows
(7) 
where
(8) 
represent the integer quotient and the remainder term of with respect to divisor , respectively.
As with matrices, any multidimensional tensor can be decomposed into a sum of Kronecker products
(9) 
where
(10) 
denotes the Kronecker rank of tensor . Thus, we can approximate using GKPD by solving the Multidimensional Nearest Kronecker Product problem
(11) 
where . For the case of matrices (2D tensors), Van Loan and Pitsianis (1992) solved this problem using SVD. We now extend their approach to the multidimensional setting. Our strategy will be to define rearrangement operators
and solve
(12) 
instead. By carefully defining the rearrangement operators, the sum of squares in (12) is kept identical to that in (11). The former corresponds to finding the best lowrank approximation which has a well known solution using SVD. We define the rearrangement operators as follows:
where
extracts nonoverlapping patches of shape from tensor , flattens its input into a vector, tensor has the same number of dimensions as with each dimension equal to unity and is a vector describing the shape of tensor . While the ordering of patch extraction and flattening is not important, it must remain consistent across the rearrangement operators.
KroneckerConvolution Layer
The convolution operation in CNNs between a weight tensor and an input is a multilinear map that can be described in scalar form as
(13) 
Assuming can be decomposed to KPD factors and , we can rewrite (13) as
(14) 
Due to the structure of tensor , we do not need to explicitly reconstruct it to carry out the summation in (14). Instead, we can carry out the summation by directly using elements of tensors and as shown in Lemma 1. This key insight leads to a large reduction in both memory and FLOPs. Effectively, this allows us to replace a large convolutional layer (with a large weight tensor) with two smaller ones, as we demonstrate in the rest of this section.
Lemma 1.
Suppose tensor can be decomposed into KPD factors such that . Then, the multilinear map involving can be written directly in terms of its factors and as follows
where, is an input tensor, is a reindexing function; and are as defined in (8). The equality also holds for any valid offsets to the input’s indices,
where .
Proof.
See Appendix. ∎
Applying Lemma 1 to the summation in (14) yields
where indices enumerate over elements in tensor and enumerate over elements in tensor . Finally, we can separate the convolution operation into two steps by exchanging the order of summation as follows:
(15) 
The inner summation in (15) corresponds to a 3D convolution and the outer summation corresponds to multiple 2D convolutions, as visualized in Fig. 3 for the special case of .
Overall, (15) can be carried out efficiently in tensor form using Algorithm 1. Effectively, the input is collapsed in two stages instead of one as in the multidimensional convolution operation. Convolving a multichannel input with a single filter in yields a scalar value at a particular output location. This is done by first scaling all elements in the corresponding multidimensional patch, then collapsing it by means of summation. Since tensor is comprised of multidimensional patches scaled by elements in , we can equivalently collapse each subpatch in the input using tensor followed by a subsequent collapsing using tensor to obtain the same scalar value.
Complexity of KroneckerConvolution
The GKPD of a convolution layer is not unique. Different configurations of Kronecker factors will lead to different reductions in memory and number of operations. Namely, for a KroneckerConvolution layer using Kronecker products with factors and the memory reduction is
(16) 
whereas the reduction in FLOPs is
(17) 
For the special case of using separable filters, and the reduction in FLOPs becomes
(18) 
implying that and should be sufficiently large in order to obtain a reduction in FLOPs. In contrast, memory reduction is unconditional in the KroneckerConvolution layer.
Universal Approximation via Kronecker Products
Universal approximation applied to shallow networks have been around for a long time Hornik (1991),(Ripley, 1996, pp 173–180) whilst such studies for deep networks are more recent Zhou (2020). In this section, we build off of these foundations to show that neural networks with weight tensors represented using low Kronecker rank summations of Kronecker products, remain universal approximators. For brevity, we refer to such networks as “Kronecker networks”.
First, we show that a shallow Kronecker network is a universal approximator. For simplicity, this is shown only for one output. Then, we can generalize the resulting approximator via treating each output dimension separately.
Consider a single layer neural network constructed using hidden units and an
Lipschitz activation function
that is defined on a compacta in . As shown in Hornik (1991), such a network serves as a universal approximator, i.e., for a given positive number there exists an such that
(19) 
Similarly, a shallow Kronecker network consisting of hidden units
(20) 
is comprised of a weight matrix made of a summation of Kronecker products between factors and . From (20) we can see that any shallow neural network with hidden units can be represented exactly using a Kronecker network with a full Kronecker rank . Thus, shallow Kronecker networks with full Kronecker rank also serve as universal approximators. In Theorem 1 we show that a similar result holds for shallow Kronecker networks , with low Kronecker ranks , provided that the
smallest singular values of the reshaped matrix
of the approximating neural network are small enough.Theorem 1.
Any shallow Kronecker network with a low Kronecker rank and hidden units defined on a compacta with Lipschitz activation is dense in the class of continuous functions , for a large enough given
where is the singular value of the reshaped version of the weight matrix , in an approximating neural network with hidden units satisfying , for .
Proof.
See Appendix. ∎
In Theorem 2, we extend the preceding result to deep convolutional neural networks, where each convolution tensor is represented using a summation of Kronecker products between factor tensors.
Theorem 2.
Any deep Kronecker convolution network with Kronecker rank in layer on compacta with Lipschitz activation, is dense in the class of continuous functions for a large enough number of layers , given
where is the singular value of the matrix of the reshaped weight tensor in the layer of an approximating convolutional neural network.
Proof.
See Appendix. ∎
The result is achieved by extending the recent universal approximation bound Zhou (2020) for the GKPD networks. One can derive the convergence rates using (Zhou, 2020, Theorem 2) as well. These results assure that the performance degradation of Kronecker networks is small, in comparison to uncompressed networks, for an appropriate choice of Kronecker rank .
Configuration Setting
As GKPD provides us with a set of possible decompositions for each layer in a network, a selection strategy is needed. For a given compression rate, there is a tradeoff between using a larger number of terms in the GKPD summation (11) together with a more compressed configuration and a smaller with a less compressed configuration. To guide our search, we select the decomposition that best approximates the original uncompressed tensor obtained from a pretrained network. This means different layers in a network will be approximated by a different number of Kronecker products. Before searching for the best decomposition, we limit our search space to configurations that satisfy a desired reduction in FLOPs. Unless otherwise stated all GKPD experiments use this approach.
Experiments
To validate our method, we provide model compression experimental results for image classification tasks using a variety of popular CNN architectures such as ResNet (He et al., 2016), and SEResNet which benefits from the squeezeandexcitation blocks (Hu et al., 2018). We also choose to apply our compression method on MobileNetV2 (Sandler et al., 2018) as a model that is optimized for efficient inference on embedded vision applications through depthwise separable convolutions and inverted residual blocks. We provide our implementation details in the appendix.
Table 1 shows the top1 accuracy on the CIFAR10 Krizhevsky (2009) dataset using compressed ResNet18 and SEResNet50. For each architecture, the compressed models obtained using the proposed GKPD are named with the “Kronecker” prefix added to the original model’s name. The configuration of each compressed model is selected such that the number of parameters is similar to MobileNetV2. We observe that for ResNet18 and SEResNet50, the number of parameters and FLOPs can be highly lowered at the expense of a small decrease in accuracy. Specifically, KroneckerResNet18 achieves a compression of 5 and a 4.7 reduction in FLOPs with only 0.12% drop in accuracy. KroneckerSEResNet50 obtains a compression rate of 9.4 and a 9.7 reduction in FLOPs with only 0.5% drop in accuracy.
Moreover, we see that applying the proposed GKPD method on highercapacity architectures such as ResNet18 and SEResNet50 can lead to higher accuracy than a handcrafted efficient network such as MobileNetV2. Specifically, with the same number of parameters as that of MobileNetV2, we achieve a compressed ResNet18 (KroneckerResNet18) and a compressed SEResNet50 (KroneckerSEResNet50) with 0.70% and 0.27% higher accuracy than MobileNetV2.
Table 2 shows the performance of GKPD when used to achieve extreme compression rates. The same baseline architectures are compressed using different configurations. We also use GKPD to compress the already efficient MobileNetV2. When targeting very small models (e.g., 0.29M parameters) compressing MobileNetV2 with a compression factor of 7.9 outperforms extreme compression of SEResNet50 with a compression factor of 73.
In the following subsections, we present comparative assessments using different model compression methods.
Model  Params (M)  FLOPs (M)  Accuracy(%) 

MobileNetV2 (Baseline)  2.30  96  94.18 
ResNet18 (Baseline)  11.17  557  95.05 
KroneckerResNet18  2.2  117  94.97 
SEResNet50 (Baseline)  21.40  1163  95.15 
KroneckerSeResNet50  2.30  120  94.45 
Model  Params (M)  Compression  Accuracy(%) 

KroneckerResNet18  0.48  23.1  92.62 
KroneckerSeResNet50  0.93  22.96  93.66 
KroneckerSeResNet50  0.29  73.30  91.85 
KroneckerMobileNetV2  0.73  3.14  93.80 
KroneckerMobileNetV2  0.29  7.79  93.01 
KroneckerMobileNetV2  0.18  12.43  91.48 
Comparison with Decompositionbased Methods
In this section, we compare GKPD to other tensor decomposition compression methods. We use a classification model pretrained on CIFAR10 and apply model compression methods based on Tucker (Kim et al., 2016), TensorTrain (Garipov et al., 2016), and TensorRing (Wang et al., 2018), along with our proposed GKPD method. We choose ResNet32 architecture in this set of experiments since it has been reported to be effectively compressed using TensorRing in (Wang et al., 2018).
The model compression results obtained using different decomposition methods aiming for a 5 compression rate are shown in Table 3. As this table suggests, GKPD outperforms all other decomposition methods for a similar compression factor. We attribute the performance of GKPD to its higher representation power. This is reflected in its ability to better reconstruct weight tensors in a pretrained network in comparison to other decomposition methods. Refer to Appendix for a comparative assessment of reconstruction errors for different layers of the ResNet architecture.
Model  Params (M)  Compression  Accuracy (%) 

Resnet32  0.46  1  92.55 
TuckerResNet32  0.09  5  87.7 
TensorTrainResNet32  0.096  4.8  88.3 
TensorRingResNet32  0.09  5  90.6 
KroneckerResNet32  0.09  5  91.52 
Model  Params (M)  Compression  Accuracy (%) 

ResNet26  0.37  1  92.94 
Mirzadeh et al. (2020)  0.17  2.13  91.23 
Heo et al. (2019)  0.17  2.13  90.34 
KroneckerResNet26  0.14  2.69  93.16 
Mirzadeh et al. (2020)  0.075  4.88  88.0 
Heo et al. (2019)  0.075  4.88  87.32 
KroneckerResNet26  0.069  5.29  91.28 
Model  Params (M)  Compression  Accuracy (%) 

ResNet50  25.6  1  75.99 
FSNet  13.9  2.0  73.11 
ThiNet  12.38  2.0  71.01 
KroneckerResNet50  12.0  2.13  73.95 
Model  Params (M)  FLOPs (M)  Accuracy(%) 

ResNet18  11.17  0.58  95.05 
KroneckerResNet18  1.41  0.17  92.96 
KroneckerResNet18  1.42  0.16  93.74 
KroneckerResNet18  1.44  0.26  94.30 
KroneckerResNet18  1.51  0.32  94.58 
Comparison with other Compression Methods
We compare our proposed model compression method with two stateoftheart KDbased compression methods; (Mirzadeh et al., 2020) and (Heo et al., 2019). These methods are known to be very effective on relatively smaller networks such as ResNet26. Thus, we perform our compression method on ResNet26 architecture in these experiments. Table 4 presents the top1 accuracy obtained for different compressed models with two different compression rates. As this table suggests, the proposed method results in greater than 2 and 3.7 improvements in top1 accuracy once we aim for compression rates of 2 and 5, respectively, compared to using the KDbased model compression methods.
Model Compression with Random Initialization
To study the effect of replacing weight tensors in neural networks with a summation of Kronecker products, we conduct experiments using randomly initialized Kronecker factors as opposed to performing GKPD on a pretrained network. By replacing all weight tensors in a predefined network architecture with a randomly initialized summation of Kronecker products, we obtain a compressed model. To this end, we run assessments on a higher capacity architecture i.e, ResNet50 on a larger scale dataset i.e, ImageNet Krizhevsky et al. (2012). Table 5 lists the top1 accuracy for ResNet50 baseline and its compressed variation. We achieve a compression rate of 2.13 with a 2.6 accuracy drop compared to the baseline model.
We also perform model compression using two stateoftheart model compression methods; ThiNet Luo et al. (2017) and FSNet Yang et al. (2020). ThiNet and FSNet are based on pruning and filter sharing techniques, respectively. They both reportedly, lead to a good accuracy on large datasets. Table 5 also lists the top1 accuracy for ResNet50 compressed using these two methods. As the table shows, our proposed method outperforms the other two techniques for a 2 compression rate. Note that the performance obtained using our method is based on a random initialization, while the compression achieved with ThiNet benefits from a pretrained model. These results indicate that the proposed GKPD can lead to a high performance even if the pretrained model is not available.
Experimental Analysis of Kronecker Rank
Using a higher Kronecker rank can increase the representation power of a network. This is reflected by the ability of GKPD to better reconstruct weight tensors using a larger number of Kronecker products in (11). In Table 6 we study the effect of using a larger in Kronecker networks while keeping the overall number of parameters constant. We find that using a larger does indeed improve performance.
Conclusion
In this paper we propose GKPD, a generalization of Kronecker Product Decomposition to multidimensional tensors for compression of deep CNNs. In the proposed GKPD, we extend the Nearest Kronecker Product problem to the multidimensional setting and use it for optimal initialization from a baseline model. We show that for a fixed number of parameters, using a summation of Kronecker products can significantly increase the representation power in comparison to a single Kronecker product. We use our approach to compress a variety of CNN architectures and show the superiority of GKPD to some stateoftheart compression methods. GKPD can be combined with other compression methods like quantization and knowledge distillation to further improve the compressionaccuracy tradeoff. Designing new architectures that can benefit most from Kronecker product representation is an area for future work.
References

Learning the number of neurons in deep networks
. In Advances in Neural Information Processing Systems, pp. 2270–2278. Cited by: Related Work on DNN Model Compression.  Compressionaware training of deep networks. Advances in neural information processing systems 30, pp. 856–867. Cited by: Related Work on DNN Model Compression.
 Reducing parameter number in residual networks by sharing weights. Pattern Recognition Letters 103, pp. 53–59. Cited by: Related Work on DNN Model Compression.
 Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: Related Work on DNN Model Compression.
 Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830. Cited by: Related Work on DNN Model Compression.

Predicting parameters in deep learning
. In Advances in Neural Information Processing Systems, pp. 2148–2156. Cited by: Introduction.  Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277. Cited by: Related Work on DNN Model Compression.
 Slowfast networks for video recognition. In IEEE/CVF International Conference on Computer Vision, pp. 6202–6211. Cited by: Introduction.
 Ultimate tensorization: compressing convolutional and FC layers alike. arXiv preprint arXiv:1611.03214. Cited by: Introduction, Related Work on DNN Model Compression, Comparison with Decompositionbased Methods.
 Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: Related Work on DNN Model Compression, Related Work on DNN Model Compression.
 Learning both weights and connections for efficient neural networks. arXiv preprint arXiv:1506.02626. Cited by: Related Work on DNN Model Compression.
 Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: 3rd item, Introduction, Table 2, Experiments.

Knowledge distillation with adversarial samples supporting decision boundary.
In
AAAI Conference on Artificial Intelligence
, pp. 3771–3778. Cited by: Related Work on DNN Model Compression, Comparison with other Compression Methods, Table 4.  Approximation capabilities of multilayer feedforward networks. Neural networks 4 (2), pp. 251–257. Cited by: Introduction, Universal Approximation via Kronecker Products, Universal Approximation via Kronecker Products, Theorem Proofs.
 Squeezeandexcitation networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. Cited by: 3rd item, Table 2, Experiments.

Quantized neural networks: training neural networks with low precision weights and activations.
The Journal of Machine Learning Research
18 (1), pp. 6869–6898. Cited by: Related Work on DNN Model Compression.  SqueezeNet: alexnetlevel accuracy with 50x fewer parameters and 0.5 MB model size. arXiv preprint arXiv:1602.07360. Cited by: Related Work on DNN Model Compression.
 Quantization and training of neural networks for efficient integerarithmeticonly inference. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713. Cited by: Related Work on DNN Model Compression.
 Compression of deep convolutional neural networks for fast and low power mobile applications. In International Conference on Learning Representations, Cited by: Introduction, Comparison with Decompositionbased Methods.
 ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1106–1114. Cited by: Model Compression with Random Initialization.
 Learning multiple layers of features from tiny images. Technical report Cited by: Experiments.
 Speedingup convolutional neural networks using finetuned cpdecomposition. In International Conference on Learning Representations, Cited by: Introduction, Related Work on DNN Model Compression.
 Ternary weight networks. arXiv preprint arXiv:1605.04711. Cited by: Related Work on DNN Model Compression.
 OICSR: outinchannel sparsity regularization for compact deep neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7046–7055. Cited by: Related Work on DNN Model Compression.
 Sparse convolutional neural networks. In IEEE conference on computer vision and pattern recognition, pp. 806–814. Cited by: Related Work on DNN Model Compression.
 The ubiquitous kronecker product. Journal of Computational and Applied Mathematics 123 (1), pp. 85–100. Cited by: Preliminaries.
 ThiNet: A filter level pruning method for deep neural network compression. In IEEE International Conference on Computer Vision, pp. 5068–5076. Cited by: Related Work on DNN Model Compression, Model Compression with Random Initialization, Table 5.
 Improved knowledge distillation via teacher assistant. In AAAI Conference on Artificial Intelligence, pp. 5191–5198. Cited by: Related Work on DNN Model Compression, Comparison with other Compression Methods, Table 4.
 Tensortrain decomposition. SIAM Journal on Scientific Computing 33 (5), pp. 2295–2317. Cited by: Related Work on DNN Model Compression.
 Xnornet: imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp. 525–542. Cited by: Related Work on DNN Model Compression.
 Faster rcnn: towards realtime object detection with region proposal networks. Advances in neural information processing systems 28, pp. 91–99. Cited by: Introduction.
 Pattern recognition and neural networks. Cambridge university press. Cited by: Universal Approximation via Kronecker Products.
 Mobilenetv2: inverted residuals and linear bottlenecks. In IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: 3rd item, Related Work on DNN Model Compression, Table 2, Experiments.
 How to train a compact binary neural network with high accuracy?. In AAAI conference on artificial intelligence, Cited by: Related Work on DNN Model Compression.
 Compressing rnns for iot devices by 1538x using kronecker products. arXiv preprint arXiv:1906.02876. Cited by: Introduction.
 Implications of factor analysis of threeway matrices for measurement of change. Problems in measuring change 15 (122137), pp. 3. Cited by: Related Work on DNN Model Compression.
 Approximation with kronecker products. In LinearAlgebra for Large Scale and Real Time Applications, pp. 293–314. Cited by: Generalized Kronecker Product Decomposition.
 Efficient low rank tensor ring completion. In IEEE International Conference on Computer Vision, pp. 5697–5705. Cited by: Related Work on DNN Model Compression.
 Wide compression: tensor ring nets. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 9329–9338. Cited by: Introduction, Related Work on DNN Model Compression, Comparison with Decompositionbased Methods.
 Learning structured sparsity in deep neural networks. Advances in neural information processing systems 29, pp. 2074–2082. Cited by: Related Work on DNN Model Compression.
 FSNet: compression of deep convolutional neural networks by filter summary. In International Conference on Learning Representations, Cited by: Related Work on DNN Model Compression, Model Compression with Random Initialization, Table 5.
 Learning to share: simultaneous parameter tying and sparsification in deep learning. In International Conference on Learning Representations, Cited by: Related Work on DNN Model Compression.
 LQnets: learned quantization for highly accurate and compact deep neural networks. In European conference on computer vision, pp. 365–382. Cited by: Related Work on DNN Model Compression.
 Universality of deep convolutional neural networks. Applied and computational harmonic analysis 48 (2), pp. 787–794. Cited by: Introduction, Universal Approximation via Kronecker Products, Universal Approximation via Kronecker Products, Theorem Proofs.
 Less is more: towards compact cnns. In European Conference on Computer Vision, pp. 662–677. Cited by: Related Work on DNN Model Compression.
 Exploiting local structures with the kronecker layer in convolutional networks. arXiv preprint arXiv:1512.09194. Cited by: Related Work on DNN Model Compression.
 Trained ternary quantization. arXiv preprint arXiv:1612.01064. Cited by: Related Work on DNN Model Compression.
Appendix
Implementation Details
All experiments on CIFAR10 are run for 200 epochs using Stochastic Gradient Descent (SGD). We use a batch size of 128, weight decay of 0.0001, momentum of 0.1 and an initial learning rate of 0.1 that is subsequently reduced by a factor of 10 at epochs 100 and 150. Similarly, experiments on ImageNet are run for 100 epochs using SGD with a batch size of 256, weight decay of 0.0001, momentum of 0.1 and an initial learning rate of 0.1 that is subsequently reduced by a factor of 10 at epochs 30, 60 and 90. Eight NVIDIA Tesla V100 SXM2 32 GB GPUs were used to run all of our experiments.
Theorem Proofs
See 1
Proof.
By definition the terms in tensor are given by
(21) 
where
See 1
Proof.
We drop the bias term for the simplicity of notation. We need to bound
(24)  
(26)  
The full rank version is dense in according to Hornik (1991), therefore (24) is bounded by . It is only required to show that (26) is also bounded
where is the reshaping operation in (12). Thus, the second term (26) is bounded by if
Note that the last term (26) is consequently bounded by the CauchySchwarz inequality. ∎
See 2
Proof.
The proof follows a similar proof sketch as in Theorem 1. Define the convolution layer as
where , is the total number of layers and of size is a Toeplitz type matrix that transforms a convolution to a matrix multiplication operation. We note that is collection of such ’s in a layer.
For a CNN of depth , the hypothesis space is the set of all functions defined by
parametrized by
According to (Zhou, 2020, Theorem 1), is dense in in , so it is also dense in . In other words, for a given there exists an approximating convolutional neural network , such that
(27) 
Building off of this result, it is sufficient to bound a Kronecker convolutional neural network with a low Kronecker rank in its hidden layer as follows:
expanding on inner layers gives
and the therefore the low rank network is bounded by given
∎
Reconstruction Error of ResNet18
We further study GKPD by plotting the reconstruction errors achieved when compressing a ResNet18 model that is pretrained on ImageNet. We obseve in Figure 4 that GKPD generally achieves a lower reconstruction error in comparison with Tucker decomposition.