DeepAI
Log In Sign Up

Convolutional Neural Network Compression through Generalized Kronecker Product Decomposition

Modern Convolutional Neural Network (CNN) architectures, despite their superiority in solving various problems, are generally too large to be deployed on resource constrained edge devices. In this paper, we reduce memory usage and floating-point operations required by convolutional layers in CNNs. We compress these layers by generalizing the Kronecker Product Decomposition to apply to multidimensional tensors, leading to the Generalized Kronecker Product Decomposition(GKPD). Our approach yields a plug-and-play module that can be used as a drop-in replacement for any convolutional layer. Experimental results for image classification on CIFAR-10 and ImageNet datasets using ResNet, MobileNetv2 and SeNet architectures substantiate the effectiveness of our proposed approach. We find that GKPD outperforms state-of-the-art decomposition methods including Tensor-Train and Tensor-Ring as well as other relevant compression methods such as pruning and knowledge distillation.

READ FULL TEXT VIEW PDF
11/12/2021

Nonlinear Tensor Ring Network

The state-of-the-art deep neural networks (DNNs) have been widely applie...
07/30/2018

Extreme Network Compression via Filter Group Approximation

In this paper we propose a novel decomposition method based on filter gr...
05/24/2021

Towards Compact CNNs via Collaborative Compression

Channel pruning and tensor decomposition have received extensive attenti...
10/19/2018

CNN inference acceleration using dictionary of centroids

It is well known that multiplication operations in convolutional layers ...
12/19/2021

Elastic-Link for Binarized Neural Network

Recent work has shown that Binarized Neural Networks (BNNs) are able to ...
12/19/2014

Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition

We propose a simple two-step approach for speeding up convolution layers...

Introduction

Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance on a wide range of computer vision tasks such as image classification

(He et al., 2016), video recognition (Feichtenhofer et al., 2019) and object detection (Ren et al., 2015). Despite achieving remarkably low generalization errors, modern CNN architectures are typically over-parameterized and consist of millions of parameters. As the size of state-of-the-art CNN architectures continues to grow, it becomes more challenging to deploy these models on resource constrained edge devices that are limited in both memory and energy. Motivated by studies demonstrating that there is significant redundancy in CNN parameters (Denil et al., 2013), model compression techniques such as pruning, quantization, tensor decomposition, and knowledge distillation have emerged to address this problem.

(a) SVD (Tucker)
(b) GKPD - 1
(c) GKPD - 8
Figure 1: A compression rate of achieved for an arbitrary tensor from the first layer of ResNet18 using SVD (Tucker) in (a), and the proposed GKPD in (b) and (c). A larger summation, GKPD-8 achieves a lower reconstruction error in comparison with both a smaller summation, GKPD-1, as well as SVD (Tucker) decomposition.

Decomposition methods have gained more attention in recent years as they can achieve higher compression rates in comparison to other approaches. Namely, Tucker (Kim et al., 2016), CP (Lebedev et al., 2015), Tensor-Train (Garipov et al., 2016) and Tensor-Ring (Wang et al., 2018) decompositions have been widely studied for DNNs. However, these methods still suffer significant accuracy loss for computer vision tasks.

Kronecker Product Decomposition (KPD) is another decomposition method that has recently shown to be very effective when applied to RNNs (Thakker et al., 2019). KPD leads to model compression via replacing a large matrix with two smaller Kronecker factor matrices that best approximate the original matrix. In this work, we generalize KPD to tensors, yielding the Generalized Kronecker Product Decomposition (GKPD), and use it to decompose convolution tensors. GKPD involves finding the summation of Kronecker products between factor tensors that best approximates the original tensor. We provide a solution to this problem called the Multidimensional Nearest Kronecker Product Problem. By formulating the convolution operation directly in terms of the Kronecker factors, we show that we can avoid reconstruction at runtime and thus obtain a significant reduction in memory footprints and floating-point operations (FLOPs). Once all convolution tensors in a pre-trained CNN have been replaced by their compressed counterparts, we retrain the network. If a pretrained network is not available, we show that we are still able to train our compressed network from a random initialization. Furthermore, we show that these randomly initialized networks retain universal approximation capability by building on Hornik (1991) and Zhou (2020).

Applying GKPD to an arbitrary tensor leads to multiple possible decompositions, one for each configuration of Kronecker factors. As shown in Figure 1, we find that for any given compression factor, choosing a decomposition that consists of a larger summation of smaller Kronecker factors (as opposed to a smaller summation of larger Kronecker factors) leads to a lower reconstruction error as well as improved model accuracy.

To summarize, the main contributions of this paper are:

  • Generalizing the Kronecker Product Decomposition to multidimensional tensors

  • Introducing the Multidimenesional Nearest Kronecker Product Problem and providing a solution

  • Providing experimental results for image classification on CIFAR-10 and ImageNet using compressed ResNet He et al. (2016), MobileNetv2 Sandler et al. (2018) and SeNet Hu et al. (2018) architectures.

Related Work on DNN Model Compression

Quantization methods focus on reducing the precision of parameters and/or activations into lower-bit representations. For example, the work in (Han et al., 2015a) quantizes the parameter precision from 32 bits to 8 bits or lower. Model weights have been quantized even further into binary (Courbariaux et al., 2015; Rastegari et al., 2016; Courbariaux et al., 2016; Hubara et al., 2017), and ternary (Li et al., 2016; Zhu et al., 2016) representations. In these methods, choosing between a uniform (Jacob et al., 2018) or nonuniform (Han et al., 2015a; Tang et al., 2017; Zhang et al., 2018b) quantization interval affects the compression rate and the acceleration.

Pruning methods began by exploring unstructured network weights and deactivating small weights through applying sparsity regularization to the weight parameters (Liu et al., 2015; Han et al., 2015a, b) or considering statistics information from layers to guide the parameter selections in ThiNet (Luo et al., 2017). Unstructured pruning results in irregularities in the weight parameters which impact the expected acceleration rate of the pruned network. Hence, several works aim at zeroing out structured groups of the convolutional filters through group sparsity regularization (Zhou et al., 2016; Wen et al., 2016; Alvarez and Salzmann, 2016). Sparsity regularization has been combined with other forms of regularizers such as low-rank (Alvarez and Salzmann, 2017), ordered weighted (Zhang et al., 2018a), and out-in-channel sparsity (Li et al., 2019) regularizers to further improve the pruning performance.

Decomposition

methods factorize the original weight matrix or tensor into lightweight representations. This results in much fewer parameters and consequently fewer computations. Applying truncated singular value decomposition (SVD) to compress the weight matrix of fully-connected layers is one of the earliest works in this category

(Denton et al., 2014). In the same vein, canonical polyadic (CP) decomposition of the kernel tensors was proposed in (Lebedev et al., 2015)

. This work uses nonlinear least squares to decompose the original convolution kernel into a set of rank-1 tensors (vectors). An alternative tensor decomposition approach to convolution kernel compression is Tucker decomposition

(Tucker, 1963). Tucker decomposition has shown to provide more flexible interaction between the factor matrices through a core tensor. The idea of reshaping weights of fully-connected layers into high-dimensional tensors and representing them in Tensor-Train format (Oseledets, 2011) was extended to CNNs in (Garipov et al., 2016). Tensor-Ring decomposition has also become another popular option to compress CNNs (Wang et al., 2018). For multidimensional data completion with a same intermediate rank, TR can be far more expressive than TT (Wang et al., 2017). Kronecker factorization was also used to replace the weight matrices and weight tensors within fully-connected and convolution layers (Zhou et al., 2015). This work however limited the representation to a single Kronecker product and trained the model with random initialization. As shown in Fig.1 and in the next sections of this paper, summation can significantly improve the representation power of the network and thus leads to a performance increase.

Other model compression forms can also be achieved through sharing convolutional weight matrices in a more structured manner as ShaResNet (Boulch, 2018) which reuses convolutional mappings within the same scale level or FSNet (Yang et al., 2020) which shares filter weights across spatial locations. NNs can also be compressed using Knowledge Distillation (KD) where a large (teacher) pre-trained network is used to train a smaller (student) network (Mirzadeh et al., 2020; Heo et al., 2019). Designing lightweight CNNs such as MobileNet (Sandler et al., 2018) and SqueezeNet (Iandola et al., 2016) is another form of model compression.

Preliminaries

Given matrices and , their Kronecker product is the matrix

(1)

As shown in Loan (2000), any matrix can be decomposed into a sum of Kronecker products as

(2)

where

(3)

is the rank of a reshaped version of matrix . We call this the Kronecker rank of . Note that the Kronecker rank is not unique, and is dependent on the dimensions of factors and .

To compress a given , we can represent it using a small number of Kronecker products that best approximate the original tensor. The factors are found by solving the Nearest Kronecker Product problem

(4)

As this approximation replaces a large matrix with a sequence of two smaller ones, memory consumption is reduced by a factor of

(5)

Furthermore, if a matrix is decomposed into its Kronecker factors then the projection can be performed without explicit reconstruction of . Instead, the factors can be used directly to perform the computation as a result of the following equivalency relationship:

(6)

where , and vectorizes matrices and by stacking their columns.

Method

In this section, we extend KPD to tensors yielding GKPD. First, we define the multidimensional Kronecker product, then we introduce the Multidimensional Nearest Kronecker Product problem and its solution. Finally, we describe our KroneckerConvolution module that uses GKPD to compress convolution tensors and avoids reconstruction at runtime.

Generalized Kronecker Product Decomposition

We now turn to generalizing the Kronecker product to operate on tensors. Let and be two given tensors. Intuitively, tensor is constructed by moving around tensor in a non-overlapping fashion, and at each position scaling it by a corresponding element of as shown in Figure 2. Formally, the Multidimensional Kronecker product is defined as follows

(7)

where

(8)

represent the integer quotient and the remainder term of with respect to divisor , respectively.

Figure 2: Illustration of Kronecker Decomposition of a single convolution filter (with spatial dimensions equal to one for simplicity).

As with matrices, any multidimensional tensor can be decomposed into a sum of Kronecker products

(9)

where

(10)

denotes the Kronecker rank of tensor . Thus, we can approximate using GKPD by solving the Multidimensional Nearest Kronecker Product problem

(11)

where . For the case of matrices (2D tensors), Van Loan and Pitsianis (1992) solved this problem using SVD. We now extend their approach to the multidimensional setting. Our strategy will be to define rearrangement operators

and solve

(12)

instead. By carefully defining the rearrangement operators, the sum of squares in (12) is kept identical to that in (11). The former corresponds to finding the best low-rank approximation which has a well known solution using SVD. We define the rearrangement operators as follows:

where

extracts non-overlapping patches of shape from tensor , flattens its input into a vector, tensor has the same number of dimensions as with each dimension equal to unity and is a vector describing the shape of tensor . While the ordering of patch extraction and flattening is not important, it must remain consistent across the rearrangement operators.

(a) Conv2d

(b) KroneckerConv2d
Figure 3: Illustration of the KroneckerConvolution operation. Although (a) and (b) result in identical outputs, the latter is more efficient in terms of memory and FLOPs.

KroneckerConvolution Layer

The convolution operation in CNNs between a weight tensor and an input is a multilinear map that can be described in scalar form as

(13)

Assuming can be decomposed to KPD factors and , we can rewrite (13) as

(14)

Due to the structure of tensor , we do not need to explicitly reconstruct it to carry out the summation in (14). Instead, we can carry out the summation by directly using elements of tensors and as shown in Lemma 1. This key insight leads to a large reduction in both memory and FLOPs. Effectively, this allows us to replace a large convolutional layer (with a large weight tensor) with two smaller ones, as we demonstrate in the rest of this section.

Lemma 1.

Suppose tensor can be decomposed into KPD factors such that . Then, the multilinear map involving can be written directly in terms of its factors and as follows

where, is an input tensor, is a re-indexing function; and are as defined in (8). The equality also holds for any valid offsets to the input’s indices,

where .

Proof.

See Appendix. ∎

Applying Lemma 1 to the summation in (14) yields

where indices enumerate over elements in tensor and enumerate over elements in tensor . Finally, we can separate the convolution operation into two steps by exchanging the order of summation as follows:

(15)

The inner summation in (15) corresponds to a 3D convolution and the outer summation corresponds to multiple 2D convolutions, as visualized in Fig. 3 for the special case of .

Input:

// Stride of original convolution

Output:
;
/* 3D Conv with stride of */
;
/* Batched 2D Conv with stride and dilation . Note that we perform multiple 2D convolutions along the first dimension of size using the same weight kernel */
;
;
Algorithm 1 Forward Pass

Overall, (15) can be carried out efficiently in tensor form using Algorithm 1. Effectively, the input is collapsed in two stages instead of one as in the multidimensional convolution operation. Convolving a multi-channel input with a single filter in yields a scalar value at a particular output location. This is done by first scaling all elements in the corresponding multidimensional patch, then collapsing it by means of summation. Since tensor is comprised of multidimensional patches scaled by elements in , we can equivalently collapse each sub-patch in the input using tensor followed by a subsequent collapsing using tensor to obtain the same scalar value.

Complexity of KroneckerConvolution

The GKPD of a convolution layer is not unique. Different configurations of Kronecker factors will lead to different reductions in memory and number of operations. Namely, for a KroneckerConvolution layer using Kronecker products with factors and the memory reduction is

(16)

whereas the reduction in FLOPs is

(17)

For the special case of using separable filters, and the reduction in FLOPs becomes

(18)

implying that and should be sufficiently large in order to obtain a reduction in FLOPs. In contrast, memory reduction is unconditional in the KroneckerConvolution layer.

Universal Approximation via Kronecker Products

Universal approximation applied to shallow networks have been around for a long time Hornik (1991),(Ripley, 1996, pp 173–180) whilst such studies for deep networks are more recent Zhou (2020). In this section, we build off of these foundations to show that neural networks with weight tensors represented using low Kronecker rank summations of Kronecker products, remain universal approximators. For brevity, we refer to such networks as “Kronecker networks”.

First, we show that a shallow Kronecker network is a universal approximator. For simplicity, this is shown only for one output. Then, we can generalize the resulting approximator via treating each output dimension separately.

Consider a single layer neural network constructed using hidden units and an

-Lipschitz activation function

that is defined on a compacta in . As shown in Hornik (1991), such a network serves as a universal approximator, i.e., for a given positive number there exists an such that

(19)

Similarly, a shallow Kronecker network consisting of hidden units

(20)

is comprised of a weight matrix made of a summation of Kronecker products between factors and . From (20) we can see that any shallow neural network with hidden units can be represented exactly using a Kronecker network with a full Kronecker rank . Thus, shallow Kronecker networks with full Kronecker rank also serve as universal approximators. In Theorem 1 we show that a similar result holds for shallow Kronecker networks , with low Kronecker ranks , provided that the

smallest singular values of the reshaped matrix

of the approximating neural network are small enough.

Theorem 1.

Any shallow Kronecker network with a low Kronecker rank and hidden units defined on a compacta with -Lipschitz activation is dense in the class of continuous functions , for a large enough given

where is the singular value of the reshaped version of the weight matrix , in an approximating neural network with hidden units satisfying , for .

Proof.

See Appendix. ∎

In Theorem 2, we extend the preceding result to deep convolutional neural networks, where each convolution tensor is represented using a summation of Kronecker products between factor tensors.

Theorem 2.

Any deep Kronecker convolution network with Kronecker rank in layer on compacta with -Lipschitz activation, is dense in the class of continuous functions for a large enough number of layers , given

where is the singular value of the matrix of the reshaped weight tensor in the layer of an approximating convolutional neural network.

Proof.

See Appendix. ∎

The result is achieved by extending the recent universal approximation bound Zhou (2020) for the GKPD networks. One can derive the convergence rates using (Zhou, 2020, Theorem 2) as well. These results assure that the performance degradation of Kronecker networks is small, in comparison to uncompressed networks, for an appropriate choice of Kronecker rank .

Configuration Setting

As GKPD provides us with a set of possible decompositions for each layer in a network, a selection strategy is needed. For a given compression rate, there is a trade-off between using a larger number of terms in the GKPD summation (11) together with a more compressed configuration and a smaller with a less compressed configuration. To guide our search, we select the decomposition that best approximates the original uncompressed tensor obtained from a pretrained network. This means different layers in a network will be approximated by a different number of Kronecker products. Before searching for the best decomposition, we limit our search space to configurations that satisfy a desired reduction in FLOPs. Unless otherwise stated all GKPD experiments use this approach.

Experiments

To validate our method, we provide model compression experimental results for image classification tasks using a variety of popular CNN architectures such as ResNet (He et al., 2016), and SEResNet which benefits from the squeeze-and-excitation blocks (Hu et al., 2018). We also choose to apply our compression method on MobileNetV2 (Sandler et al., 2018) as a model that is optimized for efficient inference on embedded vision applications through depthwise separable convolutions and inverted residual blocks. We provide our implementation details in the appendix.

Table 1 shows the top-1 accuracy on the CIFAR-10 Krizhevsky (2009) dataset using compressed ResNet18 and SEResNet50. For each architecture, the compressed models obtained using the proposed GKPD are named with the “Kronecker” prefix added to the original model’s name. The configuration of each compressed model is selected such that the number of parameters is similar to MobileNetV2. We observe that for ResNet18 and SEResNet50, the number of parameters and FLOPs can be highly lowered at the expense of a small decrease in accuracy. Specifically, KroneckerResNet18 achieves a compression of 5 and a 4.7 reduction in FLOPs with only 0.12% drop in accuracy. KroneckerSEResNet50 obtains a compression rate of 9.4 and a 9.7 reduction in FLOPs with only 0.5% drop in accuracy.

Moreover, we see that applying the proposed GKPD method on higher-capacity architectures such as ResNet18 and SEResNet50 can lead to higher accuracy than a hand-crafted efficient network such as MobileNetV2. Specifically, with the same number of parameters as that of MobileNetV2, we achieve a compressed ResNet18 (KroneckerResNet18) and a compressed SEResNet50 (KroneckerSEResNet50) with 0.70% and 0.27% higher accuracy than MobileNetV2.

Table 2 shows the performance of GKPD when used to achieve extreme compression rates. The same baseline architectures are compressed using different configurations. We also use GKPD to compress the already efficient MobileNetV2. When targeting very small models (e.g., 0.29M parameters) compressing MobileNetV2 with a compression factor of 7.9 outperforms extreme compression of SEResNet50 with a compression factor of 73.

In the following subsections, we present comparative assessments using different model compression methods.

Model Params (M) FLOPs (M) Accuracy(%)
MobileNetV2 (Baseline) 2.30 96 94.18
ResNet18 (Baseline) 11.17 557 95.05
KroneckerResNet18 2.2 117 94.97
SEResNet50 (Baseline) 21.40 1163 95.15
KroneckerSeResNet50 2.30 120 94.45
Table 1: Top-1 accuracy measured on CIFAR-10 for the baseline models MobileNetV2, ResNet18 and SEResNet as well their compressed versions using GKPD. The number of parameters in compressed models are approximately matched with that of MobileNetV2.
Model Params (M) Compression Accuracy(%)
KroneckerResNet18 0.48 23.1 92.62
KroneckerSeResNet50 0.93 22.96 93.66
KroneckerSeResNet50 0.29 73.30 91.85
KroneckerMobileNetV2 0.73 3.14 93.80
KroneckerMobileNetV2 0.29 7.79 93.01
KroneckerMobileNetV2 0.18 12.43 91.48
Table 2: Top-1 accuracy measured on CIFAR-10 highly compressed ResNet18 (He et al., 2016), MobileNetV2 (Sandler et al., 2018) and SEResNet (Hu et al., 2018).

Comparison with Decomposition-based Methods

In this section, we compare GKPD to other tensor decomposition compression methods. We use a classification model pretrained on CIFAR-10 and apply model compression methods based on Tucker (Kim et al., 2016), Tensor-Train (Garipov et al., 2016), and Tensor-Ring (Wang et al., 2018), along with our proposed GKPD method. We choose ResNet32 architecture in this set of experiments since it has been reported to be effectively compressed using Tensor-Ring in (Wang et al., 2018).

The model compression results obtained using different decomposition methods aiming for a 5 compression rate are shown in Table 3. As this table suggests, GKPD outperforms all other decomposition methods for a similar compression factor. We attribute the performance of GKPD to its higher representation power. This is reflected in its ability to better reconstruct weight tensors in a pretrained network in comparison to other decomposition methods. Refer to Appendix for a comparative assessment of reconstruction errors for different layers of the ResNet architecture.

Model Params (M) Compression Accuracy (%)
Resnet32 0.46 1 92.55
TuckerResNet32 0.09 5 87.7
TensorTrainResNet32 0.096 4.8 88.3
TensorRingResNet32 0.09 5 90.6
KroneckerResNet32 0.09 5 91.52
Table 3: Top-1 Accuracy on CIFAR-10 of compressed ResNet32 models using various decomposition approaches.
Model Params (M) Compression Accuracy (%)
ResNet26 0.37 1 92.94
Mirzadeh et al. (2020) 0.17 2.13 91.23
Heo et al. (2019) 0.17 2.13 90.34
KroneckerResNet26 0.14 2.69 93.16
Mirzadeh et al. (2020) 0.075 4.88 88.0
Heo et al. (2019) 0.075 4.88 87.32
KroneckerResNet26 0.069 5.29 91.28
Table 4: Top-1 accuracy measured on CIFAR-10 for the baseline model ResNet26 and its compressed versions obtained using the KD-based methods; (Mirzadeh et al., 2020), (Heo et al., 2019), and the proposed GKPD method.
Model Params (M) Compression Accuracy (%)
ResNet50 25.6 1 75.99
FSNet 13.9 2.0 73.11
ThiNet 12.38 2.0 71.01
KroneckerResNet50 12.0 2.13 73.95
Table 5: Top-1 accuracy measured on ImageNet for the baseline model ResNet50 and its compressed versions obtained using ThiNet Luo et al. (2017), FSNet Yang et al. (2020), and the proposed GKPD method.
Model Params (M) FLOPs (M) Accuracy(%)
ResNet18 11.17 0.58 95.05
KroneckerResNet18 1.41 0.17 92.96
KroneckerResNet18 1.42 0.16 93.74
KroneckerResNet18 1.44 0.26 94.30
KroneckerResNet18 1.51 0.32 94.58
Table 6: Top-1 image classification accuracy of compressed ResNet18 on CIFAR-10, where denotes the number of Kronecker products used in the GKPD of each layer.

Comparison with other Compression Methods

We compare our proposed model compression method with two state-of-the-art KD-based compression methods; (Mirzadeh et al., 2020) and (Heo et al., 2019). These methods are known to be very effective on relatively smaller networks such as ResNet26. Thus, we perform our compression method on ResNet26 architecture in these experiments. Table 4 presents the top-1 accuracy obtained for different compressed models with two different compression rates. As this table suggests, the proposed method results in greater than 2 and 3.7 improvements in top-1 accuracy once we aim for compression rates of 2 and 5, respectively, compared to using the KD-based model compression methods.

Model Compression with Random Initialization

To study the effect of replacing weight tensors in neural networks with a summation of Kronecker products, we conduct experiments using randomly initialized Kronecker factors as opposed to performing GKPD on a pretrained network. By replacing all weight tensors in a predefined network architecture with a randomly initialized summation of Kronecker products, we obtain a compressed model. To this end, we run assessments on a higher capacity architecture i.e, ResNet50 on a larger scale dataset i.e, ImageNet Krizhevsky et al. (2012). Table 5 lists the top-1 accuracy for ResNet50 baseline and its compressed variation. We achieve a compression rate of 2.13 with a 2.6 accuracy drop compared to the baseline model.

We also perform model compression using two state-of-the-art model compression methods; ThiNet Luo et al. (2017) and FSNet Yang et al. (2020). ThiNet and FSNet are based on pruning and filter sharing techniques, respectively. They both reportedly, lead to a good accuracy on large datasets. Table 5 also lists the top-1 accuracy for ResNet50 compressed using these two methods. As the table shows, our proposed method outperforms the other two techniques for a 2 compression rate. Note that the performance obtained using our method is based on a random initialization, while the compression achieved with ThiNet benefits from a pretrained model. These results indicate that the proposed GKPD can lead to a high performance even if the pretrained model is not available.

Experimental Analysis of Kronecker Rank

Using a higher Kronecker rank can increase the representation power of a network. This is reflected by the ability of GKPD to better reconstruct weight tensors using a larger number of Kronecker products in (11). In Table 6 we study the effect of using a larger in Kronecker networks while keeping the overall number of parameters constant. We find that using a larger does indeed improve performance.

Conclusion

In this paper we propose GKPD, a generalization of Kronecker Product Decomposition to multidimensional tensors for compression of deep CNNs. In the proposed GKPD, we extend the Nearest Kronecker Product problem to the multidimensional setting and use it for optimal initialization from a baseline model. We show that for a fixed number of parameters, using a summation of Kronecker products can significantly increase the representation power in comparison to a single Kronecker product. We use our approach to compress a variety of CNN architectures and show the superiority of GKPD to some state-of-the-art compression methods. GKPD can be combined with other compression methods like quantization and knowledge distillation to further improve the compression-accuracy trade-off. Designing new architectures that can benefit most from Kronecker product representation is an area for future work.

References

Appendix

Implementation Details

All experiments on CIFAR-10 are run for 200 epochs using Stochastic Gradient Descent (SGD). We use a batch size of 128, weight decay of 0.0001, momentum of 0.1 and an initial learning rate of 0.1 that is subsequently reduced by a factor of 10 at epochs 100 and 150. Similarly, experiments on ImageNet are run for 100 epochs using SGD with a batch size of 256, weight decay of 0.0001, momentum of 0.1 and an initial learning rate of 0.1 that is subsequently reduced by a factor of 10 at epochs 30, 60 and 90. Eight NVIDIA Tesla V100 SXM2 32 GB GPUs were used to run all of our experiments.

Theorem Proofs

See 1

Proof.

By definition the terms in tensor are given by

(21)

where

Since and decompose into an integer quotient and a remainder term (with respect to divisor ), it follows that

(22)

Therefore,

(23)

Combining (21) and (23) completes the proof. ∎

See 1

Proof.

We drop the bias term for the simplicity of notation. We need to bound

(24)
(26)

The full rank version is dense in according to Hornik (1991), therefore (24) is bounded by . It is only required to show that (26) is also bounded

where is the reshaping operation in (12). Thus, the second term (26) is bounded by if

Note that the last term (26) is consequently bounded by the Cauchy-Schwarz inequality. ∎

See 2

Proof.

The proof follows a similar proof sketch as in Theorem 1. Define the convolution layer as

where , is the total number of layers and of size is a Toeplitz type matrix that transforms a convolution to a matrix multiplication operation. We note that is collection of such ’s in a layer.

For a CNN of depth , the hypothesis space is the set of all functions defined by

parametrized by

According to (Zhou, 2020, Theorem 1), is dense in in , so it is also dense in . In other words, for a given there exists an approximating convolutional neural network , such that

(27)

Building off of this result, it is sufficient to bound a Kronecker convolutional neural network with a low Kronecker rank in its hidden layer as follows:

expanding on inner layers gives

and the therefore the low rank network is bounded by given

Reconstruction Error of ResNet18

We further study GKPD by plotting the reconstruction errors achieved when compressing a ResNet18 model that is pretrained on ImageNet. We obseve in Figure 4 that GKPD generally achieves a lower reconstruction error in comparison with Tucker decomposition.

Figure 4: Reconstruction error between convolution tensors in a pretrained ResNet18 model and compressed representations at a compression rate. GKPD always yields a lower reconstruction error than Tucker decomposition