Structured Convolutions for Efficient Neural Network Design

08/06/2020 ∙ by Yash Bhalgat, et al. ∙ Qualcomm 68

In this work, we tackle model efficiency by exploiting redundancy in the implicit structure of the building blocks of convolutional neural networks. We start our analysis by introducing a general definition of Composite Kernel structures that enable the execution of convolution operations in the form of efficient, scaled, sum-pooling components. As its special case, we propose Structured Convolutions and show that these allow decomposition of the convolution operation into a sum-pooling operation followed by a convolution with significantly lower complexity and fewer weights. We show how this decomposition can be applied to 2D and 3D kernels as well as the fully-connected layers. Furthermore, we present a Structural Regularization loss that promotes neural network layers to leverage on this desired structure in a way that, after training, they can be decomposed with negligible performance loss. By applying our method to a wide range of CNN architectures, we demonstrate "structured" versions of the ResNets that are up to 2× smaller and a new Structured-MobileNetV2 that is more efficient while staying within an accuracy loss of 1 We also show similar structured versions of EfficientNet on ImageNet and HRNet architecture for semantic segmentation on the Cityscapes dataset. Our method performs equally well or superior in terms of the complexity reduction in comparison to the existing tensor decomposition and channel pruning methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks deliver outstanding performance across a variety of use-cases but quite often fail to meet the computational budget requirements of mainstream devices. Hence, model efficiency plays a key role in bridging deep learning research into practice. Various model compression techniques rely on a key assumption that the deep networks are over-parameterized, meaning that a significant proportion of the parameters are redundant. This redundancy can appear either explicitly or implicitly. In the former case, several structured

He et al. (2017); Li et al. (2017), as well as unstructured Han et al. (2015a, b); Manessi et al. (2018); Zhang et al. (2018)

, pruning methods have been proposed to systematically remove redundant components in the network and improve run-time efficiency. On the other hand, tensor-decomposition methods based on singular values of the weight tensors, such as spatial SVD or weight SVD, remove somewhat implicit elements of the weight tensor to construct low-rank decompositions for efficient inference

Denton et al. (2014); Jaderberg et al. (2014); Kuzmin et al. (2019).

Redundancy in deep networks can also be seen as network weights possessing an unnecessarily high degrees of freedom (DOF). Alongside various regularization methods

Krogh and Hertz (1992); Srivastava et al. (2014) that impose constraints to avoid overfitting, another approach for reducing the DOF is by decreasing the number of learnable parameters. To this end, Jaderberg et al. (2014); Qiu et al. (2018); Tayyab and Mahalanobis (2019)

propose using certain basis representations for weight tensors. In these methods, the basis vectors are fixed and only their coefficients are learnable. Thus, by using a smaller number of coefficients than the size of weight tensors, the DOF can be effectively restricted. But, note that, this is useful only during training since the original higher number of parameters are used during inference.

Qiu et al. (2018) shows that systematically choosing the basis (e.g. the Fourier-Bessel basis) can lead to model size shrinkage and flops reduction even during inference.

In this work, we explore restricting the degrees of freedom of convolutional kernels by imposing a structure on them. This structure can be thought of as constructing the convolutional kernel by super-imposing several constant-height kernels. A few examples are shown in Fig. 1, where a kernel is constructed via superimposition of linearly independent masks with associated constant scalars , hence leading to degrees of freedom for the kernel. The very nature of the basis elements as binary masks enables efficient execution of the convolution operation as explained in Sec. 3.1.

In Sec. 4, we introduce Structured Convolutions as a special case of this superimposition and show that it leads to a decomposition of the convolution operation into a sum-pooling operation and a significantly smaller convolution operation. We show how this decomposition can be applied to convolutional layers as well as fully connected layers. We further propose a regularization method named Structural Regularization that promotes the normal convolution weights to have the desired structure that facilitates our proposed decomposition. Overall, our key contributions in this work are:

  1. We introduce Composite Kernel structure, which accepts an arbitrary basis in the kernel formation, leading to an efficient convolution operation. Sec. 3 provides the definition.

  2. We propose Structured Convolutions, a realization of the composite kernel structure. We show that a structured convolution can be decomposed into a sum-pooling operation followed by a much smaller convolution operation. A detailed analysis is provided in Sec. 4.1.

  3. Finally, we design Structural Regularization, an effective training method to enable the structural decomposition with minimal loss of accuracy. Our process is described in Sec. 5.1.

(a) (b) (c)
Figure 1: A composite kernel constructed as a superimposition of different underlying structures. Kernels in (a) and (b) possess 4 degrees of freedom whereas the kernel in (c) has 6 degrees of freedom.
Color combinations are chosen to reflect the summations, this figure is best viewed in color.

2 Related Work

The existing literature on exploiting redundancy in deep networks can be broadly studied as follows.

Tensor Decomposition Methods. The work in Zhang et al. (2016a) proposed a Generalized SVD approach to decompose a convolution (where and are output and input channels, and is the spatial size) into a convolution followed by a convolution. Likewise, Jaderberg et al. (2014) introduced Spatial SVD to decompose a kernel into and kernels. Tai et al. (2015) further developed a non-iterative method for such low-rank decomposition. CP-decomposition Kolda and Bader (2009); Lebedev et al. (2014) and tensor-train decomposition Oseledets (2011); Su et al. (2018); Yang et al. (2017) have been proposed to decompose high dimensional tensors. In our method, we too aim to decompose regular convolution into computationally lightweight units.

Structured Pruning. He et al. (2018, 2017); Li and Liu (2016) presented channel pruning methods where redundant channels in every layer are removed. The selection process of the redundant channels is unique to every method, for instance, He et al. (2017)

addressed the channel selection problem using lasso regression. Similarly,

Wen et al. (2016) used group lasso regularization to penalize and prune unimportant groups on different levels of granularity. We refer readers Kuzmin et al. (2019) for a survey of structured pruning and tensor decomposition methods. To our advantage, the proposed method in this paper does not explicitly prune, instead, our structural regularization loss imposes a form on the convolution kernels.

Semi-structured and Unstructured Pruning. Other works Lebedev and Lempitsky (2016); Liu et al. (2018); Elsen et al. (2019) employed block-wise sparsity (also called semi-structured pruning) which operates on a finer level than channels. Unstructured pruning methods Azarian et al. (2020); Han et al. (2015a); Kusupati et al. (2020); Zhang et al. (2018) prune on the parameter-level yielding higher compression rates. However, their unstructured nature makes it difficult to deploy them on most hardware platforms.

Using Prefixed Basis. Several works Qiu et al. (2018); Tayyab and Mahalanobis (2019) applied basis representations in deep networks. Seminal works Mallat (2012); Sifre and Mallat (2013) used wavelet bases as feature extractors. Choice of the basis is important, for example, Qiu et al. (2018) used Fourier-Bessel basis that led to a reduction in computation complexity. In general, tensor decomposition can be seen as basis representation learning. We propose using structured binary masks as our basis, which leads to an immediate reduction in the number of multiplications.

Orthogonal to structured compression, Lin et al. (2019); Wu et al. (2018); Zhong et al. (2018) utilized shift-based operations to reduce the overall computational load. Given the high computational cost of multiplications compared to additions Horowitz (2014), Chen et al. (2019) proposed networks where the majority of the multiplications are replaced by additions.

3 Composite Kernels

We first give a definition that encompasses a wide range of structures for convolution kernels.

Definition 1.

For , a Composite Basis is a linearly independent set of binary tensors of dimension as its basis elements. That is, , each element of , and .

The linear independence condition implies that . Hence, the basis spans a subspace of . The speciality of the Composite Basis is that the basis elements are binary, which leads to an immediate reduction in the number of multiplications involved in the convolution operation.

Definition 2.

A kernel is a Composite Kernel if it is in the subspace of the Composite Basis. That is, it can be constructed as a linear combination of the elements of : such that .

Note that, the binary structure of the underlying Composite Basis elements defines the structure of the Composite Kernel. Fig. 1 shows a Composite Kernel (with ) constructed using different examples of a Composite Basis. In general, the underlying basis elements could have a more random structure than what is demonstrated in those examples shown in Fig. 1.

Conventional kernels (with no restrictions on DOF) are just special cases of Composite Kernels, where and each basis element has only one nonzero element in its grid.

3.1 Convolution with Composite Kernels

Consider a convolution with a Composite Kernel of size , where is the spatial size and is the number of input channels. To compute an output, this kernel is convolved with a volume of the input feature map. Let’s call this volume . Therefore, the output at this point will be:

(1)

where ‘’ denotes convolution, ‘’ denotes element-wise multiplication. Since is a binary tensor, is same as adding the elements of wherever , thus no multiplications are needed. Ordinarily, the convolution would involve multiplications and additions. In our method, we can trade multiplications with additions. From (1), we can see that we only need multiplications and the total number of additions becomes:

(2)

Depending on the structure, number of the additions per output can be larger than . For example, in Fig. 1(b) where , we have ). In Sec. 4.2, we show that the number of additions can be amortized to as low as .

4 Structured Convolutions

Figure 2: structured kernel constructed with 8 basis elements each having a patch of ’s. Figure best viewed in color. Figure 3: structured convolution is equivalent to sum-pooling + convolution.
Definition 3.

A kernel in is a Structured Kernel if it is a Composite Kernel with for some , and if each basis tensor is made of a cuboid of ’s, while rest of its coefficients being .

A Structured Kernel is characterized by its dimensions and its underlying parameters . Convolutions performed using Structured Kernels are called Structured Convolutions.

Fig. 1(b) depicts a 2D case of a structured kernel where . As shown, there are basis elements and each element has a sized patch of ’s.

Fig. 2 shows a 3D case where . Here, there are basis elements and each element has a cuboid of ’s. Note how these cuboids of ’s (shown in colors) cover the entire grid.

4.1 Decomposition of Structured Convolutions

A major advantage of defining Structured Kernels this way is that all the basis elements are just shifted versions of each other (see Fig. 2 and Fig. 1(b)). This means, in Eq. (1), if we consider the convolution for the entire feature map , the summed outputs for all ’s are actually the same (except on the edges of ). As a result, the outputs can be computed using a single sum-pooling operation on with a kernel size of . Fig. 3 shows a simple example of how a convolution with a structured kernel can be broken into a sum-pooling followed by a convolution with a kernel made of ’s.

Furthermore, consider a convolutional layer of size that has kernels of size . In our design, the same underlying basis is used for the construction of all kernels in the layer. Suppose any two structured kernels in this layer with coefficients and , i.e. and . The convolution output with these two kernels is respectively, and . We can see that the computation is common to all the kernels of this layer. Hence, the sum-pooling operation only needs to be computed once and then reused across all the kernels.

A Structured Convolution can thus be decomposed into a sum-pooling operation and a smaller convolution operation with a kernel composed of ’s. Fig. 4 shows the decomposition of a general structured convolution layer of size .

Notably, standard convolution (), depthwise convolution (), and pointwise convolution () kernels can all be constructed as 3D structured kernels, which means that this decomposition can be widely applied to existing architectures. See supplementary material for more details on applying the decomposition to convolutions with arbitrary stride, padding, dilation.

Figure 4: Decomposition of Structured Convolution. On the left, the conventional operation of a structured convolutional layer of size is shown. On the right, we show that it is equivalent to performing a 3D sum-pooling followed by a convolutional layer of size .

4.2 Reduction in Number of Parameters and Multiplications/Additions

The sum-pooling component after decomposition requires no parameters. Thus, the total number of parameters in a convolution layer get reduced from (before decomposition) to (after decomposition). The sum-pooling component is also free of multiplications. Hence, only the smaller convolution contributes to multiplications after decomposition.

Before decomposition, computing every output element in feature map involves multiplications and additions. Hence, total multiplications involved are and total additions involved are .

After decomposition, computing every output element in feature map involves multiplications and additions. Hence, total multiplications and additions involved in computing are and respectively. Now, computing every element of the intermediate sum-pooled output involves additions. Hence, the overall total additions involved can be written as:

We can see that the number of parameters and number of multiplications have both reduced by a factor of . And in the expression above, if is large enough, the first term inside the parentheses gets amortized and the number of additions . As a result, the number of additions also reduce by approximately the same proportion . We will refer to as the compression ratio from now on.

Due to amortization, the additions per output are , which is basically since .

4.3 Extension to Fully Connected layers

For image classification networks, the last fully connected layer (sometimes called linear layer) dominates w.r.t. the number of parameters, especially if the number of classes is high. The structural decomposition can be easily extended to the linear layers by noting that a matrix multiplication is the same as performing a number of convolutions on the input. Consider a kernel and input vector . The linear operation is mathematically equivalent to the convolution , where is the same as but with dimensions and is the same as but with dimensions . In other words, each row of can be considered a convolution kernel of size .

Now, if each of these kernels (of size ) is structured with underlying parameter (where ), then the matrix multiplication operation can be structurally decomposed as shown in Fig. 5.

Figure 5: Structural decomposition of a matrix multiplication.

Same as before, we get a reduction in both the number of parameters and the number of multiplications by a factor of , as well as the number of additions by a factor of .

5 Imposing Structure on Convolution Kernels

To apply the structural decomposition, we need the weight tensors to be structured. In this section, we propose a method to impose the desired structure on the convolution kernels via training.

From the definition, , we can simply define matrix such that its column is the vectorized form of . Hence, , where .

Another way to see this is from structural decomposition. We may note that the sum-pooling can also be seen as a convolution with a kernel of all ’s; we refer to this kernel as . Hence, the structural decomposition is:

That implies, . Since the stride of the sum-pooling involved is , this can be written in terms of a matrix multiplication with a Topelitz matrix Strang (1986):

Hence, the structure matrix referred above is basically .

5.1 Training with Structural Regularization

Now, for a structured kernel characterized by , there exists a length such that . Hence, a structured kernel satisfies the property: , where is the Moore-Penrose inverse Ben-Israel and Greville (2003) of . Based on this, we propose training a deep network with a Structural Regularization loss that can gradually push the deep network’s kernels to be structured via training:

(3)

where denotes Frobenius norm and is the layer index. To ensure that regularization is applied uniformly to all layers, we use normalization in the denominator. It also stabilizes the performance of the decomposition w.r.t . The overall proposed training recipe is as follows:

Proposed Training Scheme:

  • [topsep=0pt]

  • Step 1: Train the original architecture with the Structural Regularization loss.

    • [topsep=0pt]

    • After Step 1, all weight tensors in the deep network will be almost structured.

  • Step 2: Apply the decomposition on every layer and compute .

    • [topsep=0pt]

    • This results in a smaller and more efficient decomposed architecture with ’s as the weights. Note that, every convolution / linear layer from the original architecture is now replaced with a sum-pooling layer and a smaller convolution / linear layer.

The proposed scheme trains the architecture with the original kernels in place but with a structural regularization loss. The structural regularization loss imposes a restrictive degrees of freedom while training but in a soft or gradual manner (depending on ):

  1. [leftmargin=*,topsep=0pt,itemsep=0pt]

  2. If , it is the same as normal training with no structure imposed.

  3. If is very high, the regularization loss will be heavily minimized in early training iterations. Thus, the weights will be optimized in a restricted dimensional subspace of .

  4. Choosing a moderate gives the best tradeoff between structure and model performance.

We talk about training implementation details for reproduction, such as hyperparameters and training schedules, in Supplementary material, where we also show our method is robust to the choice of

.

6 Experiments

We apply structured convolutions to a wide range of architectures and analyze the performance and complexity of the decomposed architectures. We evaluate our method on ImageNet Russakovsky et al. (2015) and CIFAR-10 Krizhevsky et al. (2009) benchmarks for image classification and Cityscapes Cordts et al. (2016) for semantic segmentation.

Architecture Adds () Mults () Params () Acc. (in %)
ResNet56
Struct-56-A (ours)
Struct-56-B (ours)
Ghost-Res56 Han et al. (2019)
ShiftRes56-6 Wu et al. (2018)
AMC-Res56 He et al. (2018)
ResNet32
Struct-32-A (ours)
Struct-32-B (ours)
ResNet20
Struct-20-A (ours)
Struct-20-B (ours)
ShiftRes20-6 Wu et al. (2018)
Table 2: Results: MobileNetV2 on ImageNet
Architecture Adds () Mults () Params () Acc. (in %)
MobileNetV2
Struct-V2-A (ours)
Struct-V2-B (ours)
AMC-MV2 He et al. (2018)
ChPrune-MV2-1.3x
Slim-MV2 Yu et al. (2018)
WeightSVD 1.3x
ChPrune-MV2-2x
Table 3: Results: ResNets on ImageNet
Architecture Adds () Mults () Params () Acc. (in %)
ResNet50
Struct-50-A (ours)
Struct-50-B (ours)
ChPrune-R50-2x He et al. (2017)
WeightSVD-R50 Zhang et al. (2016b)
Ghost-R50 (s=2) Han et al. (2019)
Versatile-v2-R50 Wang et al. (2018)
ShiftResNet50 Wu et al. (2018)
Slim-R50 x Yu et al. (2018)
ResNet34
Struct-34-A (ours)
Struct-34-B (ours)
ResNet18
Struct-18-A (ours)
Struct-18-B (ours)
WeightSVD-R18 Zhang et al. (2016b)
ChPrune-R18-2x He et al. (2017)
ChPrune-R18-4x

Entries are shortened, e.g. ‘Channel Pruning’ as ‘ChPrune’. Results for He et al. (2017); Zhang et al. (2016b) are obtained from Kuzmin et al. (2019).

Table 4: Results: EfficientNet on ImageNet
Architecture Adds () Mults () Params () Acc. (in %)
EfficientNet-B1 Tan and Le (2019)
Struct-EffNet (ours)
EfficientNet-B0 Tan and Le (2019)
Table 1: Results: ResNets on CIFAR-10

6.1 Image Classification

We present results for ResNets He et al. (2016) in Tables 4 and 4. To demonstrate the efficacy of our method on modern networks, we also show results on MobileNetV2 Sandler et al. (2018) and EfficientNet111Our Efficientnet reproduction of baselines (B0 and B1) give results slightly inferior to Tan and Le (2019). Our Struct-EffNet is created on top of this EfficientNet-B1 baseline. Tan and Le (2019) in Table 4 and 4.

To provide a comprehensive analysis, for each baseline architecture, we present structured counterparts, with version "A" designed to deliver similar accuracies and version "B" for extreme compression ratios. Using different configurations per-layer, we obtain structured versions with varying levels of reduction in model size and multiplications/additions (please see Supplementary material for details). For the "A" versions of ResNet, we set the compression ratio () to be 2 for all layers. For the "B" versions of ResNets, we use nonuniform compression ratios per layer. Specifically, we compress stages 3 and 4 drastically (4) and stages 1 and 2 by . Since MobileNet is already a compact model, we design its "A" version to be smaller and "B" version to be smaller.

We note that, on low-level hardware, additions are much power-efficient and faster than multiplications Chen et al. (2019); Horowitz (2014). Since the actual inference time depends on how software optimizations and scheduling are implemented, for most objective conclusions, we provide the number of additions / multiplications and model sizes. Considering observations for sum-pooling on dedicated hardware units Young and Gulland (2018), our structured convolutions can be easily adapted for memory and compute limited devices.

Compared to the baseline models, the Struct-A versions of ResNets are smaller, while maintaining less than loss in accuracy. The more aggressive Struct-B ResNets achieve - model size reduction with about - accuracy drop. Compared to other methods, Struct-56-A is better than AMC-Res56 He et al. (2018) of similar complexity and Struct-20-A exceeds ShiftResNet20-6 Wu et al. (2018) by while being significantly smaller. Similar trends are observed with Struct-Res18 and Struct-Res50 on ImageNet. Struct-56-A and Struct-50-A achieve competitive performance as compared to the recent GhostNets Han et al. (2019). For MobileNetV2 which is already designed to be efficient, Struct-MV2-A achieves further reduction in multiplications and model size with SOTA performance compared to other methods, see Table 4. Applying structured convolutions to EfficientNet-B1 results in Struct-EffNet that has comparable performance to EfficientNet-B0, as can be seen in Table 4.

Figure 6: Comparison with structured compression methods from Kuzmin et al. (2019) shows ‘ImageNet top-1 vs model size’ trade-off as well as ‘ImageNet top-1 vs multiplications’ trade-off.

The ResNet Struct-A versions have similar number of adds and multiplies (except ResNet50) because, as noted in Sec. 4.2, the sum-pooling contribution is amortized. But sum-pooling starts dominating as the compression gets more aggressive, as can be seen in the number of adds for Struct-B versions. Notably, both "A" and "B" versions of MobileNetV2 observe a dominance of the sum-pooling component. This is because the number of output channels are not enough to amortize the sum-pooling component resulting from the decomposition of the pointwise ( conv) layers.

Fig. 6 compares our method with state-of-the-art structured compression methods - WeightSVD Zhang et al. (2016b), Channel Pruning He et al. (2017), and Tensor-train Su et al. (2018). Note, the results were obtained from Kuzmin et al. (2019). Our proposed method achieves approximately improvement over the second best method for ResNet18 () and MobileNetV2 (). Especially for MobileNetV2, this improvement is valuable since it significantly outperforms all the other methods (see Struct-V2-A in Table 4).

6.2 Semantic Segmentation

HRNetV2-W18 -Small-v2 #adds () #mults () #params () Mean IoU (in %)
Original
Struct-HR-A
Table 5: Evaluation of proposed method on Cityscapes Cordts et al. (2016) using HRNetV2.

After demonstrating the superiority of our method on image classification, we evaluate it for semantic segmentation that requires reproducing fine details around object boundaries. We apply our method to a recently developed state-of-the-art HRNet Wang et al. (2019). Table 5 shows that the structured convolutions can significantly improve our segmentation model efficiency: HRNet model is reduced by 50% in size, and 30% in number of additions and multiplications, while having only 1.5% drop in mIoU. More results can be found in the supplementary material.

7 Conclusion

In this work, we propose Composite Kernels and Structured Convolutions in an attempt to exploit redundancy in the implicit structure of convolution kernels. We show that Structured Convolutions can be decomposed into a computationally cheap sum-pooling component followed by a significantly smaller convolution, by training the model using an intuitive structural regularization loss. The effectiveness of the proposed method is demonstrated via extensive experiments on image classification and semantic segmentation benchmarks. Sum-pooling relies purely on additions, which are known to be extremely power-efficient. Hence, our method shows promise in deploying deep models on low-power devices. Since our method keeps the convolutional structures, it allows integration of further model compression schemes, which we leave as future work.

Acknowledgements

We would like to thank our Qualcomm AI Research colleagues for their support and assistance, in particular that of Andrey Kuzmin, Tianyu Jiang, Khoi Nguyen, Kwanghoon An and Saurabh Pitre.

References

  • [1] K. Azarian, Y. Bhalgat, J. Lee, and T. Blankevoort (2020) Learned threshold pruning. arXiv preprint arXiv:2003.00075. Cited by: §2.
  • [2] A. Ben-Israel and T. N. Greville (2003) Generalized inverses: theory and applications. Vol. 15, Springer Science & Business Media. Cited by: §5.1.
  • [3] H. Chen, Y. Wang, C. Xu, B. Shi, C. Xu, Q. Tian, and C. Xu (2019) AdderNet: do we really need multiplications in deep learning?. arXiv preprint arXiv:1912.13200. Cited by: §2, §6.1.
  • [4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In

    Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: Table 5, §6.
  • [5] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 1269–1277. External Links: Link Cited by: §1.
  • [6] E. Elsen, M. Dukhan, T. Gale, and K. Simonyan (2019) Fast sparse convnets. arXiv preprint arXiv:1911.09723. Cited by: §2.
  • [7] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu (2019) GhostNet: more features from cheap operations. arXiv preprint arXiv:1911.11907. Cited by: §6.1, Table 4.
  • [8] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2.
  • [9] S. Han, J. Pool, J. Tran, and W. J. Dally (2015) Learning both weights and connections for efficient neural networks. CoRR abs/1506.02626. External Links: 1506.02626 Cited by: §1.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. External Links: Document Cited by: §6.1.
  • [11] Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han (2018) AMC: automl for model compression and acceleration on mobile devices. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII, Lecture Notes in Computer Science, Vol. 11211, pp. 815–832. External Links: Link, Document Cited by: §2, §6.1, Table 4.
  • [12] Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 1398–1406. External Links: Document Cited by: §1, §2, §6.1, Table 4.
  • [13] M. Horowitz (2014) 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14. Cited by: §2, §6.1.
  • [14] M. Jaderberg, A. Vedaldi, and A. Zisserman (2014) Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: §1, §1, §2.
  • [15] T. G. Kolda and B. W. Bader (2009) Tensor decompositions and applications. SIAM review 51 (3), pp. 455–500. Cited by: §2.
  • [16] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §6.
  • [17] A. Krogh and J. A. Hertz (1992) A simple weight decay can improve generalization. In Advances in neural information processing systems, pp. 950–957. Cited by: §1.
  • [18] A. Kusupati, V. Ramanujan, R. Somani, M. Wortsman, P. Jain, S. Kakade, and A. Farhadi (2020) Soft threshold weight reparameterization for learnable sparsity. arXiv preprint arXiv:2002.03231. Cited by: §2.
  • [19] A. Kuzmin, M. Nagel, S. Pitre, S. Pendyam, T. Blankevoort, and M. Welling (2019) Taxonomy and evaluation of structured compression of convolutional neural networks. arXiv preprint arXiv:1912.09802. Cited by: §1, §2, Figure 6, §6.1, Table 4.
  • [20] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky (2014) Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553. Cited by: §2.
  • [21] V. Lebedev and V. Lempitsky (2016) Fast convnets using group-wise brain damage. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2554–2564. Cited by: §2.
  • [22] F. Li and B. Liu (2016) Ternary weight networks. arXiv preprint arxiv:1605.04711. External Links: 1605.04711 Cited by: §2.
  • [23] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2017) Pruning filters for efficient convnets. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §1.
  • [24] J. Lin, C. Gan, and S. Han (2019) Tsm: temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7083–7093. Cited by: §2.
  • [25] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell (2018) Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270. Cited by: §2.
  • [26] S. Mallat (2012) Group invariant scattering. Communications on Pure and Applied Mathematics 65 (10), pp. 1331–1398. Cited by: §2.
  • [27] F. Manessi, A. Rozza, S. Bianco, P. Napoletano, and R. Schettini (2018) Automated pruning for deep neural network compression. In 24th International Conference on Pattern Recognition, ICPR 2018, Beijing, China, August 20-24, 2018, pp. 657–664. External Links: Link, Document Cited by: §1.
  • [28] I. V. Oseledets (2011) Tensor-train decomposition. SIAM Journal on Scientific Computing 33 (5), pp. 2295–2317. Cited by: §2.
  • [29] Q. Qiu, X. Cheng, R. Calderbank, and G. Sapiro (2018) Dcfnet: deep neural network with decomposed convolutional filters. arXiv preprint arXiv:1802.04145. Cited by: §1, §2.
  • [30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §6.
  • [31] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018-06) MobileNetV2: inverted residuals and linear bottlenecks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §6.1.
  • [32] L. Sifre and S. Mallat (2013) Rotation, scaling and deformation invariant scattering for texture discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1233–1240. Cited by: §2.
  • [33] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.

    The journal of machine learning research

    15 (1), pp. 1929–1958.
    Cited by: §1.
  • [34] G. Strang (1986) A proposal for toeplitz matrix calculations. Studies in Applied Mathematics 74 (2), pp. 171–176. Cited by: §5.
  • [35] J. Su, J. Li, B. Bhattacharjee, and F. Huang (2018) Tensorized spectrum preserving compression for neural networks. arXiv preprint arXiv:1805.10352. Cited by: §2, §6.1.
  • [36] C. Tai, T. Xiao, Y. Zhang, X. Wang, et al. (2015) Convolutional neural networks with low-rank regularization. arXiv preprint arXiv:1511.06067. Cited by: §2.
  • [37] M. Tan and Q. V. Le (2019) Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §6.1, Table 4, footnote 1.
  • [38] M. Tayyab and A. Mahalanobis (2019) BasisConv: a method for compressed representation and learning in cnns. arXiv preprint arXiv:1906.04509. Cited by: §1, §2.
  • [39] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao (2019) Deep high-resolution representation learning for visual recognition. TPAMI. Cited by: §A.3, Table 7, §6.2.
  • [40] Y. Wang, C. Xu, X. Chunjing, C. Xu, and D. Tao (2018) Learning versatile filters for efficient convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1608–1618. Cited by: Table 4.
  • [41] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li (2016) Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §2.
  • [42] B. Wu, A. Wan, X. Yue, P. Jin, S. Zhao, N. Golmant, A. Gholaminejad, J. Gonzalez, and K. Keutzer (2018) Shift: a zero flop, zero parameter alternative to spatial convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9127–9135. Cited by: §2, §6.1, Table 4.
  • [43] Y. Yang, D. Krompass, and V. Tresp (2017)

    Tensor-train recurrent neural networks for video classification

    .
    In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3891–3900. Cited by: §2.
  • [44] R. C. Young and W. J. Gulland (2018-July 24) Performing average pooling in hardware. Google Patents. Note: US Patent 10,032,110 Cited by: §6.1.
  • [45] J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang (2018) Slimmable neural networks. arXiv preprint arXiv:1812.08928. Cited by: Table 4.
  • [46] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang (2018) A systematic DNN weight pruning framework using alternating direction method of multipliers. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VIII, Lecture Notes in Computer Science, Vol. 11212, pp. 191–207. External Links: Link, Document Cited by: §1, §2.
  • [47] X. Zhang, J. Zou, K. He, and J. Sun (2016) Accelerating very deep convolutional networks for classification and detection. IEEE Trans. Pattern Anal. Mach. Intell. 38 (10), pp. 1943–1955. External Links: Document Cited by: §2.
  • [48] X. Zhang, J. Zou, K. He, and J. Sun (2016) Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence 38 (10), pp. 1943–1955. Cited by: §6.1, Table 4.
  • [49] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §A.2, §A.3, Table 7.
  • [50] H. Zhong, X. Liu, Y. He, and Y. Ma (2018) Shift-based primitives for efficient convolutional neural networks. arXiv preprint arXiv:1809.08458. Cited by: §2.

Appendix A Appendix

a.1 Structured Convolutions with arbitrary Padding, Stride and Dilation

In the main paper, we showed that a Structured Convolution can be decomposed into a Sum-Pooling component followed by a smaller convolution operation with a kernel composed of the ’s. In this section, we discuss how to calculate the equivalent stride, padding and dilation needed for the resulting decomposed sum-pooling and convolution operations.

a.1.1 Padding

The easiest of these three attributes is padding. Fig. 7 shows an example of a structured convolution with a kernel (i.e. ) with underlying parameter . Hence, it can be decomposed into a sum-pooling operation followed by a convolution. As shown in the figure, to preserve the same output after the decomposition, the sum-pooling component should use the same padding as the original convolution, whereas the smaller convolution is performed without padding.

This leads us to a more general result that - if the original convolution uses a padding of , then, after the decomposition, the sum-pooling should be performed with padding and the smaller convolution (with ’s) should be performed without padding.

Figure 7: Decomposition of a Structured Convolution with a padding of . Top shows the conventional operation of the convolution. Bottom shows the equivalent operation using sum-pooling.

a.1.2 Stride

The above rule can be simply extended to the case where the original structured convolution has a stride associated with it. The general rule is - if the original convolution uses a stride of , then, after the decomposition, the sum-pooling should be performed with a stride of and the smaller convolution (with ’s) should be performed with a stride of .

a.1.3 Dilation

Dilated or atrous convolutions are prominent in semantic segmentation architectures. Hence, it is important to consider how we can decompose dilated structured convolutions. Fig. 8 shows an example of a structured convolution with a dilation of . As can be seen in the figure, to preserve the same output after decomposition, both the sum-pooling component and the smaller convolution (with ’s) has to be performed with a dilation factor same as the original convolution.

Figure 8: Decomposition of a Structured Convolution with a dilation of . Top shows the conventional operation of the convolution. Bottom shows the equivalent operation using sum-pooling.

Fig. 9 summarizes the aforementioned rules regarding padding, stride and dilation.

a.2 Training Implementation Details

Image Classification. For both ImageNet and CIFAR-10 benchmarks, we train all the ResNet architectures from scratch with the Structural Regularization (SR) loss. We set to for the Struct-A versions and for the Struct-B versions throughout training. For MobileNetV2, we first train the deep network from scratch without SR loss (i.e. ) for epochs to obtain pretrained weights and then apply SR loss with for further epochs. For EfficientNet-B0, we first train without SR loss for epochs and then apply SR loss with for further epochs.

For CIFAR-10, we train the ResNets for epochs using a batch size of and an initial learing rate of which is decayed by a factor of at and epochs. We use a weight decay of throughout training. On ImageNet, we use a cosine learning rate schedule with an SGD optimizer for training all architectures. We train the ResNets using a batch size of and weight decay of for epochs starting with an initial learning rate of .

Figure 9: Decomposition of a general structured convolution with stride, padding and dilation. The blocked arrows indicate the dimensions of the input and output tensors. Top shows the conventional operation of the convolution. Bottom shows the equivalent operation using sum-pooling.

For MobileNetV2, we use a weight decay of and batch size throughout training. In the first phase (with ), we use an initial learning rate of for epochs and in the second phase, we start a new cosine schedule with an initial learning rate of for the next epochs. We train EfficientNet-B0 using Autoaugment, a weight decay of and batch size . We use an initial learning rate of in the first phase and we start a new cosine schedule for the second phase with an initial learning rate of for the next epochs.

Semantic Segmentation. For training Struct-HRNet-A on Cityscapes, we start from a pre-trained HRNet model and train using structural regularization loss. We set to . We use a cosine learning rate schedule with an initial learning rate of . The use image resolution of for training, same as the original image size. We train for 90000 iterations using a batch size of 4.

We show additional results with PSPNet in Sec. A.3 below. We follow a similar training process for training Struct-PSPNet-A where we start from a pre-trained PSPNet101 [49].

a.3 Additional results on Semantic Segmentation

In Table 7 and 7, we present additional results for HRNetV2-W18-Small-v1 [39] (note this is different from HRNetV2-W18-Small-v2 reported in the main paper) and PSPNet101 [49] on Cityscapes dataset.

HRNetV2-W18 -Small-v1 #adds () #mults () #params () mIoU (in %)
Original
Struct-HR-A-V1
Table 7: Evaluation of proposed method on Cityscapes using PSPNet101 [49].
PSPNet101 #adds () #mults () #params () mIoU (in %)
Original
Struct-PSP-A 76.6
Table 6: Evaluation of proposed method on Cityscapes using HRNetV2-W18-Small-v1 [39].

a.4 Layer-wise compression ratios for compared architectures

As mentioned in the Experiments section of the main paper, we use non-uniform selection for the per-layer compression ratios () for MobileNetV2 and EfficientNet-B0 as well as HRNet for semantic segmentation. Tables 11 and 11 show the layerwise parameters for each layer of the Struct-MV2-A and Struct-MV2-B architectures. Table 12 shows these per-layer parameters for Struct-EffNet.

For Struct-HRNet-A, we apply Structured Convolutions only in the spatial dimension, i.e. we use , hence there’s no decomposition across the channel dimension. For convolutional kernels, we use , which means a convolution is decomposed into a sum pooling followed by a convolution. And for convolutions, where , we use which is the only possiblility for since . We do not use Structured Convolutions in the initial two convolution layers and last convolution layer.

For Struct-PSPNet, similar to Struct-HRNet-A, we apply use structured convolutions in all the convolution layers except the first and last layer. For convolutions, the structured convolution uses and . For convolutions, the structured convolution uses and .

a.5 Sensitivity of Structural Regularization w.r.t

In Sec. 5.1, we introduced the Structural Regularization (SR) loss and proposed to train the network using this regularization with a weight . In this section, we investigate the variation in the final performance of the model (after decomposition) when trained with different values of .

We trained Struct-Res18-A and Struct-Res18-B with different values of . Note that when training both "A" and "B" versions, we start with the original architecture for ResNet18 and train it from scratch with the SR loss. After this first step, we then decompose the weights using to get the decomposed architecture. Tables 9 and 9 show the accuracy of Struct-Res18-A and Struct-Res18-B both pre-decomposition and post-decomposition.

Acc. (before decomposition) Top-1 Acc. (after decomposition)
Table 9: ImageNet performance of Struct-Res18-B trained with different
Acc. (before decomposition) Top-1 Acc. (after decomposition)
Table 8: ImageNet performance of Struct-Res18-A trained with different

From Table 9, we can see that the accuracy after decomposition isn’t affected much by the choice of . When varies from to , the post-decomposition accuracy only changed by . Similar trends are observed in Table 9 when we are compressing more aggressively. But the sensitivity of the performance w.r.t.  is slightly higher in the "B" version. Also, we can see that when , the difference between pre-decomposition and post-decomposition accuracy is significant. Since is very small in this case, the Structural Regularization loss does not impose the desired structure on the convolution kernels effectively. As a result, after decomposition, it leads to a loss in accuracy.

Idx Dimension
1 3 3
2 1 3
3 32 1
4 16 1
5 1 3
6 96 1
7 24 1
8 1 3
9 144 1
10 24 1
11 1 3
12 144 1
13 32 1
14 1 3
15 192 1
16 32 1
17 1 3
18 192 1
19 32 1
20 1 3
21 192 1
22 64 1
23 1 3
24 384 1
25 64 1
26 1 3
27 384 1
28 64 1
29 1 3
30 384 1
31 64 1
32 1 3
33 384 1
34 96 1
35 1 3
36 576 1
37 96 1
38 1 3
39 576 1
40 96 1
41 1 3
42 576 1
43 160 1
44 1 3
45 960 1
46 160 1
47 1 3
48 960 1
49 160 1
50 1 3
51 840 1
52 160 1
classifier 640 1
Table 11: Layerwise configuration for Struct-MV2-B architecture
Idx Dimension
1 3 3
2 1 3
3 32 1
4 16 1
5 1 3
6 48 1
7 12 1
8 1 3
9 72 1
10 12 1
11 1 3
12 72 1
13 16 1
14 1 3
15 96 1
16 16 1
17 1 2
18 96 1
19 16 1
20 1 2
21 96 1
22 32 1
23 1 2
24 192 1
25 32 1
26 1 2
27 192 1
28 32 1
29 1 2
30 192 1
31 32 1
32 1 2
33 192 1
34 48 1
35 1 2
36 288 1
37 48 1
38 1 2
39 288 1
40 48 1
41 1 2
42 288 1
43 80 1
44 1 2
45 480 1
46 80 1
47 1 2
48 480 1
49 80 1
50 1 3
51 480 1
52 160 1
classifier 560 1
Table 10: Layerwise configuration for Struct-MV2-A architecture
Idx Dimension
1 3 3
2 1 3
3 32 1
4 1 3
5 16 1
6 16 1
7 1 3
8 96 1
9 24 1
10 1 3
11 144 1
12 24 1
13 1 3
14 144 1
15 24 1
16 1 5
17 144 1
18 40 1
19 1 5
20 240 1
21 40 1
22 1 5
23 240 1
24 40 1
25 1 3
26 240 1
27 64 1
28 1 3
29 360 1
30 64 1
31 1 3
32 360 1
33 64 1
34 1 3
35 360 1
36 64 1
37 1 5
38 360 1
39 80 1
40 1 5
41 560 1
42 96 1
43 1 5
44 560 1
45 96 1
46 1 5
47 560 1
48 96 1
49 1 5
50 560 1
51 100 1
52 1 5
53 640 1
54 100 1
55 1 5
56 640 1
57 100 1
58 1 5
59 640 1
60 100 1
61 1 5
62 576 1
63 160 1
64 1 3
65 576 1
66 160 1
67 1 3
68 960 1
69 160 1
classifier 480 1
Table 12: Layerwise configuration for Struct-EffNet architecture