1 Introduction
Deep neural networks deliver outstanding performance across a variety of usecases but quite often fail to meet the computational budget requirements of mainstream devices. Hence, model efficiency plays a key role in bridging deep learning research into practice. Various model compression techniques rely on a key assumption that the deep networks are overparameterized, meaning that a significant proportion of the parameters are redundant. This redundancy can appear either explicitly or implicitly. In the former case, several structured
He et al. (2017); Li et al. (2017), as well as unstructured Han et al. (2015a, b); Manessi et al. (2018); Zhang et al. (2018), pruning methods have been proposed to systematically remove redundant components in the network and improve runtime efficiency. On the other hand, tensordecomposition methods based on singular values of the weight tensors, such as spatial SVD or weight SVD, remove somewhat implicit elements of the weight tensor to construct lowrank decompositions for efficient inference
Denton et al. (2014); Jaderberg et al. (2014); Kuzmin et al. (2019).Redundancy in deep networks can also be seen as network weights possessing an unnecessarily high degrees of freedom (DOF). Alongside various regularization methods
Krogh and Hertz (1992); Srivastava et al. (2014) that impose constraints to avoid overfitting, another approach for reducing the DOF is by decreasing the number of learnable parameters. To this end, Jaderberg et al. (2014); Qiu et al. (2018); Tayyab and Mahalanobis (2019)propose using certain basis representations for weight tensors. In these methods, the basis vectors are fixed and only their coefficients are learnable. Thus, by using a smaller number of coefficients than the size of weight tensors, the DOF can be effectively restricted. But, note that, this is useful only during training since the original higher number of parameters are used during inference.
Qiu et al. (2018) shows that systematically choosing the basis (e.g. the FourierBessel basis) can lead to model size shrinkage and flops reduction even during inference.In this work, we explore restricting the degrees of freedom of convolutional kernels by imposing a structure on them. This structure can be thought of as constructing the convolutional kernel by superimposing several constantheight kernels. A few examples are shown in Fig. 1, where a kernel is constructed via superimposition of linearly independent masks with associated constant scalars , hence leading to degrees of freedom for the kernel. The very nature of the basis elements as binary masks enables efficient execution of the convolution operation as explained in Sec. 3.1.
In Sec. 4, we introduce Structured Convolutions as a special case of this superimposition and show that it leads to a decomposition of the convolution operation into a sumpooling operation and a significantly smaller convolution operation. We show how this decomposition can be applied to convolutional layers as well as fully connected layers. We further propose a regularization method named Structural Regularization that promotes the normal convolution weights to have the desired structure that facilitates our proposed decomposition. Overall, our key contributions in this work are:

We introduce Composite Kernel structure, which accepts an arbitrary basis in the kernel formation, leading to an efficient convolution operation. Sec. 3 provides the definition.

We propose Structured Convolutions, a realization of the composite kernel structure. We show that a structured convolution can be decomposed into a sumpooling operation followed by a much smaller convolution operation. A detailed analysis is provided in Sec. 4.1.

Finally, we design Structural Regularization, an effective training method to enable the structural decomposition with minimal loss of accuracy. Our process is described in Sec. 5.1.
2 Related Work
The existing literature on exploiting redundancy in deep networks can be broadly studied as follows.
Tensor Decomposition Methods. The work in Zhang et al. (2016a) proposed a Generalized SVD approach to decompose a convolution (where and are output and input channels, and is the spatial size) into a convolution followed by a convolution. Likewise, Jaderberg et al. (2014) introduced Spatial SVD to decompose a kernel into and kernels. Tai et al. (2015) further developed a noniterative method for such lowrank decomposition. CPdecomposition Kolda and Bader (2009); Lebedev et al. (2014) and tensortrain decomposition Oseledets (2011); Su et al. (2018); Yang et al. (2017) have been proposed to decompose high dimensional tensors. In our method, we too aim to decompose regular convolution into computationally lightweight units.
Structured Pruning. He et al. (2018, 2017); Li and Liu (2016) presented channel pruning methods where redundant channels in every layer are removed. The selection process of the redundant channels is unique to every method, for instance, He et al. (2017)
addressed the channel selection problem using lasso regression. Similarly,
Wen et al. (2016) used group lasso regularization to penalize and prune unimportant groups on different levels of granularity. We refer readers Kuzmin et al. (2019) for a survey of structured pruning and tensor decomposition methods. To our advantage, the proposed method in this paper does not explicitly prune, instead, our structural regularization loss imposes a form on the convolution kernels.Semistructured and Unstructured Pruning. Other works Lebedev and Lempitsky (2016); Liu et al. (2018); Elsen et al. (2019) employed blockwise sparsity (also called semistructured pruning) which operates on a finer level than channels. Unstructured pruning methods Azarian et al. (2020); Han et al. (2015a); Kusupati et al. (2020); Zhang et al. (2018) prune on the parameterlevel yielding higher compression rates. However, their unstructured nature makes it difficult to deploy them on most hardware platforms.
Using Prefixed Basis. Several works Qiu et al. (2018); Tayyab and Mahalanobis (2019) applied basis representations in deep networks. Seminal works Mallat (2012); Sifre and Mallat (2013) used wavelet bases as feature extractors. Choice of the basis is important, for example, Qiu et al. (2018) used FourierBessel basis that led to a reduction in computation complexity. In general, tensor decomposition can be seen as basis representation learning. We propose using structured binary masks as our basis, which leads to an immediate reduction in the number of multiplications.
Orthogonal to structured compression, Lin et al. (2019); Wu et al. (2018); Zhong et al. (2018) utilized shiftbased operations to reduce the overall computational load. Given the high computational cost of multiplications compared to additions Horowitz (2014), Chen et al. (2019) proposed networks where the majority of the multiplications are replaced by additions.
3 Composite Kernels
We first give a definition that encompasses a wide range of structures for convolution kernels.
Definition 1.
For , a Composite Basis is a linearly independent set of binary tensors of dimension as its basis elements. That is, , each element of , and .
The linear independence condition implies that . Hence, the basis spans a subspace of . The speciality of the Composite Basis is that the basis elements are binary, which leads to an immediate reduction in the number of multiplications involved in the convolution operation.
Definition 2.
A kernel is a Composite Kernel if it is in the subspace of the Composite Basis. That is, it can be constructed as a linear combination of the elements of : such that .
Note that, the binary structure of the underlying Composite Basis elements defines the structure of the Composite Kernel. Fig. 1 shows a Composite Kernel (with ) constructed using different examples of a Composite Basis. In general, the underlying basis elements could have a more random structure than what is demonstrated in those examples shown in Fig. 1.
Conventional kernels (with no restrictions on DOF) are just special cases of Composite Kernels, where and each basis element has only one nonzero element in its grid.
3.1 Convolution with Composite Kernels
Consider a convolution with a Composite Kernel of size , where is the spatial size and is the number of input channels. To compute an output, this kernel is convolved with a volume of the input feature map. Let’s call this volume . Therefore, the output at this point will be:
(1) 
where ‘’ denotes convolution, ‘’ denotes elementwise multiplication. Since is a binary tensor, is same as adding the elements of wherever , thus no multiplications are needed. Ordinarily, the convolution would involve multiplications and additions. In our method, we can trade multiplications with additions. From (1), we can see that we only need multiplications and the total number of additions becomes:
(2) 
4 Structured Convolutions
Definition 3.
A kernel in is a Structured Kernel if it is a Composite Kernel with for some , and if each basis tensor is made of a cuboid of ’s, while rest of its coefficients being .
A Structured Kernel is characterized by its dimensions and its underlying parameters . Convolutions performed using Structured Kernels are called Structured Convolutions.
Fig. 1(b) depicts a 2D case of a structured kernel where . As shown, there are basis elements and each element has a sized patch of ’s.
Fig. 2 shows a 3D case where . Here, there are basis elements and each element has a cuboid of ’s. Note how these cuboids of ’s (shown in colors) cover the entire grid.
4.1 Decomposition of Structured Convolutions
A major advantage of defining Structured Kernels this way is that all the basis elements are just shifted versions of each other (see Fig. 2 and Fig. 1(b)). This means, in Eq. (1), if we consider the convolution for the entire feature map , the summed outputs for all ’s are actually the same (except on the edges of ). As a result, the outputs can be computed using a single sumpooling operation on with a kernel size of . Fig. 3 shows a simple example of how a convolution with a structured kernel can be broken into a sumpooling followed by a convolution with a kernel made of ’s.
Furthermore, consider a convolutional layer of size that has kernels of size . In our design, the same underlying basis is used for the construction of all kernels in the layer. Suppose any two structured kernels in this layer with coefficients and , i.e. and . The convolution output with these two kernels is respectively, and . We can see that the computation is common to all the kernels of this layer. Hence, the sumpooling operation only needs to be computed once and then reused across all the kernels.
A Structured Convolution can thus be decomposed into a sumpooling operation and a smaller convolution operation with a kernel composed of ’s. Fig. 4 shows the decomposition of a general structured convolution layer of size .
Notably, standard convolution (), depthwise convolution (), and pointwise convolution () kernels can all be constructed as 3D structured kernels, which means that this decomposition can be widely applied to existing architectures. See supplementary material for more details on applying the decomposition to convolutions with arbitrary stride, padding, dilation.
4.2 Reduction in Number of Parameters and Multiplications/Additions
The sumpooling component after decomposition requires no parameters. Thus, the total number of parameters in a convolution layer get reduced from (before decomposition) to (after decomposition). The sumpooling component is also free of multiplications. Hence, only the smaller convolution contributes to multiplications after decomposition.
Before decomposition, computing every output element in feature map involves multiplications and additions. Hence, total multiplications involved are and total additions involved are .
After decomposition, computing every output element in feature map involves multiplications and additions. Hence, total multiplications and additions involved in computing are and respectively. Now, computing every element of the intermediate sumpooled output involves additions. Hence, the overall total additions involved can be written as:
We can see that the number of parameters and number of multiplications have both reduced by a factor of . And in the expression above, if is large enough, the first term inside the parentheses gets amortized and the number of additions . As a result, the number of additions also reduce by approximately the same proportion . We will refer to as the compression ratio from now on.
Due to amortization, the additions per output are , which is basically since .
4.3 Extension to Fully Connected layers
For image classification networks, the last fully connected layer (sometimes called linear layer) dominates w.r.t. the number of parameters, especially if the number of classes is high. The structural decomposition can be easily extended to the linear layers by noting that a matrix multiplication is the same as performing a number of convolutions on the input. Consider a kernel and input vector . The linear operation is mathematically equivalent to the convolution , where is the same as but with dimensions and is the same as but with dimensions . In other words, each row of can be considered a convolution kernel of size .
Now, if each of these kernels (of size ) is structured with underlying parameter (where ), then the matrix multiplication operation can be structurally decomposed as shown in Fig. 5.
Same as before, we get a reduction in both the number of parameters and the number of multiplications by a factor of , as well as the number of additions by a factor of .
5 Imposing Structure on Convolution Kernels
To apply the structural decomposition, we need the weight tensors to be structured. In this section, we propose a method to impose the desired structure on the convolution kernels via training.
From the definition, , we can simply define matrix such that its column is the vectorized form of . Hence, , where .
Another way to see this is from structural decomposition. We may note that the sumpooling can also be seen as a convolution with a kernel of all ’s; we refer to this kernel as . Hence, the structural decomposition is:
That implies, . Since the stride of the sumpooling involved is , this can be written in terms of a matrix multiplication with a Topelitz matrix Strang (1986):
Hence, the structure matrix referred above is basically .
5.1 Training with Structural Regularization
Now, for a structured kernel characterized by , there exists a length such that . Hence, a structured kernel satisfies the property: , where is the MoorePenrose inverse BenIsrael and Greville (2003) of . Based on this, we propose training a deep network with a Structural Regularization loss that can gradually push the deep network’s kernels to be structured via training:
(3) 
where denotes Frobenius norm and is the layer index. To ensure that regularization is applied uniformly to all layers, we use normalization in the denominator. It also stabilizes the performance of the decomposition w.r.t . The overall proposed training recipe is as follows:
Proposed Training Scheme:

[topsep=0pt]

Step 1: Train the original architecture with the Structural Regularization loss.

[topsep=0pt]

After Step 1, all weight tensors in the deep network will be almost structured.


Step 2: Apply the decomposition on every layer and compute .

[topsep=0pt]

This results in a smaller and more efficient decomposed architecture with ’s as the weights. Note that, every convolution / linear layer from the original architecture is now replaced with a sumpooling layer and a smaller convolution / linear layer.

The proposed scheme trains the architecture with the original kernels in place but with a structural regularization loss. The structural regularization loss imposes a restrictive degrees of freedom while training but in a soft or gradual manner (depending on ):

[leftmargin=*,topsep=0pt,itemsep=0pt]

If , it is the same as normal training with no structure imposed.

If is very high, the regularization loss will be heavily minimized in early training iterations. Thus, the weights will be optimized in a restricted dimensional subspace of .

Choosing a moderate gives the best tradeoff between structure and model performance.
We talk about training implementation details for reproduction, such as hyperparameters and training schedules, in Supplementary material, where we also show our method is robust to the choice of
.6 Experiments
We apply structured convolutions to a wide range of architectures and analyze the performance and complexity of the decomposed architectures. We evaluate our method on ImageNet Russakovsky et al. (2015) and CIFAR10 Krizhevsky et al. (2009) benchmarks for image classification and Cityscapes Cordts et al. (2016) for semantic segmentation.
Architecture  Adds ()  Mults ()  Params ()  Acc. (in %) 
ResNet56  
Struct56A (ours)  
Struct56B (ours)  
GhostRes56 Han et al. (2019)  
ShiftRes566 Wu et al. (2018)  
AMCRes56 He et al. (2018)  –  
ResNet32  
Struct32A (ours)  
Struct32B (ours)  
ResNet20  
Struct20A (ours)  
Struct20B (ours)  
ShiftRes206 Wu et al. (2018) 
Architecture  Adds ()  Mults ()  Params ()  Acc. (in %) 
MobileNetV2  
StructV2A (ours)  
StructV2B (ours)  
AMCMV2 He et al. (2018)  –  
ChPruneMV21.3x  
SlimMV2 Yu et al. (2018)  
WeightSVD 1.3x  
ChPruneMV22x 
Architecture  Adds ()  Mults ()  Params ()  Acc. (in %) 
ResNet50  
Struct50A (ours)  
Struct50B (ours)  
ChPruneR502x He et al. (2017)  
WeightSVDR50 Zhang et al. (2016b)  
GhostR50 (s=2) Han et al. (2019)  
Versatilev2R50 Wang et al. (2018)  
ShiftResNet50 Wu et al. (2018)  –  –  
SlimR50 x Yu et al. (2018)  
ResNet34  
Struct34A (ours)  
Struct34B (ours)  
ResNet18  
Struct18A (ours)  
Struct18B (ours)  
WeightSVDR18 Zhang et al. (2016b)  
ChPruneR182x He et al. (2017)  
ChPruneR184x 
Entries are shortened, e.g. ‘Channel Pruning’ as ‘ChPrune’. Results for He et al. (2017); Zhang et al. (2016b) are obtained from Kuzmin et al. (2019).
Architecture  Adds ()  Mults ()  Params ()  Acc. (in %) 
EfficientNetB1 Tan and Le (2019)  
StructEffNet (ours)  
EfficientNetB0 Tan and Le (2019) 
6.1 Image Classification
We present results for ResNets He et al. (2016) in Tables 4 and 4. To demonstrate the efficacy of our method on modern networks, we also show results on MobileNetV2 Sandler et al. (2018) and EfficientNet^{1}^{1}1Our Efficientnet reproduction of baselines (B0 and B1) give results slightly inferior to Tan and Le (2019). Our StructEffNet is created on top of this EfficientNetB1 baseline. Tan and Le (2019) in Table 4 and 4.
To provide a comprehensive analysis, for each baseline architecture, we present structured counterparts, with version "A" designed to deliver similar accuracies and version "B" for extreme compression ratios. Using different configurations perlayer, we obtain structured versions with varying levels of reduction in model size and multiplications/additions (please see Supplementary material for details). For the "A" versions of ResNet, we set the compression ratio () to be 2 for all layers. For the "B" versions of ResNets, we use nonuniform compression ratios per layer. Specifically, we compress stages 3 and 4 drastically (4) and stages 1 and 2 by . Since MobileNet is already a compact model, we design its "A" version to be smaller and "B" version to be smaller.
We note that, on lowlevel hardware, additions are much powerefficient and faster than multiplications Chen et al. (2019); Horowitz (2014). Since the actual inference time depends on how software optimizations and scheduling are implemented, for most objective conclusions, we provide the number of additions / multiplications and model sizes. Considering observations for sumpooling on dedicated hardware units Young and Gulland (2018), our structured convolutions can be easily adapted for memory and compute limited devices.
Compared to the baseline models, the StructA versions of ResNets are smaller, while maintaining less than loss in accuracy. The more aggressive StructB ResNets achieve  model size reduction with about  accuracy drop. Compared to other methods, Struct56A is better than AMCRes56 He et al. (2018) of similar complexity and Struct20A exceeds ShiftResNet206 Wu et al. (2018) by while being significantly smaller. Similar trends are observed with StructRes18 and StructRes50 on ImageNet. Struct56A and Struct50A achieve competitive performance as compared to the recent GhostNets Han et al. (2019). For MobileNetV2 which is already designed to be efficient, StructMV2A achieves further reduction in multiplications and model size with SOTA performance compared to other methods, see Table 4. Applying structured convolutions to EfficientNetB1 results in StructEffNet that has comparable performance to EfficientNetB0, as can be seen in Table 4.
The ResNet StructA versions have similar number of adds and multiplies (except ResNet50) because, as noted in Sec. 4.2, the sumpooling contribution is amortized. But sumpooling starts dominating as the compression gets more aggressive, as can be seen in the number of adds for StructB versions. Notably, both "A" and "B" versions of MobileNetV2 observe a dominance of the sumpooling component. This is because the number of output channels are not enough to amortize the sumpooling component resulting from the decomposition of the pointwise ( conv) layers.
Fig. 6 compares our method with stateoftheart structured compression methods  WeightSVD Zhang et al. (2016b), Channel Pruning He et al. (2017), and Tensortrain Su et al. (2018). Note, the results were obtained from Kuzmin et al. (2019). Our proposed method achieves approximately improvement over the second best method for ResNet18 () and MobileNetV2 (). Especially for MobileNetV2, this improvement is valuable since it significantly outperforms all the other methods (see StructV2A in Table 4).
6.2 Semantic Segmentation
HRNetV2W18 Smallv2  #adds ()  #mults ()  #params ()  Mean IoU (in %) 
Original  
StructHRA 
After demonstrating the superiority of our method on image classification, we evaluate it for semantic segmentation that requires reproducing fine details around object boundaries. We apply our method to a recently developed stateoftheart HRNet Wang et al. (2019). Table 5 shows that the structured convolutions can significantly improve our segmentation model efficiency: HRNet model is reduced by 50% in size, and 30% in number of additions and multiplications, while having only 1.5% drop in mIoU. More results can be found in the supplementary material.
7 Conclusion
In this work, we propose Composite Kernels and Structured Convolutions in an attempt to exploit redundancy in the implicit structure of convolution kernels. We show that Structured Convolutions can be decomposed into a computationally cheap sumpooling component followed by a significantly smaller convolution, by training the model using an intuitive structural regularization loss. The effectiveness of the proposed method is demonstrated via extensive experiments on image classification and semantic segmentation benchmarks. Sumpooling relies purely on additions, which are known to be extremely powerefficient. Hence, our method shows promise in deploying deep models on lowpower devices. Since our method keeps the convolutional structures, it allows integration of further model compression schemes, which we leave as future work.
Acknowledgements
We would like to thank our Qualcomm AI Research colleagues for their support and assistance, in particular that of Andrey Kuzmin, Tianyu Jiang, Khoi Nguyen, Kwanghoon An and Saurabh Pitre.
References
 [1] (2020) Learned threshold pruning. arXiv preprint arXiv:2003.00075. Cited by: §2.
 [2] (2003) Generalized inverses: theory and applications. Vol. 15, Springer Science & Business Media. Cited by: §5.1.
 [3] (2019) AdderNet: do we really need multiplications in deep learning?. arXiv preprint arXiv:1912.13200. Cited by: §2, §6.1.

[4]
(2016)
The cityscapes dataset for semantic urban scene understanding
. InProc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: Table 5, §6.  [5] (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, pp. 1269–1277. External Links: Link Cited by: §1.
 [6] (2019) Fast sparse convnets. arXiv preprint arXiv:1911.09723. Cited by: §2.
 [7] (2019) GhostNet: more features from cheap operations. arXiv preprint arXiv:1911.11907. Cited by: §6.1, Table 4.
 [8] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2.
 [9] (2015) Learning both weights and connections for efficient neural networks. CoRR abs/1506.02626. External Links: 1506.02626 Cited by: §1.
 [10] (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pp. 770–778. External Links: Document Cited by: §6.1.
 [11] (2018) AMC: automl for model compression and acceleration on mobile devices. In Computer Vision  ECCV 2018  15th European Conference, Munich, Germany, September 814, 2018, Proceedings, Part VII, Lecture Notes in Computer Science, Vol. 11211, pp. 815–832. External Links: Link, Document Cited by: §2, §6.1, Table 4.
 [12] (2017) Channel pruning for accelerating very deep neural networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017, pp. 1398–1406. External Links: Document Cited by: §1, §2, §6.1, Table 4.
 [13] (2014) 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14. Cited by: §2, §6.1.
 [14] (2014) Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: §1, §1, §2.
 [15] (2009) Tensor decompositions and applications. SIAM review 51 (3), pp. 455–500. Cited by: §2.
 [16] (2009) Learning multiple layers of features from tiny images. Cited by: §6.
 [17] (1992) A simple weight decay can improve generalization. In Advances in neural information processing systems, pp. 950–957. Cited by: §1.
 [18] (2020) Soft threshold weight reparameterization for learnable sparsity. arXiv preprint arXiv:2002.03231. Cited by: §2.
 [19] (2019) Taxonomy and evaluation of structured compression of convolutional neural networks. arXiv preprint arXiv:1912.09802. Cited by: §1, §2, Figure 6, §6.1, Table 4.
 [20] (2014) Speedingup convolutional neural networks using finetuned cpdecomposition. arXiv preprint arXiv:1412.6553. Cited by: §2.
 [21] (2016) Fast convnets using groupwise brain damage. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2554–2564. Cited by: §2.
 [22] (2016) Ternary weight networks. arXiv preprint arxiv:1605.04711. External Links: 1605.04711 Cited by: §2.
 [23] (2017) Pruning filters for efficient convnets. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: §1.
 [24] (2019) Tsm: temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7083–7093. Cited by: §2.
 [25] (2018) Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270. Cited by: §2.
 [26] (2012) Group invariant scattering. Communications on Pure and Applied Mathematics 65 (10), pp. 1331–1398. Cited by: §2.
 [27] (2018) Automated pruning for deep neural network compression. In 24th International Conference on Pattern Recognition, ICPR 2018, Beijing, China, August 2024, 2018, pp. 657–664. External Links: Link, Document Cited by: §1.
 [28] (2011) Tensortrain decomposition. SIAM Journal on Scientific Computing 33 (5), pp. 2295–2317. Cited by: §2.
 [29] (2018) Dcfnet: deep neural network with decomposed convolutional filters. arXiv preprint arXiv:1802.04145. Cited by: §1, §2.
 [30] (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §6.
 [31] (201806) MobileNetV2: inverted residuals and linear bottlenecks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §6.1.
 [32] (2013) Rotation, scaling and deformation invariant scattering for texture discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1233–1240. Cited by: §2.

[33]
(2014)
Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research
15 (1), pp. 1929–1958. Cited by: §1.  [34] (1986) A proposal for toeplitz matrix calculations. Studies in Applied Mathematics 74 (2), pp. 171–176. Cited by: §5.
 [35] (2018) Tensorized spectrum preserving compression for neural networks. arXiv preprint arXiv:1805.10352. Cited by: §2, §6.1.
 [36] (2015) Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067. Cited by: §2.
 [37] (2019) Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §6.1, Table 4, footnote 1.
 [38] (2019) BasisConv: a method for compressed representation and learning in cnns. arXiv preprint arXiv:1906.04509. Cited by: §1, §2.
 [39] (2019) Deep highresolution representation learning for visual recognition. TPAMI. Cited by: §A.3, Table 7, §6.2.
 [40] (2018) Learning versatile filters for efficient convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1608–1618. Cited by: Table 4.
 [41] (2016) Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §2.
 [42] (2018) Shift: a zero flop, zero parameter alternative to spatial convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9127–9135. Cited by: §2, §6.1, Table 4.

[43]
(2017)
Tensortrain recurrent neural networks for video classification
. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3891–3900. Cited by: §2.  [44] (2018July 24) Performing average pooling in hardware. Google Patents. Note: US Patent 10,032,110 Cited by: §6.1.
 [45] (2018) Slimmable neural networks. arXiv preprint arXiv:1812.08928. Cited by: Table 4.
 [46] (2018) A systematic DNN weight pruning framework using alternating direction method of multipliers. In Computer Vision  ECCV 2018  15th European Conference, Munich, Germany, September 814, 2018, Proceedings, Part VIII, Lecture Notes in Computer Science, Vol. 11212, pp. 191–207. External Links: Link, Document Cited by: §1, §2.
 [47] (2016) Accelerating very deep convolutional networks for classification and detection. IEEE Trans. Pattern Anal. Mach. Intell. 38 (10), pp. 1943–1955. External Links: Document Cited by: §2.
 [48] (2016) Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence 38 (10), pp. 1943–1955. Cited by: §6.1, Table 4.
 [49] (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §A.2, §A.3, Table 7.
 [50] (2018) Shiftbased primitives for efficient convolutional neural networks. arXiv preprint arXiv:1809.08458. Cited by: §2.
Appendix A Appendix
a.1 Structured Convolutions with arbitrary Padding, Stride and Dilation
In the main paper, we showed that a Structured Convolution can be decomposed into a SumPooling component followed by a smaller convolution operation with a kernel composed of the ’s. In this section, we discuss how to calculate the equivalent stride, padding and dilation needed for the resulting decomposed sumpooling and convolution operations.
a.1.1 Padding
The easiest of these three attributes is padding. Fig. 7 shows an example of a structured convolution with a kernel (i.e. ) with underlying parameter . Hence, it can be decomposed into a sumpooling operation followed by a convolution. As shown in the figure, to preserve the same output after the decomposition, the sumpooling component should use the same padding as the original convolution, whereas the smaller convolution is performed without padding.
This leads us to a more general result that  if the original convolution uses a padding of , then, after the decomposition, the sumpooling should be performed with padding and the smaller convolution (with ’s) should be performed without padding.
a.1.2 Stride
The above rule can be simply extended to the case where the original structured convolution has a stride associated with it. The general rule is  if the original convolution uses a stride of , then, after the decomposition, the sumpooling should be performed with a stride of and the smaller convolution (with ’s) should be performed with a stride of .
a.1.3 Dilation
Dilated or atrous convolutions are prominent in semantic segmentation architectures. Hence, it is important to consider how we can decompose dilated structured convolutions. Fig. 8 shows an example of a structured convolution with a dilation of . As can be seen in the figure, to preserve the same output after decomposition, both the sumpooling component and the smaller convolution (with ’s) has to be performed with a dilation factor same as the original convolution.
Fig. 9 summarizes the aforementioned rules regarding padding, stride and dilation.
a.2 Training Implementation Details
Image Classification. For both ImageNet and CIFAR10 benchmarks, we train all the ResNet architectures from scratch with the Structural Regularization (SR) loss. We set to for the StructA versions and for the StructB versions throughout training. For MobileNetV2, we first train the deep network from scratch without SR loss (i.e. ) for epochs to obtain pretrained weights and then apply SR loss with for further epochs. For EfficientNetB0, we first train without SR loss for epochs and then apply SR loss with for further epochs.
For CIFAR10, we train the ResNets for epochs using a batch size of and an initial learing rate of which is decayed by a factor of at and epochs. We use a weight decay of throughout training. On ImageNet, we use a cosine learning rate schedule with an SGD optimizer for training all architectures. We train the ResNets using a batch size of and weight decay of for epochs starting with an initial learning rate of .
For MobileNetV2, we use a weight decay of and batch size throughout training. In the first phase (with ), we use an initial learning rate of for epochs and in the second phase, we start a new cosine schedule with an initial learning rate of for the next epochs. We train EfficientNetB0 using Autoaugment, a weight decay of and batch size . We use an initial learning rate of in the first phase and we start a new cosine schedule for the second phase with an initial learning rate of for the next epochs.
Semantic Segmentation. For training StructHRNetA on Cityscapes, we start from a pretrained HRNet model and train using structural regularization loss. We set to . We use a cosine learning rate schedule with an initial learning rate of . The use image resolution of for training, same as the original image size. We train for 90000 iterations using a batch size of 4.
a.3 Additional results on Semantic Segmentation
In Table 7 and 7, we present additional results for HRNetV2W18Smallv1 [39] (note this is different from HRNetV2W18Smallv2 reported in the main paper) and PSPNet101 [49] on Cityscapes dataset.
HRNetV2W18 Smallv1  #adds ()  #mults ()  #params ()  mIoU (in %) 
Original  
StructHRAV1 
PSPNet101  #adds ()  #mults ()  #params ()  mIoU (in %) 
Original  
StructPSPA  76.6 
a.4 Layerwise compression ratios for compared architectures
As mentioned in the Experiments section of the main paper, we use nonuniform selection for the perlayer compression ratios () for MobileNetV2 and EfficientNetB0 as well as HRNet for semantic segmentation. Tables 11 and 11 show the layerwise parameters for each layer of the StructMV2A and StructMV2B architectures. Table 12 shows these perlayer parameters for StructEffNet.
For StructHRNetA, we apply Structured Convolutions only in the spatial dimension, i.e. we use , hence there’s no decomposition across the channel dimension. For convolutional kernels, we use , which means a convolution is decomposed into a sum pooling followed by a convolution. And for convolutions, where , we use which is the only possiblility for since . We do not use Structured Convolutions in the initial two convolution layers and last convolution layer.
For StructPSPNet, similar to StructHRNetA, we apply use structured convolutions in all the convolution layers except the first and last layer. For convolutions, the structured convolution uses and . For convolutions, the structured convolution uses and .
a.5 Sensitivity of Structural Regularization w.r.t
In Sec. 5.1, we introduced the Structural Regularization (SR) loss and proposed to train the network using this regularization with a weight . In this section, we investigate the variation in the final performance of the model (after decomposition) when trained with different values of .
We trained StructRes18A and StructRes18B with different values of . Note that when training both "A" and "B" versions, we start with the original architecture for ResNet18 and train it from scratch with the SR loss. After this first step, we then decompose the weights using to get the decomposed architecture. Tables 9 and 9 show the accuracy of StructRes18A and StructRes18B both predecomposition and postdecomposition.
Acc. (before decomposition)  Top1 Acc. (after decomposition)  
Acc. (before decomposition)  Top1 Acc. (after decomposition)  
From Table 9, we can see that the accuracy after decomposition isn’t affected much by the choice of . When varies from to , the postdecomposition accuracy only changed by . Similar trends are observed in Table 9 when we are compressing more aggressively. But the sensitivity of the performance w.r.t. is slightly higher in the "B" version. Also, we can see that when , the difference between predecomposition and postdecomposition accuracy is significant. Since is very small in this case, the Structural Regularization loss does not impose the desired structure on the convolution kernels effectively. As a result, after decomposition, it leads to a loss in accuracy.
Idx  Dimension  
1  3  3  
2  1  3  
3  32  1  
4  16  1  
5  1  3  
6  96  1  
7  24  1  
8  1  3  
9  144  1  
10  24  1  
11  1  3  
12  144  1  
13  32  1  
14  1  3  
15  192  1  
16  32  1  
17  1  3  
18  192  1  
19  32  1  
20  1  3  
21  192  1  
22  64  1  
23  1  3  
24  384  1  
25  64  1  
26  1  3  
27  384  1  
28  64  1  
29  1  3  
30  384  1  
31  64  1  
32  1  3  
33  384  1  
34  96  1  
35  1  3  
36  576  1  
37  96  1  
38  1  3  
39  576  1  
40  96  1  
41  1  3  
42  576  1  
43  160  1  
44  1  3  
45  960  1  
46  160  1  
47  1  3  
48  960  1  
49  160  1  
50  1  3  
51  840  1  
52  160  1  
classifier  640  1 
Idx  Dimension  
1  3  3  
2  1  3  
3  32  1  
4  16  1  
5  1  3  
6  48  1  
7  12  1  
8  1  3  
9  72  1  
10  12  1  
11  1  3  
12  72  1  
13  16  1  
14  1  3  
15  96  1  
16  16  1  
17  1  2  
18  96  1  
19  16  1  
20  1  2  
21  96  1  
22  32  1  
23  1  2  
24  192  1  
25  32  1  
26  1  2  
27  192  1  
28  32  1  
29  1  2  
30  192  1  
31  32  1  
32  1  2  
33  192  1  
34  48  1  
35  1  2  
36  288  1  
37  48  1  
38  1  2  
39  288  1  
40  48  1  
41  1  2  
42  288  1  
43  80  1  
44  1  2  
45  480  1  
46  80  1  
47  1  2  
48  480  1  
49  80  1  
50  1  3  
51  480  1  
52  160  1  
classifier  560  1 
Idx  Dimension  
1  3  3  
2  1  3  
3  32  1  
4  1  3  
5  16  1  
6  16  1  
7  1  3  
8  96  1  
9  24  1  
10  1  3  
11  144  1  
12  24  1  
13  1  3  
14  144  1  
15  24  1  
16  1  5  
17  144  1  
18  40  1  
19  1  5  
20  240  1  
21  40  1  
22  1  5  
23  240  1  
24  40  1  
25  1  3  
26  240  1  
27  64  1  
28  1  3  
29  360  1  
30  64  1  
31  1  3  
32  360  1  
33  64  1  
34  1  3  
35  360  1  
36  64  1  
37  1  5  
38  360  1  
39  80  1  
40  1  5  
41  560  1  
42  96  1  
43  1  5  
44  560  1  
45  96  1  
46  1  5  
47  560  1  
48  96  1  
49  1  5  
50  560  1  
51  100  1  
52  1  5  
53  640  1  
54  100  1  
55  1  5  
56  640  1  
57  100  1  
58  1  5  
59  640  1  
60  100  1  
61  1  5  
62  576  1  
63  160  1  
64  1  3  
65  576  1  
66  160  1  
67  1  3  
68  960  1  
69  160  1  
classifier  480  1 
Comments
There are no comments yet.