1 Introduction
As part of the great progress made in deep learning, deep neural network (DNN) models with higher performance have been proposed for various machinelearning tasks
(LeCun et al., 2015). However, these performance improvements require a higher number of parameters and greater computational complexity. Therefore, it is important to compress them without sacrificing the performance for running the models on resourceconstrained devices.Han et al. (2016) reduced the memory requirement for devices by pruning and quantizing weight coefficients after training the models. Howard et al. (2017); Sandler et al. (2018); Howard et al. (2019) used factorized operations called depthwise and pointwise convolutions in a proposal for lightweight models suited to mobile devices. However, these methods require predefined network structures and pruning the model weights after training. Recently, automated frameworks, such as the socalled neural architecture search (NAS) (Zoph & Le, 2017), have been proposed. Tan et al. (2019) proposed a NAS method to accelerate the inference speed on smartphones by incorporating resourcerelated constraints into the objective function. Stamoulis et al. (2019) significantly reduced the search costs for NAS by applying a gradientbased search scheme with a superkernel that shares weights for multiple convolutional kernels.
However, the models trained by these methods are dedicated to specific devices, and thus do not possess the ability to be reconfigured for use on different devices. In order to change the model size, it is necessary to retrain them according to the resources of the target devices. For example, it has been reported that the inference speed when operating the same model on different devices differs according to the computing performance and memory capacity of the hardware accelerator (Ignatov et al., 2018). Therefore, it is desirable that the model size can be flexibly changed according to the resources of the target devices without retraining the model, which we refer to as scalability in this paper.
To this end, Yu et al. (2019)
introduced switchable batch normalization (BN)
(Ioffe & Szegedy, 2015), which switches BN layers according to predefined widths, and proposed “slimmable” networks whose width can be changed after training. Moreover, Yu & Huang (2019) proposed universally slimmable networks (USNets) that extend slimmable networks to arbitrary widths. However, since these methods directly reduce the width (i.e., dimensionality) in each layer, the principal components are not taken into account. In addition, they reduce the width uniformly across all layers, which ignores differences in the importance of different layers.In this paper, we propose a novel method that enables DNNs to flexibly change their size after training. We factorize a weight matrix in each layer into two lowrank matrices after training the DNNs via singular value decomposition (SVD). By changing the rank in each layer, our method can scale the model to an arbitrary size (Figure 1). Our contributions are as follows.

We do not directly reduce the width but instead reduce the redundant basis in the column space of the weight matrix, which prevents the feature map in each layer from losing important features.

We introduce simple criteria that characterize the importance of each basis and layer, namely, the error and complexitybased criteria. These enable to effectively compress the error and complexity of the models as little as possible.

We facilitate the performance of low rank networks with the following methods: a learning procedure that simultaneously minimizes losses for both the full and low rank networks, and the mean & variance correction for each BN layer according to the given rank.
In the experiments on imageclassification tasks of the CIFAR10/100 (Krizhevsky, 2009)
datasets using deep convolutional neural networks (CNNs), our method exhibits better performance for up to approximately
compressed models than slimmable networks and USNets. In the following, we first describe the details of our method (Section 2) and briefly review related works (Section 3). Then, we give some experimental results (Section 4) and conclude the paper (Section 5).2 Methods
In this section, we first give an overview then describe the details of the inference and learning methods.
2.1 Overview
For a layer in the network, let
be an output vector given by linear transformation of an input vector
with a weight matrix , where and are the numbers of input and output nodes, respectively. Let be the rank of the weight matrix, with . Given and as matrices that have left and right singular vectors (i.e., bases) in columns, and as a matrix composed of singular values in diagonal components, we can formulate the SVD as .An example of our scalable neural networks with fully connected layers is shown in Figure 1. After the training, each weight matrix in the network is factorized into two matrices of rank via SVD, and we control this value to change the model size. This can be viewed as inserting a sublayer between the original layers and changing its width
. For the convolutional tensor
of kernel width , kernel height , input channels , and output channels , we transform it to the matrix form and apply SVD as in Zhang et al. (2016); Wen et al. (2017). This yields two layers with a tensor and a tensor . The number of parameters in each layer becomes by this factorization. Thus, we can compress the network to an arbitrary size by changing the rank within the range .Associated with changing the rank, the monotonicity of approximation error holds for each layer.
Proposition 2.1.
Let be a rank approximation using the truncatedSVD for and let . Then,
The proof is given in Appendix A. According to the above, errors between an original output and its approximation monotonically decrease as the rank increases. Hence, it can be expected the performance of the entire network will scale with the model size, which is controlled by the rank in our method.
2.2 Inference
2.2.1 Rank selection
Given a target size for a model, we select the rank of each layer by reference to the following criteria.
Errorbased criterion.
According to Eq. (A) in Appendix A,
the error associated with a rank1 decrease is given by .
This implies that the error depends on the singular value
and the cosine similarity between an input vector
and the corresponding left singular vector. Based on this, we consider how to compress the model with as little error as possible by reducing the bases that induce lower errors. It has been reported that networks with BN layers and ReLUs (rectified linear units)
(Glorot et al., 2011) possess the scaleinvariance property (Arora et al., 2019). Thus, the error should be normalized by the scale of in each layer. Exploiting the fact that , we normalize it as , where is the spectrum norm of (i.e., the maximum singular value).Computing the cosine similarities is costly because it requires the whole input over the dataset in each layer. Therefore, we omit it and simply use the following criterion for selecting the rank:
(1) 
where is a layer index. This is equivalent to keeping small in each layer. We consider this is a simple but effective criterion for the following reasons. First, Arora et al. (2018) have reported that the subspace spanned by each layer’s weight vectors and the subspace spanned by their input vectors both become implicitly low rank and correlated after training. In other words, there should be many small singular values in each layer’s weight matrix. Second, the principal directions of the weights are correlated with those of the inputs. Thus, by reducing the bases that correspond to smaller singular values, we can reduce by a large number of ranks without significantly increasing the errors. Moreover, the cosine similarities are expected to be higher for large singular values, meaning that our method can reflect the principal directions of the data distribution even if we only use the singular values of the weight matrices as the criterion.
Complexitybased criterion. We achieve a high compression rate by reducing the rank in layers that have a large number of parameters and multiplications (MACs). For convolutional layers, the numbers of parameters, excluding biases and the MACs, are given by and for a feature map of height and width , respectively. We use the following as a complexitybased criterion.
(2) 
By coupling the above two criteria, we reduce the bases with lower values of across the entire network. In practice, we compute the criterion for all bases after training. Then, we sort them in ascending order and store as a list. The only necessary step for selection is to reduce the first bases in the list, where is determined by the target model size. The algorithm is given in Appendix B.
2.2.2 BN correction
As pointed out by Yu et al. (2019), the means and variances of the BN layers should be corrected when the model size is changed. Suppose that a BN layer is inserted right after the convolutional layer, and that the mean and variance of are normalized in the BN layer. Then, we should correct those values according to the rank approximation of (i.e., ). Because , lies in the rank subspace spanned by the columns of . Hence, letting and be, respectively, the population mean and covariance matrix for , we can exactly compute their projection onto the subspace as and (note that diagonal components are extracted from for the BN layer). For practical reasons, we compute and for each layer after training (Ioffe & Szegedy, 2015). Because has extra parameters to store, we keep instead, where is the maximum rank to be used, reducing the number of parameters to . At the time of inference, we can correct the mean and variance according to the ranks in each layer. On the other hand, if a list of candidate model sizes is available in advance, we can retain the means and variances for those models as Yu & Huang (2019). We compare both methods in Section 4.
2.3 Learning
Although our scalable neural networks can operate regardless of learning methods, we propose a method to gain better performance. We simultaneously minimize losses for both the fullrank and the lowrank networks as follows.
(3) 
Here,
is a loss function,
is a set of training samples in a minibatch, is the batch size, andis a hyperparameter for balancing between the two losses. For this,
, , and are sets of weight matrices, their lowrank approximations, and other trainable parameters (e.g., biases), respectively. We additionally propagate each mini batch to a lowrank network in which the weights are generated by SVD. Because , the gradient with respect to can be computed as follows: ^{1}^{1}1In fact, depends on , but we treat it as constant for simplicity.(4) 
is shared between the full and lowrank networks, so the gradients are simply computed from the weighted average for those of both networks. At each iteration step, we randomly select the model size for the lowrank network by sampling a global rank ratio
from a uniform distribution
with . Then, letting be the rank of , we reduce bases across the entire network using the criterion mentioned in subsection 2.2.1. In a later section, we experimentally investigate the effects of the parameters , , and in the experiment.Arora et al. (2018); Suzuki (2019) derived the generalization error bound for DNNs under a condition that the trained network has near lowrank weight matrices. They proved that the condition contributes not only to yield a better generalization error bound for the noncompressed network but also to compress the network efficiently. That motivates our approaches: a learning which aims to facilitate the performance of the lowrank networks as well as that of the fullrank network, and a compression scheme which reduces the redundant basis obtained via SVD.
3 Related work
Lowrank approximation & regularization. Compression methods based on lowrank approximation have been proposed in the literature. Denil et al. (2013); Tai et al. (2016); Ioannou et al. (2016) trained networks after factorizing the weight matrix into a lowrank form. Ioannou et al. (2016) achieved a high compression rate by factorizing a convolutional kernel of into and . Denton et al. (2014); Lebedev et al. (2015); Kim et al. (2016) proposed methods that use tensor factorization without rearranging the convolutional tensor into the matrix form. Yu et al. (2017) further improved the compression rate by incorporating sparseness into the lowrank constraint. Zhang et al. (2016); Li & Shi (2018) took resourcerelated constraints into account to automatically select an appropriate rank. Each of these methods trains a network with predefined ranks or compress redundant networks by applying complicated optimizations under a given target size for the model. That is, those methods would require retraining to reconfigure the models for different devices.
Kliegl et al. (2017); Xu et al. (2019) utilized tracenorm regularization as a lowrank constraint in learning the network. Wen et al. (2017) proposed a novel method called force regularization to obtain the lowrank weights. The performance of these methods depends on a hyperparamter to adjust strength of regularization. It is difficult to decide on an appropriate range for the hyperparameter in advance, meaning that selection requires trial and error to achieve a particular model size.
Scalable neural networks. Chen et al. (2018) represented the data flow in ResNettype structures (He et al., 2016)
as ordinary differential equations (ODEs), and proposed a NeuralODEs, which can be used to arbitrarily control the computational cost in the depth direction.
Zhang et al. (2019) also obtained scalability in the depth direction by allowing predefined intermediate layers to be bypassed.Yu et al. (2019); Yu & Huang (2019) proposed slimmable networks and USNets, which are scalable in the width direction. Their works are closely related to ours, but there are differences in some aspects. First, since their methods directly and uniformly reduce the width for every layer, the principal components are not taken into account, and the relative importance of each layer is not considered. Second, for USNets in particular, they introduced a “sandwich rule” to keep the performance for an arbitrary width. However, this rule does not guarantee monotonicity of the error with increasing layer width. In the next section, we compare our method with them.
4 Experiments
We evaluate our methods on the imageclassification tasks of CIFAR10/100 (Krizhevsky, 2009) datasets using deep CNNs. The CIFAR10/100 datasets contain
images for object recognition including 10 and 100 classes, respectively. Each dataset contains 50K images for training and 10K images for validation. We implement our method with TensorFlow
(Abadi et al., 2015).4.1 Ablation study
We test each component in our method on the CIFAR10 dataset. We use the same baseline setup as in Zagoruyko & Komodakis (2016), which is summarized in Table 1 in Appendix C. Unless otherwise specified, we report the average result from 5 trials with different random seeds. We adopt a VGGlike network with 15 layers (Zagoruyko, 2015; Liu et al., 2017) ^{2}^{2}2
Since the VGGnetworks are originally designed for classifying the ImageNet dataset
(Deng et al., 2009), we use a smaller type than the original for the CIFAR datasets, as used by Liu et al. (2017)., which we refer to as VGG15 below.Firstly, we evaluate our learning method for various values of the parameters and , fixing . Our method requires SVD at each iteration step, which makes it costly. To address this, we suppose that the weight subspaces are not drastically changed at each step and recompute the SVD after every two steps, reusing the results to speed up the training. We illustrate the validation accuracy of a fullrank model for different values of (resp., ) with (resp., ) fixed, on the left (resp., center) of Figure 2. It can be observed that smaller values of and larger values of are better. This can be interpreted as indicating that it is better for a fullrank model to learn with various lowrank models than to learn with models biased to a specific range of ranks. Thus, we set and for the other experiments described below. On the right side of Figure 2, we show the maximum singular value for each basis index in a fullrank model ^{3}^{3}3We let be a singular value for a basis in a layer and then compute . For layers with lower ranks, we simply fill the missing part with zeros.. We can see that our learning method obtains smaller singular values than the baseline. This implies that our learning method has an effect similar to tracenorm regularization (Kliegl et al., 2017), suggesting that we can suppress the errors produced by reducing the bases.
Next, we evaluated the performance of our inference method for various model sizes. In Figure 3, we illustrate the inference results for validation data with various number of parameters and MACs. In the figure, “infer (uni)” indicate the results obtained by uniformly reducing the basis in each layer. Concretely, with a global rank ratio , we reduce bases in order from the one corresponding to the smallest singular value. Despite the method being simple, the accuracy changes almost smoothly, and it can be confirmed that there the accuracy scales against changes in the model size. This can be considered as due to the monotonicity of errors, which is formalized in Proposition 2.1. Additionally, the performance is also improved with our learning method by applying uniform rank selection and by using our BN correction. Furthermore, the performance with respect to the parameters is improved when we apply the error and complexitybased criteria for rank selection to both inference and learning (in the figure, “c” and “cc”). However, the performance with respect to the MACs is dropped by changing the rank selection from uniform (“uni”) to errorbased (“c”). As shown on the left side of Figure 4, it is more effective for decreasing MACs to reduce the parameters in shallower layers, which have large feature maps. However, the errorbased criterion tends to reduce the parameters in deeper layers because those tend to be low rank. When both criteria are applied (in the figure, “cc”), the performance is also improved for the MACs. We show the rankselection results for different criteria on the right side of Figure 4. It can be confirmed that the ranks are decreased for 4, 6, 7, 9, and 10 layers which have large MACs in the case with both criteria (“cc”) relative to the case with only the errorbased criterion (“c
”). For the BN correction, our method is effective, but it is better with a method that recomputes the means and variances for given ranks (“bnRe”). Because our method is layerbylayer correction, this probably occurs because our method cannot fully correct for the interlayer gap, with the statistics of the deep layer changing due to the reduction of rank in the shallow layer.
Additionally, we investigate the effect of a parameter . We evaluate the validation accuracy with respect to the number of paramters for with VGG15 and ResNet34 on the CIFAR10/100 datasets. The results are shown in Figure 7 in Appendix D. We consider that there is a tradeoff between the performance of full and lowrank models, which depends on .
4.2 Comparison with slimmable networks
We compare our method with slimmable networks (Yu et al., 2019) and USNets (Yu & Huang, 2019) in terms of performance on the CIFAR10/100 datasets. We adopt VGG15 and ResNet34 (He et al., 2016). We implement the models based on the Yu’s code, available at https://github.com/JiahuiYu/slimmable_networks
(written in PyTorch
(Paszke et al., 2017)). USNets is trained with 2 random widths between the lower and upper width and inplace distillation (Yu & Huang, 2019), then BNcalibration (Yu & Huang, 2019) is applied to each of the slimmable networks and USNets after training. For our method, we incorporate all components into the comparisons and adopt BN correction with recomputation. We train the models using and the same setup as in the previous subsection. In the following, we report the results for models after the last iteration in training.First, we compare the scalability of ResNet34 on the CIFAR100 dataset. We illustrate the inference results over various numbers of parameters for 5 models trained with different random seeds in Figure 5. The results in the figure show that USNets are unstable, which is a problem for practical use. This instability is because USNets do not have monotonic error changes in each layer, a property that our method ensures. Next, we show the results for comparison of VGG15 on CIFAR10 and ResNet34 on CIFAR100 in Figure 6. The notations “base (Yu’s code)” and “base (our code)” indicate the baseline results obtained by the Yu’s code and our code with the same setup. Our baseline is slightly better than the Yu’s baseline. We consider this to be due to differences in the framework. Comparing the results with those for VGG15 on CIFAR10, our method tends to be more accurate in terms of the number of parameters than in terms of the number of MACs. Since deep layers have more parameters than shallow layers, the rank in deep layers tends to be lower than in shallow layers, resulting in more paramters reduced in deep layers by our method. In contrast, USNets reduce the width uniformly across layers, which may contribute to reducing the number of MACs. However, reducing the number of MACs does not necessarily lead to cut the inference cost dominantly, depending on the target device (Yang et al., 2018). Although we only consider the number of parameters and MACs as the complexity metrics in this paper, other metrics such as memory footprint, memory access cost, and runtime latency should be taken into account for validating the effectiveness in practical use (Tan et al., 2019; Sandler et al., 2018; Dai et al., 2019).
We can see that the accuracy of our method is lower than that of USNets when the compression rate is extremely high. Our method uses SVD and reduces the bases, which means it does not change the number of inputs and outputs (i.e., the in and out dimensionalities). Because the number of parameters in each layer is , it decreases linearly with respect to the rank. USNets reduce both input and output dimensionality, meaning that the number of parameters is decreased at a quadratic rate. This makes it easier for USNets to achieve extremely high compression. However, our method is better in larger regimes. In particular, for ResNet34 on CIFAR100, the performance of slimmable networks and USNets on the fullsize model are decreased, while our method does not decrease performance much. We illustrate additional comparison results in Figure 8 in Appendix D and give an analysis of perlayer error in Appendix E.
5 Conclusions
We proposed a novel method that enables DNNs to flexibly change their size after training. We used to factorize the weight matrix for each layer into two lowrank matrices after training the DNNs. By changing the rank in each layer, our method can scale the model to an arbitrary size. We introduced simple criteria for characterizing the importance of each basis and layer; these are the error and complexitybased criteria. Those criteria enabled effectively compressing models without introducing much error. In experiments on multiple imageclassification tasks using deep CNNs, our method exhibited good performance relative to that of other methods.
References
 Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
 Arora et al. (2018) Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning (ICML), pp. 254–263, 2018.
 Arora et al. (2019) Sanjeev Arora, Zhiyuan Li, and Kaifeng Lyu. Theoretical analysis of auto ratetuning by batch normalization. In International Conference on Learning Representations (ICLR), 2019.
 Chen et al. (2018) Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems (NeurIPS), pp. 6571–6583. 2018.

Dai et al. (2019)
Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang,
Marat Dukhan, Yunqing Hu, Yiming Wu, Yangqing Jia, Peter Vajda, Matt
Uyttendaele, and Niraj K. Jha.
ChamNet: Towards efficient network design through platformaware
model adaptation.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 11398–11407, 2019.  Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
 Denil et al. (2013) Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de Freitas. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2148–2156. 2013.
 Denton et al. (2014) Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1269–1277. 2014.

Glorot et al. (2011)
Xavier Glorot, Antoine Bordes, and Yoshua Bengio.
Deep sparse rectifier neural networks.
In
International Conference on Artificial Intelligence and Statistics (AISTATS)
, pp. 315–323, 2011.  Han et al. (2016) Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with , trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), 2016.
 He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034, 2015.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
 Howard et al. (2019) Andrew Howard, Mark Sandler, Grace Chu, LiangChieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for MobileNetV3. IEEE International Conference on Computer Vision (ICCV), 2019.
 Howard et al. (2017) Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 Ignatov et al. (2018) Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. AI benchmark: Running deep neural networks on android smartphones. arXiv preprint arXiv:1810.01109, 2018.
 Ioannou et al. (2016) Yani Ioannou, Duncan Robertson, Jamie Shotton, Roberto Cipolla, and Antonio Criminisi. Training CNNs with lowrank filters for efficient image classification. In International Conference on Learning Representations (ICLR), 2016.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pp. 448–456, 2015.
 Kim et al. (2016) YongDeok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. In International Conference on Learning Representations (ICLR), 2016.
 Kliegl et al. (2017) Markus Kliegl, Siddharth Goyal, Kexin Zhao, Kavya Srinet, and Mohammad Shoeybi. Trace norm regularization and faster inference for embedded speech recognition RNNs. arXiv preprint arXiv:1710.09026, 2017.
 Krizhevsky (2009) A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
 Lebedev et al. (2015) Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speedingup convolutional neural networks using finetuned cpdecomposition. In International Conference on Learning Representations (ICLR), 2015.
 LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. Deep learning. Nature, 521:436–444, 2015.
 Li & Shi (2018) Chong Li and C. J. Richard Shi. Constrained optimization based lowrank approximation of deep neural networks. In European Conference on Computer Vision (ECCV), pp. 746–761, 2018.
 Liu et al. (2017) Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In IEEE International Conference on Computer Vision (ICCV), pp. 2755–2763, 2017.
 Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
 Sandler et al. (2018) Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. MobileNetV2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520, 2018.
 Stamoulis et al. (2019) Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Diana Marculescu. SinglePath NAS: Designing hardwareefficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877, 2019.
 Suzuki (2019) Taiji Suzuki. Compression based bound for noncompressed network: unified generalization error analysis of large compressible deep neural network. arXiv preprint arXiv:1909.11274, 2019.
 Tai et al. (2016) Cheng Tai, Tong Xiao, Yi Zhang, XiaogangWang, and Weinan E. Convolutional neural networks with lowrank regularization. In International Conference on Learning Representations (ICLR), 2016.
 Tan et al. (2019) Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. MnasNet: Platformaware neural architecture search for mobile. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
 Wen et al. (2017) Wei Wen, Cong Xu, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Coordinating filters for faster deep neural networks. In IEEE International Conference on Computer Vision (ICCV), pp. 658–666, 2017.
 Xu et al. (2019) Yuhui Xu, Yuxi Li, Shuai Zhang, Wei Wen, Botao Wang, Wenrui Dai, Yingyong Qi, Yiran Chen, Weiyao Lin, and Hongkai Xiong. Trained rank pruning for efficient deep neural networks. In Workshop on Energy Efficient Machine Learning and Cognitive Computing (EMC), 2019.
 Yang et al. (2018) TienJu Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. NetAdapt: Platformaware neural network adaptation for mobile applications. In European Conference on Computer Vision (ECCV), pp. 289–304, 2018.
 Yu & Huang (2019) Jiahui Yu and Thomas Huang. Universally slimmable networks and improved training techniques. IEEE International Conference on Computer Vision (ICCV), 2019.
 Yu et al. (2019) Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable neural networks. In International Conference on Learning Representations (ICLR), 2019.
 Yu et al. (2017) Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by low rank and sparse decomposition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 67–76, 2017.

Zagoruyko (2015)
Sergey Zagoruyko.
92.45% on cifar10 in torch, 2015.
URL http://torch.ch/blog/2015/07/30/cifar.html.  Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In the British Machine Vision Conference (BMVC), 2016.
 Zhang et al. (2019) Linfeng Zhang, Zhanhong Tan, Jiebo Song, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. SCAN: A scalable neural networks framework towards compact and efficient models. Advances in Neural Information Processing Systems (NeurIPS), 2019.
 Zhang et al. (2016) Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):1943–1955, 2016.

Zoph & Le (2017)
Barret Zoph and Quoc V. Le.
Neural architecture search with reinforcement learning.
In International Conference on Learning Representations (ICLR), 2017.
Appendices
Appendix A Proof of Proposition 2.1
As by definition, . To prove , we show that for .
Proof.
(5) 
∎
Appendix B An algorithm for rank selection
Appendix C Baseline setup of experiments on CIFAR10/100 datasets
Preprocess  Perchannel standardization (mean, std.) 

CIFAR10 : (0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)  
CIFAR100: (0.5071, 0.4865, 0.4409), (0.2673, 0.2564, 0.2762)  
Data augmentation  Random cropping 32 32 after zeropadding 4 pixels 
Random horizontal flipping ()  
Batch size / Epochs 
128 / 200 
Optimizer  SGD with Nesterov momentum ( ) 
Learning rate  Initialized to 0.1, multiplied by 0.2 at 60, 120, and 160 epochs 
regularization  0.0005 
Initializer  HeNormal (He et al., 2015) for weights, 0 for biases 
BN  . Initialize and 
GPUs  1 
Appendix D Additional results for VGG15 and ResNet34 on CIFAR datasets
Appendix E An analysis of perlayer error
We train VGG15 on the CIFAR10 dataset with USNets (Yu & Huang, 2019) and our method. First, we illustrate learned weights in Figure 9. For comparison, we show a weight matrix for USNets and its factorized form obtained via SVD for our method. The rank of in the 1st convolutional layer is 27, and thus only 27 convolutional tensors are shown for our method in the upper part of the figure. Since becomes lowrank after training, there is only a little number of dominant bases which correspond to higher singular values in . For USNets in particular, we can see that weight coefficients of large absolute values are concentrated in lower channels. It can be considered that USNets implicitly attract important features into lower channels because USNets reduce channels in order from the one with higher indices. Therefore, it can be expected the errors induced by reducing channels are partially suppressed for USNets.
Next, we compute the sum of squared error: over the validation dataset, where , , and are the number of validation samples, an output feature map of layer for the fullsize network, and that for the compressed network, respectively. With compressing the entire network from to about in terms of the total number of parameters, we compute the squared error with respect to the number of parameters in each layer as shown in Figure 10. For our method, we reduce the bases of weight marices (i.e., columns in ) according to the criterion and compute with a lowrank weight matrix in each layer. For USNets, since the dimensionality of is decreased with reducing channels for every layer, we fill the missing part with zeros. It can be confirmed the squared errors by our method are suppressed more than that by USNets in Figure 10. We consider this is because our method do not directly reduce the channel as in USNets but instead reduce the redundant basis in the column space of the weight matrix, which prevents the feature map in each layer from losing important features (i.e., principal components).