1 Introduction
Deep convolutional neural networks have dominated the area of applied computer vision. The network architectures used for image analysis have grown in terms of performance along with their numbers of layers and parameters over the years. Nevertheless, as the ubiquitous use of these networks is now extended to resourcelimited areas such as edge computing, optimization of the architectures to minimize computational requirements has become essential. Furthermore, the reduction in FLOPs at inference time directly impacts the power consumption of the largescale consumer facing applications of artificial intelligence. As a result, the advocates of green AI recommend the use of network size and the number of FLOPs as important performance evaluation metrics for neural networks, along with accuracy
[23].To achieve the goal of architectural efficiency, a significant amount of progress has been made in the area of pruning. Pruning is the process of finding architectural components of a network that can be removed without a large loss of performance. Since the early days of convolutional neural networks, it has been shown that removing unimportant components of neural networks delivers benefits in generalization and computational efficiency [12].
Similar to [21], in this work, we focus on pruning in the context of transfer learning. While domain adaptation has been the focus of most work in transfer learning, we focus on the neglected problem of efficient inference in the target domain through pruning. Transfer learning is necessary for domains in which largescale and wellannotated datasets are scarce due to the cost of data acquisition and annotation [25]
. In transfer learning, as the source dataset usually contains features that are not present in the target dataset, using offtheshelf features or finetuning without pruning can result in unnecessarily large models. In contrast, transfer learning with pruning can achieve substantial reductions in parameters and FLOPs, probably even more than pruning for the same dataset as we found in our experiments.
Existing network pruning algorithms can be categorized in different ways. Pruning can be achieved by removing unstructured weights and connections [4, 16], or by removing structural contents such as filters or layers [6, 13, 15, 17, 19, 20, 21, 26, 29]. While most algorithms perform pruning directly on the convolutional weight matrix, some try to reconstruct the weight matrix or its output features by lowrank approximation to reduce inference time [2, 8, 30]. Some frameworks perform pruning without considering image data [8, 13], while most works use image data for better pruning ratios and accuracy [6, 17, 19, 29].
Although these frameworks provide promising results, there are several limitations. As discussed in [9, 22]
, since the filters are linearly dependent in a layer, pruning in the original filter space can be less effective. On the other hand, lowrank approximations require additional optimizations apart from backpropagation to perform filter or feature reconstruction. Furthermore, finetuning of the entire network after pruning is required in most frameworks, which may not be desirable for transfer learning with limited data.
In view of these issues, here we propose a framework that finetunes and prunes the orthogonal bases obtained by applying singular value decomposition (SVD) on the convolutional weight matrices. Our contributions are as follows:

We propose a basis pruning algorithm that prunes convolutional layers in orthogonal subspaces regardless of the network architecture. As the basis vectors are nontrainable in our framework to facilitate transfer learning, we introduce the basis scaling factors which are responsible for both importance estimation and finetuning of basis vectors. These basis scaling factors are trainable by backpropagation and only contribute a very small amount of trainable parameters.

After basis pruning, as the numbers of input and output channels of the original convolutional layers remain unchanged, other pruning algorithms can be further applied for double pruning. By combining the advantages of basis pruning and other pruning algorithms, we can achieve larger pruning ratios that cannot be achieved by either alone. This provides a new approach that can amplify existing pruning mechanism.
We tested our framework by transferring the features of ImageNet pretrained models to classify the CIFAR10, MNIST, and FashionMNIST datasets. With less than 1% reduction in the classification accuracy, we can achieve pruning ratios up to 98.9% in parameters and 98.6% in FLOPs. Experiments on samedataset pruning were also performed on CIFAR10 for comparisons.
2 Related Work
Channel/Filter Pruning. Network pruning can be achieved by pruning individual weights or entire channels/filters. Although pruning individual weights can achieve high compression ratios because of the flexibility [4, 16], as discussed in [26], the practical speedup can be limited given the irregular weight sparsity unless specialized software or hardware are utilized. In contrast, channel pruning utilizes the structured sparsity [6, 13, 15, 17, 19, 20, 21, 26, 29]. Although channel pruning is less flexible than weightlevel pruning, the dense matrix structures are maintained after pruning and significant practical speedup can be achieved with offtheshelf libraries. In [13], the L1 norm of each filter was computed as its relative importance within a convolutional layer, and filters with smaller L1 norms were pruned. In [6, 19], the channels were pruned while minimizing the reconstruction error between the original and modified feature maps in each layer. In [17, 29]
, the scaling factors in the batch normalization (BN) layers were used for channel pruning. While only backpropagation was used in
[17], the framework in [29] utilized an additional optimizer to update the scaling factor during training. In [20], the importance of a parameter was quantified by the error induced by removing it. Using Taylor expansions to approximate such errors with the gradients of the minibatches, less important filters were pruned.Pruning and Matrix Factorization.
Matrix factorization techniques such as SVD and principal component analysis (PCA) have been applied to deep learning. A convolutional weight matrix or a feature tensor can be factorized into a specified canonical form to reveal properties that cannot be observed in the original space. This transformation also enables special operations that lead to higher computational efficiency or accuracy. In
[2, 8, 30], pretrained convolutional filters were approximated by lowrank basis vectors to reduce inference time, which can be viewed as lowrank matrix approximations by SVD. In [28], without pretraining, the weight matrices were factorized by SVD. SVD training was then performed on the decomposed weights with orthogonality regularization and sparsityinducing regularizers, and the resulting network was pruned by the singular values and finetuned. In [3], PCA was applied on the feature maps to analyze the network structure for the optimal layerwise widths and number of layers, and the resulting structure was then trained from scratch. In [15], SVD was used to obtain the ranks of feature maps. Filters with lower ranks were pruned before finetuning.Relation to Our Work. Similar to most frameworks, channel pruning is used in our work. Nevertheless, as our goal is efficient transfer learning from one dataset to another potentially much smaller dataset, we try to minimize the number of trainable parameters during importance estimation and finetuning. To this end, we adopt the use of scaling factors which allow us to perform filterbased finetuning which requires much fewer trainable parameters. Furthermore, inspired by [9, 22], we found that pruning linearly independent filters obtained by SVD can be more effective. Therefore, we combine these advantages by proposing basis vectors rescaling and pruning, and present a double pruning framework for improving pruning ratios. Note that although SVD was also used in [28] for weight factorization, their goal was to train a model from scratch and thus orthogonality regularization was necessary. In contrast, the basis vectors are nontrainable in our framework thus the orthogonality is naturally preserved.
3 Methodology
The goal of our framework is to represent the convolutional weights with orthogonal bases that allow more effective network pruning. As discussed in [9, 22], the features of a layer are distributed among the linearly dependent filters, and their representations are different with different initializations. By representing the features with orthogonal bases obtained by SVD or PCA, much fewer channels are required to achieve the same accuracy. Furthermore, it was shown in [9] that the principal components of features trained from different random initializations are similar, which means that the features are more uniquely represented in such subspaces. Therefore, network pruning in orthogonal subspaces can be more effective. In addition, as shown in [2, 8], approximating the weights with lowrank tensor approximations can reduce computational complexity. In view of these advantages, here we propose a framework that utilizes SVD/PCA for network pruning.
3.1 Convolutional Weights Representation in Orthogonal Subspaces
Let be a 4D weight matrix of a convolutional layer, where and are the kernel height and width, and and are the numbers of input and output channels, respectively. For efficient transfer learning, we assume that the convolutional weights are pretrained and nontrainable. can be reshaped into a 2D matrix for further processing (), which can be factorized by compact SVD as:
(1) 
where contains the columns of leftsingular vectors, contains the columns of rightsingular vectors, and is a diagonal matrix of singular values in descending order. is the maximum rank of . As the columns of yield an orthonormal basis, like those of , we have , with
an identity matrix. With SVD, we can perform rescaling and pruning in the subspaces of
and .To transform the weight matrix with PCA, we can view the rows and columns of as samples and features, respectively. To use PCA, we compute the symmetric covariance matrix as:
(2) 
by using (1) and . Therefore, the columns of
are the eigenvectors of
corresponding to the nonzero eigenvalues.
can be projected onto as:(3) 
as . Thus, the columns of are the leftsingular vectors rescaled by the singular values. Therefore, PCA and SVD are equivalent in factorizing .
3.2 Convolutional Layer Decomposition
Using SVD, the convolutional weights can be represented by the orthonormal bases in and . Although the contributions of the basis vectors are proportional to the corresponding singular values, most singular values are of similar magnitudes and choosing which to be removed is nontrivial especially without considering the image data. As we also want to preserve the original weights as much as possible while pruning, here we introduce the basisscaling convolutional (BasisScalingConv) layer to measure the importance of the basis vectors.
Given a pretrained convolutional layer with nontrainable convolutional weights and bias , we let be a column vector of length which contains the features input to the convolutional layer at a spatial location. The output features at the same spatial location can be obtained as:
(4) 
by using (1) with . To rescale the basis vectors by their importance, we introduce a vector of scaling factors of nonnegative scalars, and (4) is modified as:
(5) 
with a diagonal matrix of . When , (4) and (5) are identical. Using (5), we can decompose the convolutional layer into two consecutive layers (Fig. 1). The first layer is a regular convolutional layer with as the convolutional weights and no bias. The second layer is the BasisScalingConv layer comprising , , and . is used as the convolutional weights to transform the outputs from the previous layer back to the original space. Only is trainable in the decomposed layers. When is updated in each step (batch), each scalar in rescales the corresponding row in . In fact, the scaling of the basis can be viewed as basis finetuning for improving accuracy. Instead of using (5) as a single convolutional layer with as the kernel, dividing into two layers reduces the number of weights and thus computational complexity after the basis vectors are pruned. Although more weights are introduced before pruning, as compact SVD is used, the increase in the total number of weights is less than 22% with our tested models.
3.3 Basis Pruning with FirstOrder Taylor Approximation of Importance
Our goal is to apply features trained from one dataset (, ImageNet) to other datasets (, CIFAR10). Given a pretrained model, we keep all layers up to and including the last convolutional layer and the associating BN and activation layers, and add a global average pooling layer and a final fullyconnected layer for classification. For transfer learning with basis pruning, we first decompose every convolutional layer as presented in Section
3.2. As BN layers are important for domain adaptation [14], they are trainable during transfer learning and are introduced after each convolutional layer if not present (, VGGNet). Therefore, only the BN layers, the vector in each BasisScalingConv layer, and the final fullyconnected layer are trainable.Although the magnitudes of the basis scaling factors can be used to indicate the relative importance of basis vectors, regularization is required to enhance sparsity for larger pruning ratios [17]. Finding the optimal
regularization parameter that balances sparsity and accuracy is nontrivial and model dependent. Smaller learning rates and more epochs are also required. These largely reduce the automation and efficiency of the framework.
To avoid these limitations, we found that importance estimation by the firstorder Taylor approximation (Taylor FO) is a good alternative [20]. In [20], the importance of a network parameter can be quantified by the error induced by removing it. Using Taylor FO, the importance score of a scaling factor can be approximated by:
(6) 
with
the gradient of the loss function with respect to
. Regardless of how the importance scores are computed, they are normalized in each layer by the maximum importance score in that layer. Fig. 2 shows that although the scaling factors tend be smaller with small singular values, the fluctuations are large especially in deeper layers. In contrast, Taylor FO provides better distinctions of filter importance which are highly correlated with the singular values.After training with enough epochs for the desired classification accuracy, the less important basis vectors are pruned (Fig. 3(a)). Let be the number of scaling factors remain after pruning, then , , and in (5) become , , and , respectively. As the sizes of , , and are unaltered, basis pruning only affects the convolutional layer being pruned but not the subsequent layers. Therefore, basis pruning can be applied to virtually all network architectures. In contrast, pruning in the original space requires pruning of the subsequent convolutional layer, which becomes complicated when skip connections are involved. If an entire layer is pruned (, ), all subsequent layers are removed.
3.4 Double Pruning
In basis pruning, only the output channels of and the input channels of are pruned (Fig. 3(a)). Larger pruning ratios can be achieved by pruning also the input channels of and the output channels of (Fig. 3(b)). This can be achieved by pruning the output channels of as the subsequent input channels of are pruned accordingly. In fact, after basis pruning, the BasisScalingConv layers can be treated as regular convolutional layers and any existing pruning algorithm can be applied.
Although double pruning can further increase the pruning ratios, it also associates with the common problems of pruning in the original space related to branching and merging (Fig. 4). For skip connections through concatenations (, DenseNet [7]), the problems are less complicated as the channel positions after concatenation are trackable. For skip connections through elementwise merging (, ResNet [5]), the convolutional layers whose outputs are merged are not pruned (, the second convolutional layer in Fig. 4). This may reduce the pruning ratios but is more adaptive to complicated architectures.
3.5 Pruning Procedure

Given a pretrained model, keep all layers up to and including the last convolutional layer and the associating BN and activation layers. Add a global average pooling layer and a final fullyconnected layer for classification. Insert BN layers if needed.

Decompose each convolutional layer into a convolutional layer and a BasisScalingConv layer (Section 3.2).

Train the model with only the BN layers, the basis scaling factors, and the final fullyconnected layer trainable with the desired number of epochs.
Multiple iterations can be performed in Step 4 and 5, though we found that one iteration is enough especially for simpler problems (, MNIST).
Regardless of how the importance scores are computed, it is important to normalize them in each layer. This is because some importance scores have different ranges in different layers. For example, HRank scores [15] are matrix ranks which are dependent on the spatial size of feature tensors, thus filters in layers after downsampling have smaller maximum scores. Normalizing importance scores allows them to be used within a layer or in the entire model.
To determine the thresholds in Step 4 and 5, we decide the percentage of basis vectors or filters to be removed from the entire model and compute the corresponding threshold.
To minimize manual involvements and improve training efficiency, different from [13], no manual layer selections are performed. Moreover, we do not use gradual filters removal [20] or filterbyfilter pruning [15]. Only the percentage of basis vectors or filters to be removed is needed.
In Step 3, each scaling factor in the BasisScalingConv or BN layers modifies the weights of a basis vector or a filter as a whole. We regard this as basis or filterbased tuning which is more efficient than finetuning individual weights.
4 Experiments
4.1 Models and Datasets
To study the characteristics of our framework, we performed transfer learning with three ImageNet [1] pretrained models on three other datasets. ImageNet was used as the source dataset because of its abundant features trained from 1.2 million images. The three models correspond to the architectures of VGG16 [24], DenseNet121 [7], and ResNet50 [5]. VGG16 has a relatively simple architecture. DenseNet121 and ResNet50 have skip connections realized by tensor concatenation and addition, respectively.
The three datasets include CIFAR10 [10], MNIST [11], and FashionMNIST [27]. The CIFAR10 dataset consists of 3232 colour images in 10 classes of animals and vehicles, with 50k training images and 10k test images. The MNIST dataset of handwritten digits (0 to 9) has 60k 2828 grayscale training images and 10k test images in 10 classes. The FashionMNIST dataset has a training set of 60k 2828 grayscale training images and 10k test images in 10 classes of fashion categories, which can be used as a dropin replacement for MNIST. Each set of training images was split into 90% for training and 10% for validation. Only the results on the test images are reported.
4.2 Tested Frameworks
Given a pretrained model, we kept all layers up to and including the last convolutional layer and the associating BN and activation layers, and added a global average pooling layer and a final fullyconnected layer. The BN layers were trainable with the final fullyconnected layer while other layers were frozen. Different frameworks were tested based on this configuration:

Baseline: No layer decompositions and pruning.

L1 [13]: No layer decompositions. Pruning using the L1 norms of filters as the importance scores.

Taylor FO [20]: No layer decompositions. Pruning by the Taylor FO importance scores. The gradients were computed from the validation dataset.

HRank [15]: No layer decompositions. Pruning by the HRank importance scores which are the matrix ranks of the feature maps of the validation dataset.

Basis: All convolutional layers were decomposed. Pruning by the Taylor FO importance scores of the basis scaling factors (Section 3.3).

Basis + Taylor FO: Double pruning by the Taylor FO importance scores (Section 3.4).

Basis + HRank: Double pruning by the HRank importance scores (Section 3.4).
For the frameworks without layer decompositions, the pruning procedure in Section 3.5 was applied without Step 2 and 4, and the targeting layers in Step 5 became the convolutional layers. For all frameworks, only one iteration was performed in Step 4 and 5. As the L1 framework was less effective in our experiments, we did not use it for double pruning. Note that our goal is to study the pruning capabilities of our frameworks but not competing for the best accuracy on specific datasets.
4.3 Training Strategy
For transfer learning, as the network architectures of the ImageNet pretrained models (VGG16, DenseNet121, and ResNet50) were created for the image size of 224224, directly applying them to the target datasets of much smaller image sizes leads to insufficient spatial sizes (, feature maps of size 11) in the deeper layers and thus poor performance. Therefore, we enlarged the image size by four times in each dimension, , 128128 for CIFAR10 and 112112 for MNIST and FashionMNIST. Image augmentation was used to reduce overfitting, with 15% of shifting in height and width for all datasets, random horizontal flipping for CIFAR10 and FashionMNIST, and 15
of rotation for MNIST. Every image was zerocentered in intensity. Dropout with rate 0.5 was applied before the final fully connected layer. Stochastic gradient descent (SGD) with cosine annealing
[18] was used as the learning rate scheduler without restarts, with the minimum and maximum learning rates as and , respectively. The SGD optimizer was used with momentum of 0.9 and a batch size of 128. There were 100 epochs for each training. All scaling factors were initialized to 0.5 and constrained to be nonnegative. The same settings were applied to samedataset pruning except that no dropout was used and the batch size became 64.When training models from scratch for samedataset pruning, as the network architectures (VGG14, DenseNetBC100, and ResNet56) are tailored for CIFAR10, the original image size was used. The same image augmentation and image preprocessing as transfer learning were applied. A weight decay of was applied on the convolutional weights without dropout. SGD with momentum of 0.9 was used with warm restarts by cosine annealing, with the minimum and maximum learning rates as and , respectively. There were 300 epochs for each training, and the learning rate scheduler restarted after 150 epochs with a batch size of 64.
The implementation was in Keras with TensorFlow backend. Each experiment was performed on a NVIDIA Tesla V100 GPU with 16 GB memory.
Framework  Accuracy  Parameters (PR)  FLOPs (PR)  

VGG16  90.9%  14.74M  (0.0%)  5.03G  (0.0%) 
L1 [13]  88.0%  13.42M  (8.9%)  3.15G  (37.5%) 
Taylor FO [20]  90.0%  6.57M  (55.4%)  2.48G  (50.8%) 
HRank [15]  90.9%  11.67M  (20.8%)  4.11G  (18.4%) 
Basis  91.2%  7.99M  (45.8%)  3.21G  (36.2%) 
Basis + Taylor FO  90.5%  5.42M  (63.2%)  2.45G  (51.2%) 
Basis + HRank  90.9%  7.13M  (51.6%)  2.95G  (41.3%) 
DenseNet121  95.2%  7.05M  (0.0%)  0.93G  (0.0%) 
L1 [13]  93.6%  6.02M  (14.5%)  0.70G  (24.7%) 
Taylor FO [20]  95.1%  4.15M  (41.2%)  0.57G  (38.8%) 
HRank [15]  94.3%  5.41M  (23.2%)  0.78G  (15.9%) 
Basis  94.4%  4.75M  (32.6%)  0.65G  (29.9%) 
Basis + Taylor FO  94.7%  4.43M  (37.2%)  0.60G  (35.7%) 
Basis + HRank  94.1%  4.20M  (40.4%)  0.60G  (35.5%) 
ResNet50  92.4%  23.61M  (0.0%)  1.29G  (0.0%) 
L1 [13]  92.1%  23.15M  (1.9%)  1.18G  (8.3%) 
Taylor FO [20]  91.4%  10.89M  (53.9%)  0.75G  (42.0%) 
HRank [15]  91.8%  17.67M  (25.2%)  1.00G  (22.5%) 
Basis  92.2%  7.14M  (69.7%)  0.61G  (52.6%) 
Basis + Taylor FO  91.6%  6.00M  (74.6%)  0.52G  (59.8%) 
Basis + HRank  92.1%  6.39M  (72.9%)  0.55G  (57.0%) 
Framework  Accuracy  Parameters (PR)  FLOPs (PR)  

VGG16  99.5%  14.74M  (0.0%)  3.85G  (0.0%) 
L1 [13]  99.3%  4.45M  (69.8%)  0.55G  (85.8%) 
Taylor FO [20]  99.3%  0.73M  (95.1%)  0.29G  (92.4%) 
HRank [15]  99.4%  3.71M  (74.8%)  1.51G  (60.8%) 
Basis  99.4%  0.64M  (95.6%)  0.35G  (90.8%) 
Basis + Taylor FO  99.3%  0.16M  (98.9%)  0.12G  (97.0%) 
Basis + HRank  99.3%  0.35M  (97.6%)  0.25G  (93.5%) 
DenseNet121  99.6%  7.05M  (0.0%)  0.71G  (0.0%) 
L1 [13]  99.2%  0.11M  (98.5%)  0.01G  (98.6%) 
Taylor FO [20]  99.3%  0.12M  (98.2%)  0.01G  (98.3%) 
HRank [15]  99.1%  0.30M  (95.8%)  0.11G  (83.8%) 
Basis  99.1%  0.37M  (94.8%)  0.03G  (96.1%) 
Basis + Taylor FO  99.4%  0.19M  (97.4%)  0.01G  (98.3%) 
Basis + HRank  99.0%  0.32M  (95.4%)  0.02G  (96.7%) 
ResNet50  99.4%  23.61M  (0.0%)  1.05G  (0.0%) 
L1 [13]  99.0%  6.14M  (74.0%)  0.24G  (76.9%) 
Taylor FO [20]  99.2%  5.74M  (75.7%)  0.23G  (77.7%) 
HRank [15]  99.3%  6.54M  (72.3%)  0.44G  (57.5%) 
Basis  99.1%  0.55M  (97.7%)  0.05G  (95.1%) 
Basis + Taylor FO  99.1%  0.48M  (98.0%)  0.04G  (95.9%) 
Basis + HRank  99.0%  0.50M  (97.9%)  0.04G  (95.7%) 
Framework  Accuracy  Parameters (PR)  FLOPs (PR)  

VGG16  93.2%  14.74M  (0.0%)  3.85G  (0.0%) 
L1 [13]  92.9%  13.42M  (8.9%)  2.41G  (37.4%) 
Taylor FO [20]  92.5%  2.64M  (82.1%)  1.04G  (73.1%) 
HRank [15]  92.2%  3.54M  (76.0%)  1.66G  (56.9%) 
Basis  92.3%  1.37M  (90.7%)  0.68G  (82.3%) 
Basis + Taylor FO  92.4%  0.90M  (93.9%)  0.51G  (86.6%) 
Basis + HRank  92.0%  0.98M  (93.4%)  0.54G  (85.9%) 
DenseNet121  93.7%  7.05M  (0.0%)  0.71G  (0.0%) 
L1 [13]  93.1%  2.54M  (63.9%)  0.17G  (76.6%) 
Taylor FO [20]  92.8%  0.94M  (86.6%)  0.11G  (85.0%) 
HRank [15]  92.7%  1.06M  (85.0%)  0.28G  (60.0%) 
Basis  93.4%  1.13M  (84.0%)  0.14G  (80.0%) 
Basis + Taylor FO  92.7%  0.63M  (91.0%)  0.07G  (90.6%) 
Basis + HRank  93.5%  0.83M  (88.3%)  0.11G  (85.1%) 
ResNet50  94.0%  23.61M  (0.0%)  1.05G  (0.0%) 
L1 [13]  93.5%  22.04M  (6.6%)  0.86G  (18.1%) 
Taylor FO [20]  93.0%  9.12M  (61.4%)  0.47G  (55.1%) 
HRank [15]  93.2%  4.73M  (80.0%)  0.28G  (73.4%) 
Basis  93.3%  1.99M  (91.6%)  0.21G  (79.6%) 
Basis + Taylor FO  93.4%  1.47M  (93.8%)  0.16G  (85.1%) 
Basis + HRank  93.4%  1.46M  (93.8%)  0.15G  (85.3%) 
4.4 Pruning Performance on Transfer Learning
Table 1, 2, 3 show the comparisons among frameworks. To compare the pruning capabilities, we obtained the largest pruning ratios while keeping the accuracy reductions less than 1%. In a few occasions, this 1% requirement could not be achieved even with only 10% of channels removal. For MNIST, we kept the accuracies .
The pruning ratios were inversely proportional to the difficulties of the classification problems. The largest parameter pruning ratios of CIFAR10, MNIST, and FashionMNIST were 74.6%, 98.9%, and 93.9%, respectively, all achieved by our Basis + Taylor FO framework. For ResNet50, our frameworks outperformed the others by a large margin on all datasets. This is because basis pruning can be applied to all convolutional layers regardless of the existence of elementwise merging. Our frameworks were also dominant on VGG16. Although they were less effective with DenseNet121 on CIFAR10 and MNIST, our best results were only lower than the best ones by less than 1.1% in parameters and 3.1% in FLOPs.
In seven out of the nine combinations (3 architectures 3 datasets), our double pruning frameworks outperformed the others. This was often (6 out of 7) achieved by Basis + Taylor FO and once by Basis + HRank. In fact, without basis pruning, Taylor FO outperformed L1 and HRank in our results. Note that Taylor FO utilizes information from both the data and the weights, whereas L1 and HRank only depend on the weight magnitudes and feature maps, respectively. This could explain why using Taylor FO in our double pruning approach often outperformed other frameworks.
Framework  Accuracy  Parameters (PR)  FLOPs (PR)  

VGG14  93.4%  14.74M  (0.0%)  0.45G  (0.0%) 
L1 [13]  92.9%  12.79M  (13.2%)  0.29G  (35.7%) 
Taylor FO [20]  92.7%  12.40M  (15.8%)  0.28G  (38.3%) 
HRank [15]  90.7%  12.64M  (14.2%)  0.26G  (41.2%) 
Basis  92.5%  2.20M  (85.1%)  0.13G  (71.0%) 
Basis + Taylor FO  92.5%  2.08M  (85.9%)  0.12G  (74.1%) 
Basis + HRank  91.9%  2.02M  (86.3%)  0.12G  (73.1%) 
DenseNetBC100  94.3%  0.79M  (0.0%)  0.29G  (0.0%) 
L1 [13]  93.2%  0.66M  (17.1%)  0.20G  (31.9%) 
Taylor FO [20]  93.3%  0.60M  (23.8%)  0.18G  (39.4%) 
HRank [15]  91.3%  0.63M  (20.8%)  0.27G  (7.6%) 
Basis  93.7%  0.57M  (28.2%)  0.19G  (37.0%) 
Basis + Taylor FO  93.6%  0.55M  (30.8%)  0.17G  (42.3%) 
Basis + HRank  91.5%  0.54M  (31.9%)  0.18G  (38.1%) 
ResNet56  93.0%  0.86M  (0.0%)  0.13G  (0.0%) 
L1 [13]  90.6%  0.83M  (3.5%)  0.11G  (17.3%) 
Taylor FO [20]  91.0%  0.83M  (3.3%)  0.11G  (15.8%) 
HRank [15]  90.3%  0.66M  (23.8%)  0.11G  (19.3%) 
Basis  91.1%  0.71M  (17.9%)  0.10G  (28.7%) 
Basis + Taylor FO  90.1%  0.67M  (21.8%)  0.09G  (33.7%) 
Basis + HRank  90.3%  0.68M  (21.0%)  0.09G  (32.4%) 
4.5 Performance on SameDataset Pruning
Although our frameworks are built for transfer learning, they are also applicable to samedataset pruning. Table 4 shows the results on samedataset pruning on CIFAR10. Similar to the transfer learning results, the Basis + Taylor FO framework performed best in general. Nevertheless, most pruning ratios were smaller compared with transfer learning. This is because the architectures tailored for CIFAR10 were much smaller in terms of model parameters. DenseNetBC100 for CIFAR10 was around nine times smaller than DenseNet121 for ImageNet [7]. ResNet56 for CIFAR10 was around 27 times smaller than ResNet50 for ImageNet [5]. As VGG14 was highly overparameterized, the pruning ratios were still large. In contrast, for ResNet56, even though we only removed 10% of the basis vectors or filters, the reductions in accuracy were larger than 1% for all frameworks. As the pruning procedure in Section 3.5 was applied, there were no manual layer selections and layerbylayer pruning. These may be applied to improve the pruning ratios but will largely reduce the automation and computational feasibility.
4.6 Characteristics of Basis Vectors
We studied the characteristics of basis vectors. Fig. 5 shows the comparisons among different methods of importance estimation on basis pruning on CIFAR10. For all network architectures, Random and Reverse obtained the worst results. In contrast, Taylor FO and Singular Values resulted in much higher accuracy, while Taylor FO performed better on VGG16 and ResNet50. These show that basis vectors of small singular values should be pruned, and the use of Taylor FO importance scores is more preferable.
4.7 Trainable Model Parameters
Table 5 shows the numbers of total and trainable model parameters after layer decomposition and before pruning. Although the total numbers of parameters became larger because of layer decomposition, the trainable parameters were less than 1.2% and could be as low as 0.1% of the total numbers of parameters. Therefore, our framework is advantageous for transfer learning with limited data.
Model  Parameters  

Total  Trainable  
VGG16  16.55M  17.77k 
DenseNet121  8.40M  104.04k 
ResNet50  28.78M  86.86k 
5 Conclusion
In this paper, we present the basis and double pruning approaches for efficient transfer learning. Using singular value decomposition, a convolutional layer can be decomposed into two consecutive layers with the basis vectors as their convolutional weights. With the basis scaling factors introduced, the basis vectors can be finetuned and pruned to reduce the network size and inference time, regardless of the existence of skip connections. Basis pruning can be further combined with other pruning algorithms for double pruning to obtain pruning ratios that cannot be achieved by either alone. Experimental results show that basis pruning outperformed pruning in the original feature space, and the performance was even more distinctive with double pruning. Without individual weights finetuning, the number of trainable model parameters can be very small. Hence, our framework is ideal for transfer learning with limited data.
References

[1]
Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei.
ImageNet: A largescale hierarchical image database.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pages 248–255, 2009.  [2] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, pages 1269–1277, 2014.
 [3] Isha Garg, Priyadarshini Panda, and Kaushik Roy. A low effort approach to structured CNN design using PCA. IEEE Access, 8:1347–1360, 2020.
 [4] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
 [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
 [6] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In IEEE International Conference on Computer Vision, pages 1389–1397, 2017.
 [7] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4700–4708, 2017.
 [8] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In British Machine Vision Conference, 2014.
 [9] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In Machine Learning Research, pages 3519–3529, 2019.
 [10] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
 [11] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [12] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605, 1990.
 [13] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. In International Conference on Learning Representations, 2017.
 [14] Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and Xiaodi Hou. Revisiting batch normalization for practical domain adaptation. In International Conference on Learning Representations (Workshop), 2017.
 [15] Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling Shao. HRank: Filter pruning using highrank feature map. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1529–1538, 2020.
 [16] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 806–814, 2015.
 [17] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In IEEE International Conference on Computer Vision, pages 2736–2744, 2017.
 [18] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
 [19] JianHao Luo, Jianxin Wu, and Weiyao Lin. ThiNet: A filter level pruning method for deep neural network compression. In IEEE International Conference on Computer Vision, pages 5058–5066, 2017.
 [20] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11264–11272, 2019.
 [21] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. In International Conference on Learning Representations, 2017.
 [22] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha SohlDickstein. SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Advances in Neural Information Processing Systems, pages 6076–6085, 2017.
 [23] Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green AI. arXiv:1907.10597 [cs.CY], 2019.
 [24] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations, 2015.
 [25] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. A survey on deep transfer learning. In International Conference on Artificial Neural Networks, pages 270–279, 2018.
 [26] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.
 [27] Han Xiao, Kashif Rasul, and Roland Vollgraf. FashionMNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747 [cs.LG], 2017.
 [28] Huanrui Yang, Minxue Tang, Wei Wen, Feng Yan, Daniel Hu, Ang Li, Hai Li, and Yiran Chen. Learning lowrank deep neural networks via singular vector orthogonality regularization and singular value sparsification. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 678–679, 2020.
 [29] Jianbo Ye, Xin Lu, Zhe Lin, and James Z Wang. Rethinking the smallernormlessinformative assumption in channel pruning of convolution layers. In International Conference on Learning Representations, 2018.
 [30] Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):1943–1955, 2015.