DeepAI
Log In Sign Up

Basis Scaling and Double Pruning for Efficient Transfer Learning

08/06/2021
by   Ken C. L. Wong, et al.
ibm
0

Transfer learning allows the reuse of deep learning features on new datasets with limited data. However, the resulting models could be unnecessarily large and thus inefficient. Although network pruning can be applied to improve inference efficiency, existing algorithms usually require fine-tuning and may not be suitable for small datasets. In this paper, we propose an algorithm that transforms the convolutional weights into the subspaces of orthonormal bases where a model is pruned. Using singular value decomposition, we decompose a convolutional layer into two layers: a convolutional layer with the orthonormal basis vectors as the filters, and a layer that we name "BasisScalingConv", which is responsible for rescaling the features and transforming them back to the original space. As the filters in each transformed layer are linearly independent with known relative importance, pruning can be more effective and stable, and fine tuning individual weights is unnecessary. Furthermore, as the numbers of input and output channels of the original convolutional layer remain unchanged, basis pruning is applicable to virtually all network architectures. Basis pruning can also be combined with existing pruning algorithms for double pruning to further increase the pruning capability. With less than 1 in the classification accuracy, we can achieve pruning ratios up to 98.9 parameters and 98.6

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/25/2022

Fine-tuning Pruned Networks with Linear Over-parameterization

Structured pruning compresses neural networks by reducing channels (filt...
10/25/2020

Neuron Merging: Compensating for Pruned Neurons

Network pruning is widely used to lighten and accelerate neural network ...
07/11/2020

To filter prune, or to layer prune, that is the question

Recent advances in pruning of neural networks have made it possible to r...
10/11/2018

Rethinking the Value of Network Pruning

Network pruning is widely used for reducing the heavy computational cost...
02/14/2020

Layer-wise Pruning and Auto-tuning of Layer-wise Learning Rates in Fine-tuning of Deep Networks

Existing fine-tuning methods use a single learning rate over all layers....
11/19/2016

Pruning Convolutional Neural Networks for Resource Efficient Inference

We propose a new formulation for pruning convolutional kernels in neural...
10/25/2021

Network compression and faster inference using spatial basis filters

We present an efficient alternative to the convolutional layer through u...

1 Introduction

Deep convolutional neural networks have dominated the area of applied computer vision. The network architectures used for image analysis have grown in terms of performance along with their numbers of layers and parameters over the years. Nevertheless, as the ubiquitous use of these networks is now extended to resource-limited areas such as edge computing, optimization of the architectures to minimize computational requirements has become essential. Furthermore, the reduction in FLOPs at inference time directly impacts the power consumption of the large-scale consumer facing applications of artificial intelligence. As a result, the advocates of green AI recommend the use of network size and the number of FLOPs as important performance evaluation metrics for neural networks, along with accuracy

[23].

To achieve the goal of architectural efficiency, a significant amount of progress has been made in the area of pruning. Pruning is the process of finding architectural components of a network that can be removed without a large loss of performance. Since the early days of convolutional neural networks, it has been shown that removing unimportant components of neural networks delivers benefits in generalization and computational efficiency [12].

Similar to [21], in this work, we focus on pruning in the context of transfer learning. While domain adaptation has been the focus of most work in transfer learning, we focus on the neglected problem of efficient inference in the target domain through pruning. Transfer learning is necessary for domains in which large-scale and well-annotated datasets are scarce due to the cost of data acquisition and annotation [25]

. In transfer learning, as the source dataset usually contains features that are not present in the target dataset, using off-the-shelf features or fine-tuning without pruning can result in unnecessarily large models. In contrast, transfer learning with pruning can achieve substantial reductions in parameters and FLOPs, probably even more than pruning for the same dataset as we found in our experiments.

Existing network pruning algorithms can be categorized in different ways. Pruning can be achieved by removing unstructured weights and connections [4, 16], or by removing structural contents such as filters or layers [6, 13, 15, 17, 19, 20, 21, 26, 29]. While most algorithms perform pruning directly on the convolutional weight matrix, some try to reconstruct the weight matrix or its output features by low-rank approximation to reduce inference time [2, 8, 30]. Some frameworks perform pruning without considering image data [8, 13], while most works use image data for better pruning ratios and accuracy [6, 17, 19, 29].

Although these frameworks provide promising results, there are several limitations. As discussed in [9, 22]

, since the filters are linearly dependent in a layer, pruning in the original filter space can be less effective. On the other hand, low-rank approximations require additional optimizations apart from backpropagation to perform filter or feature reconstruction. Furthermore, fine-tuning of the entire network after pruning is required in most frameworks, which may not be desirable for transfer learning with limited data.

In view of these issues, here we propose a framework that fine-tunes and prunes the orthogonal bases obtained by applying singular value decomposition (SVD) on the convolutional weight matrices. Our contributions are as follows:

  • We propose a basis pruning algorithm that prunes convolutional layers in orthogonal subspaces regardless of the network architecture. As the basis vectors are non-trainable in our framework to facilitate transfer learning, we introduce the basis scaling factors which are responsible for both importance estimation and fine-tuning of basis vectors. These basis scaling factors are trainable by backpropagation and only contribute a very small amount of trainable parameters.

  • After basis pruning, as the numbers of input and output channels of the original convolutional layers remain unchanged, other pruning algorithms can be further applied for double pruning. By combining the advantages of basis pruning and other pruning algorithms, we can achieve larger pruning ratios that cannot be achieved by either alone. This provides a new approach that can amplify existing pruning mechanism.

We tested our framework by transferring the features of ImageNet pre-trained models to classify the CIFAR-10, MNIST, and Fashion-MNIST datasets. With less than 1% reduction in the classification accuracy, we can achieve pruning ratios up to 98.9% in parameters and 98.6% in FLOPs. Experiments on same-dataset pruning were also performed on CIFAR-10 for comparisons.

2 Related Work

Channel/Filter Pruning. Network pruning can be achieved by pruning individual weights or entire channels/filters. Although pruning individual weights can achieve high compression ratios because of the flexibility [4, 16], as discussed in [26], the practical speedup can be limited given the irregular weight sparsity unless specialized software or hardware are utilized. In contrast, channel pruning utilizes the structured sparsity [6, 13, 15, 17, 19, 20, 21, 26, 29]. Although channel pruning is less flexible than weight-level pruning, the dense matrix structures are maintained after pruning and significant practical speedup can be achieved with off-the-shelf libraries. In [13], the L1 norm of each filter was computed as its relative importance within a convolutional layer, and filters with smaller L1 norms were pruned. In [6, 19], the channels were pruned while minimizing the reconstruction error between the original and modified feature maps in each layer. In [17, 29]

, the scaling factors in the batch normalization (BN) layers were used for channel pruning. While only backpropagation was used in

[17], the framework in [29] utilized an additional optimizer to update the scaling factor during training. In [20], the importance of a parameter was quantified by the error induced by removing it. Using Taylor expansions to approximate such errors with the gradients of the minibatches, less important filters were pruned.

Pruning and Matrix Factorization.

Matrix factorization techniques such as SVD and principal component analysis (PCA) have been applied to deep learning. A convolutional weight matrix or a feature tensor can be factorized into a specified canonical form to reveal properties that cannot be observed in the original space. This transformation also enables special operations that lead to higher computational efficiency or accuracy. In

[2, 8, 30], pre-trained convolutional filters were approximated by low-rank basis vectors to reduce inference time, which can be viewed as low-rank matrix approximations by SVD. In [28], without pre-training, the weight matrices were factorized by SVD. SVD training was then performed on the decomposed weights with orthogonality regularization and sparsity-inducing regularizers, and the resulting network was pruned by the singular values and fine-tuned. In [3], PCA was applied on the feature maps to analyze the network structure for the optimal layer-wise widths and number of layers, and the resulting structure was then trained from scratch. In [15], SVD was used to obtain the ranks of feature maps. Filters with lower ranks were pruned before fine-tuning.

Relation to Our Work. Similar to most frameworks, channel pruning is used in our work. Nevertheless, as our goal is efficient transfer learning from one dataset to another potentially much smaller dataset, we try to minimize the number of trainable parameters during importance estimation and fine-tuning. To this end, we adopt the use of scaling factors which allow us to perform filter-based fine-tuning which requires much fewer trainable parameters. Furthermore, inspired by [9, 22], we found that pruning linearly independent filters obtained by SVD can be more effective. Therefore, we combine these advantages by proposing basis vectors rescaling and pruning, and present a double pruning framework for improving pruning ratios. Note that although SVD was also used in [28] for weight factorization, their goal was to train a model from scratch and thus orthogonality regularization was necessary. In contrast, the basis vectors are non-trainable in our framework thus the orthogonality is naturally preserved.

3 Methodology

The goal of our framework is to represent the convolutional weights with orthogonal bases that allow more effective network pruning. As discussed in [9, 22], the features of a layer are distributed among the linearly dependent filters, and their representations are different with different initializations. By representing the features with orthogonal bases obtained by SVD or PCA, much fewer channels are required to achieve the same accuracy. Furthermore, it was shown in [9] that the principal components of features trained from different random initializations are similar, which means that the features are more uniquely represented in such subspaces. Therefore, network pruning in orthogonal subspaces can be more effective. In addition, as shown in [2, 8], approximating the weights with low-rank tensor approximations can reduce computational complexity. In view of these advantages, here we propose a framework that utilizes SVD/PCA for network pruning.

3.1 Convolutional Weights Representation in Orthogonal Subspaces

Let be a 4-D weight matrix of a convolutional layer, where and are the kernel height and width, and and are the numbers of input and output channels, respectively. For efficient transfer learning, we assume that the convolutional weights are pre-trained and non-trainable. can be reshaped into a 2-D matrix for further processing (), which can be factorized by compact SVD as:

(1)

where contains the columns of left-singular vectors, contains the columns of right-singular vectors, and is a diagonal matrix of singular values in descending order. is the maximum rank of . As the columns of yield an orthonormal basis, like those of , we have , with

an identity matrix. With SVD, we can perform rescaling and pruning in the subspaces of

and .

To transform the weight matrix with PCA, we can view the rows and columns of as samples and features, respectively. To use PCA, we compute the symmetric covariance matrix as:

(2)

by using (1) and . Therefore, the columns of

are the eigenvectors of

corresponding to the nonzero eigenvalues.

can be projected onto as:

(3)

as . Thus, the columns of are the left-singular vectors rescaled by the singular values. Therefore, PCA and SVD are equivalent in factorizing .

3.2 Convolutional Layer Decomposition

Figure 1: Decomposition of a convolutional layer. Only the vector of scaling factors is trainable during transfer learning.

Using SVD, the convolutional weights can be represented by the orthonormal bases in and . Although the contributions of the basis vectors are proportional to the corresponding singular values, most singular values are of similar magnitudes and choosing which to be removed is nontrivial especially without considering the image data. As we also want to preserve the original weights as much as possible while pruning, here we introduce the basis-scaling convolutional (BasisScalingConv) layer to measure the importance of the basis vectors.

aaaaVGG-16 layer #2 aaaaVGG-16 layer #4 aaaaVGG-16 layer #7 aaaaVGG-16 layer #10 aaaaVGG-16 layer #13
aaaaDenseNet-121 layer #1 aaaaDenseNet-121 layer #14 aaaaDenseNet-121 layer #39 aaaaDenseNet-121 layer #88 aaaaDenseNet-121 layer #120
aaaaResNet-50 layer #1 aaaaResNet-50 layer #11 aaaaResNet-50 layer #24 aaaaResNet-50 layer #43 aaaaResNet-50 layer #53
Figure 2: The magnitudes of the basis scaling factors (red) the first-order Taylor approximation of importance (blue) of three different models on CIFAR-10. The scaling factors in each layer are arranged in descending order of the singular values. Note that ResNet-50 actually has 53 convolutional layers because of the four convolutional shortcuts.

Given a pre-trained convolutional layer with non-trainable convolutional weights and bias , we let be a column vector of length which contains the features input to the convolutional layer at a spatial location. The output features at the same spatial location can be obtained as:

(4)

by using (1) with . To rescale the basis vectors by their importance, we introduce a vector of scaling factors of non-negative scalars, and (4) is modified as:

(5)

with a diagonal matrix of . When , (4) and (5) are identical. Using (5), we can decompose the convolutional layer into two consecutive layers (Fig. 1). The first layer is a regular convolutional layer with as the convolutional weights and no bias. The second layer is the BasisScalingConv layer comprising , , and . is used as the convolutional weights to transform the outputs from the previous layer back to the original space. Only is trainable in the decomposed layers. When is updated in each step (batch), each scalar in rescales the corresponding row in . In fact, the scaling of the basis can be viewed as basis fine-tuning for improving accuracy. Instead of using (5) as a single convolutional layer with as the kernel, dividing into two layers reduces the number of weights and thus computational complexity after the basis vectors are pruned. Although more weights are introduced before pruning, as compact SVD is used, the increase in the total number of weights is less than 22% with our tested models.

3.3 Basis Pruning with First-Order Taylor Approximation of Importance

Our goal is to apply features trained from one dataset (, ImageNet) to other datasets (, CIFAR-10). Given a pre-trained model, we keep all layers up to and including the last convolutional layer and the associating BN and activation layers, and add a global average pooling layer and a final fully-connected layer for classification. For transfer learning with basis pruning, we first decompose every convolutional layer as presented in Section

3.2. As BN layers are important for domain adaptation [14], they are trainable during transfer learning and are introduced after each convolutional layer if not present (, VGGNet). Therefore, only the BN layers, the vector in each BasisScalingConv layer, and the final fully-connected layer are trainable.

Although the magnitudes of the basis scaling factors can be used to indicate the relative importance of basis vectors, regularization is required to enhance sparsity for larger pruning ratios [17]. Finding the optimal

regularization parameter that balances sparsity and accuracy is nontrivial and model dependent. Smaller learning rates and more epochs are also required. These largely reduce the automation and efficiency of the framework.

To avoid these limitations, we found that importance estimation by the first-order Taylor approximation (Taylor FO) is a good alternative [20]. In [20], the importance of a network parameter can be quantified by the error induced by removing it. Using Taylor FO, the importance score of a scaling factor can be approximated by:

(6)

with

the gradient of the loss function with respect to

. Regardless of how the importance scores are computed, they are normalized in each layer by the maximum importance score in that layer. Fig. 2 shows that although the scaling factors tend be smaller with small singular values, the fluctuations are large especially in deeper layers. In contrast, Taylor FO provides better distinctions of filter importance which are highly correlated with the singular values.

After training with enough epochs for the desired classification accuracy, the less important basis vectors are pruned (Fig. 3(a)). Let be the number of scaling factors remain after pruning, then , , and in (5) become , , and , respectively. As the sizes of , , and are unaltered, basis pruning only affects the convolutional layer being pruned but not the subsequent layers. Therefore, basis pruning can be applied to virtually all network architectures. In contrast, pruning in the original space requires pruning of the subsequent convolutional layer, which becomes complicated when skip connections are involved. If an entire layer is pruned (, ), all subsequent layers are removed.

(a) Basis pruning.

  

(b) Double pruning.

Figure 3: Basis pruning and double pruning. (a) Pruning by removing the same number of basis vectors (blue) from and . (b) With also input pruning (red) and output pruning (green).

3.4 Double Pruning

In basis pruning, only the output channels of and the input channels of are pruned (Fig. 3(a)). Larger pruning ratios can be achieved by pruning also the input channels of and the output channels of (Fig. 3(b)). This can be achieved by pruning the output channels of as the subsequent input channels of are pruned accordingly. In fact, after basis pruning, the BasisScalingConv layers can be treated as regular convolutional layers and any existing pruning algorithm can be applied.

Although double pruning can further increase the pruning ratios, it also associates with the common problems of pruning in the original space related to branching and merging (Fig. 4). For skip connections through concatenations (, DenseNet [7]), the problems are less complicated as the channel positions after concatenation are trackable. For skip connections through element-wise merging (, ResNet [5]), the convolutional layers whose outputs are merged are not pruned (, the second convolutional layer in Fig. 4). This may reduce the pruning ratios but is more adaptive to complicated architectures.

3.5 Pruning Procedure

  1. Given a pre-trained model, keep all layers up to and including the last convolutional layer and the associating BN and activation layers. Add a global average pooling layer and a final fully-connected layer for classification. Insert BN layers if needed.

  2. Decompose each convolutional layer into a convolutional layer and a BasisScalingConv layer (Section 3.2).

  3. Train the model with only the BN layers, the basis scaling factors, and the final fully-connected layer trainable with the desired number of epochs.

  4. Perform basis pruning by removing the basis vectors whose normalized importance scores are lower than a given threshold (Section 3.3). Train the pruned model as in Step 3. Repeat if needed.

  5. Perform pruning by removing the output channels of the BasisScalingConv layers whose normalized importance scores are lower than a given threshold (Section 3.4). Train the pruned model as in Step 3. Repeat if needed.

Multiple iterations can be performed in Step 4 and 5, though we found that one iteration is enough especially for simpler problems (, MNIST).

Figure 4: Branching and merging can cause problems when pruning in the original space. For example, if feature tensors TA and TB have different numbers of channels after pruning, they cannot be merged by element-wise merging operations (, Add).

Regardless of how the importance scores are computed, it is important to normalize them in each layer. This is because some importance scores have different ranges in different layers. For example, HRank scores [15] are matrix ranks which are dependent on the spatial size of feature tensors, thus filters in layers after downsampling have smaller maximum scores. Normalizing importance scores allows them to be used within a layer or in the entire model.

To determine the thresholds in Step 4 and 5, we decide the percentage of basis vectors or filters to be removed from the entire model and compute the corresponding threshold.

To minimize manual involvements and improve training efficiency, different from [13], no manual layer selections are performed. Moreover, we do not use gradual filters removal [20] or filter-by-filter pruning [15]. Only the percentage of basis vectors or filters to be removed is needed.

In Step 3, each scaling factor in the BasisScalingConv or BN layers modifies the weights of a basis vector or a filter as a whole. We regard this as basis- or filter-based tuning which is more efficient than fine-tuning individual weights.

4 Experiments

4.1 Models and Datasets

To study the characteristics of our framework, we performed transfer learning with three ImageNet [1] pre-trained models on three other datasets. ImageNet was used as the source dataset because of its abundant features trained from 1.2 million images. The three models correspond to the architectures of VGG-16 [24], DenseNet-121 [7], and ResNet-50 [5]. VGG-16 has a relatively simple architecture. DenseNet-121 and ResNet-50 have skip connections realized by tensor concatenation and addition, respectively.

The three datasets include CIFAR-10 [10], MNIST [11], and Fashion-MNIST [27]. The CIFAR-10 dataset consists of 3232 colour images in 10 classes of animals and vehicles, with 50k training images and 10k test images. The MNIST dataset of handwritten digits (0 to 9) has 60k 2828 grayscale training images and 10k test images in 10 classes. The Fashion-MNIST dataset has a training set of 60k 2828 grayscale training images and 10k test images in 10 classes of fashion categories, which can be used as a drop-in replacement for MNIST. Each set of training images was split into 90% for training and 10% for validation. Only the results on the test images are reported.

We also trained VGG-14 [13], DenseNet-BC-100 [7], and ResNet-56 [5] from scratch on CIFAR-10 to test our framework on same-dataset pruning.

4.2 Tested Frameworks

Given a pre-trained model, we kept all layers up to and including the last convolutional layer and the associating BN and activation layers, and added a global average pooling layer and a final fully-connected layer. The BN layers were trainable with the final fully-connected layer while other layers were frozen. Different frameworks were tested based on this configuration:

  • Baseline: No layer decompositions and pruning.

  • L1 [13]: No layer decompositions. Pruning using the L1 norms of filters as the importance scores.

  • Taylor FO [20]: No layer decompositions. Pruning by the Taylor FO importance scores. The gradients were computed from the validation dataset.

  • HRank [15]: No layer decompositions. Pruning by the HRank importance scores which are the matrix ranks of the feature maps of the validation dataset.

  • Basis: All convolutional layers were decomposed. Pruning by the Taylor FO importance scores of the basis scaling factors (Section 3.3).

  • Basis + Taylor FO: Double pruning by the Taylor FO importance scores (Section 3.4).

  • Basis + HRank: Double pruning by the HRank importance scores (Section 3.4).

For the frameworks without layer decompositions, the pruning procedure in Section 3.5 was applied without Step 2 and 4, and the targeting layers in Step 5 became the convolutional layers. For all frameworks, only one iteration was performed in Step 4 and 5. As the L1 framework was less effective in our experiments, we did not use it for double pruning. Note that our goal is to study the pruning capabilities of our frameworks but not competing for the best accuracy on specific datasets.

4.3 Training Strategy

For transfer learning, as the network architectures of the ImageNet pre-trained models (VGG-16, DenseNet-121, and ResNet-50) were created for the image size of 224224, directly applying them to the target datasets of much smaller image sizes leads to insufficient spatial sizes (, feature maps of size 11) in the deeper layers and thus poor performance. Therefore, we enlarged the image size by four times in each dimension, , 128128 for CIFAR-10 and 112112 for MNIST and Fashion-MNIST. Image augmentation was used to reduce overfitting, with 15% of shifting in height and width for all datasets, random horizontal flipping for CIFAR-10 and Fashion-MNIST, and 15

of rotation for MNIST. Every image was zero-centered in intensity. Dropout with rate 0.5 was applied before the final fully connected layer. Stochastic gradient descent (SGD) with cosine annealing

[18] was used as the learning rate scheduler without restarts, with the minimum and maximum learning rates as and , respectively. The SGD optimizer was used with momentum of 0.9 and a batch size of 128. There were 100 epochs for each training. All scaling factors were initialized to 0.5 and constrained to be non-negative. The same settings were applied to same-dataset pruning except that no dropout was used and the batch size became 64.

When training models from scratch for same-dataset pruning, as the network architectures (VGG-14, DenseNet-BC-100, and ResNet-56) are tailored for CIFAR-10, the original image size was used. The same image augmentation and image preprocessing as transfer learning were applied. A weight decay of was applied on the convolutional weights without dropout. SGD with momentum of 0.9 was used with warm restarts by cosine annealing, with the minimum and maximum learning rates as and , respectively. There were 300 epochs for each training, and the learning rate scheduler restarted after 150 epochs with a batch size of 64.

The implementation was in Keras with TensorFlow backend. Each experiment was performed on a NVIDIA Tesla V100 GPU with 16 GB memory.

Framework Accuracy Parameters (PR) FLOPs (PR)
VGG-16 90.9% 14.74M (0.0%) 5.03G (0.0%)
L1 [13] 88.0% 13.42M (8.9%) 3.15G (37.5%)
Taylor FO [20] 90.0% 6.57M (55.4%) 2.48G (50.8%)
HRank [15] 90.9% 11.67M (20.8%) 4.11G (18.4%)
Basis 91.2% 7.99M (45.8%) 3.21G (36.2%)
Basis + Taylor FO 90.5% 5.42M (63.2%) 2.45G (51.2%)
Basis + HRank 90.9% 7.13M (51.6%) 2.95G (41.3%)
DenseNet-121 95.2% 7.05M (0.0%) 0.93G (0.0%)
L1 [13] 93.6% 6.02M (14.5%) 0.70G (24.7%)
Taylor FO [20] 95.1% 4.15M (41.2%) 0.57G (38.8%)
HRank [15] 94.3% 5.41M (23.2%) 0.78G (15.9%)
Basis 94.4% 4.75M (32.6%) 0.65G (29.9%)
Basis + Taylor FO 94.7% 4.43M (37.2%) 0.60G (35.7%)
Basis + HRank 94.1% 4.20M (40.4%) 0.60G (35.5%)
ResNet-50 92.4% 23.61M (0.0%) 1.29G (0.0%)
L1 [13] 92.1% 23.15M (1.9%) 1.18G (8.3%)
Taylor FO [20] 91.4% 10.89M (53.9%) 0.75G (42.0%)
HRank [15] 91.8% 17.67M (25.2%) 1.00G (22.5%)
Basis 92.2% 7.14M (69.7%) 0.61G (52.6%)
Basis + Taylor FO 91.6% 6.00M (74.6%) 0.52G (59.8%)
Basis + HRank 92.1% 6.39M (72.9%) 0.55G (57.0%)
Table 1: Pruning results on CIFAR-10 with ImageNet pre-trained models. PR stands for pruning ratio. Proposed frameworks are in bold and the best results are in blue. The results that were worse than all proposed frameworks are in red. The images were upsampled to 128128.
Framework Accuracy Parameters (PR) FLOPs (PR)
VGG-16 99.5% 14.74M (0.0%) 3.85G (0.0%)
L1 [13] 99.3% 4.45M (69.8%) 0.55G (85.8%)
Taylor FO [20] 99.3% 0.73M (95.1%) 0.29G (92.4%)
HRank [15] 99.4% 3.71M (74.8%) 1.51G (60.8%)
Basis 99.4% 0.64M (95.6%) 0.35G (90.8%)
Basis + Taylor FO 99.3% 0.16M (98.9%) 0.12G (97.0%)
Basis + HRank 99.3% 0.35M (97.6%) 0.25G (93.5%)
DenseNet-121 99.6% 7.05M (0.0%) 0.71G (0.0%)
L1 [13] 99.2% 0.11M (98.5%) 0.01G (98.6%)
Taylor FO [20] 99.3% 0.12M (98.2%) 0.01G (98.3%)
HRank [15] 99.1% 0.30M (95.8%) 0.11G (83.8%)
Basis 99.1% 0.37M (94.8%) 0.03G (96.1%)
Basis + Taylor FO 99.4% 0.19M (97.4%) 0.01G (98.3%)
Basis + HRank 99.0% 0.32M (95.4%) 0.02G (96.7%)
ResNet-50 99.4% 23.61M (0.0%) 1.05G (0.0%)
L1 [13] 99.0% 6.14M (74.0%) 0.24G (76.9%)
Taylor FO [20] 99.2% 5.74M (75.7%) 0.23G (77.7%)
HRank [15] 99.3% 6.54M (72.3%) 0.44G (57.5%)
Basis 99.1% 0.55M (97.7%) 0.05G (95.1%)
Basis + Taylor FO 99.1% 0.48M (98.0%) 0.04G (95.9%)
Basis + HRank 99.0% 0.50M (97.9%) 0.04G (95.7%)
Table 2: Pruning results on MNIST with ImageNet pre-trained models. PR stands for pruning ratio. Proposed frameworks are in bold and the best results are in blue. The results that were worse than all proposed frameworks are in red. The images were upsampled to 112112.
Framework Accuracy Parameters (PR) FLOPs (PR)
VGG-16 93.2% 14.74M (0.0%) 3.85G (0.0%)
L1 [13] 92.9% 13.42M (8.9%) 2.41G (37.4%)
Taylor FO [20] 92.5% 2.64M (82.1%) 1.04G (73.1%)
HRank [15] 92.2% 3.54M (76.0%) 1.66G (56.9%)
Basis 92.3% 1.37M (90.7%) 0.68G (82.3%)
Basis + Taylor FO 92.4% 0.90M (93.9%) 0.51G (86.6%)
Basis + HRank 92.0% 0.98M (93.4%) 0.54G (85.9%)
DenseNet-121 93.7% 7.05M (0.0%) 0.71G (0.0%)
L1 [13] 93.1% 2.54M (63.9%) 0.17G (76.6%)
Taylor FO [20] 92.8% 0.94M (86.6%) 0.11G (85.0%)
HRank [15] 92.7% 1.06M (85.0%) 0.28G (60.0%)
Basis 93.4% 1.13M (84.0%) 0.14G (80.0%)
Basis + Taylor FO 92.7% 0.63M (91.0%) 0.07G (90.6%)
Basis + HRank 93.5% 0.83M (88.3%) 0.11G (85.1%)
ResNet-50 94.0% 23.61M (0.0%) 1.05G (0.0%)
L1 [13] 93.5% 22.04M (6.6%) 0.86G (18.1%)
Taylor FO [20] 93.0% 9.12M (61.4%) 0.47G (55.1%)
HRank [15] 93.2% 4.73M (80.0%) 0.28G (73.4%)
Basis 93.3% 1.99M (91.6%) 0.21G (79.6%)
Basis + Taylor FO 93.4% 1.47M (93.8%) 0.16G (85.1%)
Basis + HRank 93.4% 1.46M (93.8%) 0.15G (85.3%)
Table 3: Pruning results on Fashion-MNIST with ImageNet pre-trained models. PR stands for pruning ratio. Proposed frameworks are in bold and the best results are in blue. The results that were worse than all proposed frameworks are in red. The images were upsampled to 112112.

4.4 Pruning Performance on Transfer Learning

Table 1, 2, 3 show the comparisons among frameworks. To compare the pruning capabilities, we obtained the largest pruning ratios while keeping the accuracy reductions less than 1%. In a few occasions, this 1% requirement could not be achieved even with only 10% of channels removal. For MNIST, we kept the accuracies .

The pruning ratios were inversely proportional to the difficulties of the classification problems. The largest parameter pruning ratios of CIFAR-10, MNIST, and Fashion-MNIST were 74.6%, 98.9%, and 93.9%, respectively, all achieved by our Basis + Taylor FO framework. For ResNet-50, our frameworks outperformed the others by a large margin on all datasets. This is because basis pruning can be applied to all convolutional layers regardless of the existence of element-wise merging. Our frameworks were also dominant on VGG-16. Although they were less effective with DenseNet-121 on CIFAR-10 and MNIST, our best results were only lower than the best ones by less than 1.1% in parameters and 3.1% in FLOPs.

In seven out of the nine combinations (3 architectures 3 datasets), our double pruning frameworks outperformed the others. This was often (6 out of 7) achieved by Basis + Taylor FO and once by Basis + HRank. In fact, without basis pruning, Taylor FO outperformed L1 and HRank in our results. Note that Taylor FO utilizes information from both the data and the weights, whereas L1 and HRank only depend on the weight magnitudes and feature maps, respectively. This could explain why using Taylor FO in our double pruning approach often outperformed other frameworks.

Framework Accuracy Parameters (PR) FLOPs (PR)
VGG-14 93.4% 14.74M (0.0%) 0.45G (0.0%)
L1 [13] 92.9% 12.79M (13.2%) 0.29G (35.7%)
Taylor FO [20] 92.7% 12.40M (15.8%) 0.28G (38.3%)
HRank [15] 90.7% 12.64M (14.2%) 0.26G (41.2%)
Basis 92.5% 2.20M (85.1%) 0.13G (71.0%)
Basis + Taylor FO 92.5% 2.08M (85.9%) 0.12G (74.1%)
Basis + HRank 91.9% 2.02M (86.3%) 0.12G (73.1%)
DenseNet-BC-100 94.3% 0.79M (0.0%) 0.29G (0.0%)
L1 [13] 93.2% 0.66M (17.1%) 0.20G (31.9%)
Taylor FO [20] 93.3% 0.60M (23.8%) 0.18G (39.4%)
HRank [15] 91.3% 0.63M (20.8%) 0.27G (7.6%)
Basis 93.7% 0.57M (28.2%) 0.19G (37.0%)
Basis + Taylor FO 93.6% 0.55M (30.8%) 0.17G (42.3%)
Basis + HRank 91.5% 0.54M (31.9%) 0.18G (38.1%)
ResNet-56 93.0% 0.86M (0.0%) 0.13G (0.0%)
L1 [13] 90.6% 0.83M (3.5%) 0.11G (17.3%)
Taylor FO [20] 91.0% 0.83M (3.3%) 0.11G (15.8%)
HRank [15] 90.3% 0.66M (23.8%) 0.11G (19.3%)
Basis 91.1% 0.71M (17.9%) 0.10G (28.7%)
Basis + Taylor FO 90.1% 0.67M (21.8%) 0.09G (33.7%)
Basis + HRank 90.3% 0.68M (21.0%) 0.09G (32.4%)
Table 4: Same-dataset pruning results on CIFAR-10. PR stands for pruning ratio. Proposed frameworks are in bold and the best results are in blue. The results that were worse than all proposed frameworks are in red. The original image size of 3232 was used. For VGG-14 and DenseNet-BC-100, the results of the Basis + HRank frameworks are not highlighted in blue as their reductions in accuracy were more than 1%.

4.5 Performance on Same-Dataset Pruning

Although our frameworks are built for transfer learning, they are also applicable to same-dataset pruning. Table 4 shows the results on same-dataset pruning on CIFAR-10. Similar to the transfer learning results, the Basis + Taylor FO framework performed best in general. Nevertheless, most pruning ratios were smaller compared with transfer learning. This is because the architectures tailored for CIFAR-10 were much smaller in terms of model parameters. DenseNet-BC-100 for CIFAR-10 was around nine times smaller than DenseNet-121 for ImageNet [7]. ResNet-56 for CIFAR-10 was around 27 times smaller than ResNet-50 for ImageNet [5]. As VGG-14 was highly over-parameterized, the pruning ratios were still large. In contrast, for ResNet-56, even though we only removed 10% of the basis vectors or filters, the reductions in accuracy were larger than 1% for all frameworks. As the pruning procedure in Section 3.5 was applied, there were no manual layer selections and layer-by-layer pruning. These may be applied to improve the pruning ratios but will largely reduce the automation and computational feasibility.

4.6 Characteristics of Basis Vectors

We studied the characteristics of basis vectors. Fig. 5 shows the comparisons among different methods of importance estimation on basis pruning on CIFAR-10. For all network architectures, Random and Reverse obtained the worst results. In contrast, Taylor FO and Singular Values resulted in much higher accuracy, while Taylor FO performed better on VGG-16 and ResNet-50. These show that basis vectors of small singular values should be pruned, and the use of Taylor FO importance scores is more preferable.

Figure 5: Comparisons among different methods of importance estimation on basis pruning (CIFAR-10). Half of the basis vectors were removed from each decomposed layer pair. Taylor FO: using Taylor FO on basis scaling factors. Singular Values: removing basis vectors of small singular values. Random: random removal. Reverse: removing basis vectors of large singular values.

4.7 Trainable Model Parameters

Table 5 shows the numbers of total and trainable model parameters after layer decomposition and before pruning. Although the total numbers of parameters became larger because of layer decomposition, the trainable parameters were less than 1.2% and could be as low as 0.1% of the total numbers of parameters. Therefore, our framework is advantageous for transfer learning with limited data.

Model Parameters
Total Trainable
VGG-16 16.55M 17.77k
DenseNet-121 8.40M 104.04k
ResNet-50 28.78M 86.86k
Table 5: Numbers of total and trainable parameters before pruning.

5 Conclusion

In this paper, we present the basis and double pruning approaches for efficient transfer learning. Using singular value decomposition, a convolutional layer can be decomposed into two consecutive layers with the basis vectors as their convolutional weights. With the basis scaling factors introduced, the basis vectors can be fine-tuned and pruned to reduce the network size and inference time, regardless of the existence of skip connections. Basis pruning can be further combined with other pruning algorithms for double pruning to obtain pruning ratios that cannot be achieved by either alone. Experimental results show that basis pruning outperformed pruning in the original feature space, and the performance was even more distinctive with double pruning. Without individual weights fine-tuning, the number of trainable model parameters can be very small. Hence, our framework is ideal for transfer learning with limited data.

References

  • [1] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , pages 248–255, 2009.
  • [2] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, pages 1269–1277, 2014.
  • [3] Isha Garg, Priyadarshini Panda, and Kaushik Roy. A low effort approach to structured CNN design using PCA. IEEE Access, 8:1347–1360, 2020.
  • [4] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
  • [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [6] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In IEEE International Conference on Computer Vision, pages 1389–1397, 2017.
  • [7] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4700–4708, 2017.
  • [8] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In British Machine Vision Conference, 2014.
  • [9] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In Machine Learning Research, pages 3519–3529, 2019.
  • [10] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
  • [11] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [12] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605, 1990.
  • [13] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. In International Conference on Learning Representations, 2017.
  • [14] Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and Xiaodi Hou. Revisiting batch normalization for practical domain adaptation. In International Conference on Learning Representations (Workshop), 2017.
  • [15] Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling Shao. HRank: Filter pruning using high-rank feature map. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1529–1538, 2020.
  • [16] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 806–814, 2015.
  • [17] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In IEEE International Conference on Computer Vision, pages 2736–2744, 2017.
  • [18] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
  • [19] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. ThiNet: A filter level pruning method for deep neural network compression. In IEEE International Conference on Computer Vision, pages 5058–5066, 2017.
  • [20] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11264–11272, 2019.
  • [21] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. In International Conference on Learning Representations, 2017.
  • [22] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Advances in Neural Information Processing Systems, pages 6076–6085, 2017.
  • [23] Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green AI. arXiv:1907.10597 [cs.CY], 2019.
  • [24] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  • [25] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. A survey on deep transfer learning. In International Conference on Artificial Neural Networks, pages 270–279, 2018.
  • [26] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.
  • [27] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747 [cs.LG], 2017.
  • [28] Huanrui Yang, Minxue Tang, Wei Wen, Feng Yan, Daniel Hu, Ang Li, Hai Li, and Yiran Chen. Learning low-rank deep neural networks via singular vector orthogonality regularization and singular value sparsification. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 678–679, 2020.
  • [29] Jianbo Ye, Xin Lu, Zhe Lin, and James Z Wang. Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. In International Conference on Learning Representations, 2018.
  • [30] Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):1943–1955, 2015.