A Highly Effective Low-Rank Compression of Deep Neural Networks with Modified Beam-Search and Modified Stable Rank

11/30/2021
by   Moonjung Eo, et al.
Seoul National University
0

Compression has emerged as one of the essential deep learning research topics, especially for the edge devices that have limited computation power and storage capacity. Among the main compression techniques, low-rank compression via matrix factorization has been known to have two problems. First, an extensive tuning is required. Second, the resulting compression performance is typically not impressive. In this work, we propose a low-rank compression method that utilizes a modified beam-search for an automatic rank selection and a modified stable rank for a compression-friendly training. The resulting BSR (Beam-search and Stable Rank) algorithm requires only a single hyperparameter to be tuned for the desired compression ratio. The performance of BSR in terms of accuracy and compression ratio trade-off curve turns out to be superior to the previously known low-rank compression methods. Furthermore, BSR can perform on par with or better than the state-of-the-art structured pruning methods. As with pruning, BSR can be easily combined with quantization for an additional compression.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/24/2019

One time is not enough: iterative tensor decomposition for neural network compression

The low-rank tensor approximation is very promising for the compression ...
02/20/2018

DeepThin: A Self-Compressing Library for Deep Neural Networks

As the industry deploys increasingly large and complex neural networks t...
10/30/2018

DeepTwist: Learning Model Compression via Occasional Weight Distortion

Model compression has been introduced to reduce the required hardware re...
11/07/2017

Compression-aware Training of Deep Networks

In recent years, great progress has been made in a variety of applicatio...
08/04/2020

PowerGossip: Practical Low-Rank Communication Compression in Decentralized Deep Learning

Lossy gradient compression has become a practical tool to overcome the c...
10/06/2015

Structured Transforms for Small-Footprint Deep Learning

We consider the task of building compact deep learning pipelines suitabl...
05/15/2020

A flexible, extensible software framework for model compression based on the LC algorithm

We propose a software framework based on the ideas of the Learning-Compr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Rank selection
(b) Singular value distribution
Figure 1: Comparison with the baseline algorithms of CA [alvarez2017compression] and LC [idelbayev2020low]

. (a) Performance of rank selection methods: a base neural network (ResNet56 on CIFAR-100) was truncated by the selected ranks and no further fine-tuning was applied. Modified beam-search (

mBS) clearly outperforms the other two. (b) Effectiveness of rank regularized training: a base neural network (ResNet56 on CIFAR-100) was regularized by each compression-friendly training method. For the target rank of five, modified stable rank (mBR) is the only one that clearly minimizes the singular values other than the top five.

As deep learning becomes widely adopted by the industry, the demand for compression techniques that are highly effective and easy-to-use is sharply increasing. The most popular compression methods include quantization [rastegari2016xnor, wu2016quantized], pruning of redundant parameters [han2015deep, lebedev2016fast, srinivas2015data], knowledge distillation from a large network to a small one [hinton2015distilling, kim2018paraphrasing, romero2014fitnets, zagoruyko2016paying], and network factorization [alvarez2017compression, denton2014exploiting, masana2017domain, xue2013restructuring]. In this work, we focus on a low-rank compression that is based on a matrix factorization and low-rank approximation of weight matrices.

A typical low-rank compression is based on a direct application of SVD (Singular Value Decomposition). For a trained neural network, the layer

’s weight matrix of size is decomposed and only the top dimensions are kept. This reduces the computation and memory requirements from to and the reduction can be large when a small is chosen.

The traditional approaches can be divided into two main streams. One considers only a post-application of SVD on a fully trained neural network, and the other additionally performs a compression-friendly training before SVD.

For the first stream, the main research problem is the selection of  [jaderberg2014speeding, denton2014exploiting, tai2015convolutional, zhang2015accelerating, wen2017coordinating]. To be precise, the problem is to select for

layers such that a high compression ratio can be achieved without harming the performance too much. Because the rank selection is a nonlinear optimization problem, a variety of heuristics have been studied where typically a threshold is introduced and adjusted. These works, however, failed to achieve a competitive effectiveness because a trained neural network is unlikely to maintain its performance when a small

is selected.

For the other stream, several works introduced an additional step of compression-friendly training [alvarez2017compression, idelbayev2020low, li2018constrained]. This can be a promising strategy because it is a common sense to train a large network to secure a desirable learning dynamics [luo2017thinet, carreira2018learning], but at the same time the network can be regularized in many different ways without harming the performance [choi2021statistical]. For rank selection, however, most of them still used heuristics and failed to achieve a competitive performance. Recently, idelbayev2020low

achieved a state-of-the-art performance for low-rank compression by calculating target ranks

with a variant of Eckhart-Young theorem and by performing a regularized training with respect to the weight matrices truncated according to the target rank vector

. The existing compression-friendly works, however, require re-calculations of during the training, train weight matrices that are not really low-rank, and demand for an extensive tuning especially for obtaining a compressed network of a desired compression ratio.

In this work, we enhance the existing low-rank compression methods in three different aspects. First, we adopt a modified beam-search with a performance validation for selecting the target rank vector . Compared to the previous works, our method allows a joint search of rank assignments over all the layers and the effect is shown in Figure 0(a). The previous works relied on a simple heuristic (e.g., all layers should be truncated to keep the same portion of energy [alvarez2017compression]) or an implicit connection of all layers (e.g. through a common penalty parameter [idelbayev2020low]). Secondly, we adopt a modified stable rank for the rank-regularized training. Because our modified stable rank does not rely on any instance of weight matrix truncation, it can be continuously enforced without any update as long as the target rank vector remains constant. The previous works used a forced truncation in the middle of training [alvarez2017compression] or a norm distance regularization with the truncated weight matrices [idelbayev2020low]. In both methods, iteration between weight truncation step and training step was necessary. In our method, we calculate and set the target rank vector only once. Our modified stable rank turns out to be very effective at controlling the singular values as can be seen in Figure 0(b). As the result of the first and the second aspects, our low-rank compression method can achieve a performance on par with or even better than the latest pruning methods. Because low-rank compression can be easily combined with quantization just like pruning, a very competitive overall compression performance can be achieved. Thirdly, our method requires only one hyperparameter to be tuned. For a desired compression ratio that is given, all that needs to be tuned is the regularization strength . While a low-rank compression method like LC in [idelbayev2020low] accomplished an outstanding performance, it requires an extensive tuning to identify a compressed network of a desired compression ratio. Such a difficulty on tuning can be a major drawback for the usability. We believe our simplification in tuning is a contribution that is as important as, or perhaps even more important than, the performance improvement.

2 Related works

2.1 DNN compression methods

In the past decade, a tremendous progress has been made in the research field of DNN compression. Comprehensive surveys can be found in [lebedev2018speeding, cheng2017survey, deng2020model, choudhary2020comprehensive, nan2019deep]. Besides the low-rank compression discussed in Section 1, the main algorithmic categories are quantization, pruning, and knowledge distillation. Among them, quantization and pruning are known as the most competitive compression schemes.

Quantization reduces the number of data bits and parameter bits, and it is an actively studied area with a broad adoption in the industry [courbariaux2016binarized, gupta2015deep, han2015learning, rastegari2016xnor, wu2016quantized]. There are many flavors of quantization including binary parameterization [courbariaux2016binarized, rastegari2016xnor], low-precision fixed-point [gupta2015deep, lin2016fixed], and mixed-precision training [yang2021bsq, bulat2021bit]. While many of them require a dedicated hardware-level support, we limit our focus to the pure algorithmic solutions, and consider combining our method with an algorithmic quantization in Section 6.

Network pruning includes weight pruning [han2015deep, han2015learning] and filter pruning [chin2020towards, luo2017thinet, he2017channel, zhuang2018discrimination, he2018amc, he2018soft, liu2018rethinking, wang2020pruning]. Weight pruning selects individual weights to be pruned. Because of the unstructured selection patterns, it requires customized GPU kernels or specialized hardware [han2015deep, zhao2019efficient]. Filter pruning selects entire filters only. Because of the structured selection patterns, the resulting compression can be easily implemented with off-the-shelf CPUs/GPUs [sui2021chip]. In Section 6, our results are compared with the state-of-the-art filter pruning methods because both are software-only solutions that reduce the number of weight parameters.

2.2 Rank selection

In the past decade, a variety of rank selection methods have been studied for low-rank compression of deep neural networks. In the early studies of low-rank compression, rank selection itself was not the main focus of the research and it was performed by human through repeated experiments [jaderberg2014speeding, denton2014exploiting, tai2015convolutional]. Then, a correlation between the sum of singular values (that is called energy) and DNN performance was observed, and subsequently rank selection methods were proposed where they minimally affect the energy [zhang2015accelerating, alvarez2017compression, wen2017coordinating, li2018constrained, Yuhuirankpruning2018]. The following works formulated and solved rank selection as optimization problems where the singular values appear in the formulations [kim2019efficient, idelbayev2020low]. In our modified beam-search method, we neither use the concept of energy nor utilize the singular values. Instead, we directly perform a search over the space of rank vectors using a validation dataset.

2.3 Beam search

Beam search is a technique for searching a tree, especially when the solution space is vast [xu2007learning, antoniol1995language, furcy2005limited]. It is based on a heuristic of developing solutions in parallel while repeatedly inspecting adjacent nodes of the solutions in the tree. is commonly referred as beam size, and corresponds to the greedy search [huang2012structured] and corresponds to the breadth-first search [meister2020if]. Obviously, beam search is a compromise between the two, where

is the control parameter. Beam search has been widely adopted, especially for natural language processing tasks such as speech recognition 

[lowerre1976speech]

, neural machine translation 

[lowerre1976speech, boulanger2013audio], scheduling [habenicht2002scheduling], and generation of a natural language adversarial attack [tengfei2021adversarial]. In our work, beam search is slightly modified to allow a search of depth- descendant nodes instead of the children nodes, and the modified beam search is employed for a joint search of the weight matrix ranks over layers, .

2.4 Stable rank and rank regularization

Formally speaking, stable rank of a matrix is defined as the ratio between the squared Frobenius norm and the squared spectral norm [rudelson2007sampling]. Simply speaking, the definition boils down to the ratio between ‘sum of squared singular values’ and the ‘squared value of the largest singular value’. Therefore, a smaller stable rank implies a relatively larger portion of energy in the largest singular value. In deep learning research, a stable rank normalization was studied for improving generalization [sanyal2019stable] and a stable rank approximation was utilized for performance tuning [choi2021statistical]. In our work, we modify the stable rank’s denominator to the sum of top singular values such that we can concentrate the activation signals in the top dimensions. While our approach requires only the target rank to be specified, the previous compression-friendly training calculated the truncated matrices and directly used them for the regularization. Therefore, the previous methods required a frequent update of the truncated matrices during the training. In our BSR method, we calculate the target rank vector only once in the beginning and keep it fixed.

3 Low-rank compression

Figure 2: Overall process of BSR algorithm.

3.1 The basic process

A typical process of low-rank compression consists of four steps: 1) train a deep neural network, 2) select rank assignments over layers, 3) factorize weight matrices using truncated SVD [denton2014exploiting, masana2017domain, xue2013restructuring] according to the selected ranks, and 4) fine-tune the truncated model to recover the performance as much as possible. In our work, we mainly focus on the rank selection step and an additional step of compression-friendly training. The additional step is placed between step 2 and step 3.

3.2 Compression ratio

Consider an -layer neural network with , where , as its weight matrices. Without a low-rank compression, the rank vector corresponds to the full rank vector of where . The rank selection is performed over the set of , and the selected rank vector is denoted as . With , we can perform a truncated SVD. For th layer’s weight matrix , we keep only the largest singular values to obtain where , , and . Then, the compression is achieved by replacing with a cascade of two matrices: and . Obviously, the computational benefit stems from the reduction in the matrix multiplication loads: for the original and for and . Similar results hold for convolutional layers. Finally, the compression ratio for the selected rank vector can be calculated as

(1)

where is a simplified notation of that is defined as if and otherwise.

4 Methodology

Figure 3: Illustration of mBS search process for , , , and .

4.1 Overall process

The overall process of our BSR low-rank compression is shown in Figure 2. Starting from a fully trained network, the phase one of BSR performs rank selection using mBS algorithm where it requires only a desired compression ratio as the input. Once the phase one is completed, is fixed and rank regularized training is performed in phase two. The strength of mSR is controlled by where its strength is gradually increased in a scheduled manner. Upon the completion of the phase two, the trained network is truncated using singular value decomposition according to . Then, a final fine-tuning is performed to complete the compression.

4.2 Modified beam-search (mBS) for rank selection

When a neural network with is truncated according to the rank vector , the accuracy can be evaluated with a validation dataset and the accuracy is denoted as . The corresponding compression ratio can be calculated as . Our goal of rank selection is to find the with the highest accuracy while the compression ratio is sufficiently close to the desired compression ratio . This problem can be formulated as below.

(2)

Note that we have introduced a small constant for relaxing the desired compression ratio. This relaxation forces the returned solution to have a compression ratio close to , and it is also utilized as a part of the exit criteria.

The problem in Equation 2

is a combinatorial optimization problem 

[reeves@1993comb] and a simple greedy algorithm can be applied as in [zhang2015accelerating]. Because the cardinality of the search space is extremely large, however, greedy algorithms hardly produce good results in a reasonable computation time. On the other hand, a full search is also unacceptable because of its long search time. As a compromise, we adopt a beam-search framework and make adequate adjustments. Before presenting the details of mBS, a simple illustration of how Equation 2 can be solved with our modified beam search is presented in Figure 3.

The details of mBS can be explained as the following.

  • Stage 1: Initialize level, . Initialize top-K set, , where (full rank assignments).

  • Stage 2: Move to the next level by adding the pre-chosen step size , . For each element in , find all of its descendants in and add them to candidate set . Exclude the descendants that performs too much compression by checking the condition .

  • Stage 3: Calculate the new top-K set at level by finding the ordered top elements:

  • Stage 4: Repeat Stage 2 and Stage 3 until is satisfied for the best element . If the condition is satisfied, return as .

An implementation of this process is provided in Algorithm 1. The main modification we make is the introduction of the level step size . For rank selection of low-rank compression, a weight matrix’s rank needs to be reduced by a sufficient amount and meets the minimum condition of to achieve a positive compression effect. Furthermore, most of the weight matrices can allow a significant amount of rank reduction without harming the performance in the beginning of the search. Therefore, we typically set between three and ten to make the search faster. As the search progresses, we reduce whenever no candidate can be found at the next level and improve the search resolution. Besides , the beam size is another important parameter that determines the trade-off between speed and search resolution. Instead of performing a fine and slow search with a small and a large , we perform a fast search a few times with different configurations and choose the best . The configuration details can be found in Section 5 and ablation studies for and can be found in Section 6.

Input: desired compression ratio ; validation data ; beam size ; level step size
Output: selected rank
Required: ratio function ; base network with rank ; evaluation function
Initialize: ;
top- rank set

1:  while ( is changed) (do
2:      
3:      for  in  do
4:          for  to  do
5:              
6:              
7:              
8:              if  then
9:                 
10:                 
11:              end if
12:          end for
13:      end for
14:      
15:      
16:  end while
17:  return
Algorithm 1 modified Beam Search (mBS) for rank selection

4.3 Modified stable rank (mSR) for regularized training

For a weight matrix , stable rank is defined as

(3)

where is the th singular value of . Because our goal of compression-friendly training is to have almost no energy in the dimensions other than the top dimensions, we modify the stable rank as below.

(4)

The modified stable rank mSR is different from the stable rank in four ways. First, it is dependent to the input parameter . Second, the summation in the denominator is performed over the largest singular values. Third, the largest singular values are excluded in the numerator’s summation. Fourth, the singular values are not squared. The third and fourth differences make mSR regularization result less energy in the undesired dimensions as shown in Figure 0(a). The compression-friendly training is performed by minimizing the loss of , where the first term is the original loss of the learning task and the second term is the mSR as a penalty loss.

Through our empirical evaluations, we have confirmed that mSR can stably affect the weight matrices. In fact, the gradient of mSR can be easily derived. To do so, we decompose into two parts by allocating the first dimensions into and the remaining dimensions into as below.

Then, the derivative can be derived as the following. The details of the derivation are deferred to Appendix A.

A crucial issue with the above mSR regularization is its effect on the computational overhead. Use of mSR

requires a repeated calculation of singular value decomposition (SVD) that is computationally intensive. To deal with this issue, we adopted the randomized matrix decomposition method in

[erichson2016randomized]. Because the target rank is typically chosen to be small, the practical choice of implementation has almost no effect on the regularization while significantly reducing the computational burden. Furthermore, we obtain an additional reduction by calculating SVD only once every 64 iterations. Consequently, the training time of mSR regularization remains almost the same as the un-regularized training.

5 Experiments

5.1 Experimental setting

5.1.1 Baseline models and datasets

To investigate the effectiveness and generalizability of BSR, we evaluate its compression performance for a variety of models and datasets. We mainly followed the experimental settings of LC [idelbayev2020low]

: LeNet5 on MNIST, ResNet32, and ResNet56 on CIFAR-10, ResNet56 on CIFAR-100, and AlexNet on large-scale ImageNet (ILSVRC 2012).

5.1.2 Rank selection configuration

BSR performs rank selection only once. Therefore, it is important to find a rank vector that can result in a high-performance after compression-friendly training. To improve the search speed and to make the search algorithm mBS robust, we use three settings of as . Search for is performed three times with the three settings, and the best performing one is selected as the final solution. Compression-friendly training is performed only after the final selection. Because we choose that is larger than one, mBS might fail to find a solution. When this happens, the level step size is multiplied by and the search is continued from the last candidates.

5.1.3 Rank regularized training configuration

We have considered for the initial learning rate and a cosine annealing method was used as the learning rate scheduler. We used Nesterov’s accelerated gradient method with momentum 0.9 on mini-batches of size 128. A regularization strength scheduling was introduced for a stable regularized training, where was gradually increased. To be specific, we used a regularization strength scheduling of = ·b with as either or and b

for every 15 epochs.

5.2 Experimental results

We have evaluated the performance in terms of compression ratio vs. test accuracy, and the results are provided here. We have additionally evaluated FLOPs vs. test accuracy, and the results can be found in the appendix.

5.2.1 Mnist

The results for LeNet5 on MNIST are shown in Figure 4.

Both of BSR and LC clearly outperform CA. Between BSR and LC, BSR outperforms LC for the entire evaluation range. It can be seen that BSR’s test accuracy is hardly reduced until the compression ratio becomes 0.97.

Figure 4: Comparison of BSR with CA and LC for LeNet5 on MNIST.

5.2.2 CIFAR-10 and CIFAR-100

The results for ResNet32 and ResNet56 on CIFAR-10 and CIFAR-100 are shown in Figure 5. For ResNet, BSR performs a low-rank compression over the convolutional layers and achieves a conspicuous improvement in compression performance. It can be also noted that BSR can even improve the no-compression accuracy when compression ratio is not too large. For instance, test accuracy in Figure 4(b) is improved from 0.92 to 0.94 by BSR when the compression ratio is around 0.55. LC is also able to improve the accuracy, but the gain is smaller than BSR. This implies that the regularized training methods can have a positive influence on learning dynamics. As in MNIST, the performance gap between BSR and the other methods becomes larger as the compression ratio increases.

For CA and LC, compression ratio cannot be fully controlled and the final compression ratio is dependent on the hyperparameter setting. We have tried many different hyperparameter settings to generate the CA and LC results. For BSR, the compression ratio can be fully controlled because it is an input parameter. Therefore, we have first generated LC results and have chosen the compression ratios of BSR evaluations to be the same as what LC ended up with.

(a) ResNet32 on CIFAR-10
(b) ResNet56 on CIFAR-10
(c) ResNet56 on CIFAR-100
Figure 5: Comparison of BSR with CA and LC for (a) ResNet32 on CIFAR-10, (b) ResNet56 on CIFAR-10, and (c) ResNet56 on CIFAR-100.

5.2.3 ImageNet

The results for AlexNet on ImageNet are shown in

Figure 6

. We used the pre-trained Pytorch AlexNet network as the base network. The network achieves accuracy of 56.55% for top-1 and 79.19% for top-5. ImageNet is a much more realistic dataset than MNIST or CIFAR. The performance curves show similar patterns as in

Figure 4 and Figure 5, confirming that BSR works well for realistic datasets.

Figure 6: Comparison of BSR with CA and LC for AlexNet on ImageNet (Top-1 performance).

6 Discussion

6.1 Ablation test

The effect of level step size :

We explore the effect of on the performance of mBS by evaluating the selected rank’s quality as a function of for , , and . The results are shown in Figure 6(a), and we can observe that a smaller provides a better accuracy performance. On the contrary, a smaller makes the search time exponentially larger, especially for a very small . Because of the trade-off, we use between three and ten with an adaptive reduction of when no solution is found.

The effect of beam size :

We explore the effect of on the performance of mBS by evaluating the selected rank’s quality as a function of for , , and . The results are shown in Figure 6(b), and we can observe that a larger

provides a better accuracy performance in general. The accuracy curve, however, exhibits a large variance and the average performance is even deteriorated when

is increased from one to three. This can be attributed to the nature of the rank selection problem. Because it is a non-convex problem, it is difficult to say what to expect. Compared to the accuracy curve, the search time curve shows a monotonic behavior where search time increases as is increased. Based on the results in Figure 6(b), we have chosen to be five.

(a) Effect of level step size
(b) Effect of beam size
Figure 7: The effect of and parameters on mBS’s performance: a base neural network (ResNet56 on CIFAR-100) was truncated by the selected ranks and no further fine-tuning was applied for this analysis. Performance and search speed change (a) as increases when is fixed to 1 (b) as increases when is fixed to 1.
Update of during training:

In the previous works of CA and LC, the rank vector is a moving target in the sense that CA performs truncated SVD multiple times during the compression-friendly training and LC updates the target weight matrices multiple times during the compression-friendly training. To investigate if BSR can benefit by updating , we have compared three different scenarios - is calculated only once in the beginning (“once”), is additionally updated once just before the decomposition and final fine-tuning (i.e., just before the phase three in Figure 2; “again before decomposition”), and is updated at every 30 epochs (“multiple times”). The results are shown in Figure 7(a). Interestingly, BSA performs best when is determined only once in the beginning and never changed until the completion of the compression. This is closely related to the characteristics of mSR. As already mentioned, regularization through the modified stable rank does not rely on any particular instance of weight matrices. In fact, it only needs to know the target rank vector to be effective. Therefore, there is no need for any update during the training, and BSA can smoothly fine-tune the neural network to have the desired . Clearly, the regularization method of mSR has a positive effect on simplifying how should be used.

Scheduled strengthening of :

The loss term during compression-friendly training is given by . As the training continues, we can expect the weight matrices to be increasingly compliant with , thanks to the accumulated effect of mSR regularization. Then, a weak regularization might not be sufficient as the training continues. Comparison between fixed and scheduled (according to the explanation in Section 5.1.3) is shown in Figure 7(b). As expected, the scheduled strengthening of is helpful for improving the compression performance.


(a) Rank selection updates

(b) Scheduled increase of
Figure 8: Performance of BSR for ResNet32 on CIFAR-10: (a) For rank selection, the best performance is achieved when the rank vector is set once and not updated. For multi-time, we have updated the target rank vector at every 30 epochs. (b) For the scheduling of the strength of , the performance is improved with a scheduled increase in strength.
Method
Test
acc. (%)
Test
acc. (%)
MFLOPs (rate %)

Pruning

Uniform 92.80 → 89.80 - 3.00 62.7 (50 %)
LEGR [chin2020towards] 93.90 → 93.70 - 0.20 58.9 (47 %)
ThiNet [luo2017thinet] 93.80 → 92.98 - 0.82 62.7 (50 %)
CP [he2017channel] 93.80 → 92.80 - 1.00 62.7 (50 %)
DCP [zhuang2018discrimination] 93.80 → 93.49 - 0.31 62.7 (50 %)
AMC [he2018amc] 92.80 → 91.90 - 0.90 62.7 (50 %)
SFP [he2018soft] 93.59 → 93.35 - 0.24 62.7 (50 %)
Rethink [liu2018rethinking] 93.80 → 93.07 - 0.73 62.7 (50 %)
PFS [wang2020pruning] 93.23 → 93.05 - 0.18 62.7 (50 %)
CHIP [sui2021chip] 93.26 → 92.05 - 1.21 34.8 (27 %)

Low-rank

CA [alvarez2017compression] 92.73 → 91.13 - 1.60 51.4 (41 %)
LC [idelbayev2020low] 92.73 → 93.10 + 0.37 55.7 (44 %)
BSR(ours) 92.73 → 93.53 + 0.80 55.7 (44 %)
BSR(ours) 92.73 → 92.51 - 0.22 32.1 (26 %)
Table 1: Our method is compared with various state-of-the-art network pruning methods for ResNet56 on CIFAR-10. In the last column, “rate” stands for the percentage of the reduced FLOPs compared to the uncompressed model. A smaller rate indicates a more efficient model in terms of MFLOPs. “ Test acc.” stands for the difference of the test accuracy between baseline and pruned or low-rank compressed model (larger is better). For this commonly used benchmark, BSR achieves the best performance.

6.2 Low-rank compression as a highly effective tool

Low-rank compression is only one of many ways of compressing a neural network. We first compare our results with the state-of-the-art pruning algorithms. Then, we show that our low-rank compression can be easily combined with quantization just like pruning can be.

6.2.1 Comparison with filtered pruning

Both pruning and low-rank compression are techniques that are applied to weight parameters. Therefore, they cannot be used together in general. Besides, filtered pruning (structured pruning) is an actively studied topic with many known high-performance algorithms. We compared BSR with other pruning methods including naive uniform channel number shrinkage (uniform), ThiNet [luo2017thinet], Channel Pruning (CP) [he2017channel], Discrimination-aware Channel Pruning (DCP) [zhuang2018discrimination], Soft Filter Pruning (SFP) [he2018soft], rethinking the value of network pruning (Rethink) [liu2018rethinking], Automatic Model Compression (AMC) [he2018amc], and channel independence-based pruning (CHIP)[sui2021chip]. Table 1 summarizes the results. We compared the performance drop of each method under similar FLOPs reduction rates. A smaller accuracy drop indicates a better pruning method. When FLOPs are reduced to about 50%, all state-of-the-art pruning algorithms exhibited performance drops, however our method (BSR) showed a performance improvement (92.73 → 93.53). We also compared our method to the most recent powerful pruning algorithm (CHIP) [sui2021chip]. The CHIP algorithm can further reduce MFLOPs to 34.8 with 1.21 performance drop, and our method showed just 0.22 performance drop at similar MFLOPs (32.1MFLOPs).

32bit 16bit 8bit 4bit
Test
acc.
Memory
(Mb)
Test
acc.
Memory
(Mb)
Test
acc.
Memory
(Mb)
Test
acc.
Memory
(Mb)
Base
network
0.92 3.41 0.92 1.71 0.92 0.85 0.32 0.43
BSR
()
0.94 1.98 0.94 0.99 0.94 0.49 0.32 0.25
BSR
()
0.94 1.50 0.94 0.75 0.93 0.38 0.32 0.19
BSR
()
0.94 1.27 0.94 0.64 0.93 0.32 0.31 0.16
BSR
()
0.93 0.79 0.93 0.40 0.92 0.20 0.33 0.10
BSR
()
0.91 0.54 0.91 0.27 0.90 0.14 0.32 0.07
BSR
()
0.89 0.35 0.89 0.17 0.83 0.09 0.31 0.04
Table 2: Quantization results of ResNet56 on CIFAR-10. The test accuracy and memory size are represented as bit-width changes.

6.2.2 Combined use with quantization

As known well, low-rank compression can be used together with quantization. The performance for using both together is shown in Table 2. Even without a performance loss, we can observe that memory usage can be additionally reduced to one-fifth compared to BSR only. In particular, when comparing the case with the same accuracy (when = 0.77) as the base network, the memory usage was reduced to 1/17 (0.2 Mb) compared to the base network (3.41 Mb). We were also able to reach a range of memory usage that could not be attained by quantization alone by using BSR together. Note that BSR allows any compression ratio while quantization does not. This feature can make using deep learning models on the edge devices much easier. Quantization and our low-rank compression are extremely easy to use and can be applied for practically any situation.

6.3 Limitations and future works

It might be possible to improve BSA in a few different ways. While our modified stable rank works well, it might be possible to identify a rank surrogate with a better learning dynamics. While we choose only one and perform only a single compression-friendly training, it might be helpful to choose multiple rank vectors, train all, and choose the best. This can be an obvious way of improving performance at the cost of an extra computation. Pruning and low-rank compression cannot be used at the same time, but it might be helpful to apply them sequentially. In general, combining multiple compression techniques to generate a synergy remains as a future work.

7 Conclusion

We have introduced a new low-rank compression method called BSR. Its main improvements compared to the previous works are in the rank selection algorithm and the rank regularized training. The design of modified beam-search is based on the idea that beam-search is a superior way of balancing search performance and search speed. Modifications such as the introduction of level step size and the compression rate constraint play important roles. The design of modified stable rank is based on a careful analysis on how weight matrix’s rank is regularized. With our best knowledge, our modified stable rank is the first regularization method that truly controls the rank. As the result, BSR performs very well.

References

Appendix A Derivative of mSR

The gradient of mSR is derived by first decomposing into two parts by allocating the first dimensions into and the remaining dimensions into as below.

Then, the gradient can be derived as the following.

Appendix B Experimental results for FLOPs

b.1 Number of floating point operations (FLOPs) computation

In the literature, there is no clear consensus on how to compute the total number of floating point operations (FLOPs) in the forward pass of neural network. While some authors define this number as the total number of multiplications and additions [ye2018rethinking], others assume that multiplications and additions can be fused and compute one multiplication and addition as a single operation [he2016deep]. In this work, we use the second definition of FLOPs.

b.1.1 FLOPs in a fully-connected layer

For a fully-connected layer with weight and bias , FLOPs can be calculated as below.

b.1.2 FLOPs in a convolutional layer

For a convolutional layer with parameters and , linear mapping is applied times. Thus, FLOPs can be calculated as below.

We exclude the batch-normalization (BN) and concatenation/copy-operation. For BN layers, the BN parameters can be fused into the weights and biases of the preceding layer, and therefore no special treatment is required. For concatenation, copy-operations, and non-linearities, we assume zero FLOPs because of its negligible cost.

b.2 Mnist

The results for LeNet5 on MNIST are shown in Figure 9. By considering FLOPs instead of the compression ratio, the resulting range of each approach varied. Nonetheless, it is not difficult to confirm that both of BSR and LC outperform CA. Between BSR and LC, BSR outperforms LC for the entire BSR evaluation range.

Figure 9: Comparison of BSR with CA and LC for LeNet5 on MNIST.

b.3 CIFAR-10 and CIFAR-100

The results for ResNet32/ResNet56 on CIFAR-10 and the results for ResNet56 on CIFAR-100 are shown in Figure 10. For ResNet, BSR performs a low-rank compression over the convolutional layers. While LC directly considered FLOPs when selecting rank, we selected rank by considering only the validation accuracy. Nonetheless, we achieve higher test accuracies. BSR can even improve the no-compression accuracy when FLOPs is not too small. For instance, test accuracy in Figure 9(b) is improved from 0.92 to 0.94 by BSR when the FLOPs is around 140 MFLOPs. LC is also able to improve the accuracy, but the gain is smaller than BSR. This implies that the regularized training methods can have a positive influence on learning dynamics. Similarly, in Figure 9(a) and Figure 9(c), the test accuracy is improved over no-compression when the flops is not too small.

(a) ResNet32 on CIFAR-10
(b) ResNet56 on CIFAR-10
(c) ResNet56 on CIFAR-100
Figure 10: Comparison of BSR with CA and LC for (a) ResNet32 on CIFAR-10, (b) ResNet56 on CIFAR-10, and (c) ResNet56 on CIFAR-100.

b.4 ImageNet

The results for AlexNet on ImageNet are shown in Figure 11. As in the compression ratio vs test accuracy, we used the pre-trained Pytorch AlexNet network as the base network.

ImageNet is a much more complex dataset than MNIST or CIFAR. The performance curves show similar patterns as in Figure 10, confirming that BSR works well for complex datasets.

Figure 11: Comparison of BSR with CA and LC for AlexNet on ImageNet (Top-1 performance).

Appendix C Quantization results of AlexNet on ImageNet

In this section, we combine low-rank compression and quantization for ImageNet. The performance for using both is shown in Table 3. Although there can be a slight performance loss, we can observe a reduction in memory usage by using quantization in addition to BSR. When , we can save more than half of the memory by reducing quantization from 32bits to 16bits. Similar to the result of ResNet56, we were also able to reach a range of memory usage that could not be attained by quantization alone by using BSR together. Even with more realistic datasets and larger networks, using BSR and quantization together is effective in reducing memory usage. This feature can make it much easier to use deep learning models on the edge devices. Quantization and our low-rank compression are extremely easy to use and can be applied for practically any situation.

32 bits 16 bits 8 bits 4 bits
Test
acc.
Memory
(Mb)
Test
acc.
Memory
(Mb)
Test
acc.
Memory
(Mb)
Test
acc.
Memory
(Mb)
Base
network
0.56 244.40 0.56 122.20 0.54 61.10 0.04 30.55
BSR
()
0.55 108.17 0.55 54.09 0.52 27.04 0.03 13.52
BSR
()
0.55 91.46 0.54 45.72 0.52 22.86 0.03 11.43
BSR
()
0.55 71.50 0.55 35.75 0.51 17.88 0.03 8.94
BSR
()
0.54 60.34 0.54 30.17 0.51 15.09 0.03 7.54
BSR
()
0.54 38.38 0.52 19.19 0.48 9.59 0.03 4.80
BSR
()
0.48 22.06 0.45 11.03 0.43 5.52 0.02 2.76
Table 3: Quantization results of AlexNet on ImageNet. The test accuracy and memory size are represented as bit-width changes.

Appendix D Additional ablation tests

Additional results are provided for adjusting and .

d.1 The effect of level step size

We checked the effect of the level step size for a fixed beam size . The analysis was repeated for a range of , and the results are shown in Figure 12. From the results, it can be confirmed that it is desirable to choose an that is not too large.

(a)
(b)
(c)
(d)
(e)
Figure 12: The effect of parameters on mBS’s performance for a fixed : a base neural network (ResNet56 on CIFAR-100) was truncated by the selected ranks and no further fine-tuning was applied for this analysis. Performance and search speed change as a function of : (a) is fixed at 1, (b) is fixed at 3, (c) is fixed at 5, (d) is fixed at 10, (e) is fixed at 20.
(a)
(b)
(c)
(d)
(e)
Figure 13: The effect of parameters on mBS’s performance for a fixed : a base neural network (ResNet56 on CIFAR-100) was truncated by the selected ranks and no further fine-tuning was applied for this analysis. Performance and search speed change as a function of : (a) is fixed at 1, (b) is fixed at 3, (c) is fixed at 5, (d) is fixed at 10, (e) is fixed at 20.

d.2 The effect of beam size

We checked the effect of the beam size for a fixed level step size . As can be seen in Figure 12(a), Figure 12(b), Figure 12(c), Figure 12(d), we can confirm that the test accuracy improves as the beam size increases in general. This is because more candidates can be searched with a larger . However, in Figure 12(e) where is very large, it can be observed that the performance deteriorates as is increased.