1 Introduction
. (a) Performance of rank selection methods: a base neural network (ResNet56 on CIFAR100) was truncated by the selected ranks and no further finetuning was applied. Modified beamsearch (
mBS) clearly outperforms the other two. (b) Effectiveness of rank regularized training: a base neural network (ResNet56 on CIFAR100) was regularized by each compressionfriendly training method. For the target rank of five, modified stable rank (mBR) is the only one that clearly minimizes the singular values other than the top five.As deep learning becomes widely adopted by the industry, the demand for compression techniques that are highly effective and easytouse is sharply increasing. The most popular compression methods include quantization [rastegari2016xnor, wu2016quantized], pruning of redundant parameters [han2015deep, lebedev2016fast, srinivas2015data], knowledge distillation from a large network to a small one [hinton2015distilling, kim2018paraphrasing, romero2014fitnets, zagoruyko2016paying], and network factorization [alvarez2017compression, denton2014exploiting, masana2017domain, xue2013restructuring]. In this work, we focus on a lowrank compression that is based on a matrix factorization and lowrank approximation of weight matrices.
A typical lowrank compression is based on a direct application of SVD (Singular Value Decomposition). For a trained neural network, the layer
’s weight matrix of size is decomposed and only the top dimensions are kept. This reduces the computation and memory requirements from to and the reduction can be large when a small is chosen.The traditional approaches can be divided into two main streams. One considers only a postapplication of SVD on a fully trained neural network, and the other additionally performs a compressionfriendly training before SVD.
For the first stream, the main research problem is the selection of [jaderberg2014speeding, denton2014exploiting, tai2015convolutional, zhang2015accelerating, wen2017coordinating]. To be precise, the problem is to select for
layers such that a high compression ratio can be achieved without harming the performance too much. Because the rank selection is a nonlinear optimization problem, a variety of heuristics have been studied where typically a threshold is introduced and adjusted. These works, however, failed to achieve a competitive effectiveness because a trained neural network is unlikely to maintain its performance when a small
is selected.For the other stream, several works introduced an additional step of compressionfriendly training [alvarez2017compression, idelbayev2020low, li2018constrained]. This can be a promising strategy because it is a common sense to train a large network to secure a desirable learning dynamics [luo2017thinet, carreira2018learning], but at the same time the network can be regularized in many different ways without harming the performance [choi2021statistical]. For rank selection, however, most of them still used heuristics and failed to achieve a competitive performance. Recently, idelbayev2020low
achieved a stateoftheart performance for lowrank compression by calculating target ranks
with a variant of EckhartYoung theorem and by performing a regularized training with respect to the weight matrices truncated according to the target rank vector
. The existing compressionfriendly works, however, require recalculations of during the training, train weight matrices that are not really lowrank, and demand for an extensive tuning especially for obtaining a compressed network of a desired compression ratio.In this work, we enhance the existing lowrank compression methods in three different aspects. First, we adopt a modified beamsearch with a performance validation for selecting the target rank vector . Compared to the previous works, our method allows a joint search of rank assignments over all the layers and the effect is shown in Figure 0(a). The previous works relied on a simple heuristic (e.g., all layers should be truncated to keep the same portion of energy [alvarez2017compression]) or an implicit connection of all layers (e.g. through a common penalty parameter [idelbayev2020low]). Secondly, we adopt a modified stable rank for the rankregularized training. Because our modified stable rank does not rely on any instance of weight matrix truncation, it can be continuously enforced without any update as long as the target rank vector remains constant. The previous works used a forced truncation in the middle of training [alvarez2017compression] or a norm distance regularization with the truncated weight matrices [idelbayev2020low]. In both methods, iteration between weight truncation step and training step was necessary. In our method, we calculate and set the target rank vector only once. Our modified stable rank turns out to be very effective at controlling the singular values as can be seen in Figure 0(b). As the result of the first and the second aspects, our lowrank compression method can achieve a performance on par with or even better than the latest pruning methods. Because lowrank compression can be easily combined with quantization just like pruning, a very competitive overall compression performance can be achieved. Thirdly, our method requires only one hyperparameter to be tuned. For a desired compression ratio that is given, all that needs to be tuned is the regularization strength . While a lowrank compression method like LC in [idelbayev2020low] accomplished an outstanding performance, it requires an extensive tuning to identify a compressed network of a desired compression ratio. Such a difficulty on tuning can be a major drawback for the usability. We believe our simplification in tuning is a contribution that is as important as, or perhaps even more important than, the performance improvement.
2 Related works
2.1 DNN compression methods
In the past decade, a tremendous progress has been made in the research field of DNN compression. Comprehensive surveys can be found in [lebedev2018speeding, cheng2017survey, deng2020model, choudhary2020comprehensive, nan2019deep]. Besides the lowrank compression discussed in Section 1, the main algorithmic categories are quantization, pruning, and knowledge distillation. Among them, quantization and pruning are known as the most competitive compression schemes.
Quantization reduces the number of data bits and parameter bits, and it is an actively studied area with a broad adoption in the industry [courbariaux2016binarized, gupta2015deep, han2015learning, rastegari2016xnor, wu2016quantized]. There are many flavors of quantization including binary parameterization [courbariaux2016binarized, rastegari2016xnor], lowprecision fixedpoint [gupta2015deep, lin2016fixed], and mixedprecision training [yang2021bsq, bulat2021bit]. While many of them require a dedicated hardwarelevel support, we limit our focus to the pure algorithmic solutions, and consider combining our method with an algorithmic quantization in Section 6.
Network pruning includes weight pruning [han2015deep, han2015learning] and filter pruning [chin2020towards, luo2017thinet, he2017channel, zhuang2018discrimination, he2018amc, he2018soft, liu2018rethinking, wang2020pruning]. Weight pruning selects individual weights to be pruned. Because of the unstructured selection patterns, it requires customized GPU kernels or specialized hardware [han2015deep, zhao2019efficient]. Filter pruning selects entire filters only. Because of the structured selection patterns, the resulting compression can be easily implemented with offtheshelf CPUs/GPUs [sui2021chip]. In Section 6, our results are compared with the stateoftheart filter pruning methods because both are softwareonly solutions that reduce the number of weight parameters.
2.2 Rank selection
In the past decade, a variety of rank selection methods have been studied for lowrank compression of deep neural networks. In the early studies of lowrank compression, rank selection itself was not the main focus of the research and it was performed by human through repeated experiments [jaderberg2014speeding, denton2014exploiting, tai2015convolutional]. Then, a correlation between the sum of singular values (that is called energy) and DNN performance was observed, and subsequently rank selection methods were proposed where they minimally affect the energy [zhang2015accelerating, alvarez2017compression, wen2017coordinating, li2018constrained, Yuhuirankpruning2018]. The following works formulated and solved rank selection as optimization problems where the singular values appear in the formulations [kim2019efficient, idelbayev2020low]. In our modified beamsearch method, we neither use the concept of energy nor utilize the singular values. Instead, we directly perform a search over the space of rank vectors using a validation dataset.
2.3 Beam search
Beam search is a technique for searching a tree, especially when the solution space is vast [xu2007learning, antoniol1995language, furcy2005limited]. It is based on a heuristic of developing solutions in parallel while repeatedly inspecting adjacent nodes of the solutions in the tree. is commonly referred as beam size, and corresponds to the greedy search [huang2012structured] and corresponds to the breadthfirst search [meister2020if]. Obviously, beam search is a compromise between the two, where
is the control parameter. Beam search has been widely adopted, especially for natural language processing tasks such as speech recognition
[lowerre1976speech][lowerre1976speech, boulanger2013audio], scheduling [habenicht2002scheduling], and generation of a natural language adversarial attack [tengfei2021adversarial]. In our work, beam search is slightly modified to allow a search of depth descendant nodes instead of the children nodes, and the modified beam search is employed for a joint search of the weight matrix ranks over layers, .2.4 Stable rank and rank regularization
Formally speaking, stable rank of a matrix is defined as the ratio between the squared Frobenius norm and the squared spectral norm [rudelson2007sampling]. Simply speaking, the definition boils down to the ratio between ‘sum of squared singular values’ and the ‘squared value of the largest singular value’. Therefore, a smaller stable rank implies a relatively larger portion of energy in the largest singular value. In deep learning research, a stable rank normalization was studied for improving generalization [sanyal2019stable] and a stable rank approximation was utilized for performance tuning [choi2021statistical]. In our work, we modify the stable rank’s denominator to the sum of top singular values such that we can concentrate the activation signals in the top dimensions. While our approach requires only the target rank to be specified, the previous compressionfriendly training calculated the truncated matrices and directly used them for the regularization. Therefore, the previous methods required a frequent update of the truncated matrices during the training. In our BSR method, we calculate the target rank vector only once in the beginning and keep it fixed.
3 Lowrank compression
3.1 The basic process
A typical process of lowrank compression consists of four steps: 1) train a deep neural network, 2) select rank assignments over layers, 3) factorize weight matrices using truncated SVD [denton2014exploiting, masana2017domain, xue2013restructuring] according to the selected ranks, and 4) finetune the truncated model to recover the performance as much as possible. In our work, we mainly focus on the rank selection step and an additional step of compressionfriendly training. The additional step is placed between step 2 and step 3.
3.2 Compression ratio
Consider an layer neural network with , where , as its weight matrices. Without a lowrank compression, the rank vector corresponds to the full rank vector of where . The rank selection is performed over the set of , and the selected rank vector is denoted as . With , we can perform a truncated SVD. For th layer’s weight matrix , we keep only the largest singular values to obtain where , , and . Then, the compression is achieved by replacing with a cascade of two matrices: and . Obviously, the computational benefit stems from the reduction in the matrix multiplication loads: for the original and for and . Similar results hold for convolutional layers. Finally, the compression ratio for the selected rank vector can be calculated as
(1) 
where is a simplified notation of that is defined as if and otherwise.
4 Methodology
4.1 Overall process
The overall process of our BSR lowrank compression is shown in Figure 2. Starting from a fully trained network, the phase one of BSR performs rank selection using mBS algorithm where it requires only a desired compression ratio as the input. Once the phase one is completed, is fixed and rank regularized training is performed in phase two. The strength of mSR is controlled by where its strength is gradually increased in a scheduled manner. Upon the completion of the phase two, the trained network is truncated using singular value decomposition according to . Then, a final finetuning is performed to complete the compression.
4.2 Modified beamsearch (mBS) for rank selection
When a neural network with is truncated according to the rank vector , the accuracy can be evaluated with a validation dataset and the accuracy is denoted as . The corresponding compression ratio can be calculated as . Our goal of rank selection is to find the with the highest accuracy while the compression ratio is sufficiently close to the desired compression ratio . This problem can be formulated as below.
(2)  
Note that we have introduced a small constant for relaxing the desired compression ratio. This relaxation forces the returned solution to have a compression ratio close to , and it is also utilized as a part of the exit criteria.
The problem in Equation 2
is a combinatorial optimization problem
[reeves@1993comb] and a simple greedy algorithm can be applied as in [zhang2015accelerating]. Because the cardinality of the search space is extremely large, however, greedy algorithms hardly produce good results in a reasonable computation time. On the other hand, a full search is also unacceptable because of its long search time. As a compromise, we adopt a beamsearch framework and make adequate adjustments. Before presenting the details of mBS, a simple illustration of how Equation 2 can be solved with our modified beam search is presented in Figure 3.The details of mBS can be explained as the following.

Stage 1: Initialize level, . Initialize topK set, , where (full rank assignments).

Stage 2: Move to the next level by adding the prechosen step size , . For each element in , find all of its descendants in and add them to candidate set . Exclude the descendants that performs too much compression by checking the condition .

Stage 3: Calculate the new topK set at level by finding the ordered top elements:

Stage 4: Repeat Stage 2 and Stage 3 until is satisfied for the best element . If the condition is satisfied, return as .
An implementation of this process is provided in Algorithm 1. The main modification we make is the introduction of the level step size . For rank selection of lowrank compression, a weight matrix’s rank needs to be reduced by a sufficient amount and meets the minimum condition of to achieve a positive compression effect. Furthermore, most of the weight matrices can allow a significant amount of rank reduction without harming the performance in the beginning of the search. Therefore, we typically set between three and ten to make the search faster. As the search progresses, we reduce whenever no candidate can be found at the next level and improve the search resolution. Besides , the beam size is another important parameter that determines the tradeoff between speed and search resolution. Instead of performing a fine and slow search with a small and a large , we perform a fast search a few times with different configurations and choose the best . The configuration details can be found in Section 5 and ablation studies for and can be found in Section 6.
4.3 Modified stable rank (mSR) for regularized training
For a weight matrix , stable rank is defined as
(3) 
where is the th singular value of . Because our goal of compressionfriendly training is to have almost no energy in the dimensions other than the top dimensions, we modify the stable rank as below.
(4) 
The modified stable rank mSR is different from the stable rank in four ways. First, it is dependent to the input parameter . Second, the summation in the denominator is performed over the largest singular values. Third, the largest singular values are excluded in the numerator’s summation. Fourth, the singular values are not squared. The third and fourth differences make mSR regularization result less energy in the undesired dimensions as shown in Figure 0(a). The compressionfriendly training is performed by minimizing the loss of , where the first term is the original loss of the learning task and the second term is the mSR as a penalty loss.
Through our empirical evaluations, we have confirmed that mSR can stably affect the weight matrices. In fact, the gradient of mSR can be easily derived. To do so, we decompose into two parts by allocating the first dimensions into and the remaining dimensions into as below.
Then, the derivative can be derived as the following. The details of the derivation are deferred to Appendix A.
A crucial issue with the above mSR regularization is its effect on the computational overhead. Use of mSR
requires a repeated calculation of singular value decomposition (SVD) that is computationally intensive. To deal with this issue, we adopted the randomized matrix decomposition method in
[erichson2016randomized]. Because the target rank is typically chosen to be small, the practical choice of implementation has almost no effect on the regularization while significantly reducing the computational burden. Furthermore, we obtain an additional reduction by calculating SVD only once every 64 iterations. Consequently, the training time of mSR regularization remains almost the same as the unregularized training.5 Experiments
5.1 Experimental setting
5.1.1 Baseline models and datasets
To investigate the effectiveness and generalizability of BSR, we evaluate its compression performance for a variety of models and datasets. We mainly followed the experimental settings of LC [idelbayev2020low]
: LeNet5 on MNIST, ResNet32, and ResNet56 on CIFAR10, ResNet56 on CIFAR100, and AlexNet on largescale ImageNet (ILSVRC 2012).
5.1.2 Rank selection configuration
BSR performs rank selection only once. Therefore, it is important to find a rank vector that can result in a highperformance after compressionfriendly training. To improve the search speed and to make the search algorithm mBS robust, we use three settings of as . Search for is performed three times with the three settings, and the best performing one is selected as the final solution. Compressionfriendly training is performed only after the final selection. Because we choose that is larger than one, mBS might fail to find a solution. When this happens, the level step size is multiplied by and the search is continued from the last candidates.
5.1.3 Rank regularized training configuration
We have considered for the initial learning rate and a cosine annealing method was used as the learning rate scheduler. We used Nesterov’s accelerated gradient method with momentum 0.9 on minibatches of size 128. A regularization strength scheduling was introduced for a stable regularized training, where was gradually increased. To be specific, we used a regularization strength scheduling of = ·b with as either or and b
for every 15 epochs.
5.2 Experimental results
We have evaluated the performance in terms of compression ratio vs. test accuracy, and the results are provided here. We have additionally evaluated FLOPs vs. test accuracy, and the results can be found in the appendix.
5.2.1 Mnist
The results for LeNet5 on MNIST are shown in Figure 4.
Both of BSR and LC clearly outperform CA. Between BSR and LC, BSR outperforms LC for the entire evaluation range. It can be seen that BSR’s test accuracy is hardly reduced until the compression ratio becomes 0.97.
5.2.2 CIFAR10 and CIFAR100
The results for ResNet32 and ResNet56 on CIFAR10 and CIFAR100 are shown in Figure 5. For ResNet, BSR performs a lowrank compression over the convolutional layers and achieves a conspicuous improvement in compression performance. It can be also noted that BSR can even improve the nocompression accuracy when compression ratio is not too large. For instance, test accuracy in Figure 4(b) is improved from 0.92 to 0.94 by BSR when the compression ratio is around 0.55. LC is also able to improve the accuracy, but the gain is smaller than BSR. This implies that the regularized training methods can have a positive influence on learning dynamics. As in MNIST, the performance gap between BSR and the other methods becomes larger as the compression ratio increases.
For CA and LC, compression ratio cannot be fully controlled and the final compression ratio is dependent on the hyperparameter setting. We have tried many different hyperparameter settings to generate the CA and LC results. For BSR, the compression ratio can be fully controlled because it is an input parameter. Therefore, we have first generated LC results and have chosen the compression ratios of BSR evaluations to be the same as what LC ended up with.
5.2.3 ImageNet
The results for AlexNet on ImageNet are shown in
Figure 6. We used the pretrained Pytorch AlexNet network as the base network. The network achieves accuracy of 56.55% for top1 and 79.19% for top5. ImageNet is a much more realistic dataset than MNIST or CIFAR. The performance curves show similar patterns as in
Figure 4 and Figure 5, confirming that BSR works well for realistic datasets.6 Discussion
6.1 Ablation test
The effect of level step size :
We explore the effect of on the performance of mBS by evaluating the selected rank’s quality as a function of for , , and . The results are shown in Figure 6(a), and we can observe that a smaller provides a better accuracy performance. On the contrary, a smaller makes the search time exponentially larger, especially for a very small . Because of the tradeoff, we use between three and ten with an adaptive reduction of when no solution is found.
The effect of beam size :
We explore the effect of on the performance of mBS by evaluating the selected rank’s quality as a function of for , , and . The results are shown in Figure 6(b), and we can observe that a larger
provides a better accuracy performance in general. The accuracy curve, however, exhibits a large variance and the average performance is even deteriorated when
is increased from one to three. This can be attributed to the nature of the rank selection problem. Because it is a nonconvex problem, it is difficult to say what to expect. Compared to the accuracy curve, the search time curve shows a monotonic behavior where search time increases as is increased. Based on the results in Figure 6(b), we have chosen to be five.Update of during training:
In the previous works of CA and LC, the rank vector is a moving target in the sense that CA performs truncated SVD multiple times during the compressionfriendly training and LC updates the target weight matrices multiple times during the compressionfriendly training. To investigate if BSR can benefit by updating , we have compared three different scenarios  is calculated only once in the beginning (“once”), is additionally updated once just before the decomposition and final finetuning (i.e., just before the phase three in Figure 2; “again before decomposition”), and is updated at every 30 epochs (“multiple times”). The results are shown in Figure 7(a). Interestingly, BSA performs best when is determined only once in the beginning and never changed until the completion of the compression. This is closely related to the characteristics of mSR. As already mentioned, regularization through the modified stable rank does not rely on any particular instance of weight matrices. In fact, it only needs to know the target rank vector to be effective. Therefore, there is no need for any update during the training, and BSA can smoothly finetune the neural network to have the desired . Clearly, the regularization method of mSR has a positive effect on simplifying how should be used.
Scheduled strengthening of :
The loss term during compressionfriendly training is given by . As the training continues, we can expect the weight matrices to be increasingly compliant with , thanks to the accumulated effect of mSR regularization. Then, a weak regularization might not be sufficient as the training continues. Comparison between fixed and scheduled (according to the explanation in Section 5.1.3) is shown in Figure 7(b). As expected, the scheduled strengthening of is helpful for improving the compression performance.
Method 


MFLOPs (rate %)  
Pruning 
Uniform  92.80 → 89.80   3.00  62.7 (50 %)  
LEGR [chin2020towards]  93.90 → 93.70   0.20  58.9 (47 %)  
ThiNet [luo2017thinet]  93.80 → 92.98   0.82  62.7 (50 %)  
CP [he2017channel]  93.80 → 92.80   1.00  62.7 (50 %)  
DCP [zhuang2018discrimination]  93.80 → 93.49   0.31  62.7 (50 %)  
AMC [he2018amc]  92.80 → 91.90   0.90  62.7 (50 %)  
SFP [he2018soft]  93.59 → 93.35   0.24  62.7 (50 %)  
Rethink [liu2018rethinking]  93.80 → 93.07   0.73  62.7 (50 %)  
PFS [wang2020pruning]  93.23 → 93.05   0.18  62.7 (50 %)  
CHIP [sui2021chip]  93.26 → 92.05   1.21  34.8 (27 %)  
Lowrank 
CA [alvarez2017compression]  92.73 → 91.13   1.60  51.4 (41 %)  
LC [idelbayev2020low]  92.73 → 93.10  + 0.37  55.7 (44 %)  
BSR(ours)  92.73 → 93.53  + 0.80  55.7 (44 %)  
BSR(ours)  92.73 → 92.51   0.22  32.1 (26 %) 
6.2 Lowrank compression as a highly effective tool
Lowrank compression is only one of many ways of compressing a neural network. We first compare our results with the stateoftheart pruning algorithms. Then, we show that our lowrank compression can be easily combined with quantization just like pruning can be.
6.2.1 Comparison with filtered pruning
Both pruning and lowrank compression are techniques that are applied to weight parameters. Therefore, they cannot be used together in general. Besides, filtered pruning (structured pruning) is an actively studied topic with many known highperformance algorithms. We compared BSR with other pruning methods including naive uniform channel number shrinkage (uniform), ThiNet [luo2017thinet], Channel Pruning (CP) [he2017channel], Discriminationaware Channel Pruning (DCP) [zhuang2018discrimination], Soft Filter Pruning (SFP) [he2018soft], rethinking the value of network pruning (Rethink) [liu2018rethinking], Automatic Model Compression (AMC) [he2018amc], and channel independencebased pruning (CHIP)[sui2021chip]. Table 1 summarizes the results. We compared the performance drop of each method under similar FLOPs reduction rates. A smaller accuracy drop indicates a better pruning method. When FLOPs are reduced to about 50%, all stateoftheart pruning algorithms exhibited performance drops, however our method (BSR) showed a performance improvement (92.73 → 93.53). We also compared our method to the most recent powerful pruning algorithm (CHIP) [sui2021chip]. The CHIP algorithm can further reduce MFLOPs to 34.8 with 1.21 performance drop, and our method showed just 0.22 performance drop at similar MFLOPs (32.1MFLOPs).
32bit  16bit  8bit  4bit  











0.92  3.41  0.92  1.71  0.92  0.85  0.32  0.43  

0.94  1.98  0.94  0.99  0.94  0.49  0.32  0.25  

0.94  1.50  0.94  0.75  0.93  0.38  0.32  0.19  

0.94  1.27  0.94  0.64  0.93  0.32  0.31  0.16  

0.93  0.79  0.93  0.40  0.92  0.20  0.33  0.10  

0.91  0.54  0.91  0.27  0.90  0.14  0.32  0.07  

0.89  0.35  0.89  0.17  0.83  0.09  0.31  0.04 
6.2.2 Combined use with quantization
As known well, lowrank compression can be used together with quantization. The performance for using both together is shown in Table 2. Even without a performance loss, we can observe that memory usage can be additionally reduced to onefifth compared to BSR only. In particular, when comparing the case with the same accuracy (when = 0.77) as the base network, the memory usage was reduced to 1/17 (0.2 Mb) compared to the base network (3.41 Mb). We were also able to reach a range of memory usage that could not be attained by quantization alone by using BSR together. Note that BSR allows any compression ratio while quantization does not. This feature can make using deep learning models on the edge devices much easier. Quantization and our lowrank compression are extremely easy to use and can be applied for practically any situation.
6.3 Limitations and future works
It might be possible to improve BSA in a few different ways. While our modified stable rank works well, it might be possible to identify a rank surrogate with a better learning dynamics. While we choose only one and perform only a single compressionfriendly training, it might be helpful to choose multiple rank vectors, train all, and choose the best. This can be an obvious way of improving performance at the cost of an extra computation. Pruning and lowrank compression cannot be used at the same time, but it might be helpful to apply them sequentially. In general, combining multiple compression techniques to generate a synergy remains as a future work.
7 Conclusion
We have introduced a new lowrank compression method called BSR. Its main improvements compared to the previous works are in the rank selection algorithm and the rank regularized training. The design of modified beamsearch is based on the idea that beamsearch is a superior way of balancing search performance and search speed. Modifications such as the introduction of level step size and the compression rate constraint play important roles. The design of modified stable rank is based on a careful analysis on how weight matrix’s rank is regularized. With our best knowledge, our modified stable rank is the first regularization method that truly controls the rank. As the result, BSR performs very well.
References
Appendix A Derivative of mSR
The gradient of mSR is derived by first decomposing into two parts by allocating the first dimensions into and the remaining dimensions into as below.
Then, the gradient can be derived as the following.
Appendix B Experimental results for FLOPs
b.1 Number of floating point operations (FLOPs) computation
In the literature, there is no clear consensus on how to compute the total number of floating point operations (FLOPs) in the forward pass of neural network. While some authors define this number as the total number of multiplications and additions [ye2018rethinking], others assume that multiplications and additions can be fused and compute one multiplication and addition as a single operation [he2016deep]. In this work, we use the second definition of FLOPs.
b.1.1 FLOPs in a fullyconnected layer
For a fullyconnected layer with weight and bias , FLOPs can be calculated as below.
b.1.2 FLOPs in a convolutional layer
For a convolutional layer with parameters and , linear mapping is applied times. Thus, FLOPs can be calculated as below.
We exclude the batchnormalization (BN) and concatenation/copyoperation. For BN layers, the BN parameters can be fused into the weights and biases of the preceding layer, and therefore no special treatment is required. For concatenation, copyoperations, and nonlinearities, we assume zero FLOPs because of its negligible cost.
b.2 Mnist
The results for LeNet5 on MNIST are shown in Figure 9. By considering FLOPs instead of the compression ratio, the resulting range of each approach varied. Nonetheless, it is not difficult to confirm that both of BSR and LC outperform CA. Between BSR and LC, BSR outperforms LC for the entire BSR evaluation range.
b.3 CIFAR10 and CIFAR100
The results for ResNet32/ResNet56 on CIFAR10 and the results for ResNet56 on CIFAR100 are shown in Figure 10. For ResNet, BSR performs a lowrank compression over the convolutional layers. While LC directly considered FLOPs when selecting rank, we selected rank by considering only the validation accuracy. Nonetheless, we achieve higher test accuracies. BSR can even improve the nocompression accuracy when FLOPs is not too small. For instance, test accuracy in Figure 9(b) is improved from 0.92 to 0.94 by BSR when the FLOPs is around 140 MFLOPs. LC is also able to improve the accuracy, but the gain is smaller than BSR. This implies that the regularized training methods can have a positive influence on learning dynamics. Similarly, in Figure 9(a) and Figure 9(c), the test accuracy is improved over nocompression when the flops is not too small.
b.4 ImageNet
The results for AlexNet on ImageNet are shown in Figure 11. As in the compression ratio vs test accuracy, we used the pretrained Pytorch AlexNet network as the base network.
ImageNet is a much more complex dataset than MNIST or CIFAR. The performance curves show similar patterns as in Figure 10, confirming that BSR works well for complex datasets.
Appendix C Quantization results of AlexNet on ImageNet
In this section, we combine lowrank compression and quantization for ImageNet. The performance for using both is shown in Table 3. Although there can be a slight performance loss, we can observe a reduction in memory usage by using quantization in addition to BSR. When , we can save more than half of the memory by reducing quantization from 32bits to 16bits. Similar to the result of ResNet56, we were also able to reach a range of memory usage that could not be attained by quantization alone by using BSR together. Even with more realistic datasets and larger networks, using BSR and quantization together is effective in reducing memory usage. This feature can make it much easier to use deep learning models on the edge devices. Quantization and our lowrank compression are extremely easy to use and can be applied for practically any situation.
32 bits  16 bits  8 bits  4 bits  











0.56  244.40  0.56  122.20  0.54  61.10  0.04  30.55  

0.55  108.17  0.55  54.09  0.52  27.04  0.03  13.52  

0.55  91.46  0.54  45.72  0.52  22.86  0.03  11.43  

0.55  71.50  0.55  35.75  0.51  17.88  0.03  8.94  

0.54  60.34  0.54  30.17  0.51  15.09  0.03  7.54  

0.54  38.38  0.52  19.19  0.48  9.59  0.03  4.80  

0.48  22.06  0.45  11.03  0.43  5.52  0.02  2.76 
Appendix D Additional ablation tests
Additional results are provided for adjusting and .
d.1 The effect of level step size
We checked the effect of the level step size for a fixed beam size . The analysis was repeated for a range of , and the results are shown in Figure 12. From the results, it can be confirmed that it is desirable to choose an that is not too large.
d.2 The effect of beam size
We checked the effect of the beam size for a fixed level step size . As can be seen in Figure 12(a), Figure 12(b), Figure 12(c), Figure 12(d), we can confirm that the test accuracy improves as the beam size increases in general. This is because more candidates can be searched with a larger . However, in Figure 12(e) where is very large, it can be observed that the performance deteriorates as is increased.
Comments
There are no comments yet.