Due to the tremendous success of deep learning, neural networks can now be found in applications everywhere. Running in the cloud, on-device, or even on dedicated chips, large deep learning networks now form the foundation for many real-world applications. They are found in voice assistants, medical image analyzers, automatic translation tools, software that enhances photographs, and many other applications.
In these real-world applications, the performance of neural networks is an important topic. Well-performing deep neural networks are large and expensive to execute, restricting their use in, e.g., mobile applications with limited compute. Even for large-scale cloud-based solutions, such as services that process millions of images or translations, neural network efficiency directly impacts compute and power costs.
Alongside quantization (Krishnamoorthi (2018)) and optimizing kernels for efficient deep learning execution (Chetlur et al. (2014)), neural network compression is an effective way to make the run-time of these models more efficient. With compression, we mean improving the run-time of models, as opposed to compressing the actual size of the network for storage purposes. In this paper, we will describe and compare several methods for compressing large deep-learning architectures for improved run-time.
Even for architectures that were designed to be efficient, such as MobilenetV2 (Sandler et al. (2018) and EfficientNet (Tan and Le (2019)), it is still helpful to do neural network compression (Liu et al., 2019; He et al., 2017). There has been a debate in the deep-learning literature on the efficacy of compression. Liu et al. (2018) argues that network compression does not help, and one could have trained that similar architecture from scratch. However, the “The lottery-ticket hypothesis”, Frankle and Carbin (2018) provides arguments for the hypothesis that it’s better to train a large network and compress it, rather than training a smaller model from scratch. We will see in our result section more evidence for the latter, indicating it helps to compress networks after training, as opposed to starting with a more efficient architecture.
In this paper, we systematically categorize the many different compression methods that have been published and test all of them on a large scale image classification task. We group methods by their practical usage into 3 different levels. Level 1: Methods that do not use data. Level 2: methods that do not use back-propagation and Level 3: methods that use a training procedure. Within these categories, we look at several different ways of doing neural network compressing, including tensor-decomposition, channel-pruning, and several Bayesian inspired approaches. Specifically, we look only at structured pruning approaches, where the size of the tensors of the network decreases in size. This is opposed to unstructured pruning methods, such as (Han et al. (2015)) and (Molchanov et al. (2017)), that remove individual weights from the network. These types of pruning methods require specific hardware to obtain speed-ups, whereas structured pruning methods more directly provide improved speed on most devices.
2 Related work
SVD decomposition was first used by Denil et al. (2013) to demonstrate redundancy in weight parameters in deep neural networks. Following this approach, several works employ low-rank filter approximation (Jaderberg et al., 2014; Denton et al., 2014) to reduce inference time for pre-trained CNN models. One of the first methods for accelerating convolutional layers by applying low-rank approximation to the kernel tensors is Denton et al. (2014). The authors suggest several decompositions approaches applied to parts of the kernel tensor obtained by bi-clustering. The spatial decomposition method from Jaderberg et al. (2014) decomposes a filter into a and while exploiting redundancy among multiple channels. Another notable improvement of SVD compression is reducing the error introduced by filter approximation based on input data. The approach suggested by Zhang et al. (2016) uses per-layer minimization of errors in activations for compressed layers.
Tensor decomposition-based methods.
Several approaches for structured CNN compression based on tensor decomposition applied to 4D convolutional kernels were suggested. An overview of tensor decomposition techniques is given in Kolda and Bader (2009). The authors of Lebedev et al. (2014) apply CP-decomposition to compress a kernel of a convolutional filter. The work of Kim et al. (2015) suggests a CNN compression approach based on the Tucker decomposition. The authors also suggest employing analytic solutions for variational Bayesian matrix factorization (VBMF) by Nakajima et al. (2013) for the rank selection. Another tensor decomposition approach that was applied to model compression is the tensor-train decomposition (Oseledets, 2011). It is used in Novikov et al. (2015) for compression of fully-connected layers, and Garipov et al. (2016) applies it for convolutional layers.
Another direction in convolutional layer compression is to increase the dimensionality by reshaping a kernel into a higher-dimensional tensor (Su et al., 2018; Novikov et al., 2015). For example, a convolutional kernel with 64 input, and 64 output channels represented as 6-dimensional tensor instead , where corresponds to one way of factorizing 64. Using any other way of factorizing 64 in combination with any of the three tensor decomposition techniques yields a new compression technique. An extensive study of applying CP-decomposition, Tucker decomposition, and tensor-train decomposition in combination with factorizing kernel dimensions were published by Su et al. (2018). They consider compression of both fully-connected and convolutional layers.
One of the ways to reduce inference time for pre-trained models is to prune redundant channels. The work of Li et al. (2016)
is focused on using channel norm magnitude as a criterion for pruning. Another approach is to use a lasso feature selection framework for choosing redundant channels while minimizing reconstruction error for the output activation based on input data(He et al., 2017).
Compression ratio selection methods.
As every layer of a neural network has different sensitivity to compression, any SVD or tensor decomposition technique can be further improved by optimizing per layer compression ratios. The methods Kim et al. (2019); Kim and Kyung (2018)
suggest efficient search strategies for the corresponding discrete optimization problem. A learning-based strategy based on reinforcement learning is suggested inHe et al. (2018).
While compression and pruning methods reduce complexity, most of the methods assume equal importance of every model parameter for the accuracy of the final model. One way to improve compression methods is to estimate the importance of each of the weights, and use this information while pruning. Several methods suggest introducing importance based on loss function increase(Wang et al., 2019; Gao et al., 2018; LeCun et al., 1990; Hassibi et al., 1993). The increase of the loss function is often estimated based on first or second-order linear approximations.
Another family of methods from the literature suggests adding a term to a loss function that controls the complexity of the model. Typically, a collection of stochastic gates is included in a network, which determines which weights are to be set to zero. Methods following this approach include Louizos et al. (2018); Neklyudov et al. (2017); Dai et al. (2018), and a recent survey is provided in Gale et al. (2019).
Efficient architecture design.
Several works aim at finding the optimal trade-off between model efficiency and prediction accuracy. MobileNet V1 (Howard et al., 2017) is based on combining depth-wise separable convolutions and depth-wise convolutions to reduce the number of FLOPs. MobileNet V2 (Sandler et al., 2018) is based on the linear bottleneck and inverted residual structure and further improves the efficiency of the model. MnasNet (Tan et al., 2019) is based on a combination of squeeze and excitation blocks. Another efficient architecture (Zhang et al., 2018) leverages group convolution, and channel shuffle operations. Some of the more recent architectures (Howard et al., 2019; Wu et al., 2019) are based on combining efficient handcrafted layers with neural architecture search.
2.1 Levels of compression solutions
To facilitate a comparison of the methods proposed in the literature, we refer to practical use cases of model compression. The following levels of compression solutions are introduced in a way similar to Nagel et al. (2019). The definition of each level depends on the amount of training data and computational resources available when using a compression method.
Level 1. Data-free compression. No data or training pipeline is available in this case. Nevertheless, the goal is to produce an efficient model with the predictions as close to the original model as possible.
Level 2. Data-optimized compression. A limited number of batches of the training data are used to guide the compression method with no ground-truth labels being used. In this case, layer-wise optimization of the parameters of the compressed model is used to improve the predictions. No back-propagation is used at this level.
Level 3. Full data compression. This level corresponds to fine-tuning of the compressed model using the full training set or training an efficient model from scratch using the full amount of data. Full back-propagation is used in this case so that the computational complexity is comparable to the complexity of the original model training procedure.
Different from Nagel et al. (2019), in the current work we omit introducing one more level for the methods which introduce architecture changes, as compression is complementary to architecture search methods and allows to obtain further performance improvement even if applied for handcrafted or learning based efficient architectures He et al. (2018); Liu et al. (2019).
The compression levels are summarized in figure 1. Using the levels formulation, all the compression methods can be categorized and compared in a similar setting. The practical choice of compression level depends on the specific envisioned use case.
3 Structured compression methods overview
To define a quantitative measure of compression, we use the number of multiply-accumulate operations (MAC units, or MACs) used by a neural network at inference time. Given a network with layers with operations in each, the total computational complexity is expressed as:
Assuming that a compression technique reduces the number of operations per layer to , per layer compression ratio can be computed as:
The whole model compression rate can be defined in a similar way:
where is the total number of operations in the compressed model.
In practice, the model’s accuracy has a different sensitivity to the compression of different layers. The problem of selecting an optimal compression ratio for each of the layers given the target whole-model compression ratio is considered in section 3.4.
3.1 Level 1. Data-free compression methods
A convolutional layer is specified by the kernel , where is the number of input channels, is the number of output channels and the spatial size of the filter is
. The kernel is assumed to be square of odd sizefor simplicity,
denotes its half-width. A convolution is a linear transformation of the feature mapinto an output tensor . We assume the spatial dimensions of the input and output feature maps are equal in order to avoid notational clutter. The convolution is defined as follows:
We omit the bias term for notation simplicity. The number of MACs in a convolutional layer is .
3.1.1 SVD methods
To leverage low-rank matrix approximation for the compression of a convolutional layer, the kernel tensor is transformed into a matrix. In this case, the dimensions of a tensor are referred to as modes. There are seven types of possible matricizations of a 4-dimensional tensor. Two of these are used in the compression methods that are introduced in the following paragraphs.
This method is based on reshaping the kernel tensor into a matrix followed by a low-rank approximation. This type of matricization corresponds to merging three of the four original modes into a single supermode.
The approximate kernel of rank is expressed as follows:
The schematic diagram of the summation is given in the figure 2(b). The factors can be obtained using SVD decomposition of and assigning and . The first factor corresponds to a convolution with a filter size , input channels and output channels whereas the second factor corresponds to convolution with input channels and output channels. The total number of MACs in the decomposed layer equals . The compression ratio of the decomposed layer is fully determined by the rank .
This method is based on reshaping the kernel to a matrix . The corresponding low-rank approximation of rank can be expressed as (see figure 2 (c)):
The factor corresponds to a convolution with a vertical filter of size and the factor corresponds to a horizontal convolution. The total number of MACs is . The trade-off between the computational complexity and approximation error is defined by the rank .
The decomposition was introduced in Jaderberg et al. (2014). In the original paper, an iterative optimization algorithm based on conjugate gradient descent was used to calculate the factorization. In a subsequent work of Tai et al. (2015), the iterative scheme was replaced by a closed-form solution based on SVD decomposition.
3.1.2 Tensor decompositions
In addition to matricization of the convolutional kernel, several compression techniques based on tensor decompositions were suggested in Lebedev et al. (2014); Kim et al. (2015); Su et al. (2018) where the kernel is directly treated as a 4-dimensional tensor and decomposed. In this case, different choice of dimensions order in the kernel yields different factorizations.
For a kernel , the CP-decomposition of rank is defined as follows (Kolda and Bader (2009)):
where , , and are tensors which are intermediate results after each step. The diagram of the decomposition and the indices used in the summations is given in the figure 3(a). Computing and corresponds to convolutions with filter size . The steps of computing and in turn corresponds to convolutions with vertical and horizontal filters, respectively. The total number of MACs in the decomposed layer is . Thus, it depends solely on the value of the rank which controls the approximation error and computational complexity.
For a kernel , a partial Tucker decomposition (Kolda and Bader (2009)) is defined as:
where is a core tensor of size and and are the factor matrices. Computation of the convolution can be decomposed into the following three steps (Kim et al. (2015)):
where the steps in Eqn. 13 and Eqn. 15 correspond to convolutions with filter size , and the step in Eqn. 14 corresponds to a convolution with the original filter size with input channels and output channels. The total number of MACs is and defined by two ranks and
, so that there is one degree of freedom when selecting the ranks given a predefined compression ratio. The original work byKim et al. (2015) suggests using variational Bayesian matrix factorization (Nakajima et al. (2013)) for rank selection.
After reordering the modes as , a tensor-train decomposition for the kernel is defined as the following sequence of matrix products (Oseledets (2011)):
The steps in Eqn. 17, and Eqn. 20 correspond to convolutions and the steps in Eqn. 18, and Eqn. 19 correspond to a convolution with vertical and horizontal filters, respectively. The total number of MACs is . The decomposition has three ranks , , and that determine the approximation error and the computational complexity of the compressed layer.
|Num. parameters||Comp. complexity||
|(a) Weight SVD|
|(b) Spatial SVD|
|(d) Tucker decomposition|
|(e) Tensor-train decomposition|
3.2 Level 2. Data driven compression methods
3.2.1 Per-layer data-optimized SVD methods
All level 1 compression methods minimize the kernel approximation error. This does not use any information of the actual data which is being processed by the layer of the network. One of the ways to improve level 1 methods is to formulate a method that minimizes the error in the activations produced by the compressed layer. A method based on minimizing the error of the output for the specific data allows one to significantly decrease the loss in accuracy after compression. This section describes multiple approaches for per-layer data-optimized SVD compression.
Data SVD. Given a kernel tensor reshaped into a matrix of shape
, an input vector, the response is given by:
Given the output data, the optimal projection matrix is given as a solution of the following optimization problem:
where are outputs sampled from the training set, is the sample mean, and
is the number of samples. The solution is given by principal component analysis (PCA) as follows(Golub and Van Loan, 1996). Let be a matrix which concatenates the entries of . Given the eigendecomposition of the covariance matrix , the values of are given by:
where are the first
are eigenvectors. This solution forcan be used to approximate the original layer. Under the low rank assumption for vector , the output can be expressed as:
where is the input vector and is the bias. Using Eqn. 23, the original kernel can be approximated as , where , and . This method corresponds to a data-optimized version of the weight SVD decomposition (Eqn. 5).
Asymmetric data SVD. One of the main issues in neural network compression is the accumulation of error when compressing a deep model. Since every layer is compressed subsequently, compressed layers could take into account the error introduced by previous layers in their decomposition for better performance. An asymmetric formulation was introduced to do this in Zhang et al. (2016). As opposed to optimizing the reconstruction error for the approximated layer based on the original input data, the asymmetric formulation is based on the input data from the previous approximated layer. This approach allows one to significantly reduce the full-model accuracy drop in level 2 settings using a limited amount of input data at the cost of solving a more general optimization problem.
Given the output of the previous compressed layer , the activations are given by:
In order to minimize the error introduced by compression, the following optimization problem is solved:
The problem is based on minimizing the same error as in Eqn. 22, but it depends on both the original layer outputs , and the compressed layer outputs . After combining responses , and into matrices and , the minimization can be written as:
The problem has a closed form solution for based on generalized SVD (Takane and Jung, 2006). The new bias for the compressed layer can be computed as .
The reconstruction error of the asymmetric data SVD can be further improved be incorporating the activation function into the formulation in Eqn.27.
Where is a matrix concatenating the entries of with no mean subtracted, and is a new bias. This problem is solved using the following relaxation:
where is an auxiliary variable, and is a penalty parameter. The second term of the objective is equivalent to Eqn. 27
. The first term can be minimized using SGD for any activation function, or in the case of the ReLU function, it can be solved analytically (Zhang et al. (2016)). Minimization of the objective Eqn. 29 is performed by using alternating minimization. The first sub-problem corresponds to fixing and solving for , , and vice versa for the second sub-problem. Increasing values of parameters are used through the iterations of the method.
Asym3D. The authors of Zhang et al. (2016) further propose to use the formulation in Eqn. 27 to perform a double decomposition based on the spatial and data SVD methods. Given two spatial SVD layers , , the formulation in Eqn. 27 can be applied in order to perform a further decomposition of the second layer . The trade-off between accuracy and the computational complexity in this case is determined by two ranks: is the rank of the original spatial SVD decomposition and rank is the rank of the data optimized decomposition applied to the factor . The final decomposed architecture consists of a filter with output channels followed by a filter with output channels and a convolutional layer with output channels.
Data optimized spatial SVD. In addition to Asym3D method, the framework for per-layer optimization (Eqn. 26) can be used to obtain a data-optimized version of the spatial SVD method. If we consider the optimization problem in Eqn. 26 without the constraint on the rank:
the solution for can be used to improve the predictions by refining weights of a compressed network layer based on some input and output data. Consider a convolutional layer decomposed using the spatial SVD decomposition (Eqn. 6). Given the original weights , the layer can be decomposed into two layers:
Given an input vector , the output is given by:
After solving Eqn. 30 for above and the reference output, the data-optimized version of the weights is given as:
In practice, the refined value can be used instead for the second layer.
3.2.2 Channel pruning
Some compression methods introduced in the literature are based on pruning channels of a convolutional filter based on different channel importance criteria. In particular, the method suggested in Li et al. (2016) is based on the weight magnitudes. Another pruning method which is optimized for data was introduced in He et al. (2017). This method uses lasso feature selection to find the set of channels to prune.
In order to formulate the pruning method as an optimization, the authors consider computing the output of a convolutional layer with a kernel on input volumes sampled from the feature map of the uncompressed model, where is the number of samples. The corresponding output volume is a matrix of shape . The original number of channels is reduced to () in a way that the reconstruction error for the output volume is minimized. The objective function is:
where is an i-th channel of the input concatenated for multiple data samples, and is i-th channel of the filter, both are reshaped into matrices. Vector is the coefficient vector for channel selection. If the value then the corresponding channel can be pruned. In order to solve the problem, the norm is relaxed to and the minimization is formulated as follows:
The minimization is performed in two steps by fixing or and solving the corresponding sub-problems.
3.3 Level 3. Compression based on training
Some compression methods require full training of the model. Either by fine-tuning an already trained model for a few training epochs, or training the model entirely from scratch. All of the procedures in the previous paragraphs can be extended this way into an iterative compression and fine-tuning scheme. Here we focus on probabilistic compression methods that need fine-tuning or training from scratch.
3.3.1 Probabilistic compression
Several methods have been proposed in the literature that add a, potentially probabilistic, multiplicative factor to each channel in the convolutional network. Such that we have for a single layer with input x, weight matrix W and output y:
with the same dimensionality as the output , and one or more learnable parameters that control the gate. The idea is that when equals 0, the output channel is off and can be removed from the network. The factor
can also be interpreted as a gate that is on or off. Similarly, in the probabilistic setting, if the gate is sampled close to 0 with a high likelihood or has a very high variance, the channel can be removed. This multiplicative factor is regularized by a penalty term in the loss function, such that during training the network optimizes for the trade-off between the loss function and the model complexity as follows:
where is the original loss function, a differentiable function of the complexity of the network, parametrized by (learnable) parameters that control the gates, and a trade-off factor between the two loss functions. In all methods,
is a hyperparameter that is set by the user.
The technique from Louizos et al. (2018) applies the -norm to channels in the neural network. The -norm is defined as , the amount of non-zero entries in a vector . Generally, this norm cannot be optimized directly, but the paper extends the continuous relaxation trick from Maddison et al. (2016); Jang et al. (2017) to optimize the gates. Louizos et al. (2018) introduces the hard-concrete distribution for the gate, which is a clipped version of the concrete distribution:
Each channel in a convolutional network is multiplied with one gate , which is parametrized by parameter . In the forward pass, a sample is drawn from the hard-concrete distribution for each gate, creating a stochastic optimization procedure. is the temperature parameter, set as
in the paper, which controls the skew of the sigmoid. Parametersare stretching factors for clipping some values to actual 0s and 1s, which are set to and
, respectively. The method penalizes the probability that each gate is sampled as. Channels corresponding to gates that have a low probability of being active can be removed from the network. This corresponds to a small parameter . The regularization factor chosen here is:
where is the total number of gates in the network.
Variational Information Bottleneck.
Dai et al. (2018) introduces a Gaussian gate that is multiplied with each channel in the network. In the forward pass, a sample is drawn from the Gaussian by using the reparametrization trick from Kingma et al. (2015). This corresponds to gates such that:
are learnable parameters, corresponding to the mean and standard deviation of the Gaussian. The corresponding regularization factor is derived to be
where again, is the number of gates in the network. The channels that have a small ratio can be removed from the network, as they are either multiplied with a small mean value or have a very large variance.
3.4 Compression ratio selection for whole-model compression
Per layer compression ratio selection is one of the important aspects of neural network compression. In this section we introduce two different methods for compression ratio selection which we used for our experiments.
3.4.1 Equal accuracy loss
To compare different SVD and tensor decomposition based compression techniques in similar settings, we suggest using the following ratio selection method. The main advantage of this method is that it can be defined for any decomposition approach in a similar way.
To introduce the rank selection method, we first define a layer-wise accuracy metric based on a verification set. The verification set is a subset of the training set used for the rank selection method to avoid using the validation set. For a layer , the accuracy is obtained by compressing the layer using a vector of ranks , while the rest of the networks remains uncompressed. The network with the single compressed layer is evaluated on the verification set to calculate the value . In order to avoid extra computational overhead, in practice the layer-wise accuracy metric is calculated only for some values of , e.g., values of that correspond to per-layer compression ratios .
We denote the combination of all rank values for all the layers as , where each rank is a scalar in case of SVD decomposition, and a vector in case of high-dimensional tensor decomposition techniques. The set of ranks can be calculated as the solution to the following optimization problem. The input consists of per-layer accuracy-based metric , the full compressed model complexity , the original model accuracy and the original model complexity :
where is the tolerance in per-layer accuracy metric decrease. The tolerance value is iteratively adjusted to meet the desired full model compression ratio .
3.4.2 Greedy algorithm based on singular values
To facilitate comparison of data-optimized SVD methods, we use the following method introduced in Zhang et al. (2016). The method is based on the assumption that the whole-model performance is related to the following PCA energy:
are the singular values of layer. To choose the ranks for SVD decomposition, the energy is maximized subject to the constraint on the total number of MACs in the compressed model:
To optimize the objective, the greedy strategy of Zhang et al. (2016) is used. This approach has a relatively low computational cost and does not require using the validation set.
To evaluate the performance of different compression techniques at different levels, we used a set of the models from PyTorch (Paszke et al. (2017)) model zoo, including Resnet18, Resnet50, VGG16, InceptionV3, and MobileNetV2 trained on ImageNet data set. For every model we used 1.33x, 2x, 3x, and 4x compression ratios in terms of MACs, which serves as a proxy for run-time.
4.1 Level 1 compression
To compare the performance of level 1 compression techniques, we used Resnet18, VGG16, Resnet50, and InceptionV3 models, no fine-tuning or data-aware optimization was used. Five different compression techniques were evaluated, including spatial SVD, weight SVD, Tucker decomposition, tensor-train decomposition, and CP-decomposition. To compute compression ratios per layer, the method based on equal accuracy loss was used for all the methods.
For decomposition approaches that have a single rank value including spatial SVD, weight SVD, and CP-decomposition, the rank value is fully determined by the compression ratio. For Tucker decomposition, we add an additional constraint to calculate the ranks, where and are the maximum values of ranks and respectively (definition of ranks for Tucker decomposition is given in Eqn. 12), and the equality is approximate due to integer values of the ranks. In a similar way, for tensor-train decomposition we use the pair of constraints to determine the set of three ranks based on the compression ratio value, where , , are maximum values of the ranks , , , respectively (the ranks for the tensor-train decomposition are defined in Eqn. 16).
The results are shown on the figure 5. The best accuracy versus compression ratio is achieved by the method based on CP-decomposition (Lebedev et al. (2014)) across all four models. The second best method across all the considered models is the Spatial SVD decomposition (Jaderberg et al. (2014)). We conjecture that good performance of both methods is due to the highly efficient final architecture that is based on horizontal and vertical filters that require few MAC units. In the case of CP-decomposition, the resulting CNN architecture is based on depth-wise separable convolutions, which results in even more savings in computational complexity.
The ranking of the other three methods depends on the model. Thus, choosing the optimal method requires empirical validation. The results show that using higher-level decomposition such as Tucker or tensor-train does not necessarily lead to better performance compared to approaches based on matricization such as weight SVD or spatial SVD.
4.2 Level 2 compression
In this section, we present the results of the ablation study for Level 2 methods from Zhang et al. (2016), and compare it to channel pruning suggested by He et al. (2017) for Resnet18, and VGG models pre-trained on ImageNet. For data-aware reconstruction, we use 5000 images. For each image, ten feature map patches at random locations were sampled.
For the Resnet18 model, five methods were evaluated, including data SVD, asymmetric data SVD, channel pruning, Asym3D, and data-optimized spatial SVD. The best performance for lower compression ratios such as 1.33x and 2x compression is achieved with data-optimized spatial SVD, whereas for higher compression ratios including 3x and 4x compression, better accuracy is achieved using Asym3D (see figure 6 on the left).
The data-optimized spatial SVD method can be seen as three improvements on top of the most basic Level 1 weight SVD compression. The first step is using data for per-layer optimization of the compressed model (Eqn. 22), the second is asymmetric formulation (Eqn. 26), and finally, some improvement is obtained by using efficient spatial SVD architecture (Eqn. 33). In order to compare improvements due to each step, we performed the following ablation study. As results in figure 6 suggest, all three steps are equally important for the compressed model performance.
The accuracy of channel pruning is mostly on par or comparable with the data SVD method (figure 6) as both methods use data-aware reconstruction based on the same amount of data without leveraging the asymmetric formulation in Eqn. 27.
As the VGG16 model has many convolutional layers followed by ReLU non-linearities without batch normalization in between, this model allows adding activation function into the data-aware reconstruction for methods such as asymmetric data SVD, Asym3D, and data-optimized Spatial SVD. The results with the activation function included into the formulation are presented in figure7. The most important part of the methods is using the ReLU function in the optimization, which is necessary for the performance of both methods.
Overall for VGG16 model, Similar to the Resnet18 results, the three improvements on top of level 1 compression, such as using data-aware optimization, the asymmetric formulation, and using efficient Spatial SVD architecture are equally crucial for the accuracy of the compressed model. For this model, channel pruning demonstrates poor performance, which is comparable to level 1 compression using the weight SVD method.
The MobileNet V2 architecture is based on depth-wise separable convolutions; therefore, spatial SVD is not possible, and the set of applicable compression methods is restricted to variants of data-optimized SVD, and channel pruning. As the SVD decomposition can only be used for 1x1 convolutional layers and is not applicable for depth-wise separable convolutions, data-optimized SVD methods, including data SVD, asymmetric data SVD, demonstrate poor performance (figure 8) which is still better than data-free weight SVD method. In contrast, channel pruning is applicable for both types of layers, which leads to better accuracy of the compressed model.
4.3 Level 3 compression
4.3.1 Fine-tuned SVD and tensor decompositions
To recover the performance of compressed models, we used the same fine-tuning scheme for different compression methods, the summary for each model is given in the table 2. All the models were fine-tuned using SGD with 0.9 momentum for 20 epochs with learning rate dropped at epochs 10 and 15. Different hyperparameters for each model, including learning rate, batch size, and weight decay value, are given in the table 2.
|Model||learning rate||batch size||weight decay|
In figure 9 we show the results for level 3 compression of Resnet18, Resnet50, VGG16, and InceptionV3. The best accuracy for all models is achieved using spatial SVD decomposition. The CP-decomposition shows the best results before fine-tuning. However, the fine-tuning scheme used for all the other methods does not recover the accuracy after compression. We were not able to find any fine-tuning hyperparameters that would allow us to recover the model accuracy. This observation agrees with the results from the original paper by Lebedev et al. (2014). The results for level 3 compression of MobileNetV2 are given in figure 10. There are only two methods applicable, and channel pruning outperforms fine-tuned weight SVD across all the compression ratios.
4.3.2 Fine-tuning data-optimized SVD
The following experiment was performed to estimate the potential benefit of combining data-aware optimization with full data fine-tuning for SVD methods. We compressed the Resnet18 network using the level 1 spatial SVD method and level 2 data-optimized spatial SVD. The two methods used the same SVD rank values provided by the greedy method based on singular values so that the resulting network architectures are identical. We fine-tuned both models using the same fine-tuning scheme used for level 3 compression.
The results are shown in figure 11. Despite the substantial difference in accuracy between level 1 and level 2 methods, the difference becomes negligible after the networks are fine-tuned. Therefore, we conclude that there is no benefit in using data-optimized compression if the network is fine-tuned after compression.
4.3.3 Probabilistic compression
Contrary to previously discussed methods, methods based on probabilistic compression usually train the network from scratch using a special regularization instead of starting from a pre-trained model111Probabilistic compression can also be used in combination with a pre-trained model. However, in most cases, this results in lower performance than starting from a randomly initialized model, especially when targeting a high compression rate.. Since the compression is indirectly enforced using a regularization term and it is not possible to target a specific compression rate directly. However, by line search over regularization strength we can achieve comparable compression targets than in the other experiments.
Similar to the original model, we train models with probabilistic compression using SGD with a learning rate of 0.1, momentum of 0.9, and a weight decay of for 120 epochs. We drop the learning rate by a factor of 0.1 at epoch 30, 60, and 90. The regularization strength depends on the method applied; for the variational information bottleneck (VIBNet) we used values between and to achieve a compression of approximately 1.3x to 3x. For based regularization we used to
resulting in similar compression rates. In case the architecture has residual connections, we add a gateto the input of the first convolution of each residual block. Thus we can prune the input and output channels of each convolution to achieve an optimal compression rate. Note, in a chain-like CNN pruning the input of a convolution is done implicitly since it depends only on the output of the previous convolution.
The results for probabilistic compression of Resnet18 are shown in figure 12. We observe that VIBNet consistently outperforms by a small margin. Compared to the previous best level 3 decomposition method, fine-tuned spatial SVD, VIBNets have a slight edge for lower compression rates but perform worse for very high compression rate. The latter might be due to the fact that spatially decomposing the convolutional filter can lead to a more efficient architecture than only pruning channels. Both VIBNets and consistently outperform fine-tuned channel pruning, which can lead to the same architectures.
4.3.4 Combining probabilistic compression and channel pruning with SVD compression
We found that different level 3 compression approaches are complementary. In fact, spatial SVD can be combined with channel pruning or probabilistic compression, which yields better model accuracy compared to compression using a single full-data method.
The results for the combinations of the methods are given on the figure 13. In the first case, spatial SVD was applied after probabilistic compression with the VIBNet approach. The VIBNet compressed model was trained from scratch, then spatial SVD was applied for the resulting model, and finally, the compressed model was fine-tuned using the scheme from table 2.
In the second case, channel pruning was applied after the spatial SVD. After each compression step, we fine-tuned the network with the scheme from table 2. The combination of the VIBNet approach and the spatial SVD achieves the best results, and allows to significantly improve the spatial SVD method.
4.3.5 Compression versus training from scratch
One of the important questions related to compression is whether a compressed model gives better performance than training the same architecture from scratch. In order to answer this question, we performed the following experiment. We compressed Resnet18 and VGG16 pre-trained on ImageNet using spatial SVD and channel pruning and then compared the accuracy of the fine-tuned models to the models trained from scratch. The architecture for the models trained from scratch was identical to the architecture obtained by applying the compression techniques.
The level 3 fine-tuning schemes (table 2) were used for fine-tuning of the compressed models. Whereas for training from scratch, for Resnet18 we use 90 epochs with similar parameters including the starting learning rate 0.1 with dropping it at epochs 30, 60, 90, and for VGG16 62 epochs were used with a learning rate 0.01 dropped at epochs 30, and 60. Using these training parameters for training uncompressed models from scratch gives accuracy equal to the accuracy of the corresponding pre-trained models, which were used for compression.
The results are shown in figure 14. For channel pruning, using compression always gives better results compared to training from scratch. For spatial SVD using compression outperforms training from scratch for lower compression rates, but training from scratch gives better performance for more aggressive compression. We conjecture that more aggressive compression effectively leaves little information in the pre-trained model. In such cases, training from scratch with random initialization is often better. Our results for lower compression rates agree with the lottery ticket hypothesis Frankle and Carbin (2018), which claims that better accuracy can be achieved by training and pruning a larger model than training a smaller model directly.
4.4 Compression ratio selection
One of the important aspects of compression methods is the per layer compression ratio selection. As layers of a network have different sensitivity to compression, different choice of compression ratios can improve or deteriorate the accuracy of the compressed model. The problem of the compression ratio selection can be regarded as a discrete optimization problem. Specifying the full model compression ratio beforehand results in a constraint imposed on the solution.
The choice of the objective function for the optimization corresponds to several different practical use cases of model compression. Besides the obvious choice of maximizing the accuracy of the compressed model, compression ratio selection can be used to minimize the inference time of the compressed model on specific hardware leading to hardware-optimized compression. In addition to inference time, the objective function can be based on the memory footprint of the model at inference time as well as use any combination of the quantities mentioned above.
Practical usage of the compression ratio optimization faces a challenge related to the need for time-consuming model fine-tuning to recover the compressed model accuracy. Using model accuracy after fine-tuning as an objective function for optimization is prohibitively expensive in this case. One way to alleviate this problem used in the literature (e.g., He et al. (2018)) is the following: a model is compressed using a set of compression ratios and evaluated on the validation set without fine-tuning. Then this accuracy value is being used in the optimization as a proxy of the accuracy of the compressed model after fine-tuning. In this case, it is assumed that a better compressed model accuracy before fine-tuning leads to a better compressed model accuracy after fine-tuning.
To quantitatively validate this assumption, we performed the following experiment. First, we compressed the Resnet18 model with a 2x compression ratio; the compression ratios per layer were selected using the greedy method based on singular values. Second, we randomly perturbed the compression ratios in a way such that the full model complexity is preserved under the perturbations. This way, we obtained 50 different compressed Resnet-18 models of the same computational complexity. To verify whether the model accuracy before fine-tuning is a suitable proxy for the model accuracy after fine-tuning, we fine-tuned all the models using the same fine-tuning scheme, which was used for level 3 compression (see table 2). The figure 15 shows the results as a scatter plot with the horizontal axis corresponding to the model accuracy before fine-tuning and vertical axis corresponding to the accuracy after fine-tuning. As the results suggest, there is no correlation between the two accuracy values. This does not agree with the assumption made above and leaves the problem of practical compression ratio optimization for architecture search methods wide open.
In this paper, we performed an extensive experimental evaluation of different neural network compression techniques. We considered several methods, including methods based on SVD, tensor factorization, channel pruning, and probabilistic compression methods.
We introduced a methodology for the comparison of different compression techniques based on levels of compression solutions. Level 1 corresponds to data-free compression with no fine-tuning or optimization used to improve the compressed model. Level 2 corresponds to data-optimized compression based on a limited number of training data batches used to improve the predictions by performing layer-wise optimization of the parameters of the compressed model. No back-propagation is used at this level. Level 3 corresponds to fine-tuning the compressed model on the full training set using back-propagation. We hope these levels help distinguish between different types of compression methods more clearly, as the vocabulary is adopted.
Experimental evaluation of the considered methods shows that the performance ranking of the considered methods depends on the level chosen for experiments. At level 1, CP-decomposition shows the best accuracy for most of the models. The ranking of the other methods depends on the model.
The best results for Level 2 compression are achieved using per-layer optimization based on the combination of asymmetric formulation (Zhang et al. (2016)); however, our experiments show that applying the GSVD method for optimizing the second factor of spatial SVD decomposition yields better results than the original Asym3D double decomposition approach from the same paper.
For level 3 compression, the best performance is given by VIBNet and L0 methods for moderate compression, and by the spatial SVD for higher compression ratios. In additional experiments, we show that SVD compression is complementary to channel pruning and probabilistic pruning approaches so that using the combination of VIBnet and spatial SVD gives the best performance overall of any of the considered compression techniques.
In further experiments, we demonstrate that level 3 compression of a larger network achieves better performance compared to training a smaller network from scratch, both for SVD-based compression, and pruning methods. These results are in agreement with the lottery ticket hypothesis (Frankle and Carbin (2018)) and indicate that compression methods should be applied after training, and are not just a way of doing neural architecture search.
We would like to thank Arash Behboodi, Christos Louizos and Roberto Bondesan for their helpful discussions and valuable feedback.
- CuDNN: efficient primitives for deep learning. CoRR abs/1410.0759. External Links: Cited by: §1.
- Compressing neural networks using the variational information bottleneck. arXiv preprint arXiv:1802.10399. Cited by: §2, §3.3.1.
- Predicting parameters in deep learning. In Advances in neural information processing systems, pp. 2148–2156. Cited by: §2.
- Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277. Cited by: §2.
- The lottery ticket hypothesis: training pruned neural networks. CoRR abs/1803.03635. External Links: Cited by: §1, §4.3.5, §5.
- The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574. Cited by: §2.
- Rate distortion for model compression: from theory to practice. arXiv preprint arXiv:1810.06401. Cited by: §2.
- Ultimate tensorization: compressing convolutional and fc layers alike. arXiv preprint arXiv:1611.03214. Cited by: §2.
- Matrix computations the john hopkins university press. Baltimore and London. Cited by: §3.2.1.
- Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1.
- Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pp. 293–299. Cited by: §2.
Amc: automl for model compression and acceleration on mobile devices.
Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800. Cited by: §2, §2.1, §4.4.
- Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §1, §2, §3.2.2, §4.2.
- Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §2.
- Searching for mobilenetv3. CoRR abs/1905.02244. External Links: Cited by: §2.
- Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: §2, §3.1.1, §4.1.
- Categorical reparametrization with gumbel-softmax. In Proceedings International Conference on Learning Representations 2017, External Links: Cited by: §3.3.1.
Efficient neural network compression.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12569–12577. Cited by: §2.
- Automatic rank selection for high-speed convolutional neural network. arXiv preprint arXiv:1806.10821. Cited by: §2.
- Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530. Cited by: §2, §3.1.2, §3.1.2.
- Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2575–2583. External Links: Cited by: §3.3.1.
- Tensor decompositions and applications. SIAM review 51 (3), pp. 455–500. Cited by: §2, §3.1.2, §3.1.2.
- Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342. External Links: Cited by: §1.
- Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553. Cited by: §2, §3.1.2, §3.1.2, §4.1, §4.3.1.
- Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §2.
- Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §2, §3.2.2.
- MetaPruning: meta learning for automatic neural network channel pruning. arXiv preprint arXiv:1903.10258. Cited by: §1, §2.1.
- Rethinking the value of network pruning. CoRR abs/1810.05270. External Links: Cited by: §1.
- Bayesian compression for deep learning. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3288–3298. External Links: Cited by: §3.3.1.
- Learning sparse neural networks through l0 regularization. In International Conference on Learning Representations, External Links: Cited by: §2, §3.3.1.
The concrete distribution: a continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712. Cited by: §3.3.1.
Variational dropout sparsifies deep neural networks.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2498–2507. Cited by: §1.
- Data-free quantization through weight equalization and bias correction. arXiv preprint arXiv:1906.04721. Cited by: §2.1, §2.1.
- Global analytic solution of fully-observed variational bayesian matrix factorization. Journal of Machine Learning Research 14 (Jan), pp. 1–37. Cited by: §2, §3.1.2.
- Structured bayesian pruning via log-normal multiplicative noise. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 6775–6784. External Links: Cited by: §2, §3.3.1.
- Tensorizing neural networks. In Advances in neural information processing systems, pp. 442–450. Cited by: §2, §2.
- Tensor-train decomposition. SIAM Journal on Scientific Computing 33 (5), pp. 2295–2317. Cited by: §2, §3.1.2.
- Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.
- Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §1, §2.
- Tensorized spectrum preserving compression for neural networks. arXiv preprint arXiv:1805.10352. Cited by: §2, §3.1.2, §3.1.2.
- Convolutional neural networks with low-rank regularization. arXiv preprint arXiv:1511.06067. Cited by: §3.1.1.
- Generalized constrained redundancy analysis. Behaviormetrika 33 (2), pp. 179–192. Cited by: §3.2.1.
- Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §2.
- EfficientNet: rethinking model scaling for convolutional neural networks. CoRR abs/1905.11946. External Links: Cited by: §1.
- EigenDamage: structured pruning in the kronecker-factored eigenbasis. arXiv preprint arXiv:1905.05934. Cited by: §2.
- Fbnet: hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10734–10742. Cited by: §2.
- Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: §2.
- Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence 38 (10), pp. 1943–1955. Cited by: §2, §3.2.1, §3.2.1, §3.2.1, §3.4.2, §4.2, §5.