A Survey of Model Compression and Acceleration for Deep Neural Networks

10/23/2017 ∙ by Yu Cheng, et al. ∙ Huazhong University of Science u0026 Technology ibm Tsinghua University 0

Deep convolutional neural networks (CNNs) have recently achieved great success in many visual recognition tasks. However, existing deep convolutional neural network models are computationally expensive and memory intensive, hindering their deployment in devices with low memory resources or in applications with strict latency requirements. Therefore, a natural thought is to perform model compression and acceleration in deep networks without significantly decreasing the model performance. During the past few years, tremendous progresses have been made in this area. In this paper, we survey the recent advanced techniques for compacting and accelerating CNNs model developed. These techniques are roughly categorized into four schemes: parameter pruning and sharing, low-rank factorization, transfered/compact convolutional filters and knowledge distillation. Methods of parameter pruning and sharing will be described at the beginning, after that the other techniques will be introduced. For each scheme, we provide insightful analysis regarding the performance, related applications, advantages and drawbacks etc. Then we will go through a few very recent additional successful methods, for example, dynamic networks and stochastic depths networks. After that, we survey the evaluation matrix, main datasets used for evaluating the model performance and recent benchmarking efforts. Finally we conclude this paper, discuss remaining challenges and possible directions in this topic.



There are no comments yet.


page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In recent years, deep neural networks have recently received lots of attentions, been applied to different applications and achieved dramatic accuracy improvements in many tasks. These works rely on deep networks with millions or even billions of parameters, and the availability of GPUs with very high computation capability plays a key role in their success. For example, the work by Krizhevsky et al. [1]

achieved breakthrough results in the 2012 ImageNet Challenge using a network containing 60 million parameters with five convolutional layers and three fully-connected layers. Usually, it takes two to three days to train the whole model on ImagetNet dataset with a NVIDIA K40 machine. Another example is the top face verification results on the Labeled Faces in the Wild (LFW) dataset were obtained with networks containing hundreds of millions of parameters, using a mix of convolutional, locally-connected, and fully-connected layers

[2, 3]. It is also very time-consuming to train such a model to get reasonable performance. In architectures that rely only on fully-connected layers, the number of parameters can grow to billions [4].

As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost becomes critical, especially for some real-time applications such as online learning and incremental learning. In addition, recent years witnessed significant progress in virtual reality, augmented reality, and smart wearable devices, creating unprecedented opportunities for researchers to tackle fundamental challenges in deploying deep learning systems to portable devices with limited resources (e.g. memory, CPU, energy, bandwidth). Efficient deep learning methods can have significant impacts on distributed systems, embedded devices, and FPGA for Artificial Intelligence. For example, the ResNet-50

[5] with 50 convolutional layers needs over 95MB memory for storage and over times of floating number multiplications for calculating each image. After discarding some redundant weights, the network still works as usual but saved more than 75% of parameters and 50% computational time. For devices like cell phones and FPGAs with only several megabyte resources, how to compact the models used on them are also important.

Achieving these goal calls for joint solutions from many disciplines, including but not limited to machine learning, optimization, computer architecture, data compression, indexing, and hardware design. In this paper, we review recent works on compressing and accelerating deep neural networks, which attracted a lot of attentions from the deep learning community and already achieved lots of progress in the past years.

We classify these approaches into four categories: parameter pruning and sharing, low-rank factorization, transferred/compact convolutional filters, and knowledge distillation. The parameter pruning and sharing based methods explore the redundancy in the model parameters and try to remove the redundant and uncritical ones. Low-rank factorization based techniques use matrix/tensor decomposition to estimate the informative parameters of the deep CNNs. The transferred/compact convolutional filters based approaches design special structural convolutional filters to reduce the storage and computation complexity. The knowledge distillation methods learn a distilled model and train a more compact neural network to reproduce the output of a larger network.

In Table I

, we briefly summarize these four types of methods. Generally, the parameter pruning & sharing, low-rank factorization and knowledge distillation approaches can be used in DNNs with fully connected layers and convolutional layers, achieving comparable performances. On the other hand, methods using transfered/compact filters are designed for models with convolutional layers only. Low-rank factorization and transfered/compact filters based approaches provide an end-to-end pipeline and can be easily implemented in CPU/GPU environment, which is straightforward. while parameter pruning & sharing use different methods such as vector quantization, binary coding and sparse constraints to perform the task. Usually it will take several steps to achieve the goal.

Regarding the training protocols, models based on parameter pruning/sharing low-rank factorization can be extracted from pre-trained ones or trained from scratch. While the transfered/compact filter and knowledge distillation models can only support train from scratch. These methods are independently designed and complement each other. For example, transfered layers and parameter pruning & sharing can be used together, and model quantization & binarization can be used together with low-rank approximations to achieve further speedup. We will describe the details of each theme, their properties, strengths and drawbacks in the following sections.

Theme Name Description Applications More details
Parameter pruning and sharing Reducing redundant parameters which Convolutional layer and Robust to various settings, can achieve
are not sensitive to the performance fully connected layer good performance, can support both train
from scratch and pre-trained model
Low-rank factorization Using matrix/tensor decomposition to Convolutional layer and Standardized pipeline, easily to be
estimate the informative parameters fully connected layer implemented, can support both train
from scratch and pre-trained model
Transfered/compact convolutional Designing special structural convolutional Only for convolutional Algorithms are dependent on applications,
filters filters to save parameters layer usually achieve good performance
only support train from scratch
Knowledge distillation Training a compact neural network with Convolutional layer and Model performances are sensitive
distilled knowledge of a large model fully connected layer to applications and network structure
only support train from scratch
TABLE I: Summarization of different approaches for network compression.

Ii Parameter Pruning and Sharing

Early works showed that network pruning is effective in reducing the network complexity and addressing the over-fitting problem [6]. After that researcher found pruning originally introduced to reduce the structure in neural networks and hence improve generalization, it has been widely studied to compress DNN models, trying to remove parameters which are not crucial to the model performance. These techniques can be further classified into three categories: model quantization and binarization, parameter sharing, and structural matrix.

Ii-a Quantization and Binarization

Network quantization compresses the original network by reducing the number of bits required to represent each weight. Gong et al. [6] and Wu et al. [7]

applied k-means scalar quantization to the parameter values. Vanhoucke

et al. [8] showed that 8-bit quantization of the parameters can result in significant speed-up with minimal loss of accuracy. The work in [9] used 16-bit fixed-point representation in stochastic rounding based CNN training, which significantly reduced memory usage and float point operations with little loss in classification accuracy.

The method proposed in [10] first pruned the unimportant connections and retrained the sparsely connected networks. Then it quantized the link weights using weight sharing, and then applied Huffman coding to the quantized weights as well as the codebook to further reduce the rate. As shown in Figure 1, it started by learning the connectivity via normal network training, followed by pruning the small-weight connections. Finally, the network wass retrained to learn the final weights for the remaining sparse connections. This work achieved the state-of-art performance among all parameter quantization based methods. It was shown in [11] that Hessian weight could be used to measure the importance of network parameters, and proposed to minimize Hessian-weighted quantization errors in average for clustering network parameters.

Fig. 1: The three-stage compression method prposed in [10]: pruning, quantization and encoding. The input is the original model and the output is the compression model.

In the extreme case of 1-bit representation of each weight, that is, binary weight neural networks, there are also many works that directly train CNNs with binary weights, for instance, BinaryConnect [12], BinaryNet [13] and XNORNetworks [14]. The main idea is to directly learn binary weights or activations during the model training. The systematic study in [15] showed that networks trained with back propogation could be resilient to specific weight distortions, including binary weights.

Drawbacks: the accuracy of such binary nets is significantly lowered when dealing with large CNNs such as GoogleNet. Another drawback of these binary nets is that existing binarization schemes are based on simple matrix approximations and ignore the effect of binarization on the accuracy loss. To address this issue, the work in [16] proposed a proximal Newton algorithm with diagonal Hessian approximation that directly minimizes the loss with respect to the binary weights. The work in [17] reduced the time on float point multiplication in the training stage by stochastically binarizing weights and converting multiplications in the hidden state computation to sign changes.

Ii-B Pruning and Sharing

Network pruning and sharing has been used both to reduce network complexity and to address the over-fitting issue. An early approach to pruning was the Biased Weight Decay [18]. The Optimal Brain Damage [19] and the Optimal Brain Surgeon [20]

methods reduced the number of connections based on the Hessian of the loss function, and their work suggested that such pruning gave higher accuracy than magnitude-based pruning such as the weight decay method. The training procedure of those methods followed the way training from scratch manner.

A recent trend in this direction is to prune redundant, non-informative weights in a pre-trained CNN model. For example, Srinivas and Babu [21]

explored the redundancy among neurons, and proposed a data-free pruning method to remove redundant neurons. Han

et al. [22] proposed to reduce the total number of parameters and operations in the entire network. Chen et al. [23] proposed a HashedNets model that used a low-cost hash function to group weights into hash buckets for parameter sharing. The deep compression method in [10] removed the redundant connections and quantized the weights, and then used Huffman coding to encode the quantized weights. In [24], a simple regularization method based on soft weight-sharing was proposed, which included both quantization and pruning in one simple (re-)training procedure. It is worthy to note that, the above pruning schemes typically produce connections pruning in CNNs.

There is also growing interest in training compact CNNs with sparsity constraints. Those sparsity constraints are typically introduced in the optimization problem as or -norm regularizers. The work in [25] imposed group sparsity constraint on the convolutional filters to achieve structured brain Damage, i.e., pruning entries of the convolution kernels in a group-wise fashion. In [26], a group-sparse regularizer on neurons was introduced during the training stage to learn compact CNNs with reduced filters. Wen et al. [27] added a structured sparsity regularizer on each layer to reduce trivial filters, channels or even layers. In the filter-level pruning, all the above works used -norm regularizers. The work in [28] used -norm to select and prune unimportant filters.

Drawbacks: there are some potential issues of the pruning and sharing works. First, pruning with or regularization requires more iterations to converge. Furthermore, all pruning criteria require manual setup of sensitivity for layers, which demands fine-tuning of the parameters and could be cumbersome for some applications.

Ii-C Designing Structural Matrix

In architectures that contain only fully-connected layers, the number of parameters can grow up to billions [4]. Thus it is critical to explore this redundancy of parameters in fully-connected layers, which is often the bottleneck in terms of memory consumption. These network layers use the nonlinear transforms , where is an element-wise nonlinear operator, is the input vector, and is the matrix of parameters. When is a large general dense matrix, the cost of storing parameters and computing matrix-vector products in time. Thus, an intuitive way to prune parameters is to impose as a parameterized structural matrix. An matrix that can be described using much fewer parameters than is called a structured matrix. Typically, the structure should not only reduce the memory cost, but also dramatically accelerate the inference and training stage via fast matrix-vector multiplication and gradient computations.

Following this direction, the work in [29, 30] proposed a simple and efficient approach based on circulant projections, while maintaining competitive error rates. Given a vector , a circulant matrix is defined as:


Thus the memory cost becomes instead of

. This circulant structure also enables the use of Fast Fourier Transform (FFT) to speed up the computation. Given a

-dimensional vector , the above 1-layer circulant neural network in Eq. 1 has time complexity of .

In [31], a novel Adaptive Fastfood transform was introduced to reparameterize the matrix-vector multiplication of fully connected layers. The Adaptive Fastfood transform matrix was defined as:


Here, , and are random diagonal matrices. is a random permutation matrix, and denotes the Walsh-Hadamard matrix. Reparameterizing a fully connected layer with inputs and outputs using the Adaptive Fastfood transform reduces the storage and the computational costs from to and from to , respectively.

The work in [32] showed the effectiveness of the new notion of parsimony in the theory of structured matrices. Their proposed method can be extended to various other structured matrix classes, including block and multi-level Toeplitz-like [33] matrices related to multi-dimensional convolution [34].

Drawbacks: one potential problem of this kind of approaches is that the structural constraint will cause loss in accuracy since the constraint might bring bias to the model. On the other hand, how to find a proper structural matrix is difficult. There is no theoretical way to derive it out.

Iii Low-rank Factorization and Sparsity

Convolution operations contribute the bulk of all computations in deep CNNs, thus reducing the convolution layer would improve the compression rate as well as the overall speedup. For the convolution kernels, it can be viewed as a 4D tensor. Ideas based on tensor decomposition is derived by the intuition that there is a significant amount of redundancy in 4D tensor, which is a particularly promising way to remove the redundancy. Regarding to the fully-connected layer, it can be view as a 2D matrix and the low-rankness can also help.

Fig. 2: A typical framework of the low-rank regularization method. The left is the original convolutional layer and the right is the low-rank constraint convolutional layer with rank-K.

It has been a long time for using low-rank filters to accelerate convolution, for example, high dimensional DCT (discrete cosine transform) and wavelet systems using tensor products to be constructed from 1D DCT transform and 1D wavelets respectively. Learning separable 1D filters was introduced by Rigamonti et al. [35], following the dictionary learning idea. Regarding some simple DNN model, a few low-rank approximation and clustering schemes for the convolutional kernels were proposed in [36]. They achieved 2 speedup for a single convolutional layer with 1% drop in classification accuracy. The work in [37] proposed using different tensor decomposition schemes, reporting a 4.5 speedup with 1% drop in accuracy in text recognition. The low-rank approximation was done layer by layer. The parameters of one layer were fixed after it was done, and the layers above were fine-tuned based on a reconstruction error criterion. These are typical low-rank methods for compressing 2D convolutional layers, which is described in Figure 2. Following this direction, Canonical Polyadic (CP) decomposition of was proposed for the kernel tensors in [38]. Their work used nonlinear least squares to compute the CP decomposition. In [39]

, a new algorithm for computing the low-rank tensor decomposition for training low-rank constrained CNNs from scratch were proposed. It used Batch Normalization (BN) to transform the activations of the internal hidden units.

In general, both the CP and the BN decomposition schemes in [39] (BN Low-rank) can be used to train CNNs from scratch. However, there are few differences between them. For example, finding the best low-rank approximation in CP decomposition is an ill-posed problem, and the best rank-K (K is the rank number) approximation may not exist sometimes. While for the BN scheme, the decomposition always exists. We also compare the performance of both methods in Table II. The actual speedup and the compression rates are used to measure their performance. We can see that the BN method can achieve slightly better speedup rate while the CP version give higher compression rates.

Model TOP-5 Accuracy Speed-up Compression Rate
AlexNet 80.03% 1. 1.
BN Low-rank 80.56% 1.09 4.94
CP Low-rank 79.66% 1.82 5.
VGG-16 90.60% 1. 1.
BN Low-rank 90.47% 1.53 2.72
CP Low-rank 90.31% 2.05 2.75
GoogleNet 92.21% 1. 1.
BN Low-rank 91.88% 1.08 2.79
CP Low-rank 91.79% 1.20 2.84
TABLE II: Comparisons between the low-rank models and their baselines on ILSVRC-2012.

Note that the fully connected layers can be viewed as a 2D matrix and thus the above mentioned methods can also be applied there. There are several classical works on exploiting low-rankness in fully connected layers. For instance, Misha et al. [40] reduced the number of dynamic parameters in deep models using the low-rank method. [41] explored a low-rank matrix factorization of the final weight layer in a DNN for acoustic modeling.

Drawbacks: low-rank approaches are straightforward for model compression and acceleration. The idea complements recent advances in deep learning, such as dropout, rectified units and maxout. However, the implementation is not that easy since it involves decomposition operation, which is computationally expensive. Another issue is that current methods perform low-rank approximation layer by layer, and thus can not perform global parameter compression, which is important as different layers hold different information. Finally, factorization requires extensive model retraining to achieve convergence when compared to the original model.

Iv Transferred/compact convolutional filters

CNNs are parameter efficient due to exploring the translation invariant property of the representations to input image, which is the key to the success of training very deep models without severe over-fitting. Although a strong theory is currently missing, a large amount of empirical evidence supports the notion that both the translation invariant property and the convolutional weight sharing are important for good predictive performance. The idea of using transferred convolutional filters to compress CNN models is motivated by recent works in [42], which introduced the equivariant group theory. Let be an input, be a network or layer and be the transform matrix. The concept of equivariance is defined as:


which says that transforming the input by the transform and then passing it through the network or layer should give the same result as first mapping through the network and then transforming the representation. Note that in Eq. (10), the transforms and are not necessarily the same as they operate on different objects. According to this theory, it is reasonable applying transform to layers or filters to compress the whole network models. From empirical observation, deep CNNs also benefit from using a large set of convolutional filters by applying certain transform to a small set of base filters since it acts as a regularizer for the model.

Following this trend, there are many recent reworks proposed to build a convolutional layer from a set of base filters[43, 44, 45, 42]. What they have in common is that the transform lies in the family of functions that only operate in the spatial domain of the convolutional filters. For example, the work in [44] found that the lower convolution layers of CNNs learned redundant filters to extract both positive and negative phase information of an input signal, and defined to be the simple negation function:


, Here, is the basis convolutional filter and is the filter consisting of the shifts whose activation is opposite to that of

and selected after max-pooling operation. By doing this, the work in

[44] can easily achieve 2 compression rate on all the convolutional layers. It is also shown that the negation transform acts as a strong regularizer to improve the classification accuracy. The intuition is that the learning algorithm with pair-wise positive-negative constraint can lead to useful convolutional filters instead of redundant ones.

In [45]

, it was observed that magnitudes of the responses from convolutional kernels had a wide diversity of pattern representations in the network, and it was not proper to discard weaker signals with a single threshold. Thus a multi-bias nonlinearity activation function was proposed to generates more patterns in the feature space at low computational cost. The transform

was define as:


where were the multi-bias factors. The work in [46] considered a combination of rotation by a multiple of and horizontal/vertical flipping with:


where was the transformation matrix which rotated the original filters with angle . In [42], the transform was generalized to any angle learned from data, and was directly obtained from data. Both works [46] and [42] can achieve good classification performance.

The work in [43] defined as the set of translation functions applied to 2D filters:


where denoted the translation of the first operand by

along its spatial dimensions, with proper zero padding at borders to maintain the shape. The proposed framework can be used to 1) improve the classification accuracy as a regularized version of maxout networks, and 2) to achieve parameter efficiency by flexibly varying their architectures to compress networks.

Table III briefly compares the performance of different methods with transferred convolutional filters, using VGGNet (16 layers) as the baseline model. The results are reported on CIFAR-10 and CIFAR-100 datasets with Top-5 error. It is observed that they can achieve reduction in parameters with little or no drop in classification accuracy.

Model CIFAR-100 CIFAR-10 Compression Rate
VGG-16 34.26% 9.85% 1.
MBA [45] 33.66% 9.76% 2.
CRELU [44] 34.57% 9.92% 2.
CIRC [42] 35.15% 10.23% 4.
DCNN [43] 33.57% 9.65% 1.62
TABLE III: Comparisons of different approaches based on transfered convolutional filters on CIFAR-10 and CIFAR-100.

Drawbacks: there are few issues to be addressed for approaches that apply transfer information to convolutional filters. First, these methods can achieve competitive performance for wide/flat architectures (like VGGNet) but not narrow/special ones (like GoogleNet, Residual Net). Secondly, the transfer assumptions sometimes are too strong to guide the algorithm, making the results unstable on some datasets.

Using a compact filter for convolution can directly reduce the computation cost. The key idea is to replace the loose and over-parametric filters with compact blocks to improve the speed, which significantly accelerate CNNs on several benchmarks. Decomposing convolution into two convolutions was used in [47], which achieved state-of-the-art acceleration performance on object recognition. SqueezeNet [48] was proposed to replace convolution with convolution, which created a compact neural network with about 50 fewer parameters and comparable accuracy when compared to AlexNet.

V Knowledge Distillation

To the best of our knowledge, exploiting knowledge transfer (KT) to compress model was first proposed by Caruana et al. [49]. They trained a compressed/ensemble model of strong classifiers with pseudo-data labeled, and reproduced the output of the original larger network. However, the work is limited to shallow models. The idea has been recently adopted in [50] as Knowledge Distillation (KD) to compress deep and wide networks into shallower ones, where the compressed model mimicked the function learned by the complex model. The main idea of KD based approaches is to shift knowledge from a large teacher model into a small one by learning the class distributions output via softened softmax.

The work in [51] introduced a KD compression framework, which eased the training of deep networks by following a student-teacher paradigm, in which the student was penalized according to a softened version of the teacher’s output. The framework compressed an ensemble of deep networks (teacher) into a student network of similar depth. To do so, the student was trained to predict the output of the teacher, as well as the true classification labels. Despite its simplicity, KD demonstrates promising results in various image classification tasks. The work in [52] aimed to address the network compression problem by taking advantage of depth neural networks. It proposed an approach to train thin but deep networks, called FitNets, to compress wide and shallower (but still deep) networks. The method was extended the idea to allow for thinner and deeper student models. In order to learn from the intermediate representations of teacher network, FitNet made the student mimic the full feature maps of the teacher. However, such assumptions are too strict since the capacities of teacher and student may differ greatly. In certain circumstances, FitNet may adversely affect the performance and convergence. All the above methods are validated on MNIST, CIFAR-10, CIFAR-100, SVHN and AFLW benchmark datasets, and simulation results show that these methods match or outperform the teacher’s performance, while requiring notably fewer parameters and multiplications.

There are several extension along this direction of distillation knowledge . The work in [53]

trained a parametric student model to approximate a Monte Carlo teacher. The proposed framework used online training, and used deep neural networks for the student model. Different from previous works which represented the knowledge using the soften label probabilities,

[54] represented the knowledge by using the neurons in the higher hidden layer, which preserved as much information as the label probabilities, but are more compact. The work in [55] accelerated the experimentation process by instantaneously transferring the knowledge from a previous network to each new deeper or wider network. The techniques are based on the concept of function-preserving transformations between neural network specifications. Zagoruyko et al. [56] proposed Attention Transfer (AT) to relax the assumption of FitNet. They transferred the attention maps that are summaries of the full activations.

Drawbacks: KD-based Approaches can make deeper models thinner and help significantly reduce the computational cost. However, there are a few disadvantages. One of them is that KD can only be applied to classification tasks with softmax loss function, which hinders its usage. Another drawback is that the model assumptions sometimes are too strict to make the performance competitive with other type of approaches.

Vi Other Types of Approaches

we first summarize the works utilizing attention-based methods. Note that attention-based systems [57] can reduce computations significantly by learning to selectively focus or “attend” to a few, task-relevant input regions. The work in [57] introduced the dynamic capacity network (DCN) that combined two types of modules: the small subnetworks with low capacity, and the large ones with high capacity. The low-capacity sub-networks were active on the whole input to first find the task-relevant areas in the input, and then the attention mechanism was used to direct the high-capacity sub-networks to focus on the task-relevant regions in the input. By dong this, the size of the CNNs model could be significantly reduced.

Following this direction, the work in [58]

introduced the conditional computation idea, which only computes the gradient for some important neurons. It proposed a new type of general purpose neural network component: a sparsely-gated mixture-of-experts Layer (MoE). The MoE consisted of a number of experts, each a simple feed-forward neural network, and a trainable gating network that selected a sparse combination of the experts to process each input. In

[59], dynamic deep neural networks (D2NN) were introduced, which were a type of feed-forward deep neural network that selected and executed a subset of D2NN neurons based on the input.

There have been other attempts to reduce the number of parameters of neural networks by replacing the fully connected layer with global average pooling [60, 43]

. Network architecture such as GoogleNet or Network in Network, can achieve state-of-the-art results on several benchmarks by adopting this idea. However, transfer learning, i.e. reusing features learned on the ImageNet dataset and applying them to new tasks, is more difficult with this approach. This problem was noted by Szegedy et al.

[60] and motivated them to add a linear layer on the top of their networks to enable transfer learning.

The work in [61] targeted the Residual Network based model with a spatially varying computation time, called stochastic depth, which enabled the seemingly contradictory setup to train short networks and used deep networks at test time. It started with very deep networks, while during training, for each mini-batch, randomly dropped a subset of layers and bypassed them with the identity function. This model is end-to-end trainable, deterministic and can be viewed as a black-box feature extractor. Following this direction, thew work in [62] proposed a pyramidal residual networks with stochastic depth.

Other approaches to reduce the convolutional overheads include using FFT based convolutions [63] and fast convolution using the Winograd algorithm [64]. Zhai et al. [65] proposed a strategy call stochastic spatial sampling pooling, which speed-up the pooling operations by a more general stochastic version. Those works only aim to speed up the computation but not reduce the memory storage.

Vii Benchmarks, Evaluation and Databases

In the past five years the deep learning community had made great efforts in benchmark models. One of the most well-known model used in compression and acceleration for CNNs is Alexnet [1], which has been occasionally used for assessing the performance of compression. Other popular standard models include LeNets [66], All-CNN-nets [67] and many others. LeNet-300-100 is a fully connected network with two hidden layers, with 300 and 100 neurons each. LeNet-5 is a convolutional network that has two convolutional layers and two fully connected layers. Recently, more and more state-of-the-art architectures are used as baseline models in many works, including network in networks (NIN) [68], VGG nets [69] and residual networks (ResNet) [70]. Table IV summarizes the baseline models commonly used in several typical compression methods.

Baseline Models Representative Works
Alexnet [1] structural matrix [31, 29, 32]
low-rank factorization [39]
Network in network [68] low-rank factorization [39]
VGG nets [69] transfered filters [43]
low-rank factorization [39]
Residual networks [70] compact filters [48], stochastic depth [61]
parameter sharing [24]
All-CNN-nets [67] transfered filters [44]
LeNets [66] parameter sharing [24]
parameter pruning [22, 20]
TABLE IV: Summarization of baseline models used in different representative works of network compression.

The standard criteria to measure the quality of model compression and acceleration are the compression and the speedup rates. Assumethat is the number of the parameters in the original model and is that of the compressed model , then the compression rate of over is


Another widely used measurement is the index space saving defined in several papers [29, 71] as


where and are the number of the dimension of the index space in the original model and that of the compressed model, respectively.

Similarly, given the running time of and of , the speedup rate is defined as:


Most work used the average training time per epoch to measure the running time, while in

[29, 71], the average testing time was used. Generally, the compression rate and speedup rate are highly correlated, as smaller models often results in faster computation for both the training and the testing stages.

Good compression methods are expected to achieve almost the same performance as the original model with much smaller parameters and less computational time. However, for different applications with different CNN designs, the relation between parameter size and computational time may be different. For example, it is observed that for deep CNNs with fully connected layers, most of the parameters are in the fully connected layers; while for image classification tasks, float point operations are mainly in the first few convolutional layers since each filter is convolved with the whole image, which is usually very large at the beginning. Thus compression and acceleration of the network should focus on different type of layers for different applications.

Viii Discussion and Challenges

In this paper, we summarized recent works on compressing and accelerating deep neural networks (DNNs). Here we discuss more details about how to choose different compression approaches, and possible challenges/solutions in this area.

Viii-a General Suggestions

here is no golden rule to measure which one of the four kinds of approach is the best. How to choose the proper approaches is really depending on the applications and requirements. Here are some general suggestions we can provide:

  • If the applications need compacted models from pre-trained models, you can choose either pruning & sharing or low rank factorization based methods. If you need end-to-end solutions for your problem, the low rank and transferred convolutional filters approaches are preferred.

  • For applications in some specific domains, methods with human prior (like the transferred convolutional filters, structural matrix) sometimes have benefits. For example, when doing medical images classification, transferred convolutional filters should work well as medical images (like organ) do have the rotation transformation property.

  • Usually the approaches of pruning & sharing could give reasonable compression rate while not hurt the accuracy. Thus for applications which requires stable model accuracy, it is better to utilize pruning & sharing.

  • If your problem involves small/medium size datasets, you can try the knowledge distillation approaches. The compressed student model can take the benefit of transferring knowledge from teacher model, making it robust datasets which are not large.

  • As we mentioned in Section I, techniques of the four themes are orthogonal. It makes senses to combine two or three of them to maximize the compression/speedup rates. For some specific applications, like object detection, which requires both convolutional and fully connected layers, you can compress the convolutional layers with low rank factorization and the fully connected layers with a pruning method.

Viii-B Technique Challenges

Techniques for deep model compression and acceleration are still in the early stage and the following challenges still need to be addressed.

  • Most of the current state-of-the-art approaches are built on well-designed CNN models, which have limited freedom to change the configuration (e.g., network structural, hyper-parameters). To handle more complicated tasks, it should provide more plausible ways to configure the compressed models.

  • Pruning is an effective way to compress and acclerate CNNs. Current pruning techniques are mostly designed to eliminate connections between neurons. On the other hand, pruning channel can directly reduce the feature map width and shrink the model into a thinner one. It is efficient but also challenging because removing channels might dramatically change the input of the following layer. It is important to address how to address this issue.

  • As we mentioned before, methods of structural matrix and transferred convolutional filters impose prior human knowledge to the model, which could significantly affect the performance and stability. It is critical to investigate how to control the impact of the imposed prior knowledge.

  • The methods of knowledge distillation (KD) provide many benefits such as directly accelerating model without special hardware or implementations. It is still worthy developing KD-based approaches and exploring how to improve the performance.

  • Hardware constraints in various of small platforms (e.g., mobile, robotic, self-driving car) are still a major problem to hinder the extension of deep CNNs. How to make full use of the limited computational source available and how to design special compression methods for such platforms are still challenges that need to be addressed.

Viii-C Possible Solutions

To solve the hyper-parameters configuration problem, we can rely on the recent learning-to-learn strategy [72, 73]. This framework provides a mechanism allowing the algorithm to automatically learn how to exploit structure in the problem of interest. There are two different ways to combine the learning-to-learn module with the model compression. the first designs compression and learning-to-learn simultaneously, while the second first configures the model with learn-to-learning and then prunes the parameters.

Channel pruning provides the efficiency benefit on both CPU and GPU because no special implementation is required. But it is also challenging to handle the input configuration. One possible solution is to use the training-based channel pruning methods [74], which focus on imposing sparse constraints on weights during training, and could adaptively determine hyper-parameters. However, training from scratch for such method is costly for very deep CNNs. In [75], the authors provided an iterative two-step algorithm to effectively prune channels in each layer.

Exploring new types of knowledge in the teacher models and transferring it to the student models is useful for the KD approaches. Instead of directly reducing and transferring parameters from the teacher models, passing selectivity knowledge of neurons could be helpful. One can derive a way to select essential neurons related to the task. The intuition is that if a neuron is activated in certain regions or samples, that implies these regions or samples share some common properties that may relate to the task. Performing such steps is time-consuming, thus efficient implementation is important.

For methods with the convolutional filters and the structural matrix, we can conclude that the transformation lies in the family of functions that only operations on the spatial dimensions. Hence to address the imposed prior issue, one solution is to provide a generalization of the afore mentioned approaches in two aspects: 1) instead of limiting the transformation to belong to a set of predefined transformations, let it be the whole family of spatial transformations applied on 2D filters or matrix, and 2) learn the transformation jointly with all the model parameters.

Regarding the use of CNNs in small platforms, proposing some general/unified approaches is one direction. Yuhen et al. [76] presented a feature map dimensionality reduction method by excavating and removing redundancy in feature maps generated by different filters, which could also preserve intrinsic information of the original network. The idea can be extended to make CNNs more applicable for different platforms. The work in [77] proposed a one-shot whole network compression scheme consisting of three components: rank selection, low-rank tensor decomposition, and fine-tuning to make deep CNNs work in mobile devices. From the systematic side, Facebook released a good platform Caffe2, which employed a particularly lightweight and modular framework and included mobile-specific optimizations based-on the hardware design. Caffe2 can help developers and researchers train large machine learning models and deliver AI on mobile devices.

Ix Acknowledgments

The authors would like to thank the reviewers and broader community for their feedback on this survey. In particular, we would like to thank Hong Zhao from the Department of Automation of Tsinghua University for her help on modifying the paper. This research is supported by National Science Foundation of China with Grant number 61401169.