A Survey of Pruning Methods for Efficient Person Re-identification Across Domains

07/04/2019 ∙ by Hugo Masson, et al. ∙ 0

Recent years have witnessed a substantial increase in the deep learning architectures proposed for visual recognition tasks like person re-identification, where individuals must be recognized over multiple distributed cameras. Although deep Siamese networks have greatly improved the state-of-the-art accuracy, the computational complexity of the CNNs used for feature extraction remains an issue, hindering their deployment on platforms with with limited resources, or in applications with real-time constraints. Thus, there is an obvious advantage to compressing these architectures without significantly decreasing their accuracy. This paper provides a survey of state-of-the-art pruning techniques that are suitable for compressing deep Siamese networks applied to person re-identification. These techniques are analysed according to their pruning criteria and strategy, and according to different design scenarios for exploiting pruning methods to fine-tuning networks for target applications. Experimental results obtained using Siamese networks with ResNet feature extractors, and multiple benchmarks re-identification datasets, indicate that pruning can considerably reduce network complexity while maintaining a high level of accuracy. In scenarios where pruning is performed with large pre-training or fine-tuning datasets, the number of FLOPS required by the ResNet feature extractor is reduced by half, while maintaining a comparable rank-1 accuracy (within 1% of the original model). Pruning while training a larger CNNs can also provide a significantly better performance than fine-tuning smaller ones.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning architectures like the convolutional neural network (CNN) have achieved state-of-the-art accuracy across a wide range of visual recognition tasks, at the expense of growing complexity (deeper and wider networks) that require more training samples and computational resources. In order to deploy these architectures on compact platforms with limited resources (e.g., embedded systems, mobile phones, portable devices), and for real-time processing (e.g., video surveillance and monitoring, virtual reality), their time and memory complexity and energy consumption should be reduced SpeedAndAcc . Consequently, there is a growing interest in effective methods able to accelerate and compress deep networks.

Providing a reasonable trade-off between accuracy and efficiency has become an important concern in person re-identification (ReID), a key function in several monitoring and surveillance applications ranging from video surveillance to sports video analytics. Systems for person ReID typically seek to recognize the same individuals that previously appeared over a non-overlapping network of video surveillance cameras (see illustration in Fig. 1). These systems face many challenges in real world application that are either related to data or the network architecture. Data related challenges that affects the ReID accuracy include the limited amount of annotated training data, ambiguous annotations, limitations of person detection and tracking techniques, variations in pose, scale and illumination, occlusions, and low resolution images.

Figure 1: Illustration of a typical person re-identification system.

To address these issues, most state-of-the-art approaches for person reID rely on deep Siamese networks (that integrate CNNs for feature extraction) to learn an embedding where similar image pairs (with the same identity) are close to each other, while dissimilar image pairs (with different identities) are distant from each other  ahmed2015improved ; hermans2017defense ; varior2016gated ; chen2017beyond ; geng2016deep ; cheng2016person ; liu2017end . While this network can provide a high level of accuracy, achieving this performance comes with the cost of millions or even billions of parameters, a challenging training procedure, and the requirement for GPU acceleration. For instance, the ResNet50 CNN he2016deep , with it 50 convolutional layers, contains about 23.5M parameters (stored in 85.94MB of memory), and requires 6.3 billion floating point operations (FLOPs) to process a color image of size

. Such complex networks are unrealistic for many real-time applications. Accordingly, the computer vision and machine learning communities have shown a great deal of interest for compressing such networks without compromising their predictive accuracy.

Several approaches have been proposed for compression of deep neural networks. These approaches can be divided into five classes that perform low-rank factorization, transferred convolutional channels, knowledge distillation, quantization and pruning. Low-rank factorization approaches lu2017fully ; rigamonti2013learning ; denton2014exploiting ; jaderberg2014speeding ; lebedev2014speeding ; tai2015convolutional

perform matrix decomposition to estimate information parameters of the a network. However, low-rank approaches suffer from number of issues – computationally expensive matrix decomposition, layer by layer low-rank approximation that diminish the possibility of global compression, and extensive model retraining to achieve convergence. Some network compression approaches 

cohen2016group ; van2016deep ; shang2016understanding ; dieleman2016exploiting , categorized as transferred convolutional channels, design special structural convolutional channels to reduce the parameter space which eventually make them computationally efficient, but application dependent and support training only from scratch. Knowledge distillation approaches buciluǎ2006model ; hinton2015distilling ; luo2016face ; chen2015net2net train a smaller or shallow deep network with distilled knowledge of a large deep network, but are also sensitive to applications and support the training only from scratch. In quantization approaches, each parameter of a network is represented with a reduced bit rate, either by reducing the precision, employing a look-up table, or combining similar values. Most of the quantization based compression approaches han2015deep ; gong2014compressing ; chen2015compressing take extra computational time may be required to access a look up table or to undo an encoding to restore the original value. In contrast, pruning seeks to reduce the number of connections or retrain either the whole, or part, of the network with a freshly trained replacement. Pruning methods typically focuses on analyzing the network for the weights or channels with the least impact on performance, and then removing them. Thus, in addition to have compressed network, pruning methods can have benefits such as reducing over-fitting, not sensitive to applications and support training the network both from pre-trained ones and from scratch. Therefore, pruning approaches have draw a great deals of attention from the network compression community. Challenges of pruning include the lack of data for pruning during the fine-tuning phase, the computational complexity associated with retraining after a pruning phase, and the reduction of capacity to learn of a model, which can impact the accuracy when the learning step is done on the pruned model.

Pruning techniques  lecun1990optimal ; hassibi1993second ; han2015learning ; Molchanov ; HaoLi

are among the most widely used compression methods with deep neural networks, and have been shown to effectively prune well-known CNNs for general image classification problems like CIFAR10, MNIST and ImageNet. This paper provides a survey of state-of-the-art pruning techniques that are suitable for compressing deep Siamese networks for applications in person re-identification. These techniques are categorized according to their pruning criteria to select channels, and to their strategy to reduce channels. In particular, we categorize techniques using criterion based on weights and based on feature maps. We also differentiate techniques according to pruning strategy: 1) prune Once and then fine-tune 2) prune iteratively on trained model 3) prune using regularization 4) prune by minimizing the reconstruction errors, and 5) prune progressively.

In addition to the survey, we propose several different design pipelines or scenarios to leverage the state-of-the-art pruning methods during pre-training and/or fine-tuning. A typical design scenario consists of four stages: (1) train a larger network with large dataset from the source domain (i.e ImageNet), (2) prune the trained large model based on some pruning criterion to select channels, (3) re-train the pruned network to re-gain the accuracy, and finally (4) fine-tune the re-trained network using a limited dataset from the target application. There are two common assumptions for this design scenario. First, a large and over-parameterized network is trained, using a large dataset, is crucial to provide a strong feature representation. The pruning process used to select and reduce the network will yield a set of redundant channels that does not significantly reduce accuracy. Under this scenario, a deep Siamese network would therefor over train on a smaller network from scratch HaoLi ; Entropy_Pruning ; ChannelPruning ; Pro_Soft_Pruning . Second, both the pruned model and its corresponding weights are essential for optimizing a final efficient network han2015learning . Thus, most of the approaches in literature tend to prune channels of a fine-tuned network, rather than a pre-trained network. This paper present other design scenarios that apply when pruning networks that have been pre-trained on large dataset and that require a fine-tuning to a target application.

Finally, this paper presents an extensive experimental evaluation and analysis of different pruning techniques and relevant design scenarios on 3 benchmark datasets for person reID – the Market-1501 market1501 , CUHK03-NP cuhk03 , and DukeMTMC-reID DukeMTMC . Pruning techniques are compared using Siamese networks with ResNet feature extractors, in terms of accuracy and complexity for person reID applications in mind.

The rest of the paper is organized as follows. In Section 2, we provide some background on the baseline deep Siamese network for person ReID with triplet loss. Section 3provides a taxonomy and description of the state-of-the-art pruning methods. Finally, Sections 5 and 6 described the experimental methodology (benchmark datasets, protocol and performance measures), and comparative results, respectively.

2 Deep Siamese Networks for Person ReID

The idea of using deep Siamese networks for biometric authentication and verification originates from Bromeley et al. bromley1994signature , where two sub-networks with shared weights encode feature embeddings for pair-wise matching between a query and reference (gallery) images. These networks were first used in yi2014deep

that employs three feature extraction sub-networks for deep feature learning. Then, various deep learning architectures were proposed to learn discriminative feature embeddings. Most of these architectures 

ahmed2015improved ; cheng2016person ; varior2016gated ; chen2017beyond ; chen2017beyond ; liu2017end employ end-to-end training, where both feature embedding and metric are learned as a joint optimisation problem.

2.1 Network architecture:

A Siamese network is trained to encode feature embedding model for pair-wise matching. Deep Siamese networks are designed by using two or more identical CNN feature extractors. These feature extractors share the same parameters and weights, and are commonly implement with CNNs suitable for classification. Once the features are extracted for a pair of images, the network than uses a matcher to produce a similarity score to determine if the pair if images are from the same or different classes. This similarity measure is often the Euclidean or cosine distance between the two feature representations. Similarity can also be assessed using a Softmax layer with two classes, where one neuron represents same class and the other different class. For a given query image, the system provides a ranked list of matching score for every reference image in the gallery, where the highest score represents the network’s prediction for the input image. A threshold is commonly applied on predictions in order to reject a prediction if the highest similarity score is too low.

An important step toward designing Siamese network is to choose an appropriate feature extractor sub-network, which is considered to be the backbone of the main Siamese architecture. They are trained with labelled data to extract features from input image of individuals to perform pair-wise matching. Any CNN networks such as VGG, Inception, ResNet, DenseNet, etc., can be used as feature extractor. It has been shown experimentally that the deeper CNNs, like e.g., ResNet50, ResNet101 and DenseNet, are better to address the challenges of ReID, but the computational complexity and over-fitting make these CNNs unsuitable for deployment in real-time applications. In contrast, the more shallow CNNs, like e.g., ResNet18 and ResNet34, provide lower re-identification accuracy. In this paper, we focus on using state-of-the-art pruning techniques to compress the larger CNNs, and thereby affect a better trade-off between the accuracy and complexity. We also explore different design scenarios for pruned the network when fine-tuning is required on target operational data.

CNNs for feature extraction can either be (1) initialized using pre-trained channels that are trained on a large dataset (i.e., ImageNet), and then fine-tuned to the target operational domain data, (2) or be trained from scratch directly on the target operational domain data. Using pre-trained networks is usually a more viable solution since CNN feature extractors need many training samples in order to learn parameters in an optimal region. But pre-training is not always a good solution in cases where the target operational dataset different significantly from the pre-training dataset. In person ReID applications, most state-of-the-art methods use pre-trained CNNs since they outperformed training CNN feature extractors from scratch.

Figure 2:

Triplet Training Architecture. Anchor and positive samples are same individual, whereas negative sample is different individual. These triplet set is fed through three identical networks. The triplet loss function optimizes the network parameters in such a way that minimizes the intra-class distances while maximizing the inter-class distance.

2.2 Triplet loss function:

In Siamese architecture, two identical networks use the same weights while working on the feature vector of two different input images. Thus, in training process, the network tries to minimize the intra-class distance and maximize the inter-class distances mostly using pair-wise loss, such as softmax loss 

varior2016gated

. However, the number of parameters will increase linearly with number of identity which makes it impractical to learn an N-way softmax classifier. In contrast, with triplet loss using Siamese architecture (i.e. three identical network for three inputs as shown in Figure 

2), a deep feature embedding is learnt that typically address the issue of large number identity. Therefore, triplet loss is very efficient and become one of the most used losses for multiple applications. During training, for a given mini-batch with labels , we randomly sample a triplet , where, is a pair of images of the same individual, and is that of different individual. The corresponding features from the backbone networks are , and . The most common form of triplet loss is as follows:

(1)

where, , denotes a margin, represents the number of all possible set of triplets in the mini-batch, and is the Euclidean loss. Different variants of triplet loss have already been in play with moderate success. Hermans et al. hermans2017defense define a hard triplet loss based metric embedding which considers the triplet set within the mini-batch. The core idea is to form a batches by randomly sampling person, and then sampling number of images of each person. For each sample in the batch, it selects the hardest positive and the hardest negative samples within the batch when forming the triplets for computing the loss:

(2)

where, is the set of all hard triplets.

3 Techniques for Pruning CNNs

Currently, in literature, pruning techniques operate on two different levels. First, techniques for weight level pruning focuses on pruning individual weights of a network. In contrast, techniques for channel level pruning prunes away the output channel and input channel of convolution layers. While weight pruning techniques can achieve high compression rate and good acceleration, its performance depends on a good sparse convolution algorithm which is unavailable and does not perform well on all platform. For this survey, we focus on channel pruning techniques which do not rely on other algorithms, and have been extensively studied in literature. This section presents a survey of channel pruning techniques, and summary of experimental result from the literature.

3.1 Channel Pruning Taxonomy:

The objective of pruning is to remove unnecessary parameters from a neural network. For channel pruning, the objective is to remove all the parameters of a channel (output or input). Removing these parameters is done to reduced the complexity of network while trying to maintain a comparable accuracy. There are several ways of categorizing these different methods, one of them is described in Table 1.

Pruning Strategy Methods
Prune Once
Hao LiHaoLi
Redundant ChannelsRedundant_Pruning
EntropyEntropy_Pruning
Iterative Pruning
MolchanovMolchanov
Play and PrunePlayAndPrune
FPGMFPGM
Pruning using regularization
Auto-BalanceAutoBalanced
Play and PrunePlayAndPrune
Pruning by minimizing
reconstruction error
ThiNetThiNet
Channel PruningChannelPruning
Progressive Pruning PSFPPro_Soft_Pruning
Table 1: A Taxonomy of techniques according to pruning strategyto reduce chanels.

In this taxonomy, Prune Once refers to techniques that prune the network one time and then fine-tune the network HaoLi ; Redundant_Pruning ; Entropy_Pruning . Iterative Pruning is a type of pruning that is done iteratively on a trained model that alternate between pruning and fine-tuningMolchanov . Pruning by regularization is usually done by adding a regularization term to the original loss function in order to leave the pruning process for the optimizationAutoBalanced ; PlayAndPrune . Pruning by minimizing the reconstruction error is a family of algorithms that tries to minimize the difference of outputs between the pruned and the original model. Progressive pruning, while very similar to iterative pruning, differs in that it can start directly from a model that wasn’t trained and progressively prune it during training.

In order to facilitate the analysis of different pruning methods, we also categorise techniques according to type of pruning criterion, presented in Table 2.

Based on Weights Based on Feature Maps
Hao LiHaoLi
Auto-BalancedAutoBalanced
Redundant ChannelRedundant_Pruning
PSFPPro_Soft_Pruning
FPGMFPGM
Play and PrunePlayAndPrune
ThiNetThiNet
EntropyEntropy_Pruning
Channel PruningChannelPruning
MolchanovMolchanov
Table 2: A Taxonomy of techniques according to pruning criteria to select channels.

With this taxonomy, algorithms are divided into two categories, the first one are weight-based methods that applies the evaluation criteria directly on the weights of the channel. As for feature map based techniques, the pruning criteria will be applied on the feature maps produced by a convolutional layer.

Pruning neural networks comes with many challenges. The first major challenge is the pruning criteria. The criteria needs to be able to discern the parameters that contribute to the accuracy and the ones that do not. The second major challenge is finding an optimal pruning compression. This compression ratio is essential since we need to find a compromise between the reduction of complexity for model and the loss of accuracy. The third and last challenge is the retraining and pruning schedule of the model. The pruning could be done in one iteration but the damage done to the network will be considerable. On the counter part, we could prune and retrain iteratively which reduces the damage done at each iteration but this will take much longer to apply. The retraining of the pruned network could also be problematic since we could overfit the model or get caught in local minimums.

3.2 Description of methods:

This subsection will be presenting and explaining different pruning algorithms for each pruning family presented in the taxonomy. For ease of notations, we are going to refer a convolution tensor as

W with , the number of input channels, the number of output channels, the kernel size. An output channel tensor is then defined as , an individual weight is defined as w. For feature map, H represents an output of a convolution layer, then represent the output channel of a feature map. For ease of notation, we do not mention the layer index of unless necessary, therefore W or H can be any convolution layer or feature map at any index.

3.2.1 Criteria based on weights:

The Hao Li HaoLi pruning algorithm is layer-by-layer method which means it will pruned the network one layer at a time. This algorithm’s pruning criteria is simple and could be done in four easy steps for each layer:

  1. For each channel in each layer i, we calculate the absolute sum of a channel weights using individual weight of each channel

    (3)
  2. Sort the channels from smallest score to the biggest

  3. Prune m channels with smallest sum value and their corresponding maps in the next layer. The corresponding channels in the next convolution layer are also removed

The retraining could be done in two different ways:

  1. Prune once over multiple layers and retrain (more adapted for resilient layers)

  2. Prune channels one by one and retrain each time (more adapted for layers that are less resilient)

For this algorithm, we chose to mix the two retraining methods to come up with pruning N channels before retraining.

The second weight method that will be presented is the redundant channel pruning Redundant_Pruning . This method’s idea is to pruned channels that are similar to the ones that are kept. To do so, the authors proposed to regroup each channel of a layer in clusters depending on a similarity score being higher than a preassigned threshold

. To determined the similarity between these channels, the authors proposed to use the cosine similarity between the weights of the channels.

(4)

With the calculation of of two output channel given below:

(5)

Equation 5 gives us the ability to determine the similarity between two channels by calculating the cosine of the angle between two vectors of dimension n. The pruning of one specific layer could be done in 2 steps:

  1. Group the channels in the same cluster if from equation 4 is above the threshold

  2. Randomly sample one channel in each cluster and pruned the remaining ones of each cluster.

The threshold acts as the compression ratio in this pruning algorithm where a low threshold means a high compression rate and vice versa.

The third weight-based method that will be presented is the Auto-Balanced pruning AutoBalanced that uses the same pruning criteria as the Hao Li algorithm which is a L1 norm of weight kernels to determine the ranking of the channels. But this method adds a regularization term during the training to transfer the representational capacity of the channels we want to pruned to the remaining ones. In order to calculate this transfer of representational capacity, the authors proposed to separate the channels in two subsets at the beginning of each pruning iteration. In order to assign the channels to their subset, the authors used the L1 norm of the weights of the channels. The function is used to flatten the weight matrix into a vector and the metric measuring the importance of. Here, we use the notation of with represent the layer index and j the output channel index.

(6)

Once the L1 score has been calculated for each channel, they are then assigned to one of the subsets depending on the threshold which is fixed depending on the desired number of remaining channels per layer.

(7)
(8)

The channels in subset R (remaining) and subset P (to pruned) are then adjusted with an L2 regularisation term. The following equations are used to calculate this L2 adjustment factor:

(9)
(10)
(11)
(12)
(13)

The cost function for training is changed with equation 14 where represents the original cost function.

(14)
(15)

This enables the model to penalized the weak channels and stimulate the strong ones. This method adds two hyperparameters in the training which are and

. is the regularization factor and the vector is the target of remaining channels in each layer.

The last weight-based method is the Progressive Soft Pruning Pro_Soft_Pruning where their pruning criteria is the same as the Hao Li method (L2 norm of the weights). The main difference with this method is they proposed an interesting pruning scheme that allows pruning during the finetuning step. The authors proposed to used soft pruning which means instead of removing the channels during the pruning, they zeroize the weights and allow these channels to be updated during the retraining phase. This pruning scheme is very interesting since the model keeps its original dimension during the retraining phase. The authors also proposed to add a progressive pruning scheme where at each pruning iteration, the compression ratio is increased in order to get a shallower network. Once these iterations of pruning and retraining are complete they do a last channel ranking using a pruning criteria and they discard the lowest channels depending on the compression ratio. Their pseudo code for the progressive soft pruning scheme can be viewed in algorithm 1. In the article, they used the L1 or L2 norm of the weights as a pruning criteria which means this method could be categorized as a weight-based method.

1:Input: training data: X
2:Input: pruning rate: , pruning rate decay
3:Input: the model with parameters M = {, }
4:   Initialize the model parameter M
5:   for do
6:     Update the model paramters M based on
7:     for do
8:          Calculate the norm for each channel
9:          Calculate the pruning rate

at this epoch using

and
10:          Select the lowest norm depending on the pruning rate
11:          Zeroize the weights of the selected channels
12:     end for
13:   end for
14:   Obtain the compact model with parameters M’ from M
15:Output: Compact model with parameters
Algorithm 1 Algorithm Description of PSFP

The represents the number of layers in the model, represents the layer number, represents the weights of a channel and is the number of channels to pruned. The pruning rate is calculated at each epoch using the pruning rate goal for the corresponding layer and the pruning rate decay . To calculate the pruning rate, we can use equation 16

(16)

The a,b and k values can be calculated by solving the equation 17

(17)

FPGMFPGM is a new technique that focuses on using geometric median to prune away output channels. A geometric median is defined as: given a set of points with , find a point that minimizes the sum of the Euclidean distances to them:

(18)

Using the Equation 18, a geometric median for all the filters of a layer can be found:

(19)

In order to select, non-important output channels, the author proposed to find the channels that have the same, or similiar value of which translates to:

(20)

Since geometric median is a non-trivial problem, it’s quite computanioannly intensive, therefore the authors propose to relax the problem by assuming that:

(21)

This transforms the Equation 19 to:

(22)

The algorithm of FPGM for the progressive soft pruning scheme can be viewed in algorithm 2. can be summarised as:

1:Input: training data: X
2:Input: pruning rate:
3:Input: the model with parameters M = {, }
4:   Initialize the model parameter M
5:   for do
6:     Update the model parameters M based on
7:     for do
8:          Select the of channels that satisfy Equation 22
9:          Zeroize the selected channels
10:     end for
11:   end for
12:   Obtain the compact model with parameters M’ from M
13:Output: Compact model with parameters
Algorithm 2 Algorithm Description of FPGM

Play And Prune PlayAndPrune is an adaptive output channel pruning technique, that, instead of focusing on a criterion, tries to find an optimal number of output channels that can pruned away given an error tolerance rate. This technique is min-max game of two modules, The Adaptive Filter Pruning (AFP) module and the Pruning Rate Controller (PRC). The goal of the AFP is to minimize the number of output channels in the model while the PRC tries to maximize the accuracy of the remaining set of output channels. This technique considers a model can be partitionned into two set of important channels I and unimportant channels U.

(23)

represents all the unimportant channels of a layer . It is selected by a selecting channels of the result of the sort operation on the L1 norm of each output channels. Once an is selected, the authors propose to add an additional penalty to the original loss function in order to prune without loss of accuracy while helping the pruning process. The original loss function would then become:

(24)

With the original cost function than try to optimize the original model parameters. is the regularization term. While this optimization help pushing the channels to have zero sum of absolute weights, it can take some epochs, therefore the authors propose an adaptive weight threshold () for each layer . Any channels with L1 norm below this threshold will be removed. While this value is given by the PRC, for the first epoch, it found by using a binary search on the histogram of sum of absolute weights. The AFP minimizes the number of output channels in the model using Equation 24. The AFP can be summarized as follow: 1) Select an of output channles of lowest important 2) Partition them into U and I 3) Perform Equation 24 with given by PRC 4) Remove unimportant channels using the threshold given by PRC. The loss function of AFP can be written as:

(25)

For the PRC, the adaptive threshold is updated as follow:

(26)

With the constant used to increase or decrease the pruning rate. is calculated as follow:

(27)

The is the accuracy of the unpruned network, is the tolerance error, is the accuracy of the model with the remaining filter . The regularization constant is also computed as follow:

(28)

With the initial regularization constant. By alternating between the AFP and the PRC, the authors propose a system that prune at each epoch in an adaptive and iterative way.

3.2.2 Criteria based on feature maps:

In the channel-based approach, we are going to present 3 algorithms which are ThiNet ThiNet , Channel Pruning ChannelPruning and Entropy Pruning Entropy_Pruning . The first two algorithms have the same idea behind their pruning algorithm which is minimising the difference in the activation maps but they diverge with their minimisation technique. ThiNet’s goal is to find a subset of channels that minimise the difference in the output at layer i+1 (feature map).

ThiNet uses greedy algorithm to find which subset of channels to eliminate and keep the input at layer i+2 almost intact. To find the subset of channels to pruned, the authors proposed to use a greedy algorithm where they compute the value for each channel in a layer and assign the lowest value to the subset. They repeat this method until our pruning subset respects the defined compression ratio. To calculate the input of the feature map in layer i+2, we can use the equation 29.

(29)

Where represents the layer, the channel index and the kernel size of the channel. To compute the value of a channel, the authors proposed to used equation 30

(30)

Where is equal to in equation 29. This greedy method is repeated for each layer needed to be pruned in the model.

The Channel Pruning method also as the goal to minimise the difference in the output (feature map) but their method is to find a subset of channels with a LASSO regression.

(31)

represents channel mask that decides whether the channel is pruned or not. Si if is 0 the channel is no longer useful. The compression ratio is defined with . The represents the number of channels and represents the number of remaining channels. During the pruning iterations, the in equation 31 is fixed which leaves us with only one variable to minimized which is . The LASSO regression is used to find this mask that minimizes the difference in the output. As in the ThiNet method, this method also requires to redo these steps for every layer needed to be pruned.

The entropy pruning Entropy_Pruning method is also a layer by layer algorithm but instead of trying to minimize the difference in the output like the two channel methods above, they used a different criteria based on the entropy of the feature maps produced by the channels. The idea behind their criteria is that a low entropy in the feature maps of a channel will most likely be less important in the decision of the network. Their pruning algorithm for layer could be done in four steps.

  1. Convert the activation maps of size into a vector of dimension by using a global average pooling. represents the number of output channels for the observed layer and and are the height and width of the activation maps.

  2. Repeat step 1 for images in order to build a matrix of dimension where is the number of samples presented. The number of samples could be the whole training set or a subset.

  3. For each channel divide the distribution into bins and calculate the entropy using equation 32.

    (32)
  4. Rank the channels according to the smallest entropy value to the biggest. Prune the x smallest channels according to the desired compression ratio. Restarts the steps 1 to 4 for the other layers in the model.

Molchanov’s Molchanov pruning algorithm as the goal to minimise the cost function. So, their technique is to approximate the change in the cost function if the channel is pruned. Equation 33 demonstrates this idea.

(33)

Where represents the cost function and the dataset. is the cost value if channel is pruned. The idea is to find a subset of channels

to pruned while minimizing the difference with the original Cost Function were these channels were used. This is represented in the equation by calculating the difference between cost function with the channels excluded and the cost function with the channels included. Using a Taylor expansion to solve this minimization the authors found that the difference in the cost function with the channels pruned could be approximated with the activation (feature map) and the gradient of the channel which can be calculated during backpropagation.

(34)

Each channels ranking value is normalized using a l2-normalization. This normalization is done on each layer individually in order to facilitate the comparison between layers since this method ranks channels across all layers.

3.3 Summary of experimental results in literature:

Dataset CIFAR10
Feature Extractor ResNet56
Algorithm Original Pruned
R-1 (%) GFLOPS Parameters (M) R-1 (%) FLOPS (G) Parameters (M)
Hao Li HaoLi 93.04 0.125 0.85 93.06 0.091 0.73
Auto-Balanced AutoBalanced 93.93 0.142 N/D 92.94 0.055 N/D
Redundant channel Redundant_Pruning 93.39 0.125 0.85 93.12 0.091 0.65
PP PlayAndPrune 93.39 0.125 0.85 93.09 0.039 N/D
FPGM FPGM 93.39 0.125 0.85 92.73 0.059 N/D
Dataset ImageNet
Feature Extractor VGG16
Algorithm Original Pruned
R-1 (%) GFLOPS Parameters (M) R1 (%) GFLOPS Parameters (M)
ThiNet ThiNet 90.01 30.94 138.34 89.41 9.58 131.44
Molchanov Molchanov 89.30 30.96 N/D 87 11.5 N/D
HaoLi HaoLi 90.01 30.94 138.34 89.13 9.58 130.87
Channel Pruning ChannelPruning 90.01 30.94 138.34 88.1 7.03 131.44
Dataset ImageNet
Feature Extractor ResNet50
Algorithm Original Pruned
R-1 (%) GFLOPS Parameters (M) R-1 (%) GFLOPS Parameters (M)
Entropy Entropy_Pruning 72.88 3.86 25.56 70.84 2.52 17.38
ThiNet ThiNet 75.30 7.72 25.56 72.03 3.41 138
FPGM FPGM 75.30 7.72 25.56 74.83 3.58 N/D
Table 3: Comparison of rank-1 accuracy and network complexity analysis in term of GFLOPS and Parameters taken from the literature.

For the channel pruning ChannelPruning they don’t specify the flops but they observed the time required to do one forward pass. They found that they could speed up the network by 2.5 times with an increased error of 1% on the COCO Dataset using VGG16. Plus, in order to compare more easily, we decided to include the results produced by ThiNet for the channel pruning method.

If we observed the Hao Li HaoLi which is weight-based versus the Auto-Balanced AutoBalanced which is also weigth-based but uses a regularizartion term, we could see that we could get higher compression ratios when using regularization. If we compare weigth-based approaches and channel-based, we could observe that using channel-based seems to give a higher compression while maintaining similar results in terms of accuracy.

For the comparison between output and channel-based approaches, we could say that their performance is sensibly similar if we look at the state of the art methods ThiNet ThiNet and Molchanov Molchanov

. The difference in FLOPS between these two methods could probably be explained in the way they calculated them.

3.4 Critical analysis of pruning methods:

The main difference between weight based methods and feature map based methods is that weight based methods are not dependent of the a dataset since weight statistic do not depend on output of a convolutional neural network. Whereas, feature map based methods need a dataset in order to compute whether the output of convolution layer or its gradients.

The chosen criteria usually depend on the desire to simplify the pruning steps for a loss of accuracy compared to some more complex criteria that requires a lot more computations to be able to keep a high accuracy. If training and pruning time is an issue, i.e. in an environment that requires fast deployment, it is more adapted to choose some simple criteria like L1 and L2 norm. But if there is no time constraints, some more complex pruning criteria like the minimization in the difference of activation or cost function seems to outperform the simple criteria but will require a lot more computations and time.

Some of the techniques also differ in term how channels are pruned, some prune layer-by-layerHaoLi ; Pro_Soft_Pruning and others prune across layerMolchanov . One of the difference between across layer pruning and layer-by-layer pruning is the imbalance in term of pruning. An across layer pruning does not prune each layer evenly, it’s possible that it can prune lower level layer more than higher level and vice-versa. Depending on how a CNN architecture is built and the pruning algorithm, this type of pruning across layer will not end up having the reduction desired. Where as pruning layer-by-layer can guarantee that all the layer will get pruned therefore an more evenly reduction at each layer.

Recently, some techniquesFPGM ; Pro_Soft_Pruning also adopt a new pruning method called soft pruning. Soft pruning differs from hard pruning because soft pruning only reset pruned channels to zero instead of completely removing them. This meant that soft pruned channels has a chance to recover. These techniques haven been shown to have achieved state-of-the-art performance.

3.5 Design scenarios with pruning:

Most deep learning person ReID applications use pretraining and then finetune the model to the task domain. Pretraining is done a very large dataset in order to get our CNN parameters closer to an optimal region. In most cases, the CNN’s are pretrained on ImageNet since this dataset as a large amount of training samples of different classes which improves the model generalization ability. In person ReID, pretrained models have proven to be more successful than models that were trained from scratch directly on the task dataset. Once the model has been pretrained, the next step is finetuning the model. Finetuning is used to map the model’s parameters from our pretraining domain to our task domain. In order to do so, it is crucial that our task dataset is not so different from our pretraining dataset. Finetuning is a complex step to achieve this is why the paper Finetune_Best came up with the best finetuning practices depending on the differences of the pretraining dataset and the domain dataset. In this article, they proposed to compute a similarity score between the pretraining set and the task dataset in order to guide the finetuning from one domain to a another. They proposed to calculate two metrics to measure the similarity. The first one is the cosine distance and the second one is MMD which is the euclidean distance. To calculate these metrics, they proposed to average the feature embedding of each dataset and calculate the metrics between the two vectors. Using these calculated metrics and the number of samples for each class in the domain dataset, the authors proposed to either train the whole network or freeze the feature extractor and finetune the classifier. Based on these measurements, this survey finds to finetune all the layers for all the datasets.

Figure 3: Scenarios for pruning and training a CNN.

Pruning neural networks can be done in both steps of training (pretrain or finetune) our models. We concluded that there are 4 possible scenarios for pruning as shown in Figure 3. The first scenario, which is the one that will be used in our experimental part, is to prune our CNN on the pretraining dataset. The idea behind this scenario is to use a much larger dataset to guide our selection of the least important channels. The second scenario is to prune on the pretraining set after the finetuning until our performance metrics are in an acceptable region and then prune again on the task dataset. The goal of this strategy is to remove channels that are not contributing to our task while keeping the ones that are. The third scenario is where we only pruned on our task dataset after the finetuning. The objective of this scenario is to accelerate the training time since pruning and retraining on a large dataset is time consuming. Finally, the last scenario is to prune on the task dataset before doing the finetuning. Once again, this scenario goal is to accelerate the training time of the model. Our goal is to also determine which scenario is the most optimal to reduce the complexity of the model while maintaining a good performance on our task.

The progressive soft pruning method is a very interesting pruning scheme since we’re able to prune the model while doing are finetuning steps. This pruning scheme seems advantageous in terms of training time since we’re combining the prune, retrain and finetuning steps into one. Progressive soft pruning in Figure 3 would be represented by combining the prune and retrain in one box for Scenario 1. For Scenario 2 and 3, the finetune, prune and retrain would be combined into one box. As for the Scenario 4, this would not be applicable for the PSFP since the pruning, retrain, finetuning is one step so it would be impossible to prune the network by ranking the channels with the task dataset and then finetune the network.

4 Experimental Methodology

In this section, we present the experimental methodology used to validate the pruning model. Our experiments is divided into two main parts. First, we experiment on a large-scale dataset, i.e. ImageNet, in order to find the best pruning methods using the same experimental protocol. The second part of these experiments will be to test the pruning algorithms on a person re-identification problem to find the advantage of using a pruned model compared to smaller model. The following section will present the experimental methodology such as the datasets, the evaluation metrics and the experiments algorithm. The results for the pruning on the ImageNet dataset and ReID datasets will also be presented.

4.1 Datasets:

Four publicly available datasets are considered for the experiments, namely Imagenet ILSVRC12 , Market1501 market1501 , DukeMTMC-reID DukeMTMC and CUHK03-NP cuhk03 . Imagenet, a larg-scale dataset, is used as pre-trained dataset and rest of the other datasets (small-scale) are used for the experiments of person re-identifications.

  • ImageNet (ILSVRC2012) ILSVRC12 is composed of two parts. The first part is used for training the model and the second part is used for validation/testing. There is 1.2M images for training and 50k for validation. The ILSVRC2012 dataset contains 1000 classes of natural images.

  • Market-1501 market1501 is one of the largest public benchmark datasets for person re-identification. It contains 1501 identities which are captured by six different cameras, and 32,668 pedestrian image bounding-boxes obtained using the Deformable Part Models (DPM) pedestrian detector. Each person has 3.6 images on average at each viewpoint. The dataset is split into two parts: 750 identities are utilized for training and the remaining 751 identities are used for testing. We follow the official testing protocol where 3,368 query images are selected as probe set to find the correct match across 19,732 reference gallery images.

  • CUHK03-NP cuhk03 consists of 14,096 images of 1,467 identities. Each person is captured using two cameras on the CUHK campus, and has an average of 4.8 images in each camera. The dataset provides both manually labeled bounding boxes and DPM-detected bounding boxes. In this paper, both experimental results on ‘labeled’ and ‘detected’ data are presented. We follow the new training protocol proposed in zhong2017re , similar to partitions of Market1501 dataset. The new protocol splits the dataset into training and testing sets, which consist of 767 and 700 identities, respectively.

  • DukeMTMC-reID DukeMTMC is constructed from the multi-camera tracking dataset– DukeMTMC. It contains 1,812 identities. We follow the standard splitting protocol proposed in zheng2017unlabeled where 702 identities are used as the training set and the remaining 1,110 identities as the testing set.

4.2 Pruning methods:

For our experiments, we compare five pruning methods in order to determined which technique gives the best compression ratio while maintaining a good performance on person ReID task. Our choice was based on the following criteria: article results, most of the families of the taxonomy are represented and the complexity for the ranking and the implementation. We selected Hao Li HaoLi and Entropy Entropy_Pruning as they rely on the techniques that prune the network only one time and then finetune the network. Although Molchanov Molchanov uses iterative pruning techniques, we chose this method for our experiments because of its theoretical explanation and requires single compression ratio. We choose to experiment with Auto-Balanced algorithm AutoBalanced because pruning is done by adding regularization term to original loss function in order to leave the pruning process for the optimization. We’ve also decided to try the Progressive Soft Pruning Pro_Soft_Pruning method since it directly prune from scratch and progressively prune during training which is suitable test on our target operational domain.

Implementation Details. For all the datasets, images are resized to . Like many state-of-art ReID approaches hermans2017defense ; chen2017beyond ; geng2016deep ; cheng2016person ; liu2017end , we use ResNet50  he2016deep as the backbone architecture, where the final layer is removed to get a 2048 feature representation. We apply all the pruning method on the ResNet50 architecture. In order to be able to compare the four algorithms more easily, we decided to come up with a pruning schedule that would be similar for all the methods. First of all, we decided to prune around 5% of the total number of channels at each iteration. This number was decided in order to accelerate the pruning iterations by pruning a good number of channels but small enough not to prune too much of the model at once which would greatly affect the accuracy and the retraining time. We also decided to only prune the first two layers of every residual block. The last layer of convolutional block is discarded since the residual part dimension is going to be changed and we cannot pruned this residual part with the same index has the last convolutional layer of the block. For the layer-by-layer methods, we chose to used a single compression rate for every layer in order to simplify our experiments and our comparison between the methods. For each pruning iteration, we decided to use 1 epoch for the ranking of the channels and 4 epochs for retraining before moving to the next iteration.

This pruning schedule was used for every method on ImageNet in order to produced our pruned models that would be used in the person ReID experiments. We’ve discarded the pruning iterations where the accuracy was too low since there was no advantage of using these networks for our task. Once our pruning was done for every method, we retrained every model on ImageNet for around 25 epochs to regain the loss of accuracy cause by the pruning. Each of our pruned models was then finetune on the ReID datasets. We also finetuned pretrained ResNet18 and ResNet34 on these ReID datasets in order to compare the advantages of using pruned models compared to shallower networks.

4.3 Performance metrics::

Following the common trend of evaluation hermans2017defense ; chen2017beyond ; geng2016deep ; cheng2016person ; liu2017end , we use the rank-01 accuracy of the cumulative matching characteristics (CMC), and the mean average precision (mAP) to evaluate the ReID accuracy. he CMC represents the expectation of finding a correct match in the top n ranks. When multiple ground truth matches are available, then CMC cannot measure how well the gallery images are ranked. Thus, we also report the mAP scores.

As the state-of-the-art pruning methods  HaoLi ; Entropy_Pruning ; ChannelPruning ; Pro_Soft_Pruning , the FLOPS’s metric is used to calculate the models complexity in terms of computational operations. To compare the different models during our experiments, we decided to calculate the number of FLOPS necessary to process one image through the model. We chose to compare the number of FLOPS since the processing time depends on the material used. The FLOPS is also a better metric than the number of pruned channels since a pruned channel at the beginning of the network will reduced considerably more the total number of FLOPS than a later layer channel since the image dimension is reduced throughout the network. We also use the number of parameters metric to be able to compare models in terms of memory consumption to save the trained model. This metric was calculated by summing the number of weights needed throughout the model.

5 Experimental Results and Discussion

5.1 Pruning on pre-training data:

Performance measures
Networks Rank-1 (%) GFLOPS Parameters (M)
ResNet50 76,01 6,32 23,48
ResNet34 73,27 6,67 21,28
ResNet18 69,64 3,09 11,12
HaoLi 71,85 2,96 11,90
Molchanov 71,65 3,21 12,09
AutoBalanced 71,97 2,96 11,90
Entropy 71,46 2,96 11,90
PSFP 71,57 2,96 11,90
Table 4: Rank-1 accuracy and complexity of baseline and pruned ResNet CNNs on ImageNet data imagenet_cvpr09 .

Table 4 shows the performance of baseline and pruned ResNet CNNs (backbone ResNet50 and smaller ResNet 18 and ResNet34 networks) on the ImageNet dataset. Results in this table provides an indication of the benefits of pruning only on a large source domain pre-training dataset. Results suggests that there are not much variations in the results of the pruning techniques. The similarity in the results indicates that efficient pruning techniques rely heavily on the availability of the large-scale datasets. Since used a large scale dataset in this experimental evaluation, ranking done by each technique is probably very similar even though the ranking metric is different. Pruning results with ResNet50 amd half the FLOPS are very similar to the state of the art without pruning. For example, using ThiNet ThiNet provides a compressed network that yields rank-1 accuracy of around 72% for half the FLOPS. Results with pruned networks provide better accuracy than the smaller ResNet18 while having similar computational complexity and memory consumption.

For person reID datasets, we attempt to preserve the same pruning compression ratio for fair comparison. The pruning compression ( around 50%) was chosen by taking the highest compression level while minimizing the difference in the results between the baseline and the pruned models. We also prune same number of filters per layer, same number of pruning iteration and same fine-tuning iteration, layer-by-layer. Across layers is not same, 50% filter gone, 5% filter at iteration, stopping condition is 50% pruned away.

Table 5 reports the results for Market-1501, DukeMTMC-reID and CUHK03-NP re-identifications. The reported results are for Scenario 1. Molchanov has higher FLOPS and a higher number of parameters than the other methods which would probably lead to a slower model and more consuming in terms of memory. Out of the 5 methods, the Hao Li method seems to be working the best by having the best or close to the best on the three datasets.

Market-1501 DukeMTMC CUHK03-NP
Networks Parameters GFLOPS mAP R-1 mAP R-1 mAP R-1
ResNet50 23.48 6.32 69.16 85.07 59.46 76.39 47.57 48.43
ResNet34 21.28 6.67 67.44 84.09 58.36 75.45 45.51 47.14
ResNet18 11.12 3.09 61.23 81.18 52.07 71.63 38.27 39.57
HaoLi 11.90 2.96 67.04 84.71 57.51 75.00 44.08 46.50
Molchanov 12.09 3.21 66.35 84.44 57.90 75.72 44.40 46.21
AutoBalanced 11.90 2.96 65.46 83.64 56.45 74.64 41.85 44.21
Entropy 11.90 2.96 65.16 82.39 56.64 74.64 42.44 44.07
PSFP 11.90 2.96 65.92 83.72 56.96 74.66 42.38 45.58
Table 5: Accuracy and complexity of baseline and pruning Siamese networks on ReID datasets. Mean average precision (mAP) and rank-01 accuracy (R-1) are shown in percentage (%).
Figure 4: Comparative ReID performance analysis of the pruning methods for all the ReID datasets: (a,c,e) mAP vs GFLOPS and (b,d,f) Parameters vs Rank-01.

The pruned models also have a low loss in terms of accuracy while reducing considerably the number of FLOPS and parameters. Pruned model are faster than backbone ResNet50 network while having similar performance (around 1%). Plus, the pruned models have a similar number of FLOPS and parameters to ResNet18 while having better results on the three performance metrics. This means that pruning a larger model is more advantageous than using a shallower model.

To get a more global view of these results, we refer to the supplementary material to see the complete result tables for each pruning iteration. Plus, the graphics in Figures 4 also shows us visually which models are better where the optimal placement would be top right and the worse would be bottom left. There are two graphics for each dataset where the first one presents the mAP vs FLOPS and the second one presents Rank1 vs Parameters.

5.2 Pruning on target application data:

Table 6 reports the experimental evaluations of all the scenarios. For fair comparison, we chose to keep the compression ratio to 5% of the total number of channels. Our Scenario 2 results are produced using the HaoLi Iteration 3 model as the model pruned on the pretraining dataset. Using the same model for the 4 techniques gives us a better idea on which pruning technique is the best when we pruned directly on our task dataset.

Market-1501 DukeMTMC-reID CUHK03-NP
Methods Scenarios mAP R-1 GFLs. Param. mAP R-1 GFLs. Param. mAP R-1 GFLs. Param.
HaoLi 67.04 84.71 2.96 11.90 57.51 75.00 2.96 11.90 44.08 46.5 2.96 11.90
Molchanov 66.35 84.44 3.21 12.09 57.90 75.72 3.21 12.09 44.40 46.21 3.21 12.09
Entropy Scenario 1 65.16 82.39 2.96 11.90 56.64 74.64 2.96 11.90 42.44 44.07 2.96 11.90
Auto-Balanced 65.46 83.64 2.96 11.90 56.45 74.64 2.96 11.90 41.85 44.21 2.96 11.90
PSFP 65.92 83.72 2.96 11.90 56.96 74.66 2.96 11.90 42.38 45.58 2.96 11.90
HaoLi 49.18 70.67 2.09 8.95 40.91 61.67 2.09 8.95 26.29 27.36 2.09 8.95
Molchanov 32.03 53.92 2.11 8.31 23.54 40.44 2.02 7.99 13.61 13.14 2.05 8.09
Entropy Scenario 2 7.38 17.31 2.09 8.95 2.34 6.78 2.09 8.95 7.83 7.29 2.09 8.95
Auto-Balanced 47.36 68.74 2.09 8.95 43.19 64.14 2.09 8.95 26.69 27.36 2.09 8.95
PSFP 65.03 80.85 2.09 8.95 53.05 73.20 2.09 8.95 42.19 43.86 2.09 8.95
HaoLi 67.44 84.23 5.08 19.17 57.16 74.28 5.08 19.17 44.73 47.86 5.08 19.17
Molchanov 63.29 81.38 5.28 19.6 44.14 63.33 4.98 18.53 36.91 38.86 4.96 18.46
Entropy Scenario 3 60.27 79.99 5.08 19.17 52.62 71.72 5.08 19.17 38.5 40.14 5.08 19.17
Auto-Balanced 67.49 84.26 5.08 19.17 58.14 75.27 5.08 19.17 46.54 48.07 5.08 19.17
PSFP 67.68 84.78 5.08 19.17 57.51 74.87 5.08 19.17 46.25 48.2 5.08 19.17
HaoLi 31.41 55.7 5.08 19.17 27.94 48.79 5.08 19.17 14.97 16.07 5.08 19.17
Molchanov 6.39 12.89 5.14 19.1 1.18 2.29 4.80 17.89 6.28 5.71 4.93 18.35
Entropy Scenario 4 25.41 46.35 5.08 19.17 21.08 41.16 5.08 19.17 11.31 11.64 5.08 19.17
Auto-Balanced 59.27 78.80 5.08 19.17 49.64 67.77 5.08 19.17 36.67 38.79 5.08 19.17
PSFP N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Table 6: Comparison of network complexity (Parameters & GFLOPS), mAP and rank-1 accuracy with different pruning Scenarios on all the ReID datasets.

As we can observe in Table 6, the results with pruning directly on the target operational domain are not as good as our results with the pre-training dataset pruning. We can make following observations from these results: (1) pruning and finetuning should done on the same domain as in the case of Scenario 1 and Scenario 3, no matter whether it is source or target operational domain; (2) lack of data in target domain effect the pruning accuracy to regain the information loss by the pruning of the weak channels; (3) with the large-scale source dataset and the Hao Li method, we were able to pruned our model to the same number of FLOPS as Scenario 2 (2.09 GFLOPS) but our Rank1 accuracy was 81.95% instead of 70.67%. The Hao Li method also seems to be better suited to pruned directly on the task dataset compared to Molchanov and Entropy. This might be explained by the fact we don’t have many samples per person since Molchanov and Entropy approach uses a subset of samples to determine which channels to pruned compared to the Hao Li method that ranks the channels with their weights; (4) Scenario 4 is not viable since all methods performance drop drastically. (5) as for the Auto-Balanced and the PSFP techniques, they seem to outperform the other methods. This could be explained by the fact that auto-balanced modifies the loss in order to transfer the information of the pruned channels to the remaining this one. This scheme seems to help considerably when our number of samples is limited. The PSFP seems to be the best suited algorithm to pruned models on a limited dataset. This can probably be explained by the fact that we only zeroized the pruned channels which keeps the model architecture which allows the recovery of certains soft-pruned channels during the fine-tuning phrase. Our results with the PSFP also very similar to the ones obtain with the Scenario 1 scheme where we prune our models on the large-scale source dataset and then finetune on our task domain dataset. The great advantage of this method is the fact we can prune and finetune our models in the same step. Plus, we’re skipping the slow step of pruning on the very large ImageNet dataset.

To compare the scenarios further, we used two compression ratios which are around half the FLOPS (C1) and around one third (C2) of the FLOPS of the original ResNet50 model. The Scenario 2 model for the first compression is using the second iteration Hao Li model as the model pruned on ImageNet. As for the second compression, We’re using the third iteration. The results for the following experiments are found in Tables 7, 8 and 9.

Scenarios Param. GFLOPS HaoLi Auto-Balanced PSFP
mAP (%) Rank 1 (%) mAP (%) Rank 1 (%) mAP(%) Rank 1 (%)
Scenario 1 (C1) 11.90 2.96 67.04 84.71 65.46 83.64 65.92 83.72
Scenario 2 (C1) 11.90 2.96 47.46 69.36 47.57 70.46 65.55 82.69
Scenario 3 (C1) 11.90 2.96 53.22 72.21 54.73 74.05 65.88 82.19
Scenario 1 (C2) 8.95 2.09 63.63 81.95 63.16 81.29 63.58 82.47
Scenario 2 (C2) 8.95 2.09 49.18 70.67 47.36 68.74 65.03 80.85
Scenario 3 (C2) 8.95 2.09 41.93 62.62 48.30 69.00 65.88 82.91
Table 7: Comparison of network complexity (Parameters & GFLOPS), mAP and rank-1 accuracy with different pruning compression ratios of different scenarios on Market1501 dataset.
Scenarios Param. GFLOPS HaoLi Auto-Balanced PSFP
mAP (%) Rank 1 (%) mAP (%) Rank 1 (%) mAP(%) Rank 1 (%)
Scenario 1 (C1) 11.90 2.96 57.51 75.00 56.64 74.64 56.96 74.66
Scenario 2 (C1) 11.90 2.96 40.60 59.47 41.09 61.09 53.71 71.90
Scenario 3 (C1) 11.90 2.96 46.88 65.93 45.54 66.16 56.62 74.09
Scenario 1 (C2) 8.95 2.09 55.35 74.69 55.10 72.94 55.22 73.57
Scenario 2 (C2) 8.95 2.09 40.91 61.67 43.19 64.14 53.05 73.20
Scenario 3 (C2) 8.95 2.09 39.17 58.71 34.80 54.58 56.77 73.38
Table 8: Comparison of network complexity (Parameters & GFLOPS), mAP and rank-1 accuracy with different pruning compression ratios of different scenarios on DukeMTMC-reID dataset.
Scenarios Param. GFLOPS HaoLi Auto-Balanced PSFP
mAP (%) Rank 1 (%) mAP (%) Rank 1 (%) mAP(%) Rank 1 (%)
Scenario 1 (C1) 11.90 2.96 44.08 46.50 41.85 44.21 42.38 45.58
Scenario 2 (C1) 11.90 2.96 27.23 28.43 27.44 29.57 40.47 45.00
Scenario 3 (C1) 11.90 2.96 33.57 36.07 33.34 35.29 40.66 44.57
Scenario 1 (C2) 8.95 2.09 37.83 39.71 38.51 40.57 38.76 40.52
Scenario 2 (C2) 8.95 2.09 26.29 27.36 26.69 27.36 42.19 43.86
Scenario 3 (C2) 8.95 2.09 26.79 28.29 26.50 27.14 40.31 40.14
Table 9: Comparison of network complexity (Parameters & GFLOPS), mAP and rank-1 accuracy with different pruning compression ratios of different scenarios on CUHK03-NP dataset.

Tables 7, 8 and 9 shows us that Scenario 1 is truly the best one since all the results outperform the other ones for any method and any dataset. As for the comparison between the Scenario 2 and 3, the conclusion to determine which one is better is hard to make since Scenario 2 can be done using many configurations to get to a model similar to the one in Scenario 3. We could either prune more on the large-scale source dataset or prune less. The Scenario 2 results are also affected by the choice of the pruned model on the large-scale source set. Our first compression results using the second iteration of the Hao Li method our less good than pruning only on the target operational dataset (Scenario 3). But using the third iteration as shown in the second compression results, our Scenario 2 results are better than our Scenario 3 results.

To further analysis the pruning on target operational domain, we apply the best finetuning practices proposed by Mr. Chu (Finetune_Best, ) presented in section 3.5. we’ve calculated their metrics for ImageNet and Market11501 and got for the cosine distance and for MMD. With these metrics and the fact that Market1501 has fewer than 20 samples for each class, the authors proposed to freeze the feature extractor during the finetuning to avoid over fitting since our task dataset is small and close to our large-scale pretraining dataset. Since our problem of a small dataset was showing during the retraining phase of the pruned network, we decided to try prune one layer at a time and freeze the others during the retraining phase. The goal of this strategy is to force the pruned layers to relearned the loss information while maintaining the other layers in the same optimal region as the baseline model. This method was tried for Scenario 2 with the Hao Li method. We decided to prune the layer 5 of the ResNet50 while freezing the rest of the network. The model was pruned to 2.61 GFLOPS and the rank1 accuracy was 76.10%. This experiment shows that we could limit the effects of pruning by using a layer by layer approach and freezing the other layers to regain the accuracy. The problem with this scheme is that its not very effective time wise since it’s a long and fastidious task to prune and retrain to the desired compression ratio for each layer instead of doing the whole model in one pass.

6 Conclusion

In this paper, we provide a survey of different state-of-the-art pruning approaches that are suitable for compressing Siamese networks for person ReID application in terms of criteria to select channels, and of strategies to reduce channels. In addition to that, we propose different scenarios or pipelines for leveraging a pruning method during the deployment of a network for a target application. Experimental evaluations on multiple benchmarks source and target datasets indicate that pruning can considerably reduce network complexity (number of FLOPS and parameters) while maintaining a high level of accuracy. It also suggests that pruning larger CNNs can also provide a significantly better performance than fine-tuning smaller ones. One key observation of the scenario based experimental evaluations is that pruning and fine-tuning should be performed in the same domain.

Future experiments could use smaller pruning iterations in order to reduced the impact of pruning on knowledge corruption. The retraining of the pruned networks could also be improved by adding a learning rate decay. Using layer-by-layer methods, with different compression ratios for each layer can improve the results since some layers are more resilient to pruning than others. Techniques for freezing parts of the network can improve accuracy but drastically increase the time complexity for pruning and retraining phases. The soft pruning method could also benefit from better selection criteria, e.g., using a gradient based approach instead of the norm of the channel weights. Finally, another interesting future experiment would be to avoid costly pruning on large pre-training dataset and only use the progressive soft pruning scheme to see if can get similar results with higher compression ratios.

References

References

  • (1) J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, K. Murphy, Speed/accuracy trade-offs for modern convolutional object detectors, in: CVPR 2017.
  • (2) E. Ahmed, M. Jones, T. K. Marks, An improved deep learning architecture for person re-identification, in: CVPR, 2015.
  • (3) A. Hermans, L. Beyer, B. Leibe, In defense of the triplet loss for person re-identification, arXiv, 2017.
  • (4) R. R. Varior, M. Haloi, G. Wang, Gated siamese convolutional neural network architecture for human re-identification, in: ECCV, 2016.
  • (5)

    W. Chen, X. Chen, J. Zhang, K. Huang, Beyond triplet loss: a deep quadruplet network for person re-identification, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

  • (6)

    M. Geng, Y. Wang, T. Xiang, Y. Tian, Deep transfer learning for person re-identification, arXiv, 2016.

  • (7) D. Cheng, Y. Gong, S. Zhou, J. Wang, N. Zheng, Person re-identification by multi-channel parts-based cnn with improved triplet loss function, in: CVPR, 2016.
  • (8) H. Liu, J. Feng, M. Qi, J. Jiang, S. Yan, End-to-end comparative attention networks for person re-identification, IEEE Trans. on TIP 26 (7) (2017) 3492–3506.
  • (9) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: CVPR, 2016.
  • (10) Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, R. Feris, Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification, in: CVPR, 2017.
  • (11) R. Rigamonti, A. Sironi, V. Lepetit, P. Fua, Learning separable filters, in: CVPR 2013, 2013.
  • (12) E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, R. Fergus, Exploiting linear structure within convolutional networks for efficient evaluation, in: NIPS, 2014.
  • (13) M. Jaderberg, A. Vedaldi, A. Zisserman, Speeding up convolutional neural networks with low rank expansions, BMVC, 2014.
  • (14) V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, V. Lempitsky, Speeding-up convolutional neural networks using fine-tuned cp-decomposition, arXiv, 2014.
  • (15) C. Tai, T. Xiao, Y. Zhang, X. Wang, et al., Convolutional neural networks with low-rank regularization, arXiv, 2015.
  • (16) T. Cohen, M. Welling, Group equivariant convolutional networks, in: ICML, 2016.
  • (17)

    H. Van Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double q-learning, in: AAAI, 2016.

  • (18)

    W. Shang, K. Sohn, D. Almeida, H. Lee, Understanding and improving convolutional neural networks via concatenated rectified linear units, in: ICML, 2016.

  • (19) S. Dieleman, J. De Fauw, K. Kavukcuoglu, Exploiting cyclic symmetry in convolutional neural networks, arXiv, 2016.
  • (20) C. Buciluǎ, R. Caruana, A. Niculescu-Mizil, Model compression, in: ACM SIGKDD, 2006.
  • (21) G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, arXiv, 2015.
  • (22) P. Luo, Z. Zhu, Z. Liu, X. Wang, X. Tang, Face model compression by distilling knowledge from neurons, in: AAAI, 2016.
  • (23) T. Chen, I. Goodfellow, J. Shlens, Net2net: Accelerating learning via knowledge transfer, arXiv, 2015.
  • (24) S. Han, H. Mao, W. J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, ICLR.
  • (25) Y. Gong, L. Liu, M. Yang, L. Bourdev, Compressing deep convolutional networks using vector quantization, ICLR.
  • (26) W. Chen, J. Wilson, S. Tyree, K. Weinberger, Y. Chen, Compressing neural networks with the hashing trick, in: ICML, 2015.
  • (27) Y. LeCun, J. S. Denker, S. A. Solla, Optimal brain damage, in: NIPS, 1990.
  • (28) B. Hassibi, D. G. Stork, Second order derivatives for network pruning: Optimal brain surgeon, in: NIPS, 1993.
  • (29) S. Han, J. Pool, J. Tran, W. Dally, Learning both weights and connections for efficient neural network, in: NIPS, 2015.
  • (30) P. Molchanov, S. Tyree, T. Karras, T. Aila, J. Kautz, Pruning convolutional neural networks for resource efficient transfer learning, arXiv, 2016.
  • (31) H. Li, A. Kadav, I. Durdanovic, H. Samet, H. P. Graf, Pruning filters for efficient convnets, arXiv, 2016.
  • (32) J.-H. Luo, J. Wu, An entropy-based pruning method for cnn compression, arXiv, 2017.
  • (33) Y. He, X. Zhang, J. Sun, Channel pruning for accelerating very deep neural networks, in: ICCV, 2017.
  • (34) Y. He, X. Dong, G. Kang, Y. Fu, Y. Yang, Progressive deep neural networks acceleration via soft filter pruning, arXiv, 2018.
  • (35) L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, Q. Tian, Scalable person re-identification: A benchmark, in: ICCV, 2015.
  • (36) W. Li, R. Zhao, T. Xiao, X. Wang, Deepreid: Deep filter pairing neural network for person re-identification, in: CVPR, 2014.
  • (37) E. Ristani, F. Solera, R. Zou, R. Cucchiara, C. Tomasi, Performance measures and a data set for multi-target, multi-camera tracking, in: ECCV-WK, 2016.
  • (38) J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, R. Shah, Signature verification using a” siamese” time delay neural network, in: NIPS, 1994.
  • (39) D. Yi, Z. Lei, S. Liao, S. Z. Li, Deep metric learning for person re-identification, in: ICPR, 2014.
  • (40) B. O. Ayinde, J. M. Zurada, Building efficient convnets using redundant feature pruning, arXiv, 2018.
  • (41) P. Singh, V. K. Verma, P. Rai, V. P. Namboodiri, Play and prune: Adaptive filter pruning for deep model compression, arXiv, 2019.
  • (42) Y. He, P. Liu, Z. Wang, Y. Yang, Pruning filter via geometric median for deep convolutional neural networks acceleration, arXiv, 2018.
  • (43) X. Ding, G. Ding, J. Han, S. Tang, Auto-balanced filter pruning for efficient convolutional neural networks, AAAI, 2018.
  • (44) J.-H. Luo, H. Zhang, H.-Y. Zhou, C.-W. Xie, J. Wu, W. Lin, Thinet: pruning cnn filters for a thinner net, TPAMI, 2018.
  • (45) B. Chu, V. Madhavan, O. Beijbom, J. Hoffman, T. Darrell, Best practices for fine-tuning visual classifiers to new domains, ECCV, 2016.
  • (46) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, ImageNet Large Scale Visual Recognition Challenge, IJCV, 2015.
  • (47) Z. Zhong, L. Zheng, D. Cao, S. Li, Re-ranking person re-identification with k-reciprocal encoding, in: CVPR, 2017.
  • (48) Z. Zheng, L. Zheng, Y. Yang, Unlabeled samples generated by gan improve the person re-identification baseline in vitro, in: ICCV, 2017.
  • (49) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database, in: CVPR09, 2009.