1 Introduction
In this paper, we explore the premise that less useful structures/features (including their possible redundancies) in overparameterized deep nets can be pruned away to boost efficiency and accuracy. We argue that optimal deep features should be taskdependent. Prior to deep learning, features were usually handengineered with domain specific knowledge. With the success of deep learning, we no longer need to handcraft features, but people are still handcrafting various architectures, which impacts both the quality and quantity of features to be learned. Some features learned via arbitrary architectures may be less useful for the current task at hand. Such features (parameters) not only add to the storage and computational burden but may also skew the data analysis (e.g. image classification) or result in overfitting. Most of today’s successful deep architectures are handdesigned for ImageNet. Thus, they may not be able to produce optimal features for other tasks, despite large enough capacity. Instead of assuming fixed nets’ generalizability to various tasks, in this paper, we attempt to address the problem through taskspecific pruning (feature selection) and generating a range of deep models welladapted to the current task.
Filter or neuron level pruning has its advantages. Deep networks learn to construct hierarchical representations. Moving up the layers, highlevel motifs that are more global, abstract, and disentangled can be built from simpler lowlevel patterns [Bengio et al.(2013)Bengio, Mesnil, Dauphin, and Rifai, Zeiler and Fergus(2014)]. In this process, the critical building block is the filter, which, through learning, is capable of capturing patterns at a certain scale of abstraction. Higher layers are agnostic as how the patterns are activated (w.r.t. weights, input, activation details). Single weightsbased approaches run the risk of destroying crucial patterns. Given nonnegative inputs, many small negative weights may jointly counteract large positive weights, resulting in a dormant neuron state. Single weight magnitudebased pruning would discard all small negative weights before reaching the large positive ones, reversing the state. This issue is serious at high pruning rates. Also, instead of setting zeros in weights matrices, filter/neuron pruning removes rows/columns/depths in weight/convolution matrices, leading to direct space and computation savings.
In this paper, we develop a Fisher Linear Discriminant Analysis (LDA) [Fisher(1936)] based neuron/filter level pruning framework that is aware of both final classification utility and its holistic dependency. Aside from efficiency gains, the pruning approach strategically selects deep features from a dimension reduction perspective, which potentially leads to accuracy boosts. Key novel contributions that distinguish our approach from others (e.g. [Han et al.(2015b)Han, Pool, Tran, and Dally, Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf, Tian et al.(2017)Tian, Arbel, and Clark]
) include: (1) Our pruning is taskdependent. It has an LDAbased neuron utility measure for pruning that is directly related to final taskspecific classification. The LDAbased utility is calculated, unraveled, and traced backwards from the final (hidden) layer where the linear assumption of LDA is more reasonable and variances are more disentangled
[Bengio et al.(2013)Bengio, Mesnil, Dauphin, and Rifai]. Those two factors make our LDAbased pruning directly along neuron dimensions wellgrounded (we show this in Sec. 3.1through solving a generalized eigenvalue problem). In contrast, most previous pruning approaches have taskblind utility measures (e.g. magnitudes of weights, variances, activation). They pay enough attention to the complexity itself, but less attention to whether the complexity change follows a taskdesirable direction. (2) Through deep discriminant analysis, our approach determines how many filters, and of what types, are appropriate in a given layer. By pruning deep modules, it provides a novel strategy for architecture design. This is different from popular compact structures that employ
random 1x1 filter blocks to reduce data dimension to size . A small may cut the information flow to higher layers, while a large may lead to redundancy/overfitting/interference. Such arbitrariness also exists for other filter types. (3) The approach presented here handles a wide variety of structures such as convolutional, fullyconnected, modular, and hybrid ones and prunes a full network in an endtoend manner. Parameters easily explode when neurons are fully connected (e.g. over 90% FC weights in AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] and VGG16 [Simonyan and Zisserman(2015)]). By strategically pruning FC weights, we offer a new alternative to weight sharing for parameter reduction, which can preserve useful location information contributing to task utility. (4) The proposed method derives a feature subspace that is invariant and robust to taskunrelated factors (e.g. lighting). In our experiments on general and domain specific datasets (CIFAR100, Adience and LFWA), we show how the proposed method leads to great complexity reductions. It brings down the total VGG16 size by 98%99% and that of the compact GoogLeNet by up to 82% without much accuracy loss (<1%). The corresponding FLOP reduction rates are as high as 83% and 62%, respectively. Additionally, we can derive more accurate models at lower complexities. Take age recognition on Adience for example, one model is over 3% more accurate than the original net but only about 1/3 in size. Also, we compare our method with MobileNet, SqueezeNet, [Han et al.(2015b)Han, Pool, Tran, and Dally], [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf] and show our taskspecific Fisher pruning’s value.2 Related Work
Early approaches to neural networks pruning date back to the late 1980s. Some pioneering examples include
[Pratt(1989), LeCun et al.(1989)LeCun, Denker, Solla, Howard, and Jackel, Hassibi and Stork(1993)]. In recent years, with increasing network depths comes more complexity, which reignited research into network pruning. Han et al [Han et al.(2015b)Han, Pool, Tran, and Dally] discard weights of small magnitudes. Similarly, approaches that sparsify networks by setting weights to zero include [Srinivas and Babu(2015), Mariet and Sra(2016), Jin et al.(2016)Jin, Yuan, Feng, and Yan, Guo et al.(2016)Guo, Yao, and Chen, Hu et al.(2016)Hu, Peng, Tai, and Tang, Sze et al.(2017)Sze, Yang, and Chen]. With further compression, this sparsity is desirable for storage and transferring purposes. Anwar et al [Anwar et al.(2015)Anwar, Hwang, and Sung] introduced structured sparsity at different scales. More recently, neuron/filter/channel pruning has gained popularity [Polyak and Wolf(2015), Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf, Tian et al.(2017)Tian, Arbel, and Clark, Luo et al.(2017)Luo, Wu, and Lin, He et al.(2017)He, Zhang, and Sun], as they directly produce hardware friendly architectures that not only reduce the requirements of storage space and transportation bandwidth, but also bring down the initially large amount of computation in conv layers. Furthermore, with fewer intermediate feature maps generated and consumed, the number of slow and energyintensive memory accesses is decreased, rendering the pruned nets more amenable to implementation on mobile devices. Despite the promising pruning rates achieved, most previous methods possess one or both of the following drawbacks: (1) the utility measure for pruning, such as magnitudes of weights, variances, or activation, is task blind, or at least not directly related to task demands. (2) utilities are often computed locally (relationships within layer/filter or across all layers may be missed). In addition to pruning, some approaches constrain space and/or computational complexity through compression, such as Huffman encoding, weight quantization [Han et al.(2015a)Han, Mao, and Dally]and bitwidth reduction (e.g. XNORNet and BWN
[Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi]). Some decompose filters with a lowrank assumption [Denton et al.(2014)Denton, Zaremba, Bruna, LeCun, and Fergus, Jaderberg et al.(2014)Jaderberg, Vedaldi, and Zisserman, Zhang et al.(2016)Zhang, Zou, He, and Sun]. Another popular method is to adopt compact layers/modules with a random set of 1*1 filters to reduce dimension (e.g. InceptionNet [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich], ResNet [He et al.(2016)He, Zhang, Ren, and Sun], SqueezeNet [Iandola et al.(2016)Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer], MobileNet [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam], and NiN [Lin et al.(2014)Lin, Chen, and Yan]), which may impede information flow or increase redundancy. Most popular compact/pruned architectures focus on popular datasets like ImageNet. However, computer vision tasks are diverse in nature, difficulty, and dataset size. Taskspecific pruning can play a crucial role for various applications especially those with limited data and strict timing requirements (e.g. car forward collision warning). The inspiration can be drawn from neuroscience findings
[Mountcastle et al.(1957), Valiant(2006)] that show that, despite the massive number of neurons in the cerebral cortex, each neuron typically receives inputs from a small taskspecific set of other neurons.3 Taskdependent Deep Fisher Pruning
In this paper, we propose a taskdependent Fisher Linear Discriminant Analysis (LDA) based pruning approach on the neuron/filter level that is aware of final taskspecific classification utility and its holistic crosslayer dependency. We treat pruning as dimensionality reduction in the deep feature space by disentangling factors of variation and discarding those of little or even harmful/interfering utility. The base net is fully pretrained, with cross entropy loss, L2 regularization, and Dropout (that can punish coadaptations). The method starts pruning by unravelling useful variances from the last hidden layer before tracing the utility backwards through deconvolution across all layers to weigh the usefulness of each neuron/filter. By abandoning less useful neurons/filters, our approach is capable of deriving optimal structures for a given task with potential accuracy boosts.
3.1 Task Utility Unravelling from Final Layer
The pruning begins from the final (hidden) layer of the welltrained base net for a number of reasons: (1) This is the only place where taskspecific distinguishing utility can be accurately and directly measured. All previous information feed to this layer. (2) Data in this layer are more likely to be linearly separable (LDA assumption). (3) Predecision neuron activations representing different motifs are shown empirically to fire in a more decorrelated manner than earlier layers. We will see how this helps later. For each image, an Mdimensional firing vector can be calculated in the final hidden layer, which is called a firing instance (
for VGG16, for GoogLeNet, pooling is applied when necessary). By stacking all such instances from a set of images, the firing data matrix for that set is obtained (useless 0variance/duplicate columns are preremoved). Our aim here is to abandon dimensions of that possess low or even negative task utility. Inspired by [Belhumeur et al.(1997)Belhumeur, Hespanha, and Kriegman, Li et al.(1999)Li, Kittler, and Matas, BekiosCalfa et al.(2011)BekiosCalfa, Buenaposada, and Baumela, Tian et al.(2017)Tian, Arbel, and Clark], Fisher’s LDA [Fisher(1936)] is adopted to quantify this utility. The goal is to maximize class separation by finding:(1) 
where
(2) 
with being the set of observations obtained in the last hidden layer for category , linearly projects the data to a new space spanned by columns. The tilde sign ( ) denotes a centering operation; For data :
(3) 
where is the number of observations in , denotes an n*1 vector of ones. Finding involves solving a generalized eigenvalue problem:
(4) 
where (,) represents a generalized eigenpair of the matrix pencil with as a column. If we only consider active neurons with nonduplicate pattern motifs, we find that in the final hidden layer, most offdiagonal values in and are very small. In other words, aside from noise or meaningless neurons, the firings of neurons representing different motifs in the predecision layer are highly decorrelated (disentanglement of latent space variances [Bengio et al.(2013)Bengio, Mesnil, Dauphin, and Rifai, Zeiler and Fergus(2014)]). It corresponds to the intuition that, unlike common primitive features in lower layers, higher layers capture highlevel abstractions of various aspects (e.g. car wheel, dog nose, flower petals). The chances of them firing simultaneously are relatively low. In fact, different filter ‘motifs’ tend to be progressively more global and decorrelated when navigating from low to high layers. The decorrelation trend is caused by the fact that coincidences/agreements in high dimensions can hardly happen by chance. Thus, we assume that and tend to be diagonal in the top layer. Since inactive neurons are not considered here, Eq. 4 becomes:
(5) 
According to Eq. 5, columns (, where ,
) are the eigenvectors of
(diagonal), thus they are standard basis vectors (i.e. columns and of the original neuron dimensions are aligned). s are the corresponding eigenvalues with:(6) 
where and are withinclass and betweenclass variances along the th neuron dimension. In other words, the optimal columns that maximize the class separation (Eq. 1) are aligned with (a subset of) the original neuron dimensions. It turns out that when pruning, we can directly discard neuron s with small (little contribution to Eq. 1) without much information loss.
3.2 Utility Tracing for CrossLayer Pruning
After unravelling twisted threads of deep variances and selecting dimensions of high task utility, the next step is to flow the utility across all previous layers to guide pruning. Unlike local approaches, our pruning unit is concerned with a filter’s/neuron’s contribution to final utility. In signal processing, deconvolution is used to reverse an unknown filter’s effect and recover corrupted sources [Haykin(1994)]. Inspired by this, to recover each neuron/filter’s utility, we trace the final discriminability via deconvolution backwards (from an easily unravelled end) across all layers. Fig. 1 demonstrates our crosslayer task utility tracing. The purpose of recovering/reconstructing the contributing sources to final task utility in the decisionmaking layer is different from capturing a certain order parameter/filter dependency, such as 1st order gradient. At convergence, most parameters have 0 or small gradients. It does not mean that these parameters are useless. Also, gradient dependency is calculated locally in a greedy manner. Structures pruned away based on a local dependency measure can never recover. There are many algorithms to compute or learn deconvolution. Here we use a version for convnets [Zeiler et al.(2011)Zeiler, Taylor, and Fergus, Zeiler and Fergus(2014)]. As an inverse process of convolution (incl. nonlinearity and pooling), the unit deconv procedure is composed of unpooling (using max location switches), nonlinear rectification, and reversed convolution (a transpose of the convolution Toeplitzlike matrix under an orthonormal assumption):
(7) 
where indicates the layer, the th column of is the feature vector converted from layer feature maps w.r.t. input , the th column is the corresponding reconstructed vector of layer contributing sources to final utilities. On the channel level:
(8) 
where ‘’ means convolution, indicates a channel, is the number of training images, is the feature map number,
is a deconv filter piece (determined after pretraining). Our calculated dependency here is datadriven and is pooled over the training set, which models the established phenomenon in neuroscience which stipulates that multiple exposures are able to strengthen relevant connections (synapses) in the brain (the Hebbian theory
[Hebb(2005)]). We also extend the deconv idea to FC structures, which can be considered as special conv structures where a layer’s input and weights are considered as stacks of conv feature maps and filters. They overlap completely (Fig. 2).For modular structures, the idea is the same except that we need to trace dependencies, i.e. apply deconvolution, for different scales in a groupwise manner. Our full net pruning, (re)training, and testing are done endtoend.
With all neurons’/filters’ utility for final discriminability known, the pruning process involves discarding structures that are less useful to final classification (e.g. structures colored white in Fig 1,2). Since feature maps (neuron outputs) correspond to nextlayer filter depths (neuron weights), our pruning leads to filterwise and channelwise savings simultaneously. When pruning, layer neurons with a LDAdeconv utility score () smaller than a threshold are deleted. In an overparameterized model, the number of ‘random’, noisy, and irrelevant structures/sources explodes exponentially with depth, while pretrained crosslayer dependencies of final taskutility are sparse. Unlike noise or random patterns, to construct a ‘meaningful’ motif, we need to follow a specific path(s). It is this crosslayer sparsity of usefulness (task difficulty related) that greatly contributes to pruning, not just the top layer. Useful neurons’/filters’ utilities are high in most large net layers. To get rid of massive numbers of useless neurons quickly while being cautious in high utility regions, we set the threshold for layer as:
(9) 
where is the average utility of layer activations, is the utility score of the th activation, and is the total number of layer activations (space aware). The assumption is that the utility scores in a certain layer follow a Gaussianlike distribution. The pruning time hyperparameter is constant over all layers and is directly related to the pruning rate. We could set it either to squeeze the net as much as possible without obvious accuracy loss or to find the ‘most accurate’ model, or to any possible pruning rates according to the resources available and accuracies expected. In other words, rather than a fixed compact model like SqueezeNet or MobileNet, we offer the flexibility to create models customized to different needs. Generalizability may be sacrificed with reduced capacity. The ‘generic’ fixed nets follow an adhoc direction by using random dimension reducing filters while our pruned models are ‘adapted’ to current task demands and invariant to unwanted factors. After pruning, retraining with surviving parameters is needed. Doing so iteratively improves convergence.
4 Experiments and Results
In this paper, we use both conventional and modulebased deep nets, e.g. VGG16 [Simonyan and Zisserman(2015)] and compact InceptionV1 a.k.a GoogLeNet [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich], to illustrate our taskdependant pruning (feature selection) method. One general object dataset CIFAR100 [Krizhevsky and Hinton(2009)] as well as two domain specific datasets, i.e. Adience [Eidinger et al.(2014)Eidinger, Enbar, and Hassner] and LFWA [Liu et al.(2015)Liu, Luo, Wang, and Tang] of facial traits, are chosen. Some most frequently explored attributes, such as age group, gender, smile/no smile are selected from the latter two. Base models are pretrained on ILSVRC12 [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and FeiFei]. The suggested splits in [Krizhevsky and Hinton(2009), Levi and Hassner(2015), Liu et al.(2015)Liu, Luo, Wang, and Tang] are adopted. For CIFAR100, we use the last 20% original training images in each of the 100 categories for validation purposes. For Adience, we use the first three folds for training, the 4th and 5th folds for validation and testing. All images are preresized to 224*224.
4.1 Accuracy v.s. Pruning Rates
(a) CIFAR100, GoogLeNet  (b) Adience Age, GoogLeNet  (c) LFWA Gender, VGG16  (d) LFWA Smile, VGG16 
Fig. 3 demonstrates the relationship of accuracy change v.s. parameters pruned. For comparison with our method, we include in the figures two other pruning approaches (i.e. Han et al [Han et al.(2015b)Han, Pool, Tran, and Dally] and Li et al [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf]) as well as modern compact structures (i.e. SqueezeNet [Iandola et al.(2016)Iandola, Han, Moskewicz, Ashraf, Dally, and
Keutzer] and MobileNet [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand,
Andreetto, and Adam]). CIFAR100 accuracy here is Top1 accuracy.
According to Fig. 3, even with large pruning rates (9899% for the VGG16 cases, 5782% for the GoogLeNet cases), our approach still maintains comparable accuracies to the original models (loss <1%). The other two methods suffer from earlier performance degradation, primarily due to their less accurate utility measures (single weights for Han et al [Han et al.(2015b)Han, Pool, Tran, and Dally] and sum of filter weights for Li et al [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf]). Additionally, for [Han et al.(2015b)Han, Pool, Tran, and Dally], inner filter relationships are vulnerable to pruning especially when the pruning rate is large. This also explains why Li et al [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf] performs slightly better than Han et al [Han et al.(2015b)Han, Pool, Tran, and Dally] at large pruning rates.
It is worth noting that during the pruning process, the proposed method obtains more accurate but lighter structures than the original net. For instance, in the age case, a model of 1/3 the original size is 3.8% more accurate than the original GoogLeNet. For CIFAR100, we achieve nearly 2% accuracy boost using 80% of the parameters. Similarly in the smile case, a 5x times smaller model can achieve 1.5% more accuracy than the unpruned VGG16 net. That is to say, in addition to boosting efficiency, our approach provides a way to find high performance deep models while being mindful of the resources available. Compared to the fixed compact nets, i.e. SqueezeNet and MobileNet, our pruning approach generally enjoys better performance at similar complexities because dimension reduction in the feature space with a taskaware utility measure is superior to reducing dimension using an arbitrary number of 1*1 filters. This supports our claim that pruning, or feature selection, should be task specific. Even in the only pruning time exception where our approach has a slightly lower accuracy at a similar size of SqueezeNet, much higher accuracies can be gained by simply adding back a few more parameters.
Also, we compare our approach with [Tian et al.(2017)Tian, Arbel, and Clark] which applies linear discriminant analysis on intermediate conv features. The comparison (Fig. 4) is in term of accuracy vs. saved computation (FLOP) on the LFWA data. As in [Han et al.(2015b)Han, Pool, Tran, and Dally], both multiplication and addition account for 1 FLOP.
FLOPs
Param#
Acc Chg
16B
49M
+0.4%
11B
19M
0.2%
9.5B
13M
0.7%
8.2B
8.6M
+0.5%
7.5B
6.9M
0.1%
7.4B
6.5M
0.7%
7.2B
6.1M
0.9%
5.2B
3.1M
1.0%
FLOPs
Param#
Acc Chg
13B
18M
+1.3%
12B
13M
+0.8%
10B
9.6M
+0.4%
8.3B
5M
0.1%
6.9B
2.7M
+0.2%
6.0B
2.5M
0.5%
5.5B
1.8M
0.5%
5.4B
1.7M
1.5%
Note: FLOPs are shared by both methods, Param# and Acc Change are ours. The left and right results are reported on LFWA gender and smile/not traits, respectively. Low pruning rates are skipped where the performance gap is small.
According to Fig. 4, our method enjoys as high as 6% more accuracy than [Tian et al.(2017)Tian, Arbel, and Clark] at large pruning rates as our LDA pruning measure is computed where it is directly related to final task utility, the linear assumption is more easily met and the variances are more disentangled (so that direct neuron abandonment is justified, Page 3).
To assess generalization ability on unseen data, we report in Table 1 the testing set performance of two of our models for each task: one achieves the highest validation accuracy (‘accuracy first’ or AF) and the other is the lightest model that maintains <1% validation accuracy loss (‘parameter first’ or PF). The competing structures are also included. We try to make competing pruned models of similar complexities (last row).
Methods & Acc  CIFAR100 (GoogLeNet, 78%)  Adience Age (GoogLeNet, 55%)  LFWA Gender (VGG, 91%)  LFWA Smile (VGG, 91%)  
AF  PF  AF  PF  AF  PF  AF  PF  
MobileNet [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam]  76%  49%  89%  87%  
SqueezeNet [Iandola et al.(2016)Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer]  71%  50%  90%  88%  
Han et al [Han et al.(2015b)Han, Pool, Tran, and Dally]  78%  73%  56%  43%  89%  83%  91%  81% 
Li et al [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf]  78%  74%  56%  46%  88%  85%  91%  83% 
Our approach  80%  77%  58%  54%  93%  92%  93%  90% 
(Param#,FLOP)  (4.8M,2.9B)  (2.6M,2.1B)  (2.3M,1.8B)  (1.1M,1.1B)  (6.5M,7.4B)  (3.1M,5.2B)  (18M,13B)  (1.8M,5.5B) 
Note: our method’s param# and FLOPs (last row) are respectively shared by [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf, Han et al.(2015b)Han, Pool, Tran, and Dally] and [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf, Tian et al.(2017)Tian, Arbel, and Clark]. [Han et al.(2015b)Han, Pool, Tran, and Dally] has the same FLOPs as bases. The bases and their testing accuracies are in Row 1 parentheses. Original param# and FLOPs for VGG16, GoogLeNet, MobileNet, and SqueezeNet are about 138M, 6.0M, 4.3M, 1.3M and 31B, 3.2B, 1.1B, 1.7B, respectively. M=, B=.
From Table 1, it is evident that our approach generalizes well to unseen data (highest accuracies over most cases). Its superiority is more obvious in the ‘parameter first’ case. This agrees with previous validation results. Additionally, although MobileNet and SqueezeNet perform similarly for Adience and LFWA, MobileNet performs clearly better on CIFAR100 mainly due to its suitable large capacity (without overfitting) for the particular task. This also indicates the superiority of providing a range of taskdependent models over fixed ones. Generally speaking, the gaps between the proposed and the fixed nets are wider in the GoogLeNet cases because the proposed method can strategically select both filter types and filter numbers according to task demands.
4.2 Layerwise Complexity Analysis
In this section, we provide a layerbylayer complexity analysis of our pruned nets in terms of parameters and computation. The net we select for each case is the smallest one that preserves comparable accuracy to the original net. Fig. 5, 6, 7 demonstrate layerwise complexity reductions for the CIFAR100, Adience, LFWA cases respectively. The base structure is GoogLeNet for the first two datasets. As Fig. 5 and 6
show, in each Inception module, different kinds of filters are pruned differently. This is determined by the scale where more task utility lies. For easy tasks, more parameters can be pruned away, which helps alleviate overfitting. When taskspecific difficulty increases (e.g. CIFAR100), the parameter numbers also grow, but in a taskdesirable direction. By choosing the kinds of filters and also the filter number for each kind, our approach provides a taskdependent manner to design deep architecture. In the pruned models, most parameters in the middle layers have been discarded. In fact, our method can collapse such layers to reduce the network depth. In our experiments, when pruning reaches a threshold, all filters left in certain modules are of size 1*1. They can be viewed as simple filter selectors (by weight assignment) and thus can be combined and merged into the previous module’s concatenation to form a weighted summation. Such ‘skipping’ modules pass feature representations to higher layers without incrementing the features’ abstraction level. InceptionV1 is chosen as an example because it offers more filter type choices. However, the proposed approach can be used to prune other modular structures, such as ResNet where the final summation in a unit module can be modeled as a concatenation followed by convolution. Fig.
7 shows the LFWA cases with VGG16 as bases. Since FC structures dominate VGG16, we add a separate conv layer parameter analysis. The results show that our approach led to significant parameter and FLOP reductions over the layers. Specifically, the method effectively prunes away almost all the dominating FC parameters. In Fig. 5, 6, 7, the first few layers are not pruned very much. This is because earlier layers correspond to primitive patterns (e.g. edges) that are commonly useful. Early layers also provide robustness to unimportant/noisy pixel space statistics. Despite its data dependent nature, the proposed approach does not depend much on training ‘pixels’, but focuses more on deep abstract manifolds learned and generalized from training instances. It generalizes well to unseen data due to its invariance to taskirrelevant factors (e.g. illumination). The pruned models are very light. On a machine with 32bit parameters the models are respectively 10MiB, 4.1MiB, 11.9MiB, and 6.7MiB. They can possibly fit into computer/cellphone memories or even caches (with superlinear efficiency boosts).(a) Conv Param  (b) All Param  (c) FLOPs  (d) Conv Param  (e) All Param  (f) FLOPs 
5 Future Work and Discussion
In our concurrent work, we attempt to derive taskoptimal architectures by proactively pushing useful deep discriminants into a condensed subset of neurons before deconv based pruning. This is achieved by simultaneously including deep LDA utility and covariance penalty in the objective function. However, compared to the simple pruning method presented here, proactive eigendecomposition and training are more computationally expensive and sometimes numerically unstable. Another possible direction is to apply the deep discriminant/component analysis idea to unsupervised scenarios. For example, deep ICA dimension reduction can possibly be done by minimizing dependence in the latent space before deconv pruning. This will condense information flow, reduce redundancy and interference. Thus, it has a potential for applications like automatic structure design of autoencoders, efficient image retrieval and reconstruction.
6 Conclusion
This paper proposes a taskspecific neuron level endtoend pruning approach with a LDADeconv utility that is aware of both final classification and its holistic crosslayer dependency. This is different from taskblind approaches and those with local (individual weights or within 12 layers) utility measures. The proposed approach is able to prune convolutional, fully connected, modular, and hybrid deep structures and it is useful for designing deep models by finding both the desired types of filters and the number for each kind. Compared to fixed nets, the method offers a range of models that are adapted for the inference task in question. On the general object dataset CIFAR100 and domain specific LFWA and Adience datasets, the approach achieves better performance and greater complexity reductions than competing methods. The high pruning rates and global taskutility awareness of the approach offer a great potential for installation on mobile devices in many realworld applications.
References
References
 [Anwar et al.(2015)Anwar, Hwang, and Sung] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural networks. arXiv preprint arXiv:1512.08571, 2015.
 [BekiosCalfa et al.(2011)BekiosCalfa, Buenaposada, and Baumela] Juan BekiosCalfa, Jose M Buenaposada, and Luis Baumela. Revisiting linear discriminant techniques in gender recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(4):858–864, 2011.
 [Belhumeur et al.(1997)Belhumeur, Hespanha, and Kriegman] Peter N. Belhumeur, João P Hespanha, and David J. Kriegman. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on pattern analysis and machine intelligence, 19(7):711–720, 1997.

[Bengio et al.(2013)Bengio, Mesnil, Dauphin, and Rifai]
Yoshua Bengio, Grégoire Mesnil, Yann Dauphin, and Salah Rifai.
Better mixing via deep representations.
In
International conference on machine learning
, pages 552–560, 2013.  [Denton et al.(2014)Denton, Zaremba, Bruna, LeCun, and Fergus] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pages 1269–1277, 2014.

[Eidinger et al.(2014)Eidinger, Enbar, and Hassner]
Eran Eidinger, Roee Enbar, and Tal Hassner.
Age and gender estimation of unfiltered faces.
IEEE Transactions on Information Forensics and Security, 9(12):2170–2179, 2014.  [Fisher(1936)] Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179–188, 1936.
 [Guo et al.(2016)Guo, Yao, and Chen] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pages 1379–1387, 2016.
 [Han et al.(2015a)Han, Mao, and Dally] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2, 2015a.
 [Han et al.(2015b)Han, Pool, Tran, and Dally] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015b.
 [Hassibi and Stork(1993)] Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain surgeon. Morgan Kaufmann, 1993.
 [Haykin(1994)] Simon S Haykin. Blind deconvolution. Prentice Hall, 1994.

[He et al.(2016)He, Zhang, Ren, and Sun]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [He et al.(2017)He, Zhang, and Sun] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397, 2017.
 [Hebb(2005)] Donald Olding Hebb. The organization of behavior: A neuropsychological theory. Psychology Press, 2005.
 [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [Hu et al.(2016)Hu, Peng, Tai, and Tang] Hengyuan Hu, Rui Peng, YuWing Tai, and ChiKeung Tang. Network trimming: A datadriven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016.
 [Iandola et al.(2016)Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and <0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
 [Jaderberg et al.(2014)Jaderberg, Vedaldi, and Zisserman] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
 [Jin et al.(2016)Jin, Yuan, Feng, and Yan] Xiaojie Jin, Xiaotong Yuan, Jiashi Feng, and Shuicheng Yan. Training skinny deep neural networks with iterative hard thresholding methods. arXiv preprint arXiv:1607.05423, 2016.
 [Krizhevsky and Hinton(2009)] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

[Krizhevsky et al.(2012)Krizhevsky, Sutskever, and
Hinton]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, pages 1097–1105, 2012.  [LeCun et al.(1989)LeCun, Denker, Solla, Howard, and Jackel] Yann LeCun, John S Denker, Sara A Solla, Richard E Howard, and Lawrence D Jackel. Optimal brain damage. In NIPs, volume 2, pages 598–605, 1989.
 [Levi and Hassner(2015)] Gil Levi and Tal Hassner. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 34–42, 2015.
 [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.

[Li et al.(1999)Li, Kittler, and Matas]
Yongping Li, Josef Kittler, and Jiri Matas.
Effective implementation of linear discriminant analysis for face recognition and verification.
In Computer Analysis of Images and Patterns, page 234. Springer, 1999.  [Lin et al.(2014)Lin, Chen, and Yan] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. ICLR, 2014.
 [Liu et al.(2015)Liu, Luo, Wang, and Tang] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
 [Luo et al.(2017)Luo, Wu, and Lin] JianHao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pages 5058–5066, 2017.
 [Mariet and Sra(2016)] Zelda Mariet and Suvrit Sra. Diversity networks. ICLR, 2016.
 [Mountcastle et al.(1957)] Vernon B Mountcastle et al. Modality and topographic properties of single neurons of cat’s somatic sensory cortex. J neurophysiol, 20(4):408–434, 1957.
 [Polyak and Wolf(2015)] Adam Polyak and Lior Wolf. Channellevel acceleration of deep face representations. IEEE Access, 3:2163–2175, 2015.
 [Pratt(1989)] Lorien Y Pratt. Comparing biases for minimal network construction with backpropagation, volume 1. Morgan Kaufmann Pub, 1989.
 [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
 [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and FeiFei] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s112630150816y.
 [Simonyan and Zisserman(2015)] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR) 2015, 2015.
 [Srinivas and Babu(2015)] Suraj Srinivas and R Venkatesh Babu. Datafree parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149, 2015.
 [Sze et al.(2017)Sze, Yang, and Chen] Vivienne Sze, TienJu Yang, and YuHsin Chen. Designing energyefficient convolutional neural networks using energyaware pruning. pages 5687–5695, 2017.
 [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
 [Tian et al.(2017)Tian, Arbel, and Clark] Qing Tian, Tal Arbel, and James J Clark. Deep ldapruned nets for efficient facial gender classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 10–19, 2017.
 [Valiant(2006)] Leslie G Valiant. A quantitative theory of neural computation. Biological cybernetics, 95(3):205–211, 2006.
 [Zeiler and Fergus(2014)] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818–833. Springer, 2014.
 [Zeiler et al.(2011)Zeiler, Taylor, and Fergus] Matthew D Zeiler, Graham W Taylor, and Rob Fergus. Adaptive deconvolutional networks for mid and high level feature learning. In 2011 International Conference on Computer Vision, pages 2018–2025. IEEE, 2011.
 [Zhang et al.(2016)Zhang, Zou, He, and Sun] Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence, 38(10):1943–1955, 2016.
Comments
There are no comments yet.