Fisher Pruning of Deep Nets for Facial Trait Classification

by   Qing Tian, et al.

Although deep nets have resulted in high accuracies for various visual tasks, their computational and space requirements are prohibitively high for inclusion on devices without high-end GPUs. In this paper, we introduce a neuron/filter level pruning framework based on Fisher's LDA which leads to high accuracies for a wide array of facial trait classification tasks, while significantly reducing space/computational complexities. The approach is general and can be applied to convolutional, fully-connected, and module-based deep structures, in all cases leveraging the high decorrelation of neuron activations found in the pre-decision layer and cross-layer deconv dependency. Experimental results on binary and multi-category facial traits from the LFWA and Adience datasets illustrate the framework's comparable/better performance to state-of-the-art pruning approaches and compact structures (e.g. SqueezeNet, MobileNet). Ours successfully maintains comparable accuracies even after discarding most parameters (98 reductions (83



There are no comments yet.


page 10

page 13


The Treasure beneath Convolutional Layers: Cross-convolutional-layer Pooling for Image Classification

A number of recent studies have shown that a Deep Convolutional Neural N...

Deep discriminant analysis for task-dependent compact network search

Most of today's popular deep architectures are hand-engineered for gener...

Data-free parameter pruning for Deep Neural Networks

Deep Neural nets (NNs) with millions of parameters are at the heart of m...

A Layer Decomposition-Recomposition Framework for Neuron Pruning towards Accurate Lightweight Networks

Neuron pruning is an efficient method to compress the network into a sli...

Towards Learning Convolutions from Scratch

Convolution is one of the most essential components of architectures use...

Faster gaze prediction with dense networks and Fisher pruning

Predicting human fixations from images has recently seen large improveme...

Smaller Models, Better Generalization

Reducing network complexity has been a major research focus in recent ye...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we explore the premise that less useful structures/features (including their possible redundancies) in overparameterized deep nets can be pruned away to boost efficiency and accuracy. We argue that optimal deep features should be task-dependent. Prior to deep learning, features were usually hand-engineered with domain specific knowledge. With the success of deep learning, we no longer need to handcraft features, but people are still handcrafting various architectures, which impacts both the quality and quantity of features to be learned. Some features learned via arbitrary architectures may be less useful for the current task at hand. Such features (parameters) not only add to the storage and computational burden but may also skew the data analysis (e.g. image classification) or result in over-fitting. Most of today’s successful deep architectures are hand-designed for ImageNet. Thus, they may not be able to produce optimal features for other tasks, despite large enough capacity. Instead of assuming fixed nets’ generalizability to various tasks, in this paper, we attempt to address the problem through task-specific pruning (feature selection) and generating a range of deep models well-adapted to the current task.

Filter or neuron level pruning has its advantages. Deep networks learn to construct hierarchical representations. Moving up the layers, high-level motifs that are more global, abstract, and disentangled can be built from simpler low-level patterns [Bengio et al.(2013)Bengio, Mesnil, Dauphin, and Rifai, Zeiler and Fergus(2014)]. In this process, the critical building block is the filter, which, through learning, is capable of capturing patterns at a certain scale of abstraction. Higher layers are agnostic as how the patterns are activated (w.r.t. weights, input, activation details). Single weights-based approaches run the risk of destroying crucial patterns. Given non-negative inputs, many small negative weights may jointly counteract large positive weights, resulting in a dormant neuron state. Single weight magnitude-based pruning would discard all small negative weights before reaching the large positive ones, reversing the state. This issue is serious at high pruning rates. Also, instead of setting zeros in weights matrices, filter/neuron pruning removes rows/columns/depths in weight/convolution matrices, leading to direct space and computation savings.

In this paper, we develop a Fisher Linear Discriminant Analysis (LDA) [Fisher(1936)] based neuron/filter level pruning framework that is aware of both final classification utility and its holistic dependency. Aside from efficiency gains, the pruning approach strategically selects deep features from a dimension reduction perspective, which potentially leads to accuracy boosts. Key novel contributions that distinguish our approach from others (e.g. [Han et al.(2015b)Han, Pool, Tran, and Dally, Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf, Tian et al.(2017)Tian, Arbel, and Clark]

) include: (1) Our pruning is task-dependent. It has an LDA-based neuron utility measure for pruning that is directly related to final task-specific classification. The LDA-based utility is calculated, unraveled, and traced backwards from the final (hidden) layer where the linear assumption of LDA is more reasonable and variances are more disentangled 

[Bengio et al.(2013)Bengio, Mesnil, Dauphin, and Rifai]. Those two factors make our LDA-based pruning directly along neuron dimensions well-grounded (we show this in Sec. 3.1

through solving a generalized eigenvalue problem). In contrast, most previous pruning approaches have task-blind utility measures (e.g. magnitudes of weights, variances, activation). They pay enough attention to the complexity itself, but less attention to whether the complexity change follows a task-desirable direction. (2) Through deep discriminant analysis, our approach determines how many filters, and of what types, are appropriate in a given layer. By pruning deep modules, it provides a novel strategy for architecture design. This is different from popular compact structures that employ

random 1x1 filter blocks to reduce data dimension to size . A small may cut the information flow to higher layers, while a large may lead to redundancy/overfitting/interference. Such arbitrariness also exists for other filter types. (3) The approach presented here handles a wide variety of structures such as convolutional, fully-connected, modular, and hybrid ones and prunes a full network in an end-to-end manner. Parameters easily explode when neurons are fully connected (e.g. over 90% FC weights in AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] and VGG16 [Simonyan and Zisserman(2015)]). By strategically pruning FC weights, we offer a new alternative to weight sharing for parameter reduction, which can preserve useful location information contributing to task utility. (4) The proposed method derives a feature subspace that is invariant and robust to task-unrelated factors (e.g. lighting). In our experiments on general and domain specific datasets (CIFAR100, Adience and LFWA), we show how the proposed method leads to great complexity reductions. It brings down the total VGG16 size by 98%-99% and that of the compact GoogLeNet by up to 82% without much accuracy loss (<1%). The corresponding FLOP reduction rates are as high as 83% and 62%, respectively. Additionally, we can derive more accurate models at lower complexities. Take age recognition on Adience for example, one model is over 3% more accurate than the original net but only about 1/3 in size. Also, we compare our method with MobileNet, SqueezeNet, [Han et al.(2015b)Han, Pool, Tran, and Dally][Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf] and show our task-specific Fisher pruning’s value.

2 Related Work

Early approaches to neural networks pruning date back to the late 1980s. Some pioneering examples include 

[Pratt(1989), LeCun et al.(1989)LeCun, Denker, Solla, Howard, and Jackel, Hassibi and Stork(1993)]. In recent years, with increasing network depths comes more complexity, which reignited research into network pruning. Han et al [Han et al.(2015b)Han, Pool, Tran, and Dally] discard weights of small magnitudes. Similarly, approaches that sparsify networks by setting weights to zero include [Srinivas and Babu(2015), Mariet and Sra(2016), Jin et al.(2016)Jin, Yuan, Feng, and Yan, Guo et al.(2016)Guo, Yao, and Chen, Hu et al.(2016)Hu, Peng, Tai, and Tang, Sze et al.(2017)Sze, Yang, and Chen]. With further compression, this sparsity is desirable for storage and transferring purposes. Anwar et al [Anwar et al.(2015)Anwar, Hwang, and Sung] introduced structured sparsity at different scales. More recently, neuron/filter/channel pruning has gained popularity [Polyak and Wolf(2015), Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf, Tian et al.(2017)Tian, Arbel, and Clark, Luo et al.(2017)Luo, Wu, and Lin, He et al.(2017)He, Zhang, and Sun], as they directly produce hardware friendly architectures that not only reduce the requirements of storage space and transportation bandwidth, but also bring down the initially large amount of computation in conv layers. Furthermore, with fewer intermediate feature maps generated and consumed, the number of slow and energy-intensive memory accesses is decreased, rendering the pruned nets more amenable to implementation on mobile devices. Despite the promising pruning rates achieved, most previous methods possess one or both of the following drawbacks: (1) the utility measure for pruning, such as magnitudes of weights, variances, or activation, is task blind, or at least not directly related to task demands. (2) utilities are often computed locally (relationships within layer/filter or across all layers may be missed). In addition to pruning, some approaches constrain space and/or computational complexity through compression, such as Huffman encoding, weight quantization [Han et al.(2015a)Han, Mao, and Dally]

and bitwidth reduction (e.g. XNOR-Net and BWN 

[Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi]). Some decompose filters with a low-rank assumption [Denton et al.(2014)Denton, Zaremba, Bruna, LeCun, and Fergus, Jaderberg et al.(2014)Jaderberg, Vedaldi, and Zisserman, Zhang et al.(2016)Zhang, Zou, He, and Sun]. Another popular method is to adopt compact layers/modules with a random set of 1*1 filters to reduce dimension (e.g. InceptionNet [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich], ResNet [He et al.(2016)He, Zhang, Ren, and Sun], SqueezeNet [Iandola et al.(2016)Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer], MobileNet [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam], and NiN [Lin et al.(2014)Lin, Chen, and Yan]

), which may impede information flow or increase redundancy. Most popular compact/pruned architectures focus on popular datasets like ImageNet. However, computer vision tasks are diverse in nature, difficulty, and dataset size. Task-specific pruning can play a crucial role for various applications especially those with limited data and strict timing requirements (e.g. car forward collision warning). The inspiration can be drawn from neuroscience findings 

[Mountcastle et al.(1957), Valiant(2006)] that show that, despite the massive number of neurons in the cerebral cortex, each neuron typically receives inputs from a small task-specific set of other neurons.

3 Task-dependent Deep Fisher Pruning

In this paper, we propose a task-dependent Fisher Linear Discriminant Analysis (LDA) based pruning approach on the neuron/filter level that is aware of final task-specific classification utility and its holistic cross-layer dependency. We treat pruning as dimensionality reduction in the deep feature space by disentangling factors of variation and discarding those of little or even harmful/interfering utility. The base net is fully pre-trained, with cross entropy loss, L2 regularization, and Dropout (that can punish co-adaptations). The method starts pruning by unravelling useful variances from the last hidden layer before tracing the utility backwards through deconvolution across all layers to weigh the usefulness of each neuron/filter. By abandoning less useful neurons/filters, our approach is capable of deriving optimal structures for a given task with potential accuracy boosts.

3.1 Task Utility Unravelling from Final Layer

The pruning begins from the final (hidden) layer of the well-trained base net for a number of reasons: (1) This is the only place where task-specific distinguishing utility can be accurately and directly measured. All previous information feed to this layer. (2) Data in this layer are more likely to be linearly separable (LDA assumption). (3) Pre-decision neuron activations representing different motifs are shown empirically to fire in a more decorrelated manner than earlier layers. We will see how this helps later. For each image, an M-dimensional firing vector can be calculated in the final hidden layer, which is called a firing instance (

for VGG16, for GoogLeNet, pooling is applied when necessary). By stacking all such instances from a set of images, the firing data matrix for that set is obtained (useless 0-variance/duplicate columns are pre-removed). Our aim here is to abandon dimensions of that possess low or even negative task utility. Inspired by [Belhumeur et al.(1997)Belhumeur, Hespanha, and Kriegman, Li et al.(1999)Li, Kittler, and Matas, Bekios-Calfa et al.(2011)Bekios-Calfa, Buenaposada, and Baumela, Tian et al.(2017)Tian, Arbel, and Clark], Fisher’s LDA [Fisher(1936)] is adopted to quantify this utility. The goal is to maximize class separation by finding:




with being the set of observations obtained in the last hidden layer for category , linearly projects the data to a new space spanned by columns. The tilde sign ( ) denotes a centering operation; For data :


where is the number of observations in , denotes an n*1 vector of ones. Finding involves solving a generalized eigenvalue problem:


where (,) represents a generalized eigenpair of the matrix pencil with as a column. If we only consider active neurons with non-duplicate pattern motifs, we find that in the final hidden layer, most off-diagonal values in and are very small. In other words, aside from noise or meaningless neurons, the firings of neurons representing different motifs in the pre-decision layer are highly decorrelated (disentanglement of latent space variances [Bengio et al.(2013)Bengio, Mesnil, Dauphin, and Rifai, Zeiler and Fergus(2014)]). It corresponds to the intuition that, unlike common primitive features in lower layers, higher layers capture high-level abstractions of various aspects (e.g. car wheel, dog nose, flower petals). The chances of them firing simultaneously are relatively low. In fact, different filter ‘motifs’ tend to be progressively more global and decorrelated when navigating from low to high layers. The decorrelation trend is caused by the fact that coincidences/agreements in high dimensions can hardly happen by chance. Thus, we assume that and tend to be diagonal in the top layer. Since inactive neurons are not considered here, Eq. 4 becomes:


According to Eq. 5, columns (, where ,

) are the eigenvectors of

(diagonal), thus they are standard basis vectors (i.e. columns and of the original neuron dimensions are aligned). s are the corresponding eigenvalues with:


where and are within-class and between-class variances along the th neuron dimension. In other words, the optimal columns that maximize the class separation (Eq. 1) are aligned with (a subset of) the original neuron dimensions. It turns out that when pruning, we can directly discard neuron s with small (little contribution to Eq. 1) without much information loss.

3.2 Utility Tracing for Cross-Layer Pruning

Note: useful (cyan) neuron outputs/feature maps that contribute to final utility (task-specific LDA) through corresponding (green) next layer weights/filter pieces, only depend on previous layers’ (cyan) counterparts via deconv.
Figure 1: Full-net LDA-Deconv utility tracing on the neuron/filter level.

After unravelling twisted threads of deep variances and selecting dimensions of high task utility, the next step is to flow the utility across all previous layers to guide pruning. Unlike local approaches, our pruning unit is concerned with a filter’s/neuron’s contribution to final utility. In signal processing, deconvolution is used to reverse an unknown filter’s effect and recover corrupted sources [Haykin(1994)]. Inspired by this, to recover each neuron/filter’s utility, we trace the final discriminability via deconvolution backwards (from an easily unravelled end) across all layers. Fig. 1 demonstrates our cross-layer task utility tracing. The purpose of recovering/reconstructing the contributing sources to final task utility in the decision-making layer is different from capturing a certain order parameter/filter dependency, such as 1st order gradient. At convergence, most parameters have 0 or small gradients. It does not mean that these parameters are useless. Also, gradient dependency is calculated locally in a greedy manner. Structures pruned away based on a local dependency measure can never recover. There are many algorithms to compute or learn deconvolution. Here we use a version for convnets [Zeiler et al.(2011)Zeiler, Taylor, and Fergus, Zeiler and Fergus(2014)]. As an inverse process of convolution (incl. nonlinearity and pooling), the unit deconv procedure is composed of unpooling (using max location switches), nonlinear rectification, and reversed convolution (a transpose of the convolution Toeplitz-like matrix under an orthonormal assumption):


where indicates the layer, the th column of is the feature vector converted from layer feature maps w.r.t. input , the th column is the corresponding reconstructed vector of layer contributing sources to final utilities. On the channel level:


where ‘’ means convolution, indicates a channel, is the number of training images, is the feature map number,

is a deconv filter piece (determined after pre-training). Our calculated dependency here is data-driven and is pooled over the training set, which models the established phenomenon in neuroscience which stipulates that multiple exposures are able to strengthen relevant connections (synapses) in the brain (the Hebbian theory 

[Hebb(2005)]). We also extend the deconv idea to FC structures, which can be considered as special conv structures where a layer’s input and weights are considered as stacks of conv feature maps and filters. They overlap completely (Fig. 2).

Note: each FC neuron is a stack of 1*1 filters with one 1*1 output feature map.
Figure 2: Deconv-based dependency tracing in FC structures.

For modular structures, the idea is the same except that we need to trace dependencies, i.e. apply deconvolution, for different scales in a group-wise manner. Our full net pruning, (re)training, and testing are done end-to-end.

With all neurons’/filters’ utility for final discriminability known, the pruning process involves discarding structures that are less useful to final classification (e.g. structures colored white in Fig 1,2). Since feature maps (neuron outputs) correspond to next-layer filter depths (neuron weights), our pruning leads to filter-wise and channel-wise savings simultaneously. When pruning, layer neurons with a LDA-deconv utility score () smaller than a threshold are deleted. In an over-parameterized model, the number of ‘random’, noisy, and irrelevant structures/sources explodes exponentially with depth, while pre-trained cross-layer dependencies of final task-utility are sparse. Unlike noise or random patterns, to construct a ‘meaningful’ motif, we need to follow a specific path(s). It is this cross-layer sparsity of usefulness (task difficulty related) that greatly contributes to pruning, not just the top layer. Useful neurons’/filters’ utilities are high in most large net layers. To get rid of massive numbers of useless neurons quickly while being cautious in high utility regions, we set the threshold for layer as:


where is the average utility of layer activations, is the utility score of the th activation, and is the total number of layer activations (space aware). The assumption is that the utility scores in a certain layer follow a Gaussian-like distribution. The pruning time hyper-parameter is constant over all layers and is directly related to the pruning rate. We could set it either to squeeze the net as much as possible without obvious accuracy loss or to find the ‘most accurate’ model, or to any possible pruning rates according to the resources available and accuracies expected. In other words, rather than a fixed compact model like SqueezeNet or MobileNet, we offer the flexibility to create models customized to different needs. Generalizability may be sacrificed with reduced capacity. The ‘generic’ fixed nets follow an ad-hoc direction by using random dimension reducing filters while our pruned models are ‘adapted’ to current task demands and invariant to unwanted factors. After pruning, retraining with surviving parameters is needed. Doing so iteratively improves convergence.

4 Experiments and Results

In this paper, we use both conventional and module-based deep nets, e.g. VGG16 [Simonyan and Zisserman(2015)] and compact InceptionV1 a.k.a GoogLeNet [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich], to illustrate our task-dependant pruning (feature selection) method. One general object dataset CIFAR100 [Krizhevsky and Hinton(2009)] as well as two domain specific datasets, i.e. Adience [Eidinger et al.(2014)Eidinger, Enbar, and Hassner] and LFWA [Liu et al.(2015)Liu, Luo, Wang, and Tang] of facial traits, are chosen. Some most frequently explored attributes, such as age group, gender, smile/no smile are selected from the latter two. Base models are pretrained on ILSVRC12 [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei-Fei]. The suggested splits in [Krizhevsky and Hinton(2009), Levi and Hassner(2015), Liu et al.(2015)Liu, Luo, Wang, and Tang] are adopted. For CIFAR100, we use the last 20% original training images in each of the 100 categories for validation purposes. For Adience, we use the first three folds for training, the 4th and 5th folds for validation and testing. All images are pre-resized to 224*224.

4.1 Accuracy v.s. Pruning Rates

(a) CIFAR100, GoogLeNet (b) Adience Age, GoogLeNet (c) LFWA Gender, VGG16 (d) LFWA Smile, VGG16
Figure 3: Accuracy change vs. parameter savings of ours (blue), Han et al [Han et al.(2015b)Han, Pool, Tran, and Dally] (red), and Li et al [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf] (orange), SqueezeNet [Iandola et al.(2016)Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer], and MobileNet [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam]. In our implementation of [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf], we adopt the same pruning rate as ours in each layer (not empirically determined as in [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf]).

Fig. 3 demonstrates the relationship of accuracy change v.s. parameters pruned. For comparison with our method, we include in the figures two other pruning approaches (i.e. Han et al [Han et al.(2015b)Han, Pool, Tran, and Dally] and Li et al [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf]) as well as modern compact structures (i.e. SqueezeNet [Iandola et al.(2016)Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer] and MobileNet [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam]). CIFAR100 accuracy here is Top-1 accuracy. According to Fig. 3, even with large pruning rates (98-99% for the VGG16 cases, 57-82% for the GoogLeNet cases), our approach still maintains comparable accuracies to the original models (loss <1%). The other two methods suffer from earlier performance degradation, primarily due to their less accurate utility measures (single weights for Han et al [Han et al.(2015b)Han, Pool, Tran, and Dally] and sum of filter weights for Li et al [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf]). Additionally, for [Han et al.(2015b)Han, Pool, Tran, and Dally], inner filter relationships are vulnerable to pruning especially when the pruning rate is large. This also explains why Li et al [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf] performs slightly better than Han et al [Han et al.(2015b)Han, Pool, Tran, and Dally] at large pruning rates. It is worth noting that during the pruning process, the proposed method obtains more accurate but lighter structures than the original net. For instance, in the age case, a model of 1/3 the original size is 3.8% more accurate than the original GoogLeNet. For CIFAR100, we achieve nearly 2% accuracy boost using 80% of the parameters. Similarly in the smile case, a 5x times smaller model can achieve 1.5% more accuracy than the unpruned VGG16 net. That is to say, in addition to boosting efficiency, our approach provides a way to find high performance deep models while being mindful of the resources available. Compared to the fixed compact nets, i.e. SqueezeNet and MobileNet, our pruning approach generally enjoys better performance at similar complexities because dimension reduction in the feature space with a task-aware utility measure is superior to reducing dimension using an arbitrary number of 1*1 filters. This supports our claim that pruning, or feature selection, should be task specific. Even in the only pruning time exception where our approach has a slightly lower accuracy at a similar size of SqueezeNet, much higher accuracies can be gained by simply adding back a few more parameters. Also, we compare our approach with [Tian et al.(2017)Tian, Arbel, and Clark] which applies linear discriminant analysis on intermediate conv features. The comparison (Fig. 4) is in term of accuracy vs. saved computation (FLOP) on the LFWA data. As in [Han et al.(2015b)Han, Pool, Tran, and Dally], both multiplication and addition account for 1 FLOP.
FLOPs Param# Acc Chg 16B 49M +0.4% 11B 19M -0.2% 9.5B 13M -0.7% 8.2B 8.6M +0.5% 7.5B 6.9M -0.1% 7.4B 6.5M -0.7% 7.2B 6.1M -0.9% 5.2B 3.1M -1.0% FLOPs Param# Acc Chg 13B 18M +1.3% 12B 13M +0.8% 10B 9.6M +0.4% 8.3B 5M -0.1% 6.9B 2.7M +0.2% 6.0B 2.5M -0.5% 5.5B 1.8M -0.5% 5.4B 1.7M -1.5% Note: FLOPs are shared by both methods, Param# and Acc Change are ours. The left and right results are reported on LFWA gender and smile/not traits, respectively. Low pruning rates are skipped where the performance gap is small. Figure 4: Accuracy change vs. FLOP savings of ours (blue) and [Tian et al.(2017)Tian, Arbel, and Clark] (red). According to Fig. 4, our method enjoys as high as 6% more accuracy than [Tian et al.(2017)Tian, Arbel, and Clark] at large pruning rates as our LDA pruning measure is computed where it is directly related to final task utility, the linear assumption is more easily met and the variances are more disentangled (so that direct neuron abandonment is justified, Page 3). To assess generalization ability on unseen data, we report in Table 1 the testing set performance of two of our models for each task: one achieves the highest validation accuracy (‘accuracy first’ or AF) and the other is the lightest model that maintains <1% validation accuracy loss (‘parameter first’ or PF). The competing structures are also included. We try to make competing pruned models of similar complexities (last row).

Methods & Acc CIFAR100 (GoogLeNet, 78%) Adience Age (GoogLeNet, 55%) LFWA Gender (VGG, 91%) LFWA Smile (VGG, 91%)
MobileNet [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam] 76% 49% 89% 87%
SqueezeNet [Iandola et al.(2016)Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer] 71% 50% 90% 88%
Han et al [Han et al.(2015b)Han, Pool, Tran, and Dally] 78% 73% 56% 43% 89% 83% 91% 81%
Li et al [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf] 78% 74% 56% 46% 88% 85% 91% 83%
Our approach 80% 77% 58% 54% 93% 92% 93% 90%
(Param#,FLOP) (4.8M,2.9B) (2.6M,2.1B) (2.3M,1.8B) (1.1M,1.1B) (6.5M,7.4B) (3.1M,5.2B) (18M,13B) (1.8M,5.5B)

Note: our method’s param# and FLOPs (last row) are respectively shared by [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf, Han et al.(2015b)Han, Pool, Tran, and Dally] and [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf, Tian et al.(2017)Tian, Arbel, and Clark][Han et al.(2015b)Han, Pool, Tran, and Dally] has the same FLOPs as bases. The bases and their testing accuracies are in Row 1 parentheses. Original param# and FLOPs for VGG16, GoogLeNet, MobileNet, and SqueezeNet are about 138M, 6.0M, 4.3M, 1.3M and 31B, 3.2B, 1.1B, 1.7B, respectively. M=, B=.

Table 1: Testing accuracies. ‘AF’: accuracy first, ‘PF’: param# first, when selecting models.

From Table 1, it is evident that our approach generalizes well to unseen data (highest accuracies over most cases). Its superiority is more obvious in the ‘parameter first’ case. This agrees with previous validation results. Additionally, although MobileNet and SqueezeNet perform similarly for Adience and LFWA, MobileNet performs clearly better on CIFAR100 mainly due to its suitable large capacity (without overfitting) for the particular task. This also indicates the superiority of providing a range of task-dependent models over fixed ones. Generally speaking, the gaps between the proposed and the fixed nets are wider in the GoogLeNet cases because the proposed method can strategically select both filter types and filter numbers according to task demands.

4.2 Layerwise Complexity Analysis

In this section, we provide a layer-by-layer complexity analysis of our pruned nets in terms of parameters and computation. The net we select for each case is the smallest one that preserves comparable accuracy to the original net. Fig. 567 demonstrate layer-wise complexity reductions for the CIFAR100, Adience, LFWA cases respectively. The base structure is GoogLeNet for the first two datasets. As Fig. 5 and 6

show, in each Inception module, different kinds of filters are pruned differently. This is determined by the scale where more task utility lies. For easy tasks, more parameters can be pruned away, which helps alleviate over-fitting. When task-specific difficulty increases (e.g. CIFAR100), the parameter numbers also grow, but in a task-desirable direction. By choosing the kinds of filters and also the filter number for each kind, our approach provides a task-dependent manner to design deep architecture. In the pruned models, most parameters in the middle layers have been discarded. In fact, our method can collapse such layers to reduce the network depth. In our experiments, when pruning reaches a threshold, all filters left in certain modules are of size 1*1. They can be viewed as simple filter selectors (by weight assignment) and thus can be combined and merged into the previous module’s concatenation to form a weighted summation. Such ‘skipping’ modules pass feature representations to higher layers without incrementing the features’ abstraction level. InceptionV1 is chosen as an example because it offers more filter type choices. However, the proposed approach can be used to prune other modular structures, such as ResNet where the final summation in a unit module can be modeled as a concatenation followed by convolution. Fig. 

7 shows the LFWA cases with VGG16 as bases. Since FC structures dominate VGG16, we add a separate conv layer parameter analysis. The results show that our approach led to significant parameter and FLOP reductions over the layers. Specifically, the method effectively prunes away almost all the dominating FC parameters. In Fig. 567, the first few layers are not pruned very much. This is because earlier layers correspond to primitive patterns (e.g. edges) that are commonly useful. Early layers also provide robustness to unimportant/noisy pixel space statistics. Despite its data dependent nature, the proposed approach does not depend much on training ‘pixels’, but focuses more on deep abstract manifolds learned and generalized from training instances. It generalizes well to unseen data due to its invariance to task-irrelevant factors (e.g. illumination). The pruned models are very light. On a machine with 32-bit parameters the models are respectively 10MiB, 4.1MiB, 11.9MiB, and 6.7MiB. They can possibly fit into computer/cellphone memories or even caches (with super-linear efficiency boosts).

From left to right, the conv layers in a Inception module are (1*1), (1*1, 3*3), (1*1, 5*5), (1*1 after pooling). Green: pruned, blue: remaining.
Figure 5: Layerwise parameter (left) and FLOPs (right) savings (CIFAR100, GoogLeNet).
From left to right, the conv layers in a Inception module are (1*1), (1*1, 3*3), (1*1, 5*5), (1*1 after pooling). Green: pruned, blue: remaining.
Figure 6: Layerwise parameter (left) and FLOPs (right) savings (Adience age, GoogLeNet).
(a) Conv Param (b) All Param (c) FLOPs (d) Conv Param (e) All Param (f) FLOPs
Figure 7: Layerwise complexity reductions, LFWA gender (a,b,c) and smile (d,e,f), VGG16.

5 Future Work and Discussion

In our concurrent work, we attempt to derive task-optimal architectures by proactively pushing useful deep discriminants into a condensed subset of neurons before deconv based pruning. This is achieved by simultaneously including deep LDA utility and covariance penalty in the objective function. However, compared to the simple pruning method presented here, proactive eigen-decomposition and training are more computationally expensive and sometimes numerically unstable. Another possible direction is to apply the deep discriminant/component analysis idea to unsupervised scenarios. For example, deep ICA dimension reduction can possibly be done by minimizing dependence in the latent space before deconv pruning. This will condense information flow, reduce redundancy and interference. Thus, it has a potential for applications like automatic structure design of auto-encoders, efficient image retrieval and reconstruction.

6 Conclusion

This paper proposes a task-specific neuron level end-to-end pruning approach with a LDA-Deconv utility that is aware of both final classification and its holistic cross-layer dependency. This is different from task-blind approaches and those with local (individual weights or within 1-2 layers) utility measures. The proposed approach is able to prune convolutional, fully connected, modular, and hybrid deep structures and it is useful for designing deep models by finding both the desired types of filters and the number for each kind. Compared to fixed nets, the method offers a range of models that are adapted for the inference task in question. On the general object dataset CIFAR100 and domain specific LFWA and Adience datasets, the approach achieves better performance and greater complexity reductions than competing methods. The high pruning rates and global task-utility awareness of the approach offer a great potential for installation on mobile devices in many real-world applications.



  • [Anwar et al.(2015)Anwar, Hwang, and Sung] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural networks. arXiv preprint arXiv:1512.08571, 2015.
  • [Bekios-Calfa et al.(2011)Bekios-Calfa, Buenaposada, and Baumela] Juan Bekios-Calfa, Jose M Buenaposada, and Luis Baumela. Revisiting linear discriminant techniques in gender recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(4):858–864, 2011.
  • [Belhumeur et al.(1997)Belhumeur, Hespanha, and Kriegman] Peter N. Belhumeur, João P Hespanha, and David J. Kriegman. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on pattern analysis and machine intelligence, 19(7):711–720, 1997.
  • [Bengio et al.(2013)Bengio, Mesnil, Dauphin, and Rifai] Yoshua Bengio, Grégoire Mesnil, Yann Dauphin, and Salah Rifai. Better mixing via deep representations. In

    International conference on machine learning

    , pages 552–560, 2013.
  • [Denton et al.(2014)Denton, Zaremba, Bruna, LeCun, and Fergus] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pages 1269–1277, 2014.
  • [Eidinger et al.(2014)Eidinger, Enbar, and Hassner] Eran Eidinger, Roee Enbar, and Tal Hassner.

    Age and gender estimation of unfiltered faces.

    IEEE Transactions on Information Forensics and Security, 9(12):2170–2179, 2014.
  • [Fisher(1936)] Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179–188, 1936.
  • [Guo et al.(2016)Guo, Yao, and Chen] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pages 1379–1387, 2016.
  • [Han et al.(2015a)Han, Mao, and Dally] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2, 2015a.
  • [Han et al.(2015b)Han, Pool, Tran, and Dally] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015b.
  • [Hassibi and Stork(1993)] Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain surgeon. Morgan Kaufmann, 1993.
  • [Haykin(1994)] Simon S Haykin. Blind deconvolution. Prentice Hall, 1994.
  • [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • [He et al.(2017)He, Zhang, and Sun] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397, 2017.
  • [Hebb(2005)] Donald Olding Hebb. The organization of behavior: A neuropsychological theory. Psychology Press, 2005.
  • [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • [Hu et al.(2016)Hu, Peng, Tai, and Tang] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016.
  • [Iandola et al.(2016)Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
  • [Jaderberg et al.(2014)Jaderberg, Vedaldi, and Zisserman] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
  • [Jin et al.(2016)Jin, Yuan, Feng, and Yan] Xiaojie Jin, Xiaotong Yuan, Jiashi Feng, and Shuicheng Yan. Training skinny deep neural networks with iterative hard thresholding methods. arXiv preprint arXiv:1607.05423, 2016.
  • [Krizhevsky and Hinton(2009)] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.

    Imagenet classification with deep convolutional neural networks.

    In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [LeCun et al.(1989)LeCun, Denker, Solla, Howard, and Jackel] Yann LeCun, John S Denker, Sara A Solla, Richard E Howard, and Lawrence D Jackel. Optimal brain damage. In NIPs, volume 2, pages 598–605, 1989.
  • [Levi and Hassner(2015)] Gil Levi and Tal Hassner. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 34–42, 2015.
  • [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
  • [Li et al.(1999)Li, Kittler, and Matas] Yongping Li, Josef Kittler, and Jiri Matas.

    Effective implementation of linear discriminant analysis for face recognition and verification.

    In Computer Analysis of Images and Patterns, page 234. Springer, 1999.
  • [Lin et al.(2014)Lin, Chen, and Yan] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. ICLR, 2014.
  • [Liu et al.(2015)Liu, Luo, Wang, and Tang] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
  • [Luo et al.(2017)Luo, Wu, and Lin] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pages 5058–5066, 2017.
  • [Mariet and Sra(2016)] Zelda Mariet and Suvrit Sra. Diversity networks. ICLR, 2016.
  • [Mountcastle et al.(1957)] Vernon B Mountcastle et al. Modality and topographic properties of single neurons of cat’s somatic sensory cortex. J neurophysiol, 20(4):408–434, 1957.
  • [Polyak and Wolf(2015)] Adam Polyak and Lior Wolf. Channel-level acceleration of deep face representations. IEEE Access, 3:2163–2175, 2015.
  • [Pratt(1989)] Lorien Y Pratt. Comparing biases for minimal network construction with back-propagation, volume 1. Morgan Kaufmann Pub, 1989.
  • [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
  • [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei-Fei] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  • [Simonyan and Zisserman(2015)] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR) 2015, 2015.
  • [Srinivas and Babu(2015)] Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149, 2015.
  • [Sze et al.(2017)Sze, Yang, and Chen] Vivienne Sze, Tien-Ju Yang, and Yu-Hsin Chen. Designing energy-efficient convolutional neural networks using energy-aware pruning. pages 5687–5695, 2017.
  • [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
  • [Tian et al.(2017)Tian, Arbel, and Clark] Qing Tian, Tal Arbel, and James J Clark. Deep lda-pruned nets for efficient facial gender classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 10–19, 2017.
  • [Valiant(2006)] Leslie G Valiant. A quantitative theory of neural computation. Biological cybernetics, 95(3):205–211, 2006.
  • [Zeiler and Fergus(2014)] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818–833. Springer, 2014.
  • [Zeiler et al.(2011)Zeiler, Taylor, and Fergus] Matthew D Zeiler, Graham W Taylor, and Rob Fergus. Adaptive deconvolutional networks for mid and high level feature learning. In 2011 International Conference on Computer Vision, pages 2018–2025. IEEE, 2011.
  • [Zhang et al.(2016)Zhang, Zou, He, and Sun] Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence, 38(10):1943–1955, 2016.