Growing Efficient Deep Networks by Structured Continuous Sparsification

07/30/2020 ∙ by Xin Yuan, et al. ∙ The University of Chicago Toyota Technological Institute at Chicago 9

We develop an approach to training deep networks while dynamically adjusting their architecture, driven by a principled combination of accuracy and sparsity objectives. Unlike conventional pruning approaches, our method adopts a gradual continuous relaxation of discrete network structure optimization and then samples sparse subnetworks, enabling efficient deep networks to be trained in a growing and pruning manner. Extensive experiments across CIFAR-10, ImageNet, PASCAL VOC, and Penn Treebank, with convolutional models for image classification and semantic segmentation, and recurrent models for language modeling, show that our training scheme yields efficient networks that are smaller and more accurate than those produced by competing pruning methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have achieved dramatic accuracy improvements in a variety of machine learning tasks such as image classification 

[26, 45], object detection [7, 32], semantic segmentation [33, 1] and language modeling [49]. Even though DNNs are typically overparameterized, recent work [14, 48, 20] shows that their performance on numerous tasks can be further improved by increasing their depth and width. Despite their success on benchmark datasets, the training and deployment of DNNs in many real-world applications is limited by their large number of parameters and computational costs. To address this, model compression and architecture search methods that learn more efficient DNN models have been proposed, yielding faster training and inference.

Efficiency improvements to DNNs have been extensively studied in previous works [23, 18, 19, 50]. For example, [22, 41] propose the use of binary weights and activations, benefiting from reduced storage costs and efficient computation through bit-counting operations. Other prominent approaches focus on finding efficient alternatives to standard spatial convolutions, e.g. depth-wise separable convolutions [44], which applies a separate convolutional kernel to each channel followed by a point-wise convolution over all channels [3, 18, 50]. Pruning methods [11, 12, 10] aim to generate a light-wise version of a given network architecture by removing individual weights [12, 11, 38] or structured parameter sets [28, 15, 35] (e.g. filters in convolutional layers).

However, existing methods typically rely on retraining or fine-tuning phases after reducing the number of parameters so that accuracy is maintained, resulting in significant computational costs. Moreover, the majority of these methods train the full-sized model prior to pruning and do not aim to diminish train-time computational costs. Recently proposed network architecture search (NAS) methods [52, 51, 31, 36, 40, 42] utilize AutoML techniques to design efficient architectures under practical resource constraints. Nonetheless, most NAS methods operate on a large “supernet” architecture, yielding a computationally expensive search phase. In addition, few of the recently-proposed methods are one-shot and hence require additional computation to retrain the final architecture in order to achieve high performance for deployment.

We propose a method to dynamically grow deep networks by continuously sparsifying structured parameter sets, resulting in efficient architectures and decreasing the computational cost not only of inference, but also of training. Unlike existing pruning or architecture search schemes that maintain a full-sized network or a “supernet”, we implicitly adapt architectures during training with different structured sparsity levels. More specifically, we first build a discrete space to maintain and explore adaptive train-time architectures of different complexities in a growing and pruning manner. To overcome the hardness of optimizing over a discrete space, we perform learning via continuation methods by approximating a discrete operation through a scaled smooth function. We design a bandwidth scheduler that is used to control the optimization hardness during this low-cost training procedure. The framework is illustrated in Figure 1

. We conduct extensive experiments on classification tasks (CIFAR-10, ImageNet), semantic segmentation (PASCAL VOC) and word-level language modeling (PTB) to demonstrate the effectiveness of our methods for both convolutional neural network (CNN) and recurrent neural network (RNN) architectures.

Figure 1: Framework of our proposed method. (Left) Our learning by continuation method is comprised of three main components: (1) To tackle the optimization hardness from the discrete growing and pruning space, we follow [43] and replace the sign operation with a gradual smooth function : the sign function is shown in blue, while red, green, cyan and yellow show with bandwidths . (2) is controlled by a carefully designed bandwidth scheduler towards . (3) A binary stochastic auxiliary variable sampled according to is introduced to implicitly reduce computational cost during the progressive training stage. (Right

) The relaxation can be applied to both CNN layers and RNN cells on various computer vision and natural language processing tasks.

Best viewed in color.

2 Related Work

Network Pruning: Network pruning methods can be split into two groups: those that prune individual weights and those that prune structured components. For individual weight-based pruning, elements of the weight matrices can be removed based on some criteria. For example, [12] propose to prune network weights with small magnitude, and build a deep compression pipeline [11]. Sparse VD [38] yields extremely sparse solutions both in fully-connected and convolutional layers by using variational dropout. [34] learns sparse networks by approximating -regularization with a stochastic reparameterization. [39] presents a magnitude-based pruning approach for RNNs where the top-k elements of the weights are set as 0 at each iteration. However, these methods that produce sparse weight matrices only lead to speedup on dedicated hardware with supporting libraries.

In structured methods, pruning is applied at the level of neurons, channels, or even layers. For example, L1-pruning 

[28] removes channels based on the norm of their filters. [15] uses group sparsity to smooth the pruning process after training. ThiNet [35] greedily prunes the channel that has the smallest effect on the next layer’s activation values. MorphNet [8] regularizes weights towards zero until they are small enough such that the corresponding output channels are marked for removal from the network. Intrinsic Structured Sparsity (ISS) [46] works on LSTMs [17] by collectively removing the columns and rows of the weight matrices via group LASSO.

Our work is more related to structured pruning methods in the sense that a slim architecture is generated at the end of training. In addition, our work also focus on adapting the train-time structured sparsification in a discrete growing and pruning space.

Lottery Ticket Hypothesis and Continuous Sparsification: The Lottery Ticket Hypothesis [6] conjectures that sparse sub-networks and their randomly initialized weights can obtain a comparable accuracy with the original network when trained in isolation. [43] further proposes Continuous Sparsification, a method to speed up ticket search, which approaches a complex optimization problem by relaxing the original objective, turning it into an intermediate and easier problem in terms of optimization. By gradually increasing the difficulty of the underlying objective during training, it results in a sequence of optimization problems converging to the original, intractable objective. In our method, we directly adopt Continuous Sparsification  [43] to formulate a gradual relaxation scheme in the context of structured pruning.

3 Method

3.1 Discrete Growing and Pruning Space

Given a network topology, we build a discrete space to maintain adaptive train-time architectures of different complexities in a growing and pruning manner. A network topology can be seen as a directed acyclic graph consisting of an ordered sequence of nodes. Each node is the input feature and each edge is a computation cell with structuredhyperparameters (e.g. filter numbers in convolutional layers or hidden neuron numbers in recurrent cells). The discrete growing and pruning space can be parameterized by associating a mask variable with each computation cell (edge), which enables a train-time pruning () and growing () dynamics.

For a convolutional layer with input channels, output channels (filters) and sized kernels, the -th output feature is computed based on the -th filter, i.e. for :

(1)

where . For a recurrent cell, without loss of generality, we focus on LSTMs [17] with hidden neurons, a common variant111The proposed growing space can be readily applied to the compression of GRUs [2] and vanilla RNNs. of RNNs that learns long-term dependencies:

(2)

where

is the sigmoid function,

denotes element-wise multiplication and is the hyperbolic tangent function.

denotes the input vector at the time-step

, denotes the current hidden state, and denotes the long-term memory cell state. denote the input-to-hidden weight matrices and denote the hidden-to-hidden weight matrices. is shared across all the gates to control the sparsity of hidden neurons.

We can optimize the trade-off between model performance and structured sparsification by considering the training objective

(3)

where can be the operation of convolutional layers in Eq. 1 or LSTM cells in Eq. 3.1 with trainable weights, is a general expression of structured sparsified weight matrices in our proposed space and

corresponds to a loss function. (

e.g. cross-entropy loss for classification), the term penalizes the number of non-zero mask values thus encouraging sparsity, is a trade-off parameter between and the penalty. In the growing and pruning space, a model is optimal if it can minimize the combined cost of the description of model complexity and the loss between the model and the data. However, optimizing is computationally intractable due to the combinatorial nature of binary states.

3.2 Continuous Relaxation and Optimization

Learning by Continuation: To make the search space continuous and the optimization feasible, we adopt the framework proposed in [43], used to derive Continuous Sparsification.

First, we reparameterize as the binary sign of a continuous variable :

(4)

and rewrite the objective in Eq. 3 as

(5)

Following [43], we attack the hard and discontinuous optimization problem in Eq. 5 by starting with an easier objective which becomes harder as the training proceeds. As in [43]

, we use a sequence of functions whose limit is the sign operation. Instead of using the sigmoid activation function, we adopt the hard sigmoid function

222Note that the original hard sigmoid function is defined as in [34] . Similar to the sigmoid function, we have that for any , , where > 0 is a bandwidth parameter.

Auxiliary Discrete Variable: Using continuation methods, we can express our final objective as:

(6)

where is sampled from . By increasing , becomes harder to optimize while the objectives converges to original discrete one. Different from [43], we introduce an 0-1 sampled auxiliary variable

based on the probability value

. Thus we (1) effectively reduce training computational cost since any train-time architecture is sampled as a structured sparse one; (2) avoid using a suboptimal thresholding criterion to generate the inference architecture at the end of training.

Bandwidth Scheduler: We start training deep networks using Eq. 6 with , where the initial value is set as . We adapt its bandwidth to control the optimization difficulty by instantiating a bandwidth scheduler in two ways: globally and structure-wise separately

. A global bandwidth scheduler is called at the end of each training epoch and updates

on all activation functions following

(7)

where is the initial bandwidth which is set as 1, n_iters is the number of training iterations so far, is used to constrain to a certain range. In our experiments, we set as 100. Constants and are hyperparameters that govern the increasing speed of the bandwidth during the progressive training procedure. Note that such adaptive control system can be customized for different resource requirements (e.g. training computation cost) by tuning and . A structure-wise separate bandwidth scheduler requires specifying one additional step: for each mask variable, instead of using a global counter n_iters, we set a separate counter n_sampled_iters which is increased only when its associated mask value is sampled as 1 in Eq. 6. Similarly we instantiate this scheduler with

(8)

Intuitively, the structure-wise separate bandwidth scheduler is more compelling because it allows bandwidth to increase at different rates for different mask variables: those more frequently sampled masks, indicative of a higher probability not to be pruned, will become more stable due to the higher optimization difficulty; Those less sampled masks at early stages may still have the chance to be grown under a relatively lower . However, the global scheduler may fail to handle such cases. In our experiments, we report performance using the structure-wise scheduler and we also conduct investigation on the two alternatives during training.

In summary, Algorithm 1 shows full details of our optimization procedure with the structure-wise separate bandwidth scheduler.

  Input: Training set and label set: = , =
  Output: Target efficient model
  Initialize: as random weights and as 0 in ; , as float constants, as 1, 100, update interval , n_sampled_iters as all 1 vectors associated with each function.
  for  to  do
     Sample random mini-batch from
     Sample and record the index where value is 1.
     Train and using Eq. 6 with SGD.
     Update
     if  then
        Update using Eq. 8
     end if
  end for
  return S
Algorithm 1 : Optimization

4 Experiments

4.1 Experimental Setup

Datasets: Evaluation is conducted on various tasks to demonstrate the effectiveness of our proposed method. For image classification, we use CIFAR-10 [25] and ImageNet [4]: CIFAR-10 consists of 60,000 images of 10 classes, with 6,000 images per class. The train and test sets contain 50,000 and 10,000 images respectively. ImageNet is a large dataset for visual recognition which contains over 1.2M images in the training set and 50K images in the validation set covering 1,000 categories. For semantic segmentation, we use the PASCAL VOC 2012 [5] benchmark which contains 20 foreground object classes and one background class. The original dataset contains 1,464 (train), 1,449 (val), and 1,456 (test) pixel-level labeled images for training, validation, and testing, respectively. The dataset is augmented by the extra annotations provided by [13], resulting in 10,582 training images. For language modeling, we use the word level Penn Treeban (PTB) dataset [37] which consists of 929k training words, 73k validation words and 82k test words with 10,000 unique words in its vocabulary.

Unpruned Baseline Models: For CIFAR-10, we use VGG-16 [45] with BatchNorm [24], ResNet-20 [14] and WideResNet-28-10 [48] as baselines. We adopt a standard data augmentation scheme (shifting/mirroring) following [30, 21]

, and normalize the input data with channel means and standard deviations. Note that we use the CIFAR version of ResNet-20, VGG-16, and WideResNet-28-10. VGG-16, ResNet-20, and WideResNet-28-10 are trained for 160, 160 and 200 epochs with a batch size of 128 and initial learning rate of 0.1. For VGG-16 and ResNet-20 we divide learning rate by 10 at epochs 80 and 120 and set the weights decay and momentum as

and 0.9. For WideResNet-28-10, the learning rate is divided by 5 at epochs 60, 120 and 160; the weight decay and momentum are set to and 0.9. For ImageNet, we train the baseline ResNet-50 and MobileNetV1 model following the respective papers. We adopt the same data augmentation scheme as in [9] and report top-1 validation accuracy. For semantic segmentation, the performance is measured in terms of pixel intersection-over-union (IOU) averaged across the 21 classes (mIOU). We use Deeplab-v3-ResNet-101333https://github.com/chenxi116/DeepLabv3.pytorch [1] as the baseline model following the training details in [1]. For language modeling, we use vanilla two-layer stacked LSTM [49] as a baseline. The dropout keep ratio is 0.35 for the baseline model. The vocabulary size, embedding size, and hidden size of the stacked LSTMs are set as 10,000, 1,500, and 1,500, respectively, which is consistent with the settings in [49].

Implementation Details: There are two kinds of trainable variables in our method, denoted as model weights and mask weights. As a one-shot method, for model weights, we adopt the same hyperparameters with the corresponding unpruned baseline models, except that dropout keep ratio for language modeling is set as 0.5. For mask variables, we initialize the weights as 0 and use SGD training with initial learning rate of 0.1, weight decay of 0 and momentum of 0.9 on all datasets. The learning rate scheduler is the same with its corresponding model weights. The trade-off parameter is set as 0.01 on classification and semantic segmentation tasks, and 0.1 for language modeling tasks. For the bandwidth scheduler, we report model performance trained with structure-wise separate scheduler where (, ) are set as (0.0005, 0.7) for classification and segmentation models, and (0.0005, 1.2) for language modeling, respectively. All and are set as 0 and 100. We also conduct parameter sensitivity analysis of sparsity and accuracy in terms of and .

Model Method Val Acc(%) Params(M) FLOPs(%)

VGG-16 [45] Original 92.9 (+0.0) 14.99 (100%) 100
L1 [28] 91.8 (-1.1) 2.98 (19.9%) 19.9
SoftNet [15] 92.1 (-0.8) 5.40 (36.0%) 36.1
ThiNet [35] 90.8 (-2.1) 5.40 (36.0%) 36.1
Provable [29] 92.4 (-0.5) 0.85 (5.7%) 15.0
Ours 92.9 (-0.0) 1.50 (10.0%) 16.5
ResNet-20 [14] Original 91.3 (+0.0) 0.27 (100%) 100
L1 [28] 90.9 (-0.4) 0.15 (55.6%) 55.4
SoftNet [15] 90.8 (-0.5) 0.14 (53.6%) 50.6
ThiNet [35] 89.2 (-2.1) 0.18 (67.1%) 67.3
Provable [29] 90.8 (-0.5) 0.10 (37.3%) 54.5
Ours 91.1 (-0.2) 0.11 (39.1%) 59.8
WideResNet Original 96.2 (+0.0) 36.5 (100%) 100
-28 L1 [28] 95.2 (-1.0) 7.6 (20.8%) 49.5
-10 [48] BAR(16x V) [27] 92.0 (-4.2) 2.3 (6.3%) 1.5
Ours 95.6 (-0.6) 2.6 (7.1%) 18.6
Table 1: Overview of the pruning performance of each algorithm for various CNN architectures on CIFAR-10. For each algorithm and network architecture, the table reports the retained params and ratio (Params, M, %), and retained FLOPs ratio (FLOPs, %) of pruned models.

4.2 Results

VGG-16, ResNet-20, and WideResNet-28-10 on CIFAR-10: Table 1 shows the pruning results in terms of validation accuracy, retained parameters, and FLOPs of VGG-16, ResNet-20, and WideResNet-28-10 on CIFAR-10. We compare with various pruning algorithms that we implement and run alongside our algorithm. We can see that ours achieves either larger pruning ratio or less degradation in accuracy. Our pruned VGG-16 and ResNet-20 can achieve comparable parameters and FLOPs reduction with recently proposed Provable [29] method while outperforming it by 0.5 and 0.3 in validation accuracy. For very aggressively pruned WideResNet-28-10, we observe that BAR [27] might not have enough capacity to achieve negligible accuracy drop even with the knowledge distillation [16] during the training process.

Model Method Top-1 Val Acc(%) Params(M) FLOPs(%)

Original 76.1 (+0) 23.0 (100%) 100
L1 [28] 74.7 (-1.4) 19.6 (85.2%) 77.5
ResNet-50 [14] SoftNet [15] 74.6 (-1.5) N/A 58.2
Provable [29] 75.2 (-0.9) 15.2 (65.9%) 70.0
Ours 75.2 (-0.9) 16.5 (71.7%) 64.5
Original(25%) 45.1 (+0) 0.47 (100%) 100
MobileNetV1 MorphNet [8] 46.0 (+0.9) N/A 110
(128) [18] Netadapt [47] 46.3 (+1.2) N/A 81
Ours 46.0 (+0.9) 0.41 (87.2%) 70
Table 2: Overview of the pruning performance of each algorithm for various CNN architectures on ImageNet. For each algorithm and network architecture, the table reports the retained params and ratio (Params, M, %), and retained FLOPs ratio (FLOPs, %) of pruned models.

ResNet-50 and MobileNetV1 on ImageNet: To validate the effectiveness of the proposed method on large-scale datasets, we further prune the widely used ResNet-50 and MobileNetV1 (128 128 resolution) on ImageNet and compare the performance of our method to the results reported directly in the respective papers, as shown in Table 2. In MobileNetV1 experiments, following the same setting with Netadapt [47], we apply our method on MobileNetV1 with 0.5 multiplier while setting the original model’s multiplier as 0.25 for comparison. Note that 50%-MobileNetV1(128) is one of the most compact networks, and thus is more challenging to simplify than other larger networks. Our method can still generate a sparser MobileNetV1 model compared with competing methods.

Deeplab-v3-ResNet-101 on PASCAL VOC 2012: We also test the effectiveness of our proposed method on the semantic segmentation task by pruning the Deeplab-v3-ResNet-101 model on the PASCAL VOC 2012 dataset. We apply our method to both the ResNet-101 backbone and ASPP module. Compared to the baseline, our pruned network reduces the FLOPs by 54.5% and the parameters amount by 41.8% while approximately maintaining mIoU (76.5% to 76.2%). See Table 3.

2-Stacked-LSTMs on PTB: We compare our proposed method with ISS based on vanilla two-layer stacked LSTM. As shown in Table 4, our method finds very compact model structure, while achieving similar perplexity on both validation and test sets. To be specific, our method achieves a 3.2 model size reduction and 7.8 FLOPs reduction from baseline model. Note that for fair comparison, we only prune the LSTM structure while keeping the embedding layer unchanged, following the same setting with ISS. Our method can achieve more compact structure than ISS, further reducing the hidden units from (373, 315) to (319, 285). These improvements may be due to the fact that our method dynamically grows and prunes the hidden neurons towards a better trade-off between model complexity and performance than that of ISS, which simply uses the group lasso to penalize the norms of all groups collectively for compactness.

Model Method mIOU Params(M) FLOPs(%)

Deeplab Original 76.5 (+0) 58.0 (100%) 100
-v3 L1 [28] 75.1 (-1.4) 45.7 (78.8%) 62.5
-ResNet-101 [1] Ours 76.2 (-0.3) 33.8 (58.2%) 45.5
Table 3: Results on the PASCAL VOC dataset, showing mIOU, retained parameters and ratio (M, %), and retained FLOPs ratio (%).

4.3 Analysis

Method Perplexity (val,test) Final Structure Weight(M) FLOPs(%)
Original [49] (82.57, 78.57) (1500, 1500) 66M (100%) 100
ISS [46] (82.59, 78.65) (373, 315) 21.8M (33.1%) 13.4
Ours (82.16, 78.67) (319, 285) 20.9M (31.7%) 12.8
Table 4: Results on the PTB dataset, showing validation and test perplexity, retained parameters and ratio (M, %), final sparse lstm structure, and retained FLOPs ratio (%).

Dynamic Train-time Cost: One advantage of our method over conventional pruning methods is that we effectively decrease the computational cost not only of inference but also of training via structured continuous sparsification. Figure 3 shows the dynamics of train-time layer-wise FLOPs and stage-wise retained filters ratios of VGG-16 and ResNet-20 on CIFAR-10, respectively. From Figure 3(b) and  3

(d) we see that our method preserves more filters of earlier stage (1 and 2) in VGG-16 and earlier layers within each stage of ResNet-20. Also, the layer-wise final sparsity of ResNet-20 is more uniform due to the residual connections.

Ours as Structured Regularization: We investigate the value of our proposed automatic pruning method serving as a more efficient training method with structure regularization. We re-initialize the pruned ResNet-20 and two-layer stacked LSTM and re-train them from scratch on CIFAR-10 and PTB, respectively. Comparing with their reported pruned model performance, we notice a performance degradation on both ResNet-20 (accuracy 91.1% 90.8% (-0.3)) and LSTM models (test perplexity 78.67 86.22 (+7.55)). Our method appears to have a positive effect in terms of regularization or optimization dynamics, which is lost if one attempts to directly train the final compact structure.

Parameter Sensitivity: We analyze the sparsity and performance sensitivity relative to the bandwidth scheduler (structure-wise separately) parameters. We measured the performance with respect to a combination of and . Specifically, we measure the normalized parameters sparsity and validation accuracy of ResNet-20 on the CIFAR-10 dataset as shown in Figure 3. From Figure 3(a) we can acquire some knowledge on how to localize hyperparameters and to achieve highly sparse networks. Figure 3(b) shows that when and are located in relative large range (e.g. right-bottom), the validation accuracy is robust to changes in these hyperparameters.

(a) VGG-16 layer-wise FLOPs ratios (b) VGG-16 stage-wise channel kept ratios (c) ResNet-20 layer-wise FLOPs ratios (d) ResNet-20 stage-wise channel kept ratios
Figure 2: Track of train-time FLOPs and channel kept ratios.
(a) Normalized Sparsity (b) Normalized Accuracy
Figure 3: Parameter sensitivity of bandwidth scheduler hyperparameters and .

Investigation on Bandwidth Scheduler: We investigate the effect of both global scheduler and structure-wise separate scheduler by conducting experiments on CIFAR-10 using VGG-16, ResNet-20, and WideResNet-28-10. The results using structure-wise separate scheduler are as reported in Table 1. For global scheduler, we note that to achieve similar sparsity, the pruned models suffer from accuracy drops of 0.5%, 0.2%, and 1.2%. With the global scheduler, optimization of all filters’ masks stops at very early epochs and the following epochs of training are equivalent to directly training a stabilized compact structure. This may lock the network into a suboptimal structure, compared to our separate scheduler which dynamically grows and prunes over a longer duration.

5 Conclusion

In this paper, we propose a simple yet effective method to grow efficient deep networks via structured continuous sparsification, which decreases the computational cost not only of inference but also of training. The method is simple to implement and quick to execute, which aims at automating the network structure sparsification process for general purposes. The pruning results for widely used deep networks on various computer vision and language modeling tasks show that our method consistently generates smaller and more accurate networks compared to competing methods.

There are many interesting directions to be investigated further. For example, while our current sparsification process is designed with a generic objective, it would be interesting to incorporate model size and FLOPs constraints into training objective in order to target a particular resource. Additionally, our method’s growing and pruning space is anchored to a given network topology. Future work could explore an architectural design space in which large subcomponents of the network are themselves candidates for pruning.

References

  • [1] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587. Cited by: §1, §4.1, Table 3.
  • [2] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: footnote 1.
  • [3] F. Chollet (2017)

    Xception: deep learning with depthwise separable convolutions

    .
    In CVPR, Cited by: §1.
  • [4] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: §4.1.
  • [5] M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman (2015) The PASCAL visual object classes challenge: A retrospective. IJCV. Cited by: §4.1.
  • [6] J. Frankle and M. Carbin (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. In ICLR, Cited by: §2.
  • [7] R. B. Girshick (2015) Fast R-CNN. In ICCV, Cited by: §1.
  • [8] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T. Yang, and E. Choi (2018) MorphNet: fast & simple resource-constrained structure learning of deep networks. In CVPR, Cited by: §2, Table 2.
  • [9] S. Gross and M. Wilber (2016) Training and investigating residual nets. http://torch.ch/blog/2016/02/04/resnets.html. Cited by: §4.1.
  • [10] Y. Guo, A. Yao, and Y. Chen (2016) Dynamic network surgery for efficient DNNs. In NIPS, Cited by: §1.
  • [11] S. Han, H. Mao, and W. J. Dally (2016) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. ICLR. Cited by: §1, §2.
  • [12] S. Han, J. Pool, J. Tran, and W. J. Dally (2015) Learning both weights and connections for efficient neural networks. NIPS. Cited by: §1, §2.
  • [13] B. Hariharan, P. Arbelaez, L. D. Bourdev, S. Maji, and J. Malik (2011) Semantic contours from inverse detectors. In ICCV, Cited by: §4.1.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §4.1, Table 1, Table 2.
  • [15] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang (2018) Soft filter pruning for accelerating deep convolutional neural networks. In IJCAI, Cited by: §1, §2, Table 1, Table 2.
  • [16] G. E. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. NIPS Deep Learning and Representation Learning Workshop. Cited by: §4.2.
  • [17] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation. Cited by: §2, §3.1.
  • [18] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Cited by: §1, Table 2.
  • [19] G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger (2018) CondenseNet: an efficient DenseNet using learned group convolutions. CVPR. Cited by: §1.
  • [20] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In CVPR, Cited by: §1.
  • [21] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger (2016) Deep networks with stochastic depth. In ECCV, Cited by: §4.1.
  • [22] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks. In NIPS, Cited by: §1.
  • [23] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. arXiv:1602.07360. Cited by: §1.
  • [24] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §4.1.
  • [25] A. Krizhevsky, V. Nair, and G. Hinton (2014) The CIFAR-10 dataset. http://www.cs.toronto.edu/~kriz/cifar.html. Cited by: §4.1.
  • [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
  • [27] C. Lemaire, A. Achkar, and P. Jodoin (2019) Structured pruning of neural networks with budget-aware regularization. In CVPR, Cited by: §4.2, Table 1.
  • [28] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2017) Pruning filters for efficient ConvNets. In ICLR, Cited by: §1, §2, Table 1, Table 2, Table 3.
  • [29] L. Liebenwein, C. Baykal, H. Lang, D. Feldman, and D. Rus (2020) Provable filter pruning for efficient neural networks. In ICLR, Cited by: §4.2, Table 1, Table 2.
  • [30] M. Lin, Q. Chen, and S. Yan (2013) Network in network. arXiv:1312.4400. Cited by: §4.1.
  • [31] H. Liu, K. Simonyan, and Y. Yang (2019) DARTS: differentiable architecture search. ICLR. Cited by: §1.
  • [32] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In ECCV, Cited by: §1.
  • [33] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §1.
  • [34] C. Louizos, M. Welling, and D. P. Kingma (2018) Learning sparse neural networks through l regularization. ICLR. Cited by: §2, footnote 2.
  • [35] J. Luo, J. Wu, and W. Lin (2017) ThiNet: A filter level pruning method for deep neural network compression. In ICCV, Cited by: §1, §2, Table 1.
  • [36] R. Luo, F. Tian, T. Qin, E. Chen, and T. Liu (2018) Neural architecture optimization. In NIPS, Cited by: §1.
  • [37] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz (1993) Building a large annotated corpus of english: the penn treebank. Computational Linguistics. Cited by: §4.1.
  • [38] D. Molchanov, A. Ashukha, and D. P. Vetrov (2017) Variational dropout sparsifies deep neural networks. In ICML, Cited by: §1, §2.
  • [39] S. Narang, G. Diamos, S. Sengupta, and E. Elsen (2017) Exploring sparsity in recurrent neural networks. In ICLR, Cited by: §2.
  • [40] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §1.
  • [41] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) XNOR-Net: imagenet classification using binary convolutional neural networks. In ECCV, Cited by: §1.
  • [42] P. Savarese and M. Maire (2019) Learning implicitly recurrent CNNs through parameter sharing. In ICLR, Cited by: §1.
  • [43] P. Savarese, H. Silva, and M. Maire (2019) Winning the lottery with continuous sparsification. arXiv:1912.04427. Cited by: Figure 1, §2, §3.2, §3.2, §3.2.
  • [44] L. Sifre and P. Mallat (2014) Rigid-motion scattering for image classification. Ph.D. Thesis, Ecole Polytechnique, CMAP. Cited by: §1.
  • [45] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. ICLR. Cited by: §1, §4.1, Table 1.
  • [46] W. Wen, Y. He, S. Rajbhandari, M. Zhang, W. Wang, F. Liu, B. Hu, Y. Chen, and H. Li (2018) Learning intrinsic sparse structures within long short-term memory. In ICLR, Cited by: §2, Table 4.
  • [47] T. Yang, A. G. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam (2018) NetAdapt: platform-aware neural network adaptation for mobile applications. In ECCV, Cited by: §4.2, Table 2.
  • [48] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In BMVC, Cited by: §1, §4.1, Table 1.
  • [49] W. Zaremba, I. Sutskever, and O. Vinyals (2014) Recurrent neural network regularization. arXiv:1409.2329. Cited by: §1, §4.1, Table 4.
  • [50] X. Zhang, X. Zhou, M. Lin, and J. Sun (2018) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. CVPR. Cited by: §1.
  • [51] B. Zoph and Q. V. Le (2016)

    Neural architecture search with reinforcement learning

    .
    arXiv:1611.01578. Cited by: §1.
  • [52] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. CVPR. Cited by: §1.