1 Introduction
Deep neural networks (DNNs) have achieved dramatic accuracy improvements in a variety of machine learning tasks such as image classification
[26, 45], object detection [7, 32], semantic segmentation [33, 1] and language modeling [49]. Even though DNNs are typically overparameterized, recent work [14, 48, 20] shows that their performance on numerous tasks can be further improved by increasing their depth and width. Despite their success on benchmark datasets, the training and deployment of DNNs in many realworld applications is limited by their large number of parameters and computational costs. To address this, model compression and architecture search methods that learn more efficient DNN models have been proposed, yielding faster training and inference.Efficiency improvements to DNNs have been extensively studied in previous works [23, 18, 19, 50]. For example, [22, 41] propose the use of binary weights and activations, benefiting from reduced storage costs and efficient computation through bitcounting operations. Other prominent approaches focus on finding efficient alternatives to standard spatial convolutions, e.g. depthwise separable convolutions [44], which applies a separate convolutional kernel to each channel followed by a pointwise convolution over all channels [3, 18, 50]. Pruning methods [11, 12, 10] aim to generate a lightwise version of a given network architecture by removing individual weights [12, 11, 38] or structured parameter sets [28, 15, 35] (e.g. filters in convolutional layers).
However, existing methods typically rely on retraining or finetuning phases after reducing the number of parameters so that accuracy is maintained, resulting in significant computational costs. Moreover, the majority of these methods train the fullsized model prior to pruning and do not aim to diminish traintime computational costs. Recently proposed network architecture search (NAS) methods [52, 51, 31, 36, 40, 42] utilize AutoML techniques to design efficient architectures under practical resource constraints. Nonetheless, most NAS methods operate on a large “supernet” architecture, yielding a computationally expensive search phase. In addition, few of the recentlyproposed methods are oneshot and hence require additional computation to retrain the final architecture in order to achieve high performance for deployment.
We propose a method to dynamically grow deep networks by continuously sparsifying structured parameter sets, resulting in efficient architectures and decreasing the computational cost not only of inference, but also of training. Unlike existing pruning or architecture search schemes that maintain a fullsized network or a “supernet”, we implicitly adapt architectures during training with different structured sparsity levels. More specifically, we first build a discrete space to maintain and explore adaptive traintime architectures of different complexities in a growing and pruning manner. To overcome the hardness of optimizing over a discrete space, we perform learning via continuation methods by approximating a discrete operation through a scaled smooth function. We design a bandwidth scheduler that is used to control the optimization hardness during this lowcost training procedure. The framework is illustrated in Figure 1
. We conduct extensive experiments on classification tasks (CIFAR10, ImageNet), semantic segmentation (PASCAL VOC) and wordlevel language modeling (PTB) to demonstrate the effectiveness of our methods for both convolutional neural network (CNN) and recurrent neural network (RNN) architectures.
2 Related Work
Network Pruning: Network pruning methods can be split into two groups: those that prune individual weights and those that prune structured components. For individual weightbased pruning, elements of the weight matrices can be removed based on some criteria. For example, [12] propose to prune network weights with small magnitude, and build a deep compression pipeline [11]. Sparse VD [38] yields extremely sparse solutions both in fullyconnected and convolutional layers by using variational dropout. [34] learns sparse networks by approximating regularization with a stochastic reparameterization. [39] presents a magnitudebased pruning approach for RNNs where the topk elements of the weights are set as 0 at each iteration. However, these methods that produce sparse weight matrices only lead to speedup on dedicated hardware with supporting libraries.
In structured methods, pruning is applied at the level of neurons, channels, or even layers. For example, L1pruning
[28] removes channels based on the norm of their filters. [15] uses group sparsity to smooth the pruning process after training. ThiNet [35] greedily prunes the channel that has the smallest effect on the next layer’s activation values. MorphNet [8] regularizes weights towards zero until they are small enough such that the corresponding output channels are marked for removal from the network. Intrinsic Structured Sparsity (ISS) [46] works on LSTMs [17] by collectively removing the columns and rows of the weight matrices via group LASSO.Our work is more related to structured pruning methods in the sense that a slim architecture is generated at the end of training. In addition, our work also focus on adapting the traintime structured sparsification in a discrete growing and pruning space.
Lottery Ticket Hypothesis and Continuous Sparsification: The Lottery Ticket Hypothesis [6] conjectures that sparse subnetworks and their randomly initialized weights can obtain a comparable accuracy with the original network when trained in isolation. [43] further proposes Continuous Sparsification, a method to speed up ticket search, which approaches a complex optimization problem by relaxing the original objective, turning it into an intermediate and easier problem in terms of optimization. By gradually increasing the difficulty of the underlying objective during training, it results in a sequence of optimization problems converging to the original, intractable objective. In our method, we directly adopt Continuous Sparsification [43] to formulate a gradual relaxation scheme in the context of structured pruning.
3 Method
3.1 Discrete Growing and Pruning Space
Given a network topology, we build a discrete space to maintain adaptive traintime architectures of different complexities in a growing and pruning manner. A network topology can be seen as a directed acyclic graph consisting of an ordered sequence of nodes. Each node is the input feature and each edge is a computation cell with structuredhyperparameters (e.g. filter numbers in convolutional layers or hidden neuron numbers in recurrent cells). The discrete growing and pruning space can be parameterized by associating a mask variable with each computation cell (edge), which enables a traintime pruning () and growing () dynamics.
For a convolutional layer with input channels, output channels (filters) and sized kernels, the th output feature is computed based on the th filter, i.e. for :
(1) 
where . For a recurrent cell, without loss of generality, we focus on LSTMs [17] with hidden neurons, a common variant^{1}^{1}1The proposed growing space can be readily applied to the compression of GRUs [2] and vanilla RNNs. of RNNs that learns longterm dependencies:
(2) 
where
is the sigmoid function,
denotes elementwise multiplication and is the hyperbolic tangent function.denotes the input vector at the timestep
, denotes the current hidden state, and denotes the longterm memory cell state. denote the inputtohidden weight matrices and denote the hiddentohidden weight matrices. is shared across all the gates to control the sparsity of hidden neurons.We can optimize the tradeoff between model performance and structured sparsification by considering the training objective
(3) 
where can be the operation of convolutional layers in Eq. 1 or LSTM cells in Eq. 3.1 with trainable weights, is a general expression of structured sparsified weight matrices in our proposed space and
corresponds to a loss function. (
e.g. crossentropy loss for classification), the term penalizes the number of nonzero mask values thus encouraging sparsity, is a tradeoff parameter between and the penalty. In the growing and pruning space, a model is optimal if it can minimize the combined cost of the description of model complexity and the loss between the model and the data. However, optimizing is computationally intractable due to the combinatorial nature of binary states.3.2 Continuous Relaxation and Optimization
Learning by Continuation: To make the search space continuous and the optimization feasible, we adopt the framework proposed in [43], used to derive Continuous Sparsification.
First, we reparameterize as the binary sign of a continuous variable :
(4) 
and rewrite the objective in Eq. 3 as
(5) 
Following [43], we attack the hard and discontinuous optimization problem in Eq. 5 by starting with an easier objective which becomes harder as the training proceeds. As in [43]
, we use a sequence of functions whose limit is the sign operation. Instead of using the sigmoid activation function, we adopt the hard sigmoid function
^{2}^{2}2Note that the original hard sigmoid function is defined as in [34] . Similar to the sigmoid function, we have that for any , , where > 0 is a bandwidth parameter.Auxiliary Discrete Variable: Using continuation methods, we can express our final objective as:
(6) 
where is sampled from . By increasing , becomes harder to optimize while the objectives converges to original discrete one. Different from [43], we introduce an 01 sampled auxiliary variable
based on the probability value
. Thus we (1) effectively reduce training computational cost since any traintime architecture is sampled as a structured sparse one; (2) avoid using a suboptimal thresholding criterion to generate the inference architecture at the end of training.Bandwidth Scheduler: We start training deep networks using Eq. 6 with , where the initial value is set as . We adapt its bandwidth to control the optimization difficulty by instantiating a bandwidth scheduler in two ways: globally and structurewise separately
. A global bandwidth scheduler is called at the end of each training epoch and updates
on all activation functions following(7) 
where is the initial bandwidth which is set as 1, n_iters is the number of training iterations so far, is used to constrain to a certain range. In our experiments, we set as 100. Constants and are hyperparameters that govern the increasing speed of the bandwidth during the progressive training procedure. Note that such adaptive control system can be customized for different resource requirements (e.g. training computation cost) by tuning and . A structurewise separate bandwidth scheduler requires specifying one additional step: for each mask variable, instead of using a global counter n_iters, we set a separate counter n_sampled_iters which is increased only when its associated mask value is sampled as 1 in Eq. 6. Similarly we instantiate this scheduler with
(8) 
Intuitively, the structurewise separate bandwidth scheduler is more compelling because it allows bandwidth to increase at different rates for different mask variables: those more frequently sampled masks, indicative of a higher probability not to be pruned, will become more stable due to the higher optimization difficulty; Those less sampled masks at early stages may still have the chance to be grown under a relatively lower . However, the global scheduler may fail to handle such cases. In our experiments, we report performance using the structurewise scheduler and we also conduct investigation on the two alternatives during training.
In summary, Algorithm 1 shows full details of our optimization procedure with the structurewise separate bandwidth scheduler.
4 Experiments
4.1 Experimental Setup
Datasets: Evaluation is conducted on various tasks to demonstrate the effectiveness of our proposed method. For image classification, we use CIFAR10 [25] and ImageNet [4]: CIFAR10 consists of 60,000 images of 10 classes, with 6,000 images per class. The train and test sets contain 50,000 and 10,000 images respectively. ImageNet is a large dataset for visual recognition which contains over 1.2M images in the training set and 50K images in the validation set covering 1,000 categories. For semantic segmentation, we use the PASCAL VOC 2012 [5] benchmark which contains 20 foreground object classes and one background class. The original dataset contains 1,464 (train), 1,449 (val), and 1,456 (test) pixellevel labeled images for training, validation, and testing, respectively. The dataset is augmented by the extra annotations provided by [13], resulting in 10,582 training images. For language modeling, we use the word level Penn Treeban (PTB) dataset [37] which consists of 929k training words, 73k validation words and 82k test words with 10,000 unique words in its vocabulary.
Unpruned Baseline Models: For CIFAR10, we use VGG16 [45] with BatchNorm [24], ResNet20 [14] and WideResNet2810 [48] as baselines. We adopt a standard data augmentation scheme (shifting/mirroring) following [30, 21]
, and normalize the input data with channel means and standard deviations. Note that we use the CIFAR version of ResNet20, VGG16, and WideResNet2810. VGG16, ResNet20, and WideResNet2810 are trained for 160, 160 and 200 epochs with a batch size of 128 and initial learning rate of 0.1. For VGG16 and ResNet20 we divide learning rate by 10 at epochs 80 and 120 and set the weights decay and momentum as
and 0.9. For WideResNet2810, the learning rate is divided by 5 at epochs 60, 120 and 160; the weight decay and momentum are set to and 0.9. For ImageNet, we train the baseline ResNet50 and MobileNetV1 model following the respective papers. We adopt the same data augmentation scheme as in [9] and report top1 validation accuracy. For semantic segmentation, the performance is measured in terms of pixel intersectionoverunion (IOU) averaged across the 21 classes (mIOU). We use Deeplabv3ResNet101^{3}^{3}3https://github.com/chenxi116/DeepLabv3.pytorch [1] as the baseline model following the training details in [1]. For language modeling, we use vanilla twolayer stacked LSTM [49] as a baseline. The dropout keep ratio is 0.35 for the baseline model. The vocabulary size, embedding size, and hidden size of the stacked LSTMs are set as 10,000, 1,500, and 1,500, respectively, which is consistent with the settings in [49].Implementation Details: There are two kinds of trainable variables in our method, denoted as model weights and mask weights. As a oneshot method, for model weights, we adopt the same hyperparameters with the corresponding unpruned baseline models, except that dropout keep ratio for language modeling is set as 0.5. For mask variables, we initialize the weights as 0 and use SGD training with initial learning rate of 0.1, weight decay of 0 and momentum of 0.9 on all datasets. The learning rate scheduler is the same with its corresponding model weights. The tradeoff parameter is set as 0.01 on classification and semantic segmentation tasks, and 0.1 for language modeling tasks. For the bandwidth scheduler, we report model performance trained with structurewise separate scheduler where (, ) are set as (0.0005, 0.7) for classification and segmentation models, and (0.0005, 1.2) for language modeling, respectively. All and are set as 0 and 100. We also conduct parameter sensitivity analysis of sparsity and accuracy in terms of and .
Model  Method  Val Acc(%)  Params(M)  FLOPs(%)  


VGG16 [45]  Original  92.9 (+0.0)  14.99 (100%)  100 
L1 [28]  91.8 (1.1)  2.98 (19.9%)  19.9  
SoftNet [15]  92.1 (0.8)  5.40 (36.0%)  36.1  
ThiNet [35]  90.8 (2.1)  5.40 (36.0%)  36.1  
Provable [29]  92.4 (0.5)  0.85 (5.7%)  15.0  
Ours  92.9 (0.0)  1.50 (10.0%)  16.5  
ResNet20 [14]  Original  91.3 (+0.0)  0.27 (100%)  100  
L1 [28]  90.9 (0.4)  0.15 (55.6%)  55.4  
SoftNet [15]  90.8 (0.5)  0.14 (53.6%)  50.6  
ThiNet [35]  89.2 (2.1)  0.18 (67.1%)  67.3  
Provable [29]  90.8 (0.5)  0.10 (37.3%)  54.5  
Ours  91.1 (0.2)  0.11 (39.1%)  59.8  
WideResNet  Original  96.2 (+0.0)  36.5 (100%)  100  
28  L1 [28]  95.2 (1.0)  7.6 (20.8%)  49.5  
10 [48]  BAR(16x V) [27]  92.0 (4.2)  2.3 (6.3%)  1.5  
Ours  95.6 (0.6)  2.6 (7.1%)  18.6 
4.2 Results
VGG16, ResNet20, and WideResNet2810 on CIFAR10: Table 1 shows the pruning results in terms of validation accuracy, retained parameters, and FLOPs of VGG16, ResNet20, and WideResNet2810 on CIFAR10. We compare with various pruning algorithms that we implement and run alongside our algorithm. We can see that ours achieves either larger pruning ratio or less degradation in accuracy. Our pruned VGG16 and ResNet20 can achieve comparable parameters and FLOPs reduction with recently proposed Provable [29] method while outperforming it by 0.5 and 0.3 in validation accuracy. For very aggressively pruned WideResNet2810, we observe that BAR [27] might not have enough capacity to achieve negligible accuracy drop even with the knowledge distillation [16] during the training process.
Model  Method  Top1 Val Acc(%)  Params(M)  FLOPs(%)  


Original  76.1 (+0)  23.0 (100%)  100  
L1 [28]  74.7 (1.4)  19.6 (85.2%)  77.5  
ResNet50 [14]  SoftNet [15]  74.6 (1.5)  N/A  58.2  
Provable [29]  75.2 (0.9)  15.2 (65.9%)  70.0  
Ours  75.2 (0.9)  16.5 (71.7%)  64.5  
Original(25%)  45.1 (+0)  0.47 (100%)  100  
MobileNetV1  MorphNet [8]  46.0 (+0.9)  N/A  110  
(128) [18]  Netadapt [47]  46.3 (+1.2)  N/A  81  
Ours  46.0 (+0.9)  0.41 (87.2%)  70 
ResNet50 and MobileNetV1 on ImageNet: To validate the effectiveness of the proposed method on largescale datasets, we further prune the widely used ResNet50 and MobileNetV1 (128 128 resolution) on ImageNet and compare the performance of our method to the results reported directly in the respective papers, as shown in Table 2. In MobileNetV1 experiments, following the same setting with Netadapt [47], we apply our method on MobileNetV1 with 0.5 multiplier while setting the original model’s multiplier as 0.25 for comparison. Note that 50%MobileNetV1(128) is one of the most compact networks, and thus is more challenging to simplify than other larger networks. Our method can still generate a sparser MobileNetV1 model compared with competing methods.
Deeplabv3ResNet101 on PASCAL VOC 2012: We also test the effectiveness of our proposed method on the semantic segmentation task by pruning the Deeplabv3ResNet101 model on the PASCAL VOC 2012 dataset. We apply our method to both the ResNet101 backbone and ASPP module. Compared to the baseline, our pruned network reduces the FLOPs by 54.5% and the parameters amount by 41.8% while approximately maintaining mIoU (76.5% to 76.2%). See Table 3.
2StackedLSTMs on PTB: We compare our proposed method with ISS based on vanilla twolayer stacked LSTM. As shown in Table 4, our method finds very compact model structure, while achieving similar perplexity on both validation and test sets. To be specific, our method achieves a 3.2 model size reduction and 7.8 FLOPs reduction from baseline model. Note that for fair comparison, we only prune the LSTM structure while keeping the embedding layer unchanged, following the same setting with ISS. Our method can achieve more compact structure than ISS, further reducing the hidden units from (373, 315) to (319, 285). These improvements may be due to the fact that our method dynamically grows and prunes the hidden neurons towards a better tradeoff between model complexity and performance than that of ISS, which simply uses the group lasso to penalize the norms of all groups collectively for compactness.
4.3 Analysis
Method  Perplexity (val,test)  Final Structure  Weight(M)  FLOPs(%) 

Original [49]  (82.57, 78.57)  (1500, 1500)  66M (100%)  100 
ISS [46]  (82.59, 78.65)  (373, 315)  21.8M (33.1%)  13.4 
Ours  (82.16, 78.67)  (319, 285)  20.9M (31.7%)  12.8 
Dynamic Traintime Cost: One advantage of our method over conventional pruning methods is that we effectively decrease the computational cost not only of inference but also of training via structured continuous sparsification. Figure 3 shows the dynamics of traintime layerwise FLOPs and stagewise retained filters ratios of VGG16 and ResNet20 on CIFAR10, respectively. From Figure 3(b) and 3
(d) we see that our method preserves more filters of earlier stage (1 and 2) in VGG16 and earlier layers within each stage of ResNet20. Also, the layerwise final sparsity of ResNet20 is more uniform due to the residual connections.
Ours as Structured Regularization: We investigate the value of our proposed automatic pruning method serving as a more efficient training method with structure regularization. We reinitialize the pruned ResNet20 and twolayer stacked LSTM and retrain them from scratch on CIFAR10 and PTB, respectively. Comparing with their reported pruned model performance, we notice a performance degradation on both ResNet20 (accuracy 91.1% 90.8% (0.3)) and LSTM models (test perplexity 78.67 86.22 (+7.55)). Our method appears to have a positive effect in terms of regularization or optimization dynamics, which is lost if one attempts to directly train the final compact structure.
Parameter Sensitivity: We analyze the sparsity and performance sensitivity relative to the bandwidth scheduler (structurewise separately) parameters. We measured the performance with respect to a combination of and . Specifically, we measure the normalized parameters sparsity and validation accuracy of ResNet20 on the CIFAR10 dataset as shown in Figure 3. From Figure 3(a) we can acquire some knowledge on how to localize hyperparameters and to achieve highly sparse networks. Figure 3(b) shows that when and are located in relative large range (e.g. rightbottom), the validation accuracy is robust to changes in these hyperparameters.
Investigation on Bandwidth Scheduler: We investigate the effect of both global scheduler and structurewise separate scheduler by conducting experiments on CIFAR10 using VGG16, ResNet20, and WideResNet2810. The results using structurewise separate scheduler are as reported in Table 1. For global scheduler, we note that to achieve similar sparsity, the pruned models suffer from accuracy drops of 0.5%, 0.2%, and 1.2%. With the global scheduler, optimization of all filters’ masks stops at very early epochs and the following epochs of training are equivalent to directly training a stabilized compact structure. This may lock the network into a suboptimal structure, compared to our separate scheduler which dynamically grows and prunes over a longer duration.
5 Conclusion
In this paper, we propose a simple yet effective method to grow efficient deep networks via structured continuous sparsification, which decreases the computational cost not only of inference but also of training. The method is simple to implement and quick to execute, which aims at automating the network structure sparsification process for general purposes. The pruning results for widely used deep networks on various computer vision and language modeling tasks show that our method consistently generates smaller and more accurate networks compared to competing methods.
There are many interesting directions to be investigated further. For example, while our current sparsification process is designed with a generic objective, it would be interesting to incorporate model size and FLOPs constraints into training objective in order to target a particular resource. Additionally, our method’s growing and pruning space is anchored to a given network topology. Future work could explore an architectural design space in which large subcomponents of the network are themselves candidates for pruning.
References
 [1] (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587. Cited by: §1, §4.1, Table 3.
 [2] (2014) Learning phrase representations using RNN encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: footnote 1.

[3]
(2017)
Xception: deep learning with depthwise separable convolutions
. In CVPR, Cited by: §1.  [4] (2009) ImageNet: a largescale hierarchical image database. In CVPR, Cited by: §4.1.
 [5] (2015) The PASCAL visual object classes challenge: A retrospective. IJCV. Cited by: §4.1.
 [6] (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. In ICLR, Cited by: §2.
 [7] (2015) Fast RCNN. In ICCV, Cited by: §1.
 [8] (2018) MorphNet: fast & simple resourceconstrained structure learning of deep networks. In CVPR, Cited by: §2, Table 2.
 [9] (2016) Training and investigating residual nets. http://torch.ch/blog/2016/02/04/resnets.html. Cited by: §4.1.
 [10] (2016) Dynamic network surgery for efficient DNNs. In NIPS, Cited by: §1.
 [11] (2016) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. ICLR. Cited by: §1, §2.
 [12] (2015) Learning both weights and connections for efficient neural networks. NIPS. Cited by: §1, §2.
 [13] (2011) Semantic contours from inverse detectors. In ICCV, Cited by: §4.1.
 [14] (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §4.1, Table 1, Table 2.
 [15] (2018) Soft filter pruning for accelerating deep convolutional neural networks. In IJCAI, Cited by: §1, §2, Table 1, Table 2.
 [16] (2015) Distilling the knowledge in a neural network. NIPS Deep Learning and Representation Learning Workshop. Cited by: §4.2.
 [17] (1997) Long shortterm memory. Neural Computation. Cited by: §2, §3.1.
 [18] (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Cited by: §1, Table 2.
 [19] (2018) CondenseNet: an efficient DenseNet using learned group convolutions. CVPR. Cited by: §1.
 [20] (2017) Densely connected convolutional networks. In CVPR, Cited by: §1.
 [21] (2016) Deep networks with stochastic depth. In ECCV, Cited by: §4.1.
 [22] (2016) Binarized neural networks. In NIPS, Cited by: §1.
 [23] (2016) SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <1MB model size. arXiv:1602.07360. Cited by: §1.
 [24] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §4.1.
 [25] (2014) The CIFAR10 dataset. http://www.cs.toronto.edu/~kriz/cifar.html. Cited by: §4.1.
 [26] (2012) ImageNet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
 [27] (2019) Structured pruning of neural networks with budgetaware regularization. In CVPR, Cited by: §4.2, Table 1.
 [28] (2017) Pruning filters for efficient ConvNets. In ICLR, Cited by: §1, §2, Table 1, Table 2, Table 3.
 [29] (2020) Provable filter pruning for efficient neural networks. In ICLR, Cited by: §4.2, Table 1, Table 2.
 [30] (2013) Network in network. arXiv:1312.4400. Cited by: §4.1.
 [31] (2019) DARTS: differentiable architecture search. ICLR. Cited by: §1.
 [32] (2016) SSD: single shot multibox detector. In ECCV, Cited by: §1.
 [33] (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §1.
 [34] (2018) Learning sparse neural networks through l regularization. ICLR. Cited by: §2, footnote 2.
 [35] (2017) ThiNet: A filter level pruning method for deep neural network compression. In ICCV, Cited by: §1, §2, Table 1.
 [36] (2018) Neural architecture optimization. In NIPS, Cited by: §1.
 [37] (1993) Building a large annotated corpus of english: the penn treebank. Computational Linguistics. Cited by: §4.1.
 [38] (2017) Variational dropout sparsifies deep neural networks. In ICML, Cited by: §1, §2.
 [39] (2017) Exploring sparsity in recurrent neural networks. In ICLR, Cited by: §2.
 [40] (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §1.
 [41] (2016) XNORNet: imagenet classification using binary convolutional neural networks. In ECCV, Cited by: §1.
 [42] (2019) Learning implicitly recurrent CNNs through parameter sharing. In ICLR, Cited by: §1.
 [43] (2019) Winning the lottery with continuous sparsification. arXiv:1912.04427. Cited by: Figure 1, §2, §3.2, §3.2, §3.2.
 [44] (2014) Rigidmotion scattering for image classification. Ph.D. Thesis, Ecole Polytechnique, CMAP. Cited by: §1.
 [45] (2015) Very deep convolutional networks for largescale image recognition. ICLR. Cited by: §1, §4.1, Table 1.
 [46] (2018) Learning intrinsic sparse structures within long shortterm memory. In ICLR, Cited by: §2, Table 4.
 [47] (2018) NetAdapt: platformaware neural network adaptation for mobile applications. In ECCV, Cited by: §4.2, Table 2.
 [48] (2016) Wide residual networks. In BMVC, Cited by: §1, §4.1, Table 1.
 [49] (2014) Recurrent neural network regularization. arXiv:1409.2329. Cited by: §1, §4.1, Table 4.
 [50] (2018) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. CVPR. Cited by: §1.

[51]
(2016)
Neural architecture search with reinforcement learning
. arXiv:1611.01578. Cited by: §1.  [52] (2018) Learning transferable architectures for scalable image recognition. CVPR. Cited by: §1.