Efficiency improvements to DNNs have been extensively studied in previous works [23, 18, 19, 50]. For example, [22, 41] propose the use of binary weights and activations, benefiting from reduced storage costs and efficient computation through bit-counting operations. Other prominent approaches focus on finding efficient alternatives to standard spatial convolutions, e.g. depth-wise separable convolutions , which applies a separate convolutional kernel to each channel followed by a point-wise convolution over all channels [3, 18, 50]. Pruning methods [11, 12, 10] aim to generate a light-wise version of a given network architecture by removing individual weights [12, 11, 38] or structured parameter sets [28, 15, 35] (e.g. filters in convolutional layers).
However, existing methods typically rely on retraining or fine-tuning phases after reducing the number of parameters so that accuracy is maintained, resulting in significant computational costs. Moreover, the majority of these methods train the full-sized model prior to pruning and do not aim to diminish train-time computational costs. Recently proposed network architecture search (NAS) methods [52, 51, 31, 36, 40, 42] utilize AutoML techniques to design efficient architectures under practical resource constraints. Nonetheless, most NAS methods operate on a large “supernet” architecture, yielding a computationally expensive search phase. In addition, few of the recently-proposed methods are one-shot and hence require additional computation to retrain the final architecture in order to achieve high performance for deployment.
We propose a method to dynamically grow deep networks by continuously sparsifying structured parameter sets, resulting in efficient architectures and decreasing the computational cost not only of inference, but also of training. Unlike existing pruning or architecture search schemes that maintain a full-sized network or a “supernet”, we implicitly adapt architectures during training with different structured sparsity levels. More specifically, we first build a discrete space to maintain and explore adaptive train-time architectures of different complexities in a growing and pruning manner. To overcome the hardness of optimizing over a discrete space, we perform learning via continuation methods by approximating a discrete operation through a scaled smooth function. We design a bandwidth scheduler that is used to control the optimization hardness during this low-cost training procedure. The framework is illustrated in Figure 1
. We conduct extensive experiments on classification tasks (CIFAR-10, ImageNet), semantic segmentation (PASCAL VOC) and word-level language modeling (PTB) to demonstrate the effectiveness of our methods for both convolutional neural network (CNN) and recurrent neural network (RNN) architectures.
2 Related Work
Network Pruning: Network pruning methods can be split into two groups: those that prune individual weights and those that prune structured components. For individual weight-based pruning, elements of the weight matrices can be removed based on some criteria. For example,  propose to prune network weights with small magnitude, and build a deep compression pipeline . Sparse VD  yields extremely sparse solutions both in fully-connected and convolutional layers by using variational dropout.  learns sparse networks by approximating -regularization with a stochastic reparameterization.  presents a magnitude-based pruning approach for RNNs where the top-k elements of the weights are set as 0 at each iteration. However, these methods that produce sparse weight matrices only lead to speedup on dedicated hardware with supporting libraries.
In structured methods, pruning is applied at the level of neurons, channels, or even layers. For example, L1-pruning removes channels based on the norm of their filters.  uses group sparsity to smooth the pruning process after training. ThiNet  greedily prunes the channel that has the smallest effect on the next layer’s activation values. MorphNet  regularizes weights towards zero until they are small enough such that the corresponding output channels are marked for removal from the network. Intrinsic Structured Sparsity (ISS)  works on LSTMs  by collectively removing the columns and rows of the weight matrices via group LASSO.
Our work is more related to structured pruning methods in the sense that a slim architecture is generated at the end of training. In addition, our work also focus on adapting the train-time structured sparsification in a discrete growing and pruning space.
Lottery Ticket Hypothesis and Continuous Sparsification: The Lottery Ticket Hypothesis  conjectures that sparse sub-networks and their randomly initialized weights can obtain a comparable accuracy with the original network when trained in isolation.  further proposes Continuous Sparsification, a method to speed up ticket search, which approaches a complex optimization problem by relaxing the original objective, turning it into an intermediate and easier problem in terms of optimization. By gradually increasing the difficulty of the underlying objective during training, it results in a sequence of optimization problems converging to the original, intractable objective. In our method, we directly adopt Continuous Sparsification  to formulate a gradual relaxation scheme in the context of structured pruning.
3.1 Discrete Growing and Pruning Space
Given a network topology, we build a discrete space to maintain adaptive train-time architectures of different complexities in a growing and pruning manner. A network topology can be seen as a directed acyclic graph consisting of an ordered sequence of nodes. Each node is the input feature and each edge is a computation cell with structuredhyperparameters (e.g. filter numbers in convolutional layers or hidden neuron numbers in recurrent cells). The discrete growing and pruning space can be parameterized by associating a mask variable with each computation cell (edge), which enables a train-time pruning () and growing () dynamics.
For a convolutional layer with input channels, output channels (filters) and sized kernels, the -th output feature is computed based on the -th filter, i.e. for :
where . For a recurrent cell, without loss of generality, we focus on LSTMs  with hidden neurons, a common variant111The proposed growing space can be readily applied to the compression of GRUs  and vanilla RNNs. of RNNs that learns long-term dependencies:
is the sigmoid function,denotes element-wise multiplication and is the hyperbolic tangent function.
denotes the input vector at the time-step, denotes the current hidden state, and denotes the long-term memory cell state. denote the input-to-hidden weight matrices and denote the hidden-to-hidden weight matrices. is shared across all the gates to control the sparsity of hidden neurons.
We can optimize the trade-off between model performance and structured sparsification by considering the training objective
corresponds to a loss function. (e.g. cross-entropy loss for classification), the term penalizes the number of non-zero mask values thus encouraging sparsity, is a trade-off parameter between and the penalty. In the growing and pruning space, a model is optimal if it can minimize the combined cost of the description of model complexity and the loss between the model and the data. However, optimizing is computationally intractable due to the combinatorial nature of binary states.
3.2 Continuous Relaxation and Optimization
Learning by Continuation: To make the search space continuous and the optimization feasible, we adopt the framework proposed in , used to derive Continuous Sparsification.
First, we reparameterize as the binary sign of a continuous variable :
and rewrite the objective in Eq. 3 as
, we use a sequence of functions whose limit is the sign operation. Instead of using the sigmoid activation function, we adopt the hard sigmoid function222Note that the original hard sigmoid function is defined as in  . Similar to the sigmoid function, we have that for any , , where > 0 is a bandwidth parameter.
Auxiliary Discrete Variable: Using continuation methods, we can express our final objective as:
where is sampled from . By increasing , becomes harder to optimize while the objectives converges to original discrete one. Different from , we introduce an 0-1 sampled auxiliary variable
based on the probability value. Thus we (1) effectively reduce training computational cost since any train-time architecture is sampled as a structured sparse one; (2) avoid using a suboptimal thresholding criterion to generate the inference architecture at the end of training.
Bandwidth Scheduler: We start training deep networks using Eq. 6 with , where the initial value is set as . We adapt its bandwidth to control the optimization difficulty by instantiating a bandwidth scheduler in two ways: globally and structure-wise separately
. A global bandwidth scheduler is called at the end of each training epoch and updateson all activation functions following
where is the initial bandwidth which is set as 1, n_iters is the number of training iterations so far, is used to constrain to a certain range. In our experiments, we set as 100. Constants and are hyperparameters that govern the increasing speed of the bandwidth during the progressive training procedure. Note that such adaptive control system can be customized for different resource requirements (e.g. training computation cost) by tuning and . A structure-wise separate bandwidth scheduler requires specifying one additional step: for each mask variable, instead of using a global counter n_iters, we set a separate counter n_sampled_iters which is increased only when its associated mask value is sampled as 1 in Eq. 6. Similarly we instantiate this scheduler with
Intuitively, the structure-wise separate bandwidth scheduler is more compelling because it allows bandwidth to increase at different rates for different mask variables: those more frequently sampled masks, indicative of a higher probability not to be pruned, will become more stable due to the higher optimization difficulty; Those less sampled masks at early stages may still have the chance to be grown under a relatively lower . However, the global scheduler may fail to handle such cases. In our experiments, we report performance using the structure-wise scheduler and we also conduct investigation on the two alternatives during training.
In summary, Algorithm 1 shows full details of our optimization procedure with the structure-wise separate bandwidth scheduler.
4.1 Experimental Setup
Datasets: Evaluation is conducted on various tasks to demonstrate the effectiveness of our proposed method. For image classification, we use CIFAR-10  and ImageNet : CIFAR-10 consists of 60,000 images of 10 classes, with 6,000 images per class. The train and test sets contain 50,000 and 10,000 images respectively. ImageNet is a large dataset for visual recognition which contains over 1.2M images in the training set and 50K images in the validation set covering 1,000 categories. For semantic segmentation, we use the PASCAL VOC 2012  benchmark which contains 20 foreground object classes and one background class. The original dataset contains 1,464 (train), 1,449 (val), and 1,456 (test) pixel-level labeled images for training, validation, and testing, respectively. The dataset is augmented by the extra annotations provided by , resulting in 10,582 training images. For language modeling, we use the word level Penn Treeban (PTB) dataset  which consists of 929k training words, 73k validation words and 82k test words with 10,000 unique words in its vocabulary.
Unpruned Baseline Models: For CIFAR-10, we use VGG-16  with BatchNorm , ResNet-20  and WideResNet-28-10  as baselines. We adopt a standard data augmentation scheme (shifting/mirroring) following [30, 21]
, and normalize the input data with channel means and standard deviations. Note that we use the CIFAR version of ResNet-20, VGG-16, and WideResNet-28-10. VGG-16, ResNet-20, and WideResNet-28-10 are trained for 160, 160 and 200 epochs with a batch size of 128 and initial learning rate of 0.1. For VGG-16 and ResNet-20 we divide learning rate by 10 at epochs 80 and 120 and set the weights decay and momentum asand 0.9. For WideResNet-28-10, the learning rate is divided by 5 at epochs 60, 120 and 160; the weight decay and momentum are set to and 0.9. For ImageNet, we train the baseline ResNet-50 and MobileNetV1 model following the respective papers. We adopt the same data augmentation scheme as in  and report top-1 validation accuracy. For semantic segmentation, the performance is measured in terms of pixel intersection-over-union (IOU) averaged across the 21 classes (mIOU). We use Deeplab-v3-ResNet-101333https://github.com/chenxi116/DeepLabv3.pytorch  as the baseline model following the training details in . For language modeling, we use vanilla two-layer stacked LSTM  as a baseline. The dropout keep ratio is 0.35 for the baseline model. The vocabulary size, embedding size, and hidden size of the stacked LSTMs are set as 10,000, 1,500, and 1,500, respectively, which is consistent with the settings in .
Implementation Details: There are two kinds of trainable variables in our method, denoted as model weights and mask weights. As a one-shot method, for model weights, we adopt the same hyperparameters with the corresponding unpruned baseline models, except that dropout keep ratio for language modeling is set as 0.5. For mask variables, we initialize the weights as 0 and use SGD training with initial learning rate of 0.1, weight decay of 0 and momentum of 0.9 on all datasets. The learning rate scheduler is the same with its corresponding model weights. The trade-off parameter is set as 0.01 on classification and semantic segmentation tasks, and 0.1 for language modeling tasks. For the bandwidth scheduler, we report model performance trained with structure-wise separate scheduler where (, ) are set as (0.0005, 0.7) for classification and segmentation models, and (0.0005, 1.2) for language modeling, respectively. All and are set as 0 and 100. We also conduct parameter sensitivity analysis of sparsity and accuracy in terms of and .
||VGG-16 ||Original||92.9 (+0.0)||14.99 (100%)||100|
|L1 ||91.8 (-1.1)||2.98 (19.9%)||19.9|
|SoftNet ||92.1 (-0.8)||5.40 (36.0%)||36.1|
|ThiNet ||90.8 (-2.1)||5.40 (36.0%)||36.1|
|Provable ||92.4 (-0.5)||0.85 (5.7%)||15.0|
|Ours||92.9 (-0.0)||1.50 (10.0%)||16.5|
|ResNet-20 ||Original||91.3 (+0.0)||0.27 (100%)||100|
|L1 ||90.9 (-0.4)||0.15 (55.6%)||55.4|
|SoftNet ||90.8 (-0.5)||0.14 (53.6%)||50.6|
|ThiNet ||89.2 (-2.1)||0.18 (67.1%)||67.3|
|Provable ||90.8 (-0.5)||0.10 (37.3%)||54.5|
|Ours||91.1 (-0.2)||0.11 (39.1%)||59.8|
|WideResNet||Original||96.2 (+0.0)||36.5 (100%)||100|
|-28||L1 ||95.2 (-1.0)||7.6 (20.8%)||49.5|
|-10 ||BAR(16x V) ||92.0 (-4.2)||2.3 (6.3%)||1.5|
|Ours||95.6 (-0.6)||2.6 (7.1%)||18.6|
VGG-16, ResNet-20, and WideResNet-28-10 on CIFAR-10: Table 1 shows the pruning results in terms of validation accuracy, retained parameters, and FLOPs of VGG-16, ResNet-20, and WideResNet-28-10 on CIFAR-10. We compare with various pruning algorithms that we implement and run alongside our algorithm. We can see that ours achieves either larger pruning ratio or less degradation in accuracy. Our pruned VGG-16 and ResNet-20 can achieve comparable parameters and FLOPs reduction with recently proposed Provable  method while outperforming it by 0.5 and 0.3 in validation accuracy. For very aggressively pruned WideResNet-28-10, we observe that BAR  might not have enough capacity to achieve negligible accuracy drop even with the knowledge distillation  during the training process.
|Model||Method||Top-1 Val Acc(%)||Params(M)||FLOPs(%)|
||Original||76.1 (+0)||23.0 (100%)||100|
|L1 ||74.7 (-1.4)||19.6 (85.2%)||77.5|
|ResNet-50 ||SoftNet ||74.6 (-1.5)||N/A||58.2|
|Provable ||75.2 (-0.9)||15.2 (65.9%)||70.0|
|Ours||75.2 (-0.9)||16.5 (71.7%)||64.5|
|Original(25%)||45.1 (+0)||0.47 (100%)||100|
|MobileNetV1||MorphNet ||46.0 (+0.9)||N/A||110|
|(128) ||Netadapt ||46.3 (+1.2)||N/A||81|
|Ours||46.0 (+0.9)||0.41 (87.2%)||70|
ResNet-50 and MobileNetV1 on ImageNet: To validate the effectiveness of the proposed method on large-scale datasets, we further prune the widely used ResNet-50 and MobileNetV1 (128 128 resolution) on ImageNet and compare the performance of our method to the results reported directly in the respective papers, as shown in Table 2. In MobileNetV1 experiments, following the same setting with Netadapt , we apply our method on MobileNetV1 with 0.5 multiplier while setting the original model’s multiplier as 0.25 for comparison. Note that 50%-MobileNetV1(128) is one of the most compact networks, and thus is more challenging to simplify than other larger networks. Our method can still generate a sparser MobileNetV1 model compared with competing methods.
Deeplab-v3-ResNet-101 on PASCAL VOC 2012: We also test the effectiveness of our proposed method on the semantic segmentation task by pruning the Deeplab-v3-ResNet-101 model on the PASCAL VOC 2012 dataset. We apply our method to both the ResNet-101 backbone and ASPP module. Compared to the baseline, our pruned network reduces the FLOPs by 54.5% and the parameters amount by 41.8% while approximately maintaining mIoU (76.5% to 76.2%). See Table 3.
2-Stacked-LSTMs on PTB: We compare our proposed method with ISS based on vanilla two-layer stacked LSTM. As shown in Table 4, our method finds very compact model structure, while achieving similar perplexity on both validation and test sets. To be specific, our method achieves a 3.2 model size reduction and 7.8 FLOPs reduction from baseline model. Note that for fair comparison, we only prune the LSTM structure while keeping the embedding layer unchanged, following the same setting with ISS. Our method can achieve more compact structure than ISS, further reducing the hidden units from (373, 315) to (319, 285). These improvements may be due to the fact that our method dynamically grows and prunes the hidden neurons towards a better trade-off between model complexity and performance than that of ISS, which simply uses the group lasso to penalize the norms of all groups collectively for compactness.
|Method||Perplexity (val,test)||Final Structure||Weight(M)||FLOPs(%)|
|Original ||(82.57, 78.57)||(1500, 1500)||66M (100%)||100|
|ISS ||(82.59, 78.65)||(373, 315)||21.8M (33.1%)||13.4|
|Ours||(82.16, 78.67)||(319, 285)||20.9M (31.7%)||12.8|
Dynamic Train-time Cost: One advantage of our method over conventional pruning methods is that we effectively decrease the computational cost not only of inference but also of training via structured continuous sparsification. Figure 3 shows the dynamics of train-time layer-wise FLOPs and stage-wise retained filters ratios of VGG-16 and ResNet-20 on CIFAR-10, respectively. From Figure 3(b) and 3
(d) we see that our method preserves more filters of earlier stage (1 and 2) in VGG-16 and earlier layers within each stage of ResNet-20. Also, the layer-wise final sparsity of ResNet-20 is more uniform due to the residual connections.
Ours as Structured Regularization: We investigate the value of our proposed automatic pruning method serving as a more efficient training method with structure regularization. We re-initialize the pruned ResNet-20 and two-layer stacked LSTM and re-train them from scratch on CIFAR-10 and PTB, respectively. Comparing with their reported pruned model performance, we notice a performance degradation on both ResNet-20 (accuracy 91.1% 90.8% (-0.3)) and LSTM models (test perplexity 78.67 86.22 (+7.55)). Our method appears to have a positive effect in terms of regularization or optimization dynamics, which is lost if one attempts to directly train the final compact structure.
Parameter Sensitivity: We analyze the sparsity and performance sensitivity relative to the bandwidth scheduler (structure-wise separately) parameters. We measured the performance with respect to a combination of and . Specifically, we measure the normalized parameters sparsity and validation accuracy of ResNet-20 on the CIFAR-10 dataset as shown in Figure 3. From Figure 3(a) we can acquire some knowledge on how to localize hyperparameters and to achieve highly sparse networks. Figure 3(b) shows that when and are located in relative large range (e.g. right-bottom), the validation accuracy is robust to changes in these hyperparameters.
Investigation on Bandwidth Scheduler: We investigate the effect of both global scheduler and structure-wise separate scheduler by conducting experiments on CIFAR-10 using VGG-16, ResNet-20, and WideResNet-28-10. The results using structure-wise separate scheduler are as reported in Table 1. For global scheduler, we note that to achieve similar sparsity, the pruned models suffer from accuracy drops of 0.5%, 0.2%, and 1.2%. With the global scheduler, optimization of all filters’ masks stops at very early epochs and the following epochs of training are equivalent to directly training a stabilized compact structure. This may lock the network into a suboptimal structure, compared to our separate scheduler which dynamically grows and prunes over a longer duration.
In this paper, we propose a simple yet effective method to grow efficient deep networks via structured continuous sparsification, which decreases the computational cost not only of inference but also of training. The method is simple to implement and quick to execute, which aims at automating the network structure sparsification process for general purposes. The pruning results for widely used deep networks on various computer vision and language modeling tasks show that our method consistently generates smaller and more accurate networks compared to competing methods.
There are many interesting directions to be investigated further. For example, while our current sparsification process is designed with a generic objective, it would be interesting to incorporate model size and FLOPs constraints into training objective in order to target a particular resource. Additionally, our method’s growing and pruning space is anchored to a given network topology. Future work could explore an architectural design space in which large subcomponents of the network are themselves candidates for pruning.
-  (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587. Cited by: §1, §4.1, Table 3.
-  (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: footnote 1.
Xception: deep learning with depthwise separable convolutions. In CVPR, Cited by: §1.
-  (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: §4.1.
-  (2015) The PASCAL visual object classes challenge: A retrospective. IJCV. Cited by: §4.1.
-  (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. In ICLR, Cited by: §2.
-  (2015) Fast R-CNN. In ICCV, Cited by: §1.
-  (2018) MorphNet: fast & simple resource-constrained structure learning of deep networks. In CVPR, Cited by: §2, Table 2.
-  (2016) Training and investigating residual nets. http://torch.ch/blog/2016/02/04/resnets.html. Cited by: §4.1.
-  (2016) Dynamic network surgery for efficient DNNs. In NIPS, Cited by: §1.
-  (2016) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. ICLR. Cited by: §1, §2.
-  (2015) Learning both weights and connections for efficient neural networks. NIPS. Cited by: §1, §2.
-  (2011) Semantic contours from inverse detectors. In ICCV, Cited by: §4.1.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §4.1, Table 1, Table 2.
-  (2018) Soft filter pruning for accelerating deep convolutional neural networks. In IJCAI, Cited by: §1, §2, Table 1, Table 2.
-  (2015) Distilling the knowledge in a neural network. NIPS Deep Learning and Representation Learning Workshop. Cited by: §4.2.
-  (1997) Long short-term memory. Neural Computation. Cited by: §2, §3.1.
-  (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Cited by: §1, Table 2.
-  (2018) CondenseNet: an efficient DenseNet using learned group convolutions. CVPR. Cited by: §1.
-  (2017) Densely connected convolutional networks. In CVPR, Cited by: §1.
-  (2016) Deep networks with stochastic depth. In ECCV, Cited by: §4.1.
-  (2016) Binarized neural networks. In NIPS, Cited by: §1.
-  (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. arXiv:1602.07360. Cited by: §1.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §4.1.
-  (2014) The CIFAR-10 dataset. http://www.cs.toronto.edu/~kriz/cifar.html. Cited by: §4.1.
-  (2012) ImageNet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
-  (2019) Structured pruning of neural networks with budget-aware regularization. In CVPR, Cited by: §4.2, Table 1.
-  (2017) Pruning filters for efficient ConvNets. In ICLR, Cited by: §1, §2, Table 1, Table 2, Table 3.
-  (2020) Provable filter pruning for efficient neural networks. In ICLR, Cited by: §4.2, Table 1, Table 2.
-  (2013) Network in network. arXiv:1312.4400. Cited by: §4.1.
-  (2019) DARTS: differentiable architecture search. ICLR. Cited by: §1.
-  (2016) SSD: single shot multibox detector. In ECCV, Cited by: §1.
-  (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §1.
-  (2018) Learning sparse neural networks through l regularization. ICLR. Cited by: §2, footnote 2.
-  (2017) ThiNet: A filter level pruning method for deep neural network compression. In ICCV, Cited by: §1, §2, Table 1.
-  (2018) Neural architecture optimization. In NIPS, Cited by: §1.
-  (1993) Building a large annotated corpus of english: the penn treebank. Computational Linguistics. Cited by: §4.1.
-  (2017) Variational dropout sparsifies deep neural networks. In ICML, Cited by: §1, §2.
-  (2017) Exploring sparsity in recurrent neural networks. In ICLR, Cited by: §2.
-  (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §1.
-  (2016) XNOR-Net: imagenet classification using binary convolutional neural networks. In ECCV, Cited by: §1.
-  (2019) Learning implicitly recurrent CNNs through parameter sharing. In ICLR, Cited by: §1.
-  (2019) Winning the lottery with continuous sparsification. arXiv:1912.04427. Cited by: Figure 1, §2, §3.2, §3.2, §3.2.
-  (2014) Rigid-motion scattering for image classification. Ph.D. Thesis, Ecole Polytechnique, CMAP. Cited by: §1.
-  (2015) Very deep convolutional networks for large-scale image recognition. ICLR. Cited by: §1, §4.1, Table 1.
-  (2018) Learning intrinsic sparse structures within long short-term memory. In ICLR, Cited by: §2, Table 4.
-  (2018) NetAdapt: platform-aware neural network adaptation for mobile applications. In ECCV, Cited by: §4.2, Table 2.
-  (2016) Wide residual networks. In BMVC, Cited by: §1, §4.1, Table 1.
-  (2014) Recurrent neural network regularization. arXiv:1409.2329. Cited by: §1, §4.1, Table 4.
-  (2018) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. CVPR. Cited by: §1.
Neural architecture search with reinforcement learning. arXiv:1611.01578. Cited by: §1.
-  (2018) Learning transferable architectures for scalable image recognition. CVPR. Cited by: §1.