Neural networks are used in an increasingly wide variety of applications on a diverse set of hardware architectures, ranging from laptops to phones to embedded sensors. This wide variety of deployment settings means that inference time and model size are becoming as important as prediction accuracy when assessing model quality. However, currently these three dimensions, prediction accuracy, inference time, and model size, are optimized independently, often with sub-optimal results.
Our approach to optimize the three dimensions also stands in contrast to existing techniques, which can be categorized into two general approaches: (1) quantization Jouppi:2017:IPA:3079856.3080246 and code compilation, techniques that can be applied to any network, and (2) techniques which analyze the structure of the network and systematically prune connections or neurons han2015deepcompression ; Cun . While the first category is useful, it has limited impact on the network size. The second category can reduce the model size much more but has several drawbacks: first, those techniques often negatively impact model quality. Second, they can also (surprisingly) negatively impact inference time as they transform dense matrix operations into sparse ones, which can be substantially slower to execute on GPUs which do not support efficiently sparse linear algebrahan2015deepcompression . Third, these techniques generally start by optimizing a particular architecture for prediction performance, and then, as a post- processing step, applying compression to generate a smaller model that meets the resource constraints of the deployment setting. Because the network architecture is essentially fixed during this post-processing, model architectures that work better in small settings may be missed – this is especially true in large networks like many-layered CNNs, where it is infeasible to try explore even a small fraction of possible network configurations.
In contrast, in this paper we present a new and surprisingly simple method to simultaneously optimize network size and model performance. The key idea is to learn the right network size at the same time that we optimize for prediction performance. Our approach, called Smallify, starts with an over-sized network, and dynamically shrinks it by eliminating unimportant neurons—those that do not contribute to prediction performance—during training time. We achieve this, by introducing a new layer, called SwitchLayer, which can switch neurons on and off, and is co-optimized while training the neural net. Furthermore, the layer-based approach makes it not only easy to implement Smallify in various neural net frameworks, but also to use it as part of existing network architectures. Smallify has two main benefits. First, it explores the architecture of models that are both small and perform well, rather than starting with a high-performing model and making it small. Smallify
accomplishes this goal by using a single new hyperparameter that effectively models the target network size. Second, in contrast to existing neural network compression techniquesAghasi2016 ; han2015deepcompression , our approach results in models that are not only small, but where the weight matrices are dense, leading to better inference time.
In summary, our contributions are as follows:
1. We propose a novel technique based on dynamically switching on and off neurons, which allows us to optimize the network size while the network is trained.
2. We extend our technique to remove entire neurons, leading not only to smaller networks, but also dense matrices, which yield improved inference times as networks shrink. Furthermore, our switching layers used during training can be safely removed before the model is used for inference, meaning they add no additional overhead at inference time.
3. We show that our technique is a relaxation of group LASSO Yuan2006 and prove that our problem admits many global minima.
4. We evaluate Smallify
with both fully-connected as well as convolutional neural networks. ForCIFAR10, we achieve the same accuracy as a traditionally trained network while reducing the network size by a factor of . Further, while sacrificing just % of performance, Smallify finds networks that are 35X smaller. All in all, this leads to speedups in inference time of up to 6X.
2 Related Work
There are several lines of work related to optimizing network structure.
Hyperparameter optimization techniques: One way to optimize network architecture is to use hyperparameter optimization. Although many methods have been proposed for hyperparameter optimization, simple techniques such as randomized search have been shown to work suprisingly well in practice BergstraJAMESBERGSTRA2012 ; Snoek12 . Alternative more advanced techniques include Bayesian techniques and/or various bandit algorithms (e.g. li2016hyperband ; jamieson2016 ) Although these methods can be used to tune the size of each layer in a network, in practice, related work presents limited experimental evidence regarding this, likely because treating each layer as a hyperparameter would lead to an excessively large search space. In contrast, with Smallify
, the size of the network can be tuned with a single parameter. Recently, methods based on reinforcement learning have been proposed (Zoph2017b ; Zoph2017a ) and shown to generate very accurate networks (NAS-Net). However as stated in Zoph2017b
, they still used the popular heuristic that doubles the number of channels every time the dimension of features is reduced without challenging it.
Model Compression: Model compression techniques focus on reducing the model size after training, in contrast to Smallify, which reduces it while training. Optimal brain damage Cun identifies connections in a network that are unimportant and then prunes these connections. DeepCompression han2015deepcompression takes this one step further and in addition to pruning connections, it quantizes weights to make inference extremely efficient. A different vein of work such as romero2014fitnets ; hinton2015distilling proposes techniques for distilling a network into a simpler network or a different model. Because these techniques work after training, they are orthogonal and complementary to Smallify. Further, some of these techniques, e.g., Han2015 ; Cun , produce sparse matrices that are not likely to improve inference times even though they reduce network size.
Dynamically Sizing Networks
The techniques closest to our proposed method are those based on group sparsity such as Scardapane2017 ; Alvarez2016a , nuclear norm Alvarez2017a , low-rank constraints Zhou2016 , exclusive sparsity Yoon , and even physics-inspired methods Wen2017 . In Wen2016 , authors look beyond removing channels and experiment with shape and depth. In Philipp , the authors propose a method called Adaptive Radial-Angular Gradient Descent that adds and removes neurons on the fly via an penalty. This approach requires a new optimizer and takes longer to converge compared to Smallify. Liu is similar to Smallify
in that they both scale each channel/neuron by a scalar. Our approach is more general since it can be used with any architecture, does not depend on batch normalization layers, and in contrast toLiu we propose some implementation details to make the framework more practical in section 4. Most of these methods train for sparsity and deactivate neurons at the end of the training process except Alvarez2017a
that do a single step of garbage collection at epoch 15. Our pipeline allows early detection of the least important neurons/channels and take advantage from it to speed up training.
3 The Smallify Approach
In this section we describe the Smallify approach. We discuss first the new SwitchLayers
which are used to deactivate neurons, followed by a description of how we adapt the training loss function.
At a high-level, our approach consists of two interconnected stages. The first one identifies neurons that do not improve the prediction accuracy of the network and deactivates them. The second stage then removes neurons from the network (explicitly shrinking weight matrices and updating optimizer state) thus leading to smaller networks and faster inference.
Deactivating Neurons On-The-Fly: During the first stage, Smallify applies an on/off switch to every neuron of an initially over-sized network. We model the on/off switches by multiplying each input (or output) of each layer by a parameter . A value of will deactivate the neuron, while will let the signal go through. These switches are part of a new layer, called the SwitchLayer; this layer applies to fully connected as well as convolutional layers.
Our objective is to minimize the number of “on” switches to reduce the model size as much as possible while preserving prediction accuracy. This can be achieved by jointly minimizing the training loss of the network and applying an norm to the parameters of the SwitchLayer. Since minimizing the norm is an NP-Hard problem, we instead relax the constraint to an norm by constraining to be a real number instead of a binary value.
Neuron Removal: During this stage, the neurons that are deactivated by the switch layers are actually removed from the network, effectively shrinking the network size. This step improves inference times. We choose to remove neurons at training time because we have observed that this allows the remaining active neurons to adapt to the new network architecture and we can avoid a post-training step to prune deactivated neurons.
Next we describe in detail the switch layer as well as and the training process for Smallify, and then describe the removal process in Section 4.
3.2 The Switch Layer
be a layer in a neural network that takes an input tensorand produces an output tensor of shape where is the number of neurons in that layer. For instance, for fully connected layers,
=0 and the output is single dimensional vector of size(ignoring batch size for now) while for a 2-D convolutional layer, =2 and is the number of output channels or feature maps.
We want to tune the size of by applying a SwitchLayer, , containing switches. The SwitchLayer is parametrized by a vector such that the result of applying to is a also a tensor size such that:
Once passed through the switch layer, each output channel produced by is scaled by the corresponding . Note that when , the channel is multiplied by zero and will not contribute to any computation after the switch layer. If this happens, we say the switch layer has deactivated the neuron corresponding to channel of layer .
We place SwitchLayer after each layer whose size we wish to tune; these are typically fully connected and convolutional layers. We discuss next how to train Smallify.
3.3 Training Smallify
For training, we need to account for the effect of the SwitchLayers on the loss function. The effect of SwitchLayers can be expressed in terms of a sparsity constraint that pushes values in the vector to 0. In this way, given a neural network parameterized by weights and switch layer parameters , we optimize Smallify loss as:
This expression augments the regular training loss with a regularization term for the switch parameters and another on the network weights.
Interestingly, there exists a connection between Smallify and group sparsity regularization (LASSO) which we will discuss in the following subsection.
3.4 Relation to Group Sparsity (LASSO)
Smallify removes neurons, i.e., inputs/outputs of layers. For a fully connected layer defined as:
where represents the connections and the bias, removing an input neuron is equivalent to having . Removing an output neuron is the same as setting and . Solving optimization problems while trying to set entire group of parameters to zero is the goal of group sparsity regularization Scardapane2017 . In any partitioning of the set of parameters defining a model in groups: , group sparsity penalty is defined as:
with being the regularization parameter. In fully-connected layers, the groups are either columns of if we want to remove inputs, or rows of and the corresponding entry in if we want to remove outputs. For simplicity, we focus our analysis on the simple one-layer case. As filtering outputs does not make sense in this case, we only consider removing inputs. The group sparsity regularization then becomes (when is folded into the )
Interestingly, group sparsity and Smallify try to achieve the same goal and are closely related. First let’s recall the two problems. In the context of approximating
with a linear regression from features, the two problems are:
We can prove that under the condition: the two problems are equivalent by taking , and replacing by . However, if we relax this constraint then Smallify becomes non-convex and has no global minimum. The latter is true because one can divide by an arbitrarly large constant and multipliying by the same value. Fortunately, by adding an extra term to the Smallify regularization term we can avoid that problem and prove that:
has global minimums for all . More specifically there are at least , where is the total number of components in . Indeed, for any solution, one can obtain the same output by flipping any sign in and the corresponding entries in . This is the reason we defined the regularized Smallify penalty above in Eq. 4. In practice, we observed that or are good a choice; note that the latter will also introduce additional sparsity into the parameters because the is, thest best convex approximation of the norm.
4 Smallify in Practice
In this section we discuss practical aspects of Smallify, including neuron removal and several optimizations.
On-The-Fly Neuron Removal. Switch layers are initialized with weights sampled from ; their values change as part of the training process so as to switch on or off neurons. Using gradient descent, it is very unlikely that the unimportant components of will ever be exactly . In most cases, irrelevant neurons will see their SwitchLayer oscillate close to 0, while never reaching 0, influenced solely by the penalty. Our goal is to detect this situation and effectively force them to to deactivate them. We evaluated multiple screening strategies but the most efficient and flexible one was the Sign variance strategy: At each update we measure the sign of each component of ( or
). We maintain two metrics: the exponential moving average (EMA) of its mean and variance. When the variance exceeds a predefined threshold, we assume that the neuron does not contribute significantly to the output, so we effectively deactivate it. This strategy is parametrized by two hyper-parameters, the threshold but also the momentum of the statistics we keep.
Preparing for Inference. With Smallify we obtain reduced-sized networks during training, which is the first steps towards faster inference. This networks are readily available for inference. However, because they include switch layers—and therefore more parameters—they introduce unnecessary overhead at inference time. To avoid this overhead, we reduce the network parameters by combining each switch layer with its respective network layer by multiplying the respective parameters before emitting the final trained network. As a result, the final network is a dense network without any switching layers.
Neural Garbage Collection. Smallify decides on-the-fly which neurons to deactivate. Since Smallify deactivate a large fraction of neurons, we must dynamically remove these neurons at runtime to not unnecessarily impact network training time. We implemented a neural garbage collection method as part of our library which takes care of updating the necessary network layers as well as updating optimizer state to reflect the neuron removal.
The goal of our evaluation is to explore (1) whether, by varying , Smallify can efficiently explore (in terms of number of training runs) the spectrum of high-accuracy models from small to large, on both CNNs and fully connected networks. Our results show that, for each network size, we obtain models that perform as well or better than Static Networks, trained via traditional hyperparameter optimization; (2) whether, because these smaller networks are dense, they result in improved inference times on both CPUs and GPUs; and (3) whether the Smallify approach results in network architectures that are substantially different than the best network architectures (in terms of relative number of neurons per layer) identified in the literature.
We implemented SwitchLayers
and the associated training procedure as a library in pytorchpaszke2017automatic . The layer can be freely mixed with other popular layers such as convolutional layers, batchnorm layers, fully connected layers, and used with all the traditional optimizers. We use our implementation to evaluate Smallify throughout the evaluation section.
5.1 Can Smallify achieve good accuracy?
To answer this question we compare Smallify with a traditional network. In both cases, we need to perform hyperparameter optimization to explore different network sizes. We perform random search, which is an effective technique for this purpose BergstraJAMESBERGSTRA2012 . We evaluate Smallify on two architectures. One for which it is not possible to explore the entire space of network architectures (VGG
) and one for which it is possible to do so (3 layers perceptron).
We assume no prior knowledge on the optimal batch size, learning rate, or weight decay (). Instead, we trained a number of models, randomly and independently selecting the values of these parameters from a range of values commonly used in practice. Training is done using the Adam optimizer Kingma2015a . We start with randomly sampled learning rate; we divide the learning rate by every consecutive epochs without improvement. We stop when the learning rate is under . We pick the epoch with the best validation accuracy after the size of network converged and report the corresponding testing accuracy. We also measure the total size, in terms of number of floating point parameters, excluding the SwitchLayers because as described in Section 4, these are eliminated after training.
5.1.1 Large Network Setting: Cifar10
CIFAR10 is an image classification dataset containing color images , belonging to different classes. We use it with the VGG16 network Srivastava2014 . We applied Smallify to the VGG16 network by adding SwitchLayers after each BatchNorm and each fully connected layer (except for the last layer). Recall that Smallify assume that the starting size of the network is an upper bound on the optimal size. Thus, we started with a network with 2x the original size for each layer.
As the baseline we use a fixed-sized network, which architecture is configured by a total of 13 parameters for the convolutional layers and for the fully connected layers. Smallify effectively fuse all these parameters in a single . However, for traditional conventional architectures where all of these parameters are free, it is infeasible to obtain a reasonable sample for such a large search space. To obtain a baseline, we therefore use the same conventional heuristic that the original VGG architecture and many other CNNs use, which doubles the number of channels after every MaxPool layer. For Static Networks we sample the size between and
times the size original one, designed for ImageNet. We report the same numbers as we did forSmallify and we compare the two distributions.
The results are shown in the top figure of Fig. 1, with blue dots indicating models produced by Smallify and orange dots indicating static networks. model, we plot its accuracy and model size. The lines show the Pareto frontier of models in each of the two optimization settings. Smallify explore the trade-off between model size and accuracy more effectively. Note that the best performing Smallify model has accuracy which is identical to the accuracy of the static network, while the Smallify model is times smaller. In addition, if we give up just 1% error, Smallify find a model that is 35.5 times smaller than any static network that performs as good.
5.1.2 Small Network Setting: Covertype
The COVERTYPE Blackard:1998:CNN:928509 dataset contains descriptions of geographical area (elevation, inclination, etc…) and the goal is to predict the type of forest growing in each area. We picked this dataset for two reasons. First it is simple, such that we can reach good accuracy with only a few fully-connected layers. This is important because we want to show that Smallify find sizes as good as Static Networks, even if we are sampling the entire space of possible network sizes. Second, Scardapane et al Scardapane2017 perform their evaluation on this dataset, which allows us to compare the results obtained by our method with the method in Scardapane2017 . We compare Smallify against the same architecture used in Scardapane2017 , i.e., a three fully-connected layers network with no Dropout Srivastava2014 and no BatchNorm. In this case, for the Static Networks, we independently sample the sizes of the three different layers to explore all possible architectures.
The results are shown in the top figure of Fig. 2. Here, Static method finds models that perform well at a variety of sizes, because it is able to explore the entire parameter space. This is as expected; the fact that Smallify perform as well as the Static indicates that Smallify are doing an effective job of exploring the parameter space using just the single parameter. Note that the best performing Smallify models has accuracy while the best static model is only accurate, while the Smallify shrink model is times smaller. In addition, if we give up just 0.5% error, Smallify find a model that is 38.6X smaller than any static network with equivalent accuracy.
5.2 Can Smallify speed up inference?
The previous experiment showed that Smallify find networks of similar or better accuracy than static networks that are much smaller. As noted in the introduction, for some applications, compact models that offer fast inference times are as important as absolute accuracy. In this section, we study the relationship between accuracy, network size and inference time. To do this, we select the smallest model that achieves a given accuracy for the both Smallify and Static approach. For each model, we measure the time to run inference with the model. We then compute the ratio of the network size and inference time between Smallify and Static at each accuracy level, and plot them on the bottom of Figure 1 and 2. We limit our plots to the models with accuracy range because those are the ones that we consider to be practically useful.
The middle plot in each figure shows the ratio of model size between Smallify and Static (values 1 mean Smallify are smaller) at different accuracy levels. These figures show that is that size improvements are are particularly significant for CIFAR10. In the range of accuracies we are interested in, improvements in size go from 4x to 40x. The fact that the COVERTYPE networks are not dramatically smaller is expected: as the distribution at the top of Figure 2 shows, the static method is able to explore most of the parameter search space.
For speedup, we experimented with both CPUs and GPUs. For each data set/GPU/CPU combination, we show results with batch size 1, as well as with a batch size large enough to fully utilize the hardware on each dataset and hardware configuration. Note that when using a batch size of on GPU, we do not expect to (and do not) observe any improvement because inference times are very small (typically about 10 ), such that setup time dominates overall runtime.
The bottom four graphs in each figure show the results. Again, the CIFAR10 results show the benefit of the Smallify approach most dramatically. On CPU, speedups range up to 6x depending on the batch size, with many models exceeding 3x speedup. In general, speedups are less than compression ratios, due to overheads in problem setup, invocation, and result generation in Python/PyTorch. On GPU, the speedups are less substantial because the CUDA benchmarking utility that we use for evaluation can choose better algorithms for larger matrices which masks some of our benefit, although they are still often 1.5x–2x faster for large batch sizes.
A key takeaway of these speedup results is that, unlike local sparsity compression methods, our methods’ improvement on size translates directly to higher throughput at inference time Han2015 .
5.3 Architectures obtained after convergence
Smallify effectively explore the frontier of model size and accuracy. For a given target accuracy, the size needed is significantly smaller than when we use the "channel doubling" heuristic commonly used to size convolutional neural networks. This suggests that this conventional heuristic may not in fact be optimal, especially when looking for smaller models. Empirically we observed this to often be the case. For example, during our experimentations on the MNIST Lecun1998 and FashionMNIST Xiao2017 datasets (not reported here due to space constraints), we observed that even though these datasets have the same number of classes, input features, and output distributions, for a fixed Smallify converged to considerably bigger networks in the case of FashionMNIST. This evidence shows that optimal architecture not only depends on the output distribution or shape of the data but actually reflects the dataset. This makes sense, as MNIST is a much easier problem than FashionMNIST.
To illustrate this point on a larger dataset, we show two examples of architectures learned by Smallify in Figure 3. In the plot, the dashed line shows the number of neurons in each layer of the original VGG net, and the shaded regions show the size of the Smallify as it converges (with the darkest region representing the fully converged network). Observe that the final network that is trained looks quite different in the two cases, with the optimal performing network appearing similar to the original VGG net, whereas the shrunken network allocates many fewer neurons to the middle layers, and then additional neurons to the final fewer layers.
We presented Smallify, an approach to learn deep network sizes while training. Smallify employs a SwitchLayer, which deactivates neurons, as well as of a method to remove them, which reduces network sizes, leading to faster inference times. We demonstrated these claims on on two well-known datasets, on which we achieved networks of the same accuracy as traditional neural networks, but up to 35X smaller, with inference speedups of up to 6X.
First, we prove that there is at least one global minimum. Then, we how to construct distinct solutions from a single global minimum. In order to prove this second statement, we first show that for any solution to the first problem, there exists a solution in the second with the exact same value, and vice-versa.
Assume we have a potential solution for the first problem. We define such that , and . It is easy to see that the constraint on is satisfied by construction. Now:
Assuming we take an that satisfies the constraint and a , we can define . We can apply the same operations in reverse order and obtain an instance of the first problem with the same value.
There is no way these two problems have different minima, because we are able to construct a solution to a problem from the solution of the other while preserving the value of the objective. ∎
is not convex in and .
To prove this we will take the simplest instance of the problem: where everything is a scalar. We have . For simplicty we’s take and . If we consider two candidates and , we have . However , which break the convexity property. Since we showed that a particular case of the problem is non-convex then necessarily the general case cannot be convex. ∎
has no solution if .
Let’s assume this problem has a minimum . Let’s consider . Trivially the first component of the sum is identical for the two solutions, however . Therefore cannot be the minimum. We conclude that this problem has no solution. ∎
For this proposition we will not restrict ourselves to single layer but the composition of an an arbitrary large () layers as defined individually as . Suppose the entire network is denoted by the function . For , and we have that has at least global minimum where
We split this proof into two parts. First we show that there is at least
one global minimum, then we will show how to construct other distinct
solutions with the same objective.
Part 1: The two components of the expression are always positive so we know that this problem is bounded by below by . is trivially coercive. Since we have a sum of terms, all bounded by below by and one of them is coercive, so the entire function admits at least one global minimum.
Part 2: Let’s consider one global minimum. For each component of for some . Negating it and negating the column of does not change the the first part of the objective because the two factors cancel each other. The two norms do not change either because by definition the norm is independent of the sign. As a result these two sets of parameters have the same value and by extension also a global minimum. It is easy to see that going from this global minimum, we can decide to negate or not each element in each . We have a binary choice for each parameter, there are parameters, so we have at least global minima.
-  Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, pages 1–12, New York, NY, USA, 2017. ACM.
-  Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
-  Yann Le Cun, John S Denker, and Sara a Solla. Optimal Brain Damage. Advances in Neural Information Processing Systems, 2(1):598–605, 1990.
-  Alireza Aghasi, Afshin Abdi, Nam Nguyen, and Justin Romberg. Net-Trim: Convex Pruning of Deep Neural Networks with Performance Guarantee. 2016.
Ming Yuan and Yi Lin.
Model selection and estimation in regression with grouped variables.Journal of the Royal Statistical Society. Series B: Statistical Methodology, 68(1):49–67, 2006.
James Bergstra and Bengio Yoshua.
Random Search for Hyper-Parameter Optimization.
Journal of Machine Learning Research, 13:281–305, 2012.
-  Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2951–2959. Curran Associates, Inc., 2012.
-  Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560, 2016.
-  Kevin Jamieson and Ameet Talwalkar. Non-stochastic best arm identification and hyperparameter optimization. In Artificial Intelligence and Statistics, pages 240–248, 2016.
-  Barret Zoph and Quoc V. Le. Neural Architecture Search with Reinforcement Learning. In ICLR, pages 976–981, nov 2017.
-  Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning Transferable Architectures for Scalable Image Recognition. jul 2017.
-  Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
-  Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  Song Han, Huizi Mao, and William J Dally. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. 2015.
-  Simone Scardapane, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89, 2017.
-  Jose M Alvarez and Mathieu Salzmann. Learning the Number of Neurons in Deep Networks. Neural Information Processing Systems, 2016.
-  Jose M Alvarez and Mathieu Salzmann. Compression-aware Training of Deep Networks. In Neural Information Processing Systems, pages 1–10, 2017.
-  Hao Zhou, Jose M Alvarez, and Fatih Porikli. Less Is More: Towards Compact CNNs. Computer Vision – ECCV 2016, pages 662–677, 2016.
-  Jaehong Yoon and Sung Ju Hwang. Combined Group and Exclusive Sparsity for Deep Neural Networks. Proceedings of the 34th International Conference on Machine Learning, 70:3958–3966, 2017.
-  Wei Wen, Cong Xu, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Coordinating Filters for Faster Deep Neural Networks. In Proceedings of the IEEE International Conference on Computer Vision, volume 2017-Octob, pages 658–666, 2017.
-  Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning Structured Sparsity in Deep Neural Networks. Neural Information Processing Systems, 2016.
-  George Philipp and Jaime G Carbonell. Nonparametric Neural Network. In Proc. International Conference on Learning Representations, number 2016, pages 1–27, 2017.
-  Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning Efficient Convolutional Networks through Network Slimming. ICCV 2017, 2017.
-  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
-  Diederik P. Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations 2015, pages 1–15, 2015.
-  Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
-  Jock A. Blackard. Comparison of Neural Networks and Discriminant Analysis in Predicting Forest Cover Types. PhD thesis, Fort Collins, CO, USA, 1998. AAI9921979.
-  Y LeCun, L Bottou, Yoshua Bengio, and P Haffner. Gradient-Based Learning Applied to Document Recognition. In Intelligent Signal Processing, pages 306–351, 2001.
-  Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. 2017.