1 Introduction
Everything should be made as simple as possible, but not simpler  Einstein
For largescale tasks like image classification, the general practice in recent times has been to train large networks with many millions of parameters (see [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton, Simonyan and Zisserman(2015), Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich]). Looking at these models, it is natural to ask  are so many parameters really needed for good performance? In other words, are these models as simple as they can be? A smaller model has the advantage of being faster to evaluate and easier to store  both of which are crucial for realtime and embedded applications. In this work, we consider the problem of automatically building smaller networks that achieve performance levels similar to larger networks.
Regularizers are often used to encourage learning simpler models. These usually restrict the magnitude () or the sparsity () of weights. However, to restrict the computational complexity of neural networks, we need a regularizer which restricts the width and depth of network. Here, width of a layer refers to the number of neurons in that layer, while depth simply corresponds to the total number of layers. Generally speaking, the greater the width and depth, the more are the number of neurons, the more computationally complex the model is. Naturally, one would want to restrict the total number of neurons as a means of controlling the computational complexity of the model. However, the number of neurons is an integer, making it difficult to optimize over. This work aims at making this problem easier to solve.
The overall contributions of the paper are as follows.

We propose novel trainable parameters which are used to restrict the total number of neurons in a neural network model  thus effectively selecting width and depth (Section 2)

We perform experimental analysis of our method to analyze the behaviour of our method. (Section 4)

We use our method to perform architecture selection and learn models with considerably small number of parameters (Section 4)
2 Complexity as a regularizer
In general, the term ‘architecture’ of a neural network can refer to aspects of a network other than width and depth (like filter size, stride, etc). However, here we use that word to simply mean width and depth. Given that we want to reduce the complexity of the model, let us formally define our notions of complexity and architecture.
Notation.
Let
be an infinitedimensional vector whose first
components are positive integers, while the rest are zeros. This represents an layer neural network architecture with neurons for the layer. We call as the architecture of a neural network.For these vectors, we define an associated norm which corresponds to our notion of architectural complexity of the neural network. Our notion of complexity is simply the total number of neurons in the network.
The true measure of computational complexity of a neural network would be the total number of weights or parameters. However, if we consider a single layer neural network, this is proportional to the number of neurons in the hidden layer. Even though this equivalence breaks down for multilayered neural networks, we nevertheless use the same for want of simplicity.
Definition.
The complexity of a layer neural network with architecture is given by .
Our overall objective can hence be stated as the following optimization problem.
(1) 
where denotes the weights of the neural network, and the architecture.
denotes the loss function, which depends on the underlying task to be solved. For example, squarederror loss functions are generally used for regression problems and crossentropy loss for classification. In this objective, there exists the classical tradeoff between model complexity and loss, which is handled by the
parameter. Note that we learn both the weights () as well as the architecture () in this problem. We term any algorithm which solves the above problem as an ArchitectureLearning (AL) algorithm.We observe that the task defined above is very difficult to solve, primarily because
is an integer. This makes it an integer programming problem. Hence, we cannot use gradientbased techniques to optimize for this. The main contribution of this work is the reformulation of this optimization problem so that Stochastic Gradient Descent (SGD) and backpropagation may be used.
2.1 A Strategy for a trainable regularizer
We require a strategy to automatically select a neural network’s architecture, i.e; the width of each layer and depth of the network. One way to select for width of a layer is to introduce additional learnable parameters which multiply with every neuron’s output, as shown in Figure 1(a). If these new parameters are restricted to be binary, then those neurons with a zeroparameter can simply be removed. In the figure, the trainable parameters corresponding to neurons with values and are zero, nullifying their contribution. Thus, the sum of these binary trainable parameters will be equal to the effective width of the network. For convolutional layers with feature map outputs, we have additional parameters that select a subset of the feature maps. A single additional parameter multiplies with an entire feature map either making it zero or preserving it. After all, filters are analogous to neurons for convolutional layers.
(a)  (b) 
(b) Graph of the
regularizer and the binarizing regularizer in 1D.
To further reduce the complexity of network, we also strive to reduce the network’s depth. It is well known that two neural network layers without any nonlinearity between them is equivalent to a single layer, whose parameters are given by the matrix product of the weight matrices of the original two layers. This is shown on the right of Figure 1(a). We can therefore consider a trainable nonlinearity, which prefers ‘linearity’ over ‘nonlinearity’. Wherever linearity is selected, the corresponding layer can be combined with the next layer. Hence, the total complexity of the neural network would be the number of parameters in layers with a nonlinearity.
In this work, we combine both these intuitive observations into one single framework. This is captured in our definition of the tristate ReLU which follows.
2.1.1 Definition: Tristate ReLU
We define a new trainable nonlinearity which we call the tristate ReLU (tsReLU) as follows:
(2) 
This reduces to the usual ReLU for and . For a fixed and a trainable , this turns into parametric ReLU [He et al.(2015)He, Zhang, Ren, and Sun]. For us, both and are trainable. However, we restrict both these parameters to take only binary values. As a result, three possible states exist for this function. For , this function is always returns zero. For and it behaves similar to ReLU, while for it reduces to the identity function.
Here, parameter selects for the width of the layer, while decides depth. While the parameter is different across channels of a layer, the parameter is tied to the same value across all channels. If , we can combine that layer with the next to yield a single layer. If for any channel, we can simply remove that neuron as well as the corresponding weights in the next layer.
Thus, our objective while using the tristate ReLU is
(3)  
such that  
We remind the reader that here denotes the layer number, while denotes the neuron in a layer. Note that for , it converts the objective in Equation 1 from an integer programming problem to that of binary programming.
2.1.2 Learning binary parameters
Given the definition of tristate ReLU (tsReLU) above, we require a method to learn binary parameters for and . To this end, we use a regularizer given by [Murray and Ng(2010)]. This regularizer encourages binary values for parameters, if they are constrained to lie in .
Henceforth, we shall refer to this as the binarizing regularizer. Murray and Ng [Murray and Ng(2010)] showed that this regularizer does indeed converge to binary values given a large number of iterations. For the 1D case, this function is an downwardfacing parabola with minima at and , as shown in Figure 1(b). As a result, weights “fall” to or at convergence. In contrast, the regularizer is an upward facing parabola with a minimum at , which causes it to push weights to be close to zero.
With this intuition, we now state our tsReLU optimization objective.
(4) 
Note that is the regularization constant for the widthlimiting term, while is for the depthlimiting term. This objective can be solved using the usual backpropagation algorithm. As indicated earlier, this binarizing regularizer works only if ’s and ’s are guaranteed to be in . To enforce the same, we perform clipping after parameter update.
After optimization, even though the final parameters are expected to be close to binary, they are still real numbers close to or . Let be the parameter obtained during the optimization. The tsReLU function uses a binarized version of this variable
during the feedforward stage. Note that slowly changes during training, while only reflects the changes made to . A similar equation holds for .
2.2 Adding model complexity
So far, we have considered the problem of solving Equation 1 with . As a result, the objective function described above does not necessarily select for smaller models. Let correspond to the complexity of a layer. The model complexity term is given by
This is formulated such that for , the complexity in a layer is just , while for (nonlinearity absent), the complexity is . Overall, it counts the total number of weights in the model at convergence.
We now add a regularizer analogous to model complexity (defined above) in our optimization objective in Equation 4 . Let us call the regularizer corresponding to model complexity as , which is given by
(5) 
The first term in the above equation limits the complexity of each layer’s width, while the second term limits the network’s depth by encouraging linearity. Note that the first term becomes zero when a nonlinearity is absent. Also note that the indicator function in the first term is nondifferentiable. As a result, we simply treat that term as a constant with respect to .
3 Related Work
There have been many works which look at performing compression of a neural network. Weightpruning techniques were popularized by Optimal Brain Damage [LeCun et al.(1989)LeCun, Denker, Solla, Howard, and Jackel] and Optimal Brain Surgery [Hassibi et al.(1993)Hassibi, Stork, et al.]. Recently, [Srinivas and Babu(2015)] proposed a neuron pruning technique, which relied on neuronal similarity. Our work, on the other hand, performs neuron pruning based on learning, rather than handcrafted rules. Our learning objective can thus be seen as performing pruning and learning together, unlike the work of Han et al. [Han et al.(2015)Han, Pool, Tran, and Dally], who perform both operations alternately.
Learning neural network architecture has also been explored to some extent. The Cascadecorrelation [Fahlman and Lebiere(1989)] proposed a novel learning rule to ‘grow’ the neural network. However, it was shown for only a single layer network and is hence not clear how to scale to large deep networks. Our work is inspired from the recent work of Kulkarni et al. [Kulkarni et al.(2015)Kulkarni, Zepeda, Jurie, Pérez, and Chevallier] who proposed to learn the width of neural networks in a way similar to ours. Specifically, they proposed to learn a diagonal matrix along with neurons , such that represents that layer’s neurons. However, instead of imposing a binary constraint (like ours), they learn realweights and impose an based sparsityinducing regularizer on to encourage zeros. By imposing a binary constraint, we are able to directly regularize for the model complexity. Recently, Bayesian Optimizationbased algorithms [Snoek et al.(2012)Snoek, Larochelle, and Adams]
have also been proposed for automatically learning hyperparameters of neural networks. However, for the purpose of selecting architecture, these typically require training multiple models with different architectures  while our method selects the architecture in a single run. A large number of evolutionary algorithms (see
[Yao(1999), Stanley and Miikkulainen(2002), Stanley et al.(2009)Stanley, D’Ambrosio, and Gauci]) also exist for the task of finding Neural Network architectures. However, these are typically evaluated on small scale problems, often not relating to pattern recognition tasks.
Many methods have been proposed to train models that are deep, yet have a lower parameterisation than conventional networks. Collins and Kohli [Collins and Kohli(2014)] propose a sparsity inducing regulariser for backpropogation which promotes many weights to have zero magnitude. They achieve reduction in memory consumption when compared to traditionally trained models. In contrast, our method promotes neurons to have a zeromagnitude. As a result, our overall objective function is much simpler to solve. Denil et al. [Denil et al.(2013)Denil, Shakibi, Dinh, de Freitas, et al.] demonstrate that most of the parameters of a model can be predicted given only a few parameters. At training time, they learn only a few parameters and predict the rest. Yang et al. [Yang et al.(2014)Yang, Moczulski, Denil, de Freitas, Smola, Song, and Wang] propose an Adaptive Fastfood transform, which is an efficient reparametrization of fullyconnected layer weights. This results in a reduction of complexity for weight storage and computation.
Some recent works have also focussed on using approximations of weight matrices to perform compression. Jaderberg et al. [Jaderberg et al.(2014)Jaderberg, Vedaldi, and Zisserman] and Denton et al. [Denton et al.(2014)Denton, Zaremba, Bruna, LeCun, and Fergus] use SVDbased low rank approximations of the weight matrix. Gong et al. [Gong et al.(2014)Gong, Liu, Yang, and Bourdev] use a clusteringbased product quantization approach to build an indexing scheme that reduces the space occupied by the matrix on disk.
4 Experiments
In this section, we perform experiments to analyse the behaviour of our method. In the first set of experiments, we evaluate performance on the MNIST dataset. Later, we look at a case study on ILSVRC 2012 dataset. Our experiments are performed using the Theano Deep Learning Framework
[Bergstra et al.(2010)Bergstra, Breuleux, Bastien, Lamblin, Pascanu, Desjardins, Turian, WardeFarley, and Bengio].4.1 Compression performance
We evaluate our method on the MNIST dataset, using a LeNetlike [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner] architecture. The network consists of two convolutional layers with 20 and 50 filters, and two fully connected layers with 500 and 10 (output layer) neurons. We use this architecture as a starting point to learn smaller architectures. First, we learn using our additional parameters and regularizers. Second, we remove neurons with zero gate values and collapse depth for linearities wherever is it advantageous. For example, it might not be advantageous to remove depth in a bottleneck layer (like in autoencoders). Thus, the second part of the process is humanguided.
Starting from a baseline architecture, we learn smaller architectures with variations of our method. Note that there is maxpooling applied after each of the convolutional layers, which rules out depth selection for those two layers. We compare the proposed method against baselines of directly training a neural network (NN) on the final architecture, and our method of learning a fixed final width (FFW) for various layers. In Table
1, the Layers Learnt column has binary elements which denotes whether width() or depth() are learnt for each layer in the baseline network. As an example, the second row shows a method where only the width is learnt in the first two layers, and depth also learnt in the third layer. This table shows that all considered models  large and small  perform more or less equally well in terms of accuracy. This empirically shows that the small models discovered by AL preserve accuracy.We also compare the compression performance of our AL method against SVDbased compression of the weight matrix in Table 2. Here we only compress layer 3 (which has weights) using SVD. The results show that learning a smaller network is beneficial over learning a large network and then performing SVDbased compression.
Method  Layers Learnt  Architecture  AL (%)  NN (%)  

Baseline  N/A  (0,x)(0,x)(0,0)  205050010  N/A  99.3 
AL  (1,x)(1,x)(1,1)  162610  99.07  99.08  
AL  (1,x)(1,x)(1,0)  20502010  99.07  99.14  
AL  (1,x)(1,x)(1,1)  164010  99.22  99.25  
AL  (1,x)(1,x)(1,0)  20507010  99.19  99.21  
Method  Params  Accuracy (%) 

Baseline  431K  99.3 
SVD (rank10)  43.6K  98.47 
AL  40.9K  99.07 
SVD (rank40)  83.1K  99.06 
AL  82.3K  99.19 
4.2 Analysis
We now perform a few more experiments to further analyse the behaviour of our method. In all cases, we train ‘AL’like models, and consider the third layer for evaluation. We start learning with the baseline architecture considered above.
First, we look at the effects of using different hyperparameters . From Figure 2(a) , we observe that (i) increasing encourages the method to prune more, and (ii) decreasing encourages the method to learn the architecture for an extended amount of time. In both cases, we see that the architecture stays moreorless constant after a large enough number of iterations.
Second, we look at the learnt architectures for different amounts of data complexity. Intuitively, simpler data should lead to smaller architectures. A simple way to obtain data of differing complexity is to simply vary the number of classes in a multiclass problem like MNIST. We hence vary the number of classes from , and run our method for each case without changing any hyperparameters. As seen in Figure 2(b) , we see an almost monotonic increase in both architectural complexity and error rate, which confirms our hypothesis.
Third, we look at the depthselection capabilities of our method. We used models with various initial depths and observed the depths of the resultant models. We used an initial architecture of 20  50  (75 n)  10, where layers with width 75 are repeated to obtain a network of desired depth. We see that for small changes in the initial depth, the final learnt depth stays more or less constant. ^{1}^{1}1For theoretical analysis of our method, please see the Supplementary material.
Initial  Final  Learnt Architecture  Error (%) 

Depth  Depth  
6  5  1831322410  1.02 
8  6  173739292110  0.99 
10  6  173432212110  0.97 
12  6  183430211710  1.04 
15  8  16373525202210  0.93 
(a)  (b) 
(b) Plot of the no. of neurons learnt for MNIST with various number of classes. We see that both the neuron count and the error rate increase with increase in number of classes.
4.3 Architecture Selection
In recent times, Bayesian Optimization (BO) has emerged as a compelling option for hyperparameter optimization. In these set of experiments, we compare the architectureselection capabilities of our method against BO. In particular, we use the Spearmintlite software package [Snoek et al.(2012)Snoek, Larochelle, and Adams] with default parameters for our experiments.
We use BO to first determine the width of the last FC layer (a single scalar), and later, the width of all three layers (3 scalars). For comparison, we use the same objective function for both BO and ArchitectureLearning. This means that we use for AL, while we externally compute the cost after every training run for BO. Figure 3 shows that BO typically needs multiple runs to discover networks which perform close to AL. Performing such multiple runs is often prohibitive for large networks. Even for a small network like ours, training took
30 minutes on a TitanX GPU for 300 epochs. Training with AL does not change the training time, whereas using BO we spent
10 hours for completing 20 runs. Further, AL directly optimizes the cost function as opposed to BO, which performs a blackbox optimization.Given that we perform architecture selection, what hyperparameters does AL need? We notice that we only need to decide four quantities  . If our objective is to only decide widths, we need to decide only two quantities  and . Thus, for a layer neural network, we are able to decide (or
) numbers (widths and depths) based on only two (or four) global hyperparameters. In the Appendix, we shall look at heuristics for setting these hyperparameters.
(a)  (b) 
4.4 Case study: AlexNet
For the experiments that follow, we use an AlexNetlike [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton]
model, called CaffeNet, provided with the Caffe Deep Learning framework. It is very similar to AlexNet, except that the order of maxpooling and normalization have been interchanged. We use the ILSVRC 2012
[Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and FeiFei] validation set to compute accuracies in the Table 4. Unlike the experiments performed previously, we start with a pretrained model and then perform architecture learning (AL) on the learnt weights. We see that our method performs almost as well as the state of the art compression methods. This means that one can simply use a smaller neural network instead of using weight reparameterization techniques (FastFood, SVD) on a large network.Further, many compression methods are formulated keeping only fullyconnected layers in mind. For tasks like Semantic Segmentation, networks with only convolutional layers are used [Long et al.(2015)Long, Shelhamer, and Darrell]. Our results show that the proposed method can successfully prune both fully connected neurons and convolutional filters. Further, ours (along with SVD) is among the few compression methods that can utilize dense matrix computations, whereas all other methods require specialized kernels for sparse matrix computations [Han et al.(2015)Han, Pool, Tran, and Dally] or custom implementations for diagonal matrix multiplication [Yang et al.(2014)Yang, Moczulski, Denil, de Freitas, Smola, Song, and Wang], etc.
Method  Params  Accuracy (%)  Compression (% ) 
Reference Model (CaffeNet)  60.9M  57.41  0 
Neuron Pruning ([Srinivas and Babu(2015)])  39.6M  55.60  35 
SVDquarterF ([Yang et al.(2014)Yang, Moczulski, Denil, de Freitas, Smola, Song, and Wang])  25.6M  56.19  58 
Adaptive FastFood 16 ([Yang et al.(2014)Yang, Moczulski, Denil, de Freitas, Smola, Song, and Wang])  18.7M  57.10  69 
ALconvfc  19.6M  55.90  68 
ALfc  19.8M  54.30  68 
ALconv  47.8M  55.87  22 
Method  Layers Learnt  Architecture 

Baseline  N/A  96 256 384 384 256 4096 4096 1000 
ALfc  fc[6,7]  96 256 384 384 256 1536 1317 1000 
ALconv  conv[1,2,3,4,5]  80 127 264 274 183 4096 4096 1000 
ALconvfc  conv[5]  fc[6,7]  96 256 384 384 237 1761 1661 1000 
5 Conclusions
We have presented a method to learn a neural network’s architecture along with weights. Rather than directly selecting width and depth of networks, we introduced a small number of realvalued hyperparameters which selected width and depth for us. We also saw that we get smaller architectures for MNIST and ImageNet datasets that perform on par with the large architectures. Our method is very simple and straightforward, and can be suitably applied to any neural network. This can also be used as a tool to further explore the dependence of architecture on the optimization and convergence of neural networks.
References
 [Bergstra et al.(2010)Bergstra, Breuleux, Bastien, Lamblin, Pascanu, Desjardins, Turian, WardeFarley, and Bengio] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David WardeFarley, and Yoshua Bengio. Theano: a cpu and gpu math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3. Austin, TX, 2010.
 [Collins and Kohli(2014)] Maxwell D. Collins and Pushmeet Kohli. Memory bounded deep convolutional networks. CoRR, abs/1412.1442, 2014. URL http://arxiv.org/abs/1412.1442.
 [Denil et al.(2013)Denil, Shakibi, Dinh, de Freitas, et al.] Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, pages 2148–2156, 2013.
 [Denton et al.(2014)Denton, Zaremba, Bruna, LeCun, and Fergus] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, pages 1269–1277, 2014.
 [Fahlman and Lebiere(1989)] Scott E Fahlman and Christian Lebiere. The cascadecorrelation learning architecture. 1989.
 [Gong et al.(2014)Gong, Liu, Yang, and Bourdev] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.
 [Han et al.(2015)Han, Pool, Tran, and Dally] Song Han, Jeff Pool, John Tran, and William J Dally. Learning both weights and connections for efficient neural networks. arXiv preprint arXiv:1506.02626, 2015.
 [Hassibi et al.(1993)Hassibi, Stork, et al.] Babak Hassibi, David G Stork, et al. Second order derivatives for network pruning: Optimal brain surgeon. Advances in Neural Information Processing Systems, pages 164–164, 1993.
 [He et al.(2015)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. arXiv preprint arXiv:1502.01852, 2015.

[Jaderberg et al.(2014)Jaderberg, Vedaldi, and
Zisserman]
Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman.
Speeding up convolutional neural networks with low rank expansions.
In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.  [Jia et al.(2014)Jia, Shelhamer, Donahue, Karayev, Long, Girshick, Guadarrama, and Darrell] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
 [Kulkarni et al.(2015)Kulkarni, Zepeda, Jurie, Pérez, and Chevallier] Praveen Kulkarni, Joaquin Zepeda, Frederic Jurie, Patrick Pérez, and Louis Chevallier. Learning the structure of deep architectures using l1 regularization. In Mark W. Jones Xianghua Xie and Gary K. L. Tam, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 23.1–23.11. BMVA Press, September 2015. ISBN 1901725537. doi: 10.5244/C.29.23. URL https://dx.doi.org/10.5244/C.29.23.
 [LeCun et al.(1989)LeCun, Denker, Solla, Howard, and Jackel] Yann LeCun, John S Denker, Sara A Solla, Richard E Howard, and Lawrence D Jackel. Optimal brain damage. In Advances in Neural Information Processing Systems, volume 2, pages 598–605, 1989.
 [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[Long et al.(2015)Long, Shelhamer, and Darrell]
Jonathan Long, Evan Shelhamer, and Trevor Darrell.
Fully convolutional networks for semantic segmentation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3431–3440, 2015. 
[Murray and Ng(2010)]
Walter Murray and KienMing Ng.
An algorithm for nonlinear optimization problems with binary variables.
Computational Optimization and Applications, 47(2):257–288, 2010.  [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and FeiFei] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015. doi: 10.1007/s112630150816y.
 [Simonyan and Zisserman(2015)] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations, 2015.
 [Snoek et al.(2012)Snoek, Larochelle, and Adams] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
 [Srinivas and Babu(2015)] Suraj Srinivas and R. Venkatesh Babu. Datafree parameter pruning for deep neural networks. In Mark W. Jones Xianghua Xie and Gary K. L. Tam, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 31.1–31.12. BMVA Press, September 2015. ISBN 1901725537. doi: 10.5244/C.29.31. URL https://dx.doi.org/10.5244/C.29.31.
 [Stanley and Miikkulainen(2002)] Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies. Evolutionary computation, 10(2):99–127, 2002.
 [Stanley et al.(2009)Stanley, D’Ambrosio, and Gauci] Kenneth O Stanley, David B D’Ambrosio, and Jason Gauci. A hypercubebased encoding for evolving largescale neural networks. Artificial life, 15(2):185–212, 2009.
 [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 [Yang et al.(2014)Yang, Moczulski, Denil, de Freitas, Smola, Song, and Wang] Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and Ziyu Wang. Deep fried convnets. arXiv preprint arXiv:1412.7149, 2014.
 [Yao(1999)] Xin Yao. Evolving artificial neural networks. Proceedings of the IEEE, 87(9):1423–1447, 1999.
Properties of the method
Here we identify a few properties of our architecture selection method.

Nonredundancy of architecture: The learnt final architecture must not have any redundant neurons. Removing neurons should necessarily degrade performance.

Localoptimality of weights: The performance of the learnt final architecture must at least be equal to a trained neural network initialized with this final architecture.

Mirroring datacomplexity: A ‘harder’ dataset should result in a larger model than an ‘easier’ dataset.
We intuitively observe that all these properties would automatically hold if a ‘master’ property which requires both the architecture and the weights be globally optimal holds. Given that the optimization objective of neural networks is highly nonconvex, global optimality cannot be guaranteed. As a result, we restrict ourselves to studying the three properties listed.
In the text that follows, we provide statements that hold for our method. These are obtained by analysing widths of each layer of a neural network assuming that depth is never collapsed. In other words, these hold for neural networks with a single hidden layer. Proofs are provided in a later section.
Nonredundancy of architecture
This is an important property that forms the main motivation for doing architecturelearning. Such a procedure can replace the nodepruning techniques that are used to compress neural networks.
Proposition 1.
At convergence, the loss () of the proposed method over the train set satisfies
This statement implies that change in architecture is inversely proportional to change in loss. In other words, if the architecture grows smaller, the loss must increase. While there isn’t a strict relationship between loss and accuracy, a high loss generally indicates worse accuracy.
Local Optimality of weights
The proposed method learns both architecture and weights. What would happen if we initialized a neural network with this learnt architecture, and proceeded to learn only the weights? This property ensures that in both cases we fall into a local minimum with architecture .
Proposition 2.
Let be the loss over the train set at convergence obtained by training a neural network on data with a fixed architecture . Let be the loss at convergence when the neural network is trained with the proposed method on data such that it results in the same final architecture . Then, and for any .
Mirroring datacomplexity
Characterizing datacomplexity has traditionally been hard. Here, we consider the following approach.
Proposition 3.
Let and be two datasets which produce train losses and upon training with a fixed architecture such that . When trained with the proposed method, the final architectures and (corresponding to and ) satisfy the relation at convergence.
Here, is the ‘harder’ dataset because it produces a higher loss on the same neural network architecture. As a result, the ‘harder’ dataset always produces a larger final architecture. We do not provide a proof for this statement. Instead, we experimentally verify this in Section 4.2.
Proofs of Propositions
Let be total objective function, where is the binarizing regularizer, is the model complexity term. At convergence, we assume that
as the corresponding weights are all binary or close to binary. Let the maximum step size (due to gradient clipping) for
w and d be .Proof of proposition 1.
At convergence, we assume
, for some .
for some sufficiently small.
∎
Proof of proposition 2.
Let at iteration with architecture . Let be the architecture at iteration such that at iterations , architecture is .
an iteration such that , being the maximum step size.
Let . Let be parameterized by as follows.
Hence, for large enough , . After iterations, we have
(6) 
for some . However, if , then , such that .
Without loss of generality, let us assume that neurons corresponding to first weights are selected for, while the rest are inactive. As a result, , for . Hence, the following holds . This, along with equation 6, proves the assertion.
∎
Hyperparameter selection
For effective usage of our method, we need a good set of s. Here, we describe how to do so practically.
First, we set to a low value based on the initial widths and loss values. Recall that this value multiplies with the number of neurons in the cost function. That is, if a network has a layer with neurons, we get . Hence, if multiplies by 10, divides by 10, so that the regularizer value remains the same. We used for MNISTnetwork and for AlexNet. For a given initial architecture, a large places more emphasis on getting small models than reducing loss.
Second, we set to be about times . Using a positive shifts the curve in Fig. 1(b) to the right. By letting , the curve shifts to the extreme right with the peak at . Hence if , we set .
We simply set and to of and respectively.
Comments
There are no comments yet.