Learning Neural Network Architectures using Backpropagation

11/17/2015 ∙ by Suraj Srinivas, et al. ∙ 0

Deep neural networks with millions of parameters are at the heart of many state of the art machine learning models today. However, recent works have shown that models with much smaller number of parameters can also perform just as well. In this work, we introduce the problem of architecture-learning, i.e; learning the architecture of a neural network along with weights. We introduce a new trainable parameter called tri-state ReLU, which helps in eliminating unnecessary neurons. We also propose a smooth regularizer which encourages the total number of neurons after elimination to be small. The resulting objective is differentiable and simple to optimize. We experimentally validate our method on both small and large networks, and show that it can learn models with a considerably small number of parameters without affecting prediction accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Everything should be made as simple as possible, but not simpler -  Einstein

For large-scale tasks like image classification, the general practice in recent times has been to train large networks with many millions of parameters (see [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton, Simonyan and Zisserman(2015), Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich]). Looking at these models, it is natural to ask - are so many parameters really needed for good performance? In other words, are these models as simple as they can be? A smaller model has the advantage of being faster to evaluate and easier to store - both of which are crucial for real-time and embedded applications. In this work, we consider the problem of automatically building smaller networks that achieve performance levels similar to larger networks.

Regularizers are often used to encourage learning simpler models. These usually restrict the magnitude () or the sparsity () of weights. However, to restrict the computational complexity of neural networks, we need a regularizer which restricts the width and depth of network. Here, width of a layer refers to the number of neurons in that layer, while depth simply corresponds to the total number of layers. Generally speaking, the greater the width and depth, the more are the number of neurons, the more computationally complex the model is. Naturally, one would want to restrict the total number of neurons as a means of controlling the computational complexity of the model. However, the number of neurons is an integer, making it difficult to optimize over. This work aims at making this problem easier to solve.

The overall contributions of the paper are as follows.

  • We propose novel trainable parameters which are used to restrict the total number of neurons in a neural network model - thus effectively selecting width and depth (Section 2)

  • We perform experimental analysis of our method to analyze the behaviour of our method. (Section 4)

  • We use our method to perform architecture selection and learn models with considerably small number of parameters (Section 4)

2 Complexity as a regularizer

In general, the term ‘architecture’ of a neural network can refer to aspects of a network other than width and depth (like filter size, stride, etc). However, here we use that word to simply mean width and depth. Given that we want to reduce the complexity of the model, let us formally define our notions of complexity and architecture.



be an infinite-dimensional vector whose first

components are positive integers, while the rest are zeros. This represents an -layer neural network architecture with neurons for the layer. We call as the architecture of a neural network.

For these vectors, we define an associated norm which corresponds to our notion of architectural complexity of the neural network. Our notion of complexity is simply the total number of neurons in the network.

The true measure of computational complexity of a neural network would be the total number of weights or parameters. However, if we consider a single layer neural network, this is proportional to the number of neurons in the hidden layer. Even though this equivalence breaks down for multi-layered neural networks, we nevertheless use the same for want of simplicity.


The complexity of a -layer neural network with architecture is given by .

Our overall objective can hence be stated as the following optimization problem.


where denotes the weights of the neural network, and the architecture.

denotes the loss function, which depends on the underlying task to be solved. For example, squared-error loss functions are generally used for regression problems and cross-entropy loss for classification. In this objective, there exists the classical trade-off between model complexity and loss, which is handled by the

parameter. Note that we learn both the weights () as well as the architecture () in this problem. We term any algorithm which solves the above problem as an Architecture-Learning (AL) algorithm.

We observe that the task defined above is very difficult to solve, primarily because

is an integer. This makes it an integer programming problem. Hence, we cannot use gradient-based techniques to optimize for this. The main contribution of this work is the re-formulation of this optimization problem so that Stochastic Gradient Descent (SGD) and back-propagation may be used.

2.1 A Strategy for a trainable regularizer

We require a strategy to automatically select a neural network’s architecture, i.e; the width of each layer and depth of the network. One way to select for width of a layer is to introduce additional learnable parameters which multiply with every neuron’s output, as shown in Figure 1(a). If these new parameters are restricted to be binary, then those neurons with a zero-parameter can simply be removed. In the figure, the trainable parameters corresponding to neurons with values and are zero, nullifying their contribution. Thus, the sum of these binary trainable parameters will be equal to the effective width of the network. For convolutional layers with feature map outputs, we have additional parameters that select a subset of the feature maps. A single additional parameter multiplies with an entire feature map either making it zero or preserving it. After all, filters are analogous to neurons for convolutional layers.

(a)                  (b)
Figure 1: (a) Our strategy for selecting width and depth. Left: Grey blobs denote neurons, coloured blobs denote the proposed additional trainable parameters. Right: Purple bars denote weight-matrices.
(b) Graph of the

regularizer and the binarizing regularizer in 1-D.

To further reduce the complexity of network, we also strive to reduce the network’s depth. It is well known that two neural network layers without any non-linearity between them is equivalent to a single layer, whose parameters are given by the matrix product of the weight matrices of the original two layers. This is shown on the right of Figure 1(a). We can therefore consider a trainable non-linearity, which prefers ‘linearity’ over ‘non-linearity’. Wherever linearity is selected, the corresponding layer can be combined with the next layer. Hence, the total complexity of the neural network would be the number of parameters in layers with a non-linearity.

In this work, we combine both these intuitive observations into one single framework. This is captured in our definition of the tri-state ReLU which follows.

2.1.1 Definition: Tri-state ReLU

We define a new trainable non-linearity which we call the tri-state ReLU (tsReLU) as follows:


This reduces to the usual ReLU for and . For a fixed and a trainable , this turns into parametric ReLU [He et al.(2015)He, Zhang, Ren, and Sun]. For us, both and are trainable. However, we restrict both these parameters to take only binary values. As a result, three possible states exist for this function. For , this function is always returns zero. For and it behaves similar to ReLU, while for it reduces to the identity function.

Here, parameter selects for the width of the layer, while decides depth. While the parameter is different across channels of a layer, the parameter is tied to the same value across all channels. If , we can combine that layer with the next to yield a single layer. If for any channel, we can simply remove that neuron as well as the corresponding weights in the next layer.

Thus, our objective while using the tri-state ReLU is

such that

We remind the reader that here denotes the layer number, while denotes the neuron in a layer. Note that for , it converts the objective in Equation 1 from an integer programming problem to that of binary programming.

2.1.2 Learning binary parameters

Given the definition of tri-state ReLU (tsReLU) above, we require a method to learn binary parameters for and . To this end, we use a regularizer given by [Murray and Ng(2010)]. This regularizer encourages binary values for parameters, if they are constrained to lie in .

Henceforth, we shall refer to this as the binarizing regularizer. Murray and Ng [Murray and Ng(2010)] showed that this regularizer does indeed converge to binary values given a large number of iterations. For the 1-D case, this function is an downward-facing parabola with minima at and , as shown in Figure 1(b). As a result, weights “fall” to or at convergence. In contrast, the regularizer is an upward facing parabola with a minimum at , which causes it to push weights to be close to zero.

With this intuition, we now state our tsReLU optimization objective.


Note that is the regularization constant for the width-limiting term, while is for the depth-limiting term. This objective can be solved using the usual back-propagation algorithm. As indicated earlier, this binarizing regularizer works only if ’s and ’s are guaranteed to be in . To enforce the same, we perform clipping after parameter update.

After optimization, even though the final parameters are expected to be close to binary, they are still real numbers close to or . Let be the parameter obtained during the optimization. The tsReLU function uses a binarized version of this variable

during the feedforward stage. Note that slowly changes during training, while only reflects the changes made to . A similar equation holds for .

2.2 Adding model complexity

So far, we have considered the problem of solving Equation 1 with . As a result, the objective function described above does not necessarily select for smaller models. Let correspond to the complexity of a layer. The model complexity term is given by

This is formulated such that for , the complexity in a layer is just , while for (non-linearity absent), the complexity is . Overall, it counts the total number of weights in the model at convergence.

We now add a regularizer analogous to model complexity (defined above) in our optimization objective in Equation 4 . Let us call the regularizer corresponding to model complexity as , which is given by


The first term in the above equation limits the complexity of each layer’s width, while the second term limits the network’s depth by encouraging linearity. Note that the first term becomes zero when a non-linearity is absent. Also note that the indicator function in the first term is non-differentiable. As a result, we simply treat that term as a constant with respect to .

3 Related Work

There have been many works which look at performing compression of a neural network. Weight-pruning techniques were popularized by Optimal Brain Damage [LeCun et al.(1989)LeCun, Denker, Solla, Howard, and Jackel] and Optimal Brain Surgery [Hassibi et al.(1993)Hassibi, Stork, et al.]. Recently, [Srinivas and Babu(2015)] proposed a neuron pruning technique, which relied on neuronal similarity. Our work, on the other hand, performs neuron pruning based on learning, rather than hand-crafted rules. Our learning objective can thus be seen as performing pruning and learning together, unlike the work of Han et al. [Han et al.(2015)Han, Pool, Tran, and Dally], who perform both operations alternately.

Learning neural network architecture has also been explored to some extent. The Cascade-correlation [Fahlman and Lebiere(1989)] proposed a novel learning rule to ‘grow’ the neural network. However, it was shown for only a single layer network and is hence not clear how to scale to large deep networks. Our work is inspired from the recent work of Kulkarni et al. [Kulkarni et al.(2015)Kulkarni, Zepeda, Jurie, Pérez, and Chevallier] who proposed to learn the width of neural networks in a way similar to ours. Specifically, they proposed to learn a diagonal matrix along with neurons , such that represents that layer’s neurons. However, instead of imposing a binary constraint (like ours), they learn real-weights and impose an -based sparsity-inducing regularizer on to encourage zeros. By imposing a binary constraint, we are able to directly regularize for the model complexity. Recently, Bayesian Optimization-based algorithms [Snoek et al.(2012)Snoek, Larochelle, and Adams]

have also been proposed for automatically learning hyper-parameters of neural networks. However, for the purpose of selecting architecture, these typically require training multiple models with different architectures - while our method selects the architecture in a single run. A large number of evolutionary algorithms (see

[Yao(1999), Stanley and Miikkulainen(2002), Stanley et al.(2009)Stanley, D’Ambrosio, and Gauci]

) also exist for the task of finding Neural Network architectures. However, these are typically evaluated on small scale problems, often not relating to pattern recognition tasks.

Many methods have been proposed to train models that are deep, yet have a lower parameterisation than conventional networks. Collins and Kohli [Collins and Kohli(2014)] propose a sparsity inducing regulariser for backpropogation which promotes many weights to have zero magnitude. They achieve reduction in memory consumption when compared to traditionally trained models. In contrast, our method promotes neurons to have a zero-magnitude. As a result, our overall objective function is much simpler to solve. Denil et al. [Denil et al.(2013)Denil, Shakibi, Dinh, de Freitas, et al.] demonstrate that most of the parameters of a model can be predicted given only a few parameters. At training time, they learn only a few parameters and predict the rest. Yang et al. [Yang et al.(2014)Yang, Moczulski, Denil, de Freitas, Smola, Song, and Wang] propose an Adaptive Fastfood transform, which is an efficient re-parametrization of fully-connected layer weights. This results in a reduction of complexity for weight storage and computation.

Some recent works have also focussed on using approximations of weight matrices to perform compression. Jaderberg et al. [Jaderberg et al.(2014)Jaderberg, Vedaldi, and Zisserman] and Denton et al. [Denton et al.(2014)Denton, Zaremba, Bruna, LeCun, and Fergus] use SVD-based low rank approximations of the weight matrix. Gong et al. [Gong et al.(2014)Gong, Liu, Yang, and Bourdev] use a clustering-based product quantization approach to build an indexing scheme that reduces the space occupied by the matrix on disk.

4 Experiments

In this section, we perform experiments to analyse the behaviour of our method. In the first set of experiments, we evaluate performance on the MNIST dataset. Later, we look at a case study on ILSVRC 2012 dataset. Our experiments are performed using the Theano Deep Learning Framework

[Bergstra et al.(2010)Bergstra, Breuleux, Bastien, Lamblin, Pascanu, Desjardins, Turian, Warde-Farley, and Bengio].

4.1 Compression performance

We evaluate our method on the MNIST dataset, using a LeNet-like [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner] architecture. The network consists of two convolutional layers with 20 and 50 filters, and two fully connected layers with 500 and 10 (output layer) neurons. We use this architecture as a starting point to learn smaller architectures. First, we learn using our additional parameters and regularizers. Second, we remove neurons with zero gate values and collapse depth for linearities wherever is it advantageous. For example, it might not be advantageous to remove depth in a bottleneck layer (like in auto-encoders). Thus, the second part of the process is human-guided.

Starting from a baseline architecture, we learn smaller architectures with variations of our method. Note that there is max-pooling applied after each of the convolutional layers, which rules out depth selection for those two layers. We compare the proposed method against baselines of directly training a neural network (NN) on the final architecture, and our method of learning a fixed final width (FFW) for various layers. In Table

1, the Layers Learnt column has binary elements which denotes whether width() or depth() are learnt for each layer in the baseline network. As an example, the second row shows a method where only the width is learnt in the first two layers, and depth also learnt in the third layer. This table shows that all considered models - large and small - perform more or less equally well in terms of accuracy. This empirically shows that the small models discovered by AL preserve accuracy.

We also compare the compression performance of our AL method against SVD-based compression of the weight matrix in Table 2. Here we only compress layer 3 (which has weights) using SVD. The results show that learning a smaller network is beneficial over learning a large network and then performing SVD-based compression.

Method Layers Learnt Architecture AL (%) NN (%)
Baseline N/A (0,x)-(0,x)-(0,0) 20-50-500-10 N/A 99.3
AL (1,x)-(1,x)-(1,1) 16-26-10 99.07 99.08
AL (1,x)-(1,x)-(1,0) 20-50-20-10 99.07 99.14
AL (1,x)-(1,x)-(1,1) 16-40-10 99.22 99.25
AL (1,x)-(1,x)-(1,0) 20-50-70-10 99.19 99.21
Table 1: Architecture learning performance of our method on a LeNet-like baseline. The Layers Learnt column has binary elements which denotes whether width() or depth() are learnt for each layer in the baseline network. AL = Architecture Learning, NN = Neural Network trained w/o AL
Method Params Accuracy (%)
Baseline 431K 99.3
SVD (rank-10) 43.6K 98.47
AL 40.9K 99.07
SVD (rank-40) 83.1K 99.06
AL 82.3K 99.19
Table 2: Comparison of compression performance of proposed method against SVD-based weight-matrix compression.

4.2 Analysis

We now perform a few more experiments to further analyse the behaviour of our method. In all cases, we train ‘AL’-like models, and consider the third layer for evaluation. We start learning with the baseline architecture considered above.

First, we look at the effects of using different hyper-parameters . From Figure 2(a) , we observe that (i) increasing encourages the method to prune more, and (ii) decreasing encourages the method to learn the architecture for an extended amount of time. In both cases, we see that the architecture stays more-or-less constant after a large enough number of iterations.

Second, we look at the learnt architectures for different amounts of data complexity. Intuitively, simpler data should lead to smaller architectures. A simple way to obtain data of differing complexity is to simply vary the number of classes in a multi-class problem like MNIST. We hence vary the number of classes from , and run our method for each case without changing any hyper-parameters. As seen in Figure 2(b) , we see an almost monotonic increase in both architectural complexity and error rate, which confirms our hypothesis.

Third, we look at the depth-selection capabilities of our method. We used models with various initial depths and observed the depths of the resultant models. We used an initial architecture of 20 - 50 - (75 n) - 10, where layers with width 75 are repeated to obtain a network of desired depth. We see that for small changes in the initial depth, the final learnt depth stays more or less constant. 111For theoretical analysis of our method, please see the Supplementary material.

Initial Final Learnt Architecture Error (%)
Depth Depth
6 5 18-31-32-24-10 1.02
8 6 17-37-39-29-21-10 0.99
10 6 17-34-32-21-21-10 0.97
12 6 18-34-30-21-17-10 1.04
15 8 16-37-35-25-20-22-10 0.93
Table 3: Performance of the proposed method on networks of increasing depth.
(a) (b)
Figure 2: (a) Plot of the architecture learnt against the number of iterations. We see that affects convergence rate while affects amount of pruning.
(b) Plot of the no. of neurons learnt for MNIST with various number of classes. We see that both the neuron count and the error rate increase with increase in number of classes.

4.3 Architecture Selection

In recent times, Bayesian Optimization (BO) has emerged as a compelling option for hyper-parameter optimization. In these set of experiments, we compare the architecture-selection capabilities of our method against BO. In particular, we use the Spearmint-lite software package [Snoek et al.(2012)Snoek, Larochelle, and Adams] with default parameters for our experiments.

We use BO to first determine the width of the last FC layer (a single scalar), and later, the width of all three layers (3 scalars). For comparison, we use the same objective function for both BO and Architecture-Learning. This means that we use for AL, while we externally compute the cost after every training run for BO. Figure 3 shows that BO typically needs multiple runs to discover networks which perform close to AL. Performing such multiple runs is often prohibitive for large networks. Even for a small network like ours, training took

30 minutes on a TitanX GPU for 300 epochs. Training with AL does not change the training time, whereas using BO we spent

10 hours for completing 20 runs. Further, AL directly optimizes the cost function as opposed to BO, which performs a black-box optimization.

Given that we perform architecture selection, what hyper-parameters does AL need? We notice that we only need to decide four quantities - . If our objective is to only decide widths, we need to decide only two quantities - and . Thus, for a -layer neural network, we are able to decide (or

) numbers (widths and depths) based on only two (or four) global hyper-parameters. In the Appendix, we shall look at heuristics for setting these hyper-parameters.

(a) (b)
Figure 3: (a) Comparison against Bayesian Optimization for the case of learning the width of only third layer. (b) Similar comparison for learning the widths of all three layers. ( in both cases)

4.4 Case study: AlexNet

For the experiments that follow, we use an AlexNet-like [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton]

model, called CaffeNet, provided with the Caffe Deep Learning framework. It is very similar to AlexNet, except that the order of max-pooling and normalization have been interchanged. We use the ILSVRC 2012

[Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei-Fei] validation set to compute accuracies in the Table 4. Unlike the experiments performed previously, we start with a pre-trained model and then perform architecture learning (AL) on the learnt weights. We see that our method performs almost as well as the state of the art compression methods. This means that one can simply use a smaller neural network instead of using weight re-parameterization techniques (FastFood, SVD) on a large network.

Further, many compression methods are formulated keeping only fully-connected layers in mind. For tasks like Semantic Segmentation, networks with only convolutional layers are used [Long et al.(2015)Long, Shelhamer, and Darrell]. Our results show that the proposed method can successfully prune both fully connected neurons and convolutional filters. Further, ours (along with SVD) is among the few compression methods that can utilize dense matrix computations, whereas all other methods require specialized kernels for sparse matrix computations [Han et al.(2015)Han, Pool, Tran, and Dally] or custom implementations for diagonal matrix multiplication [Yang et al.(2014)Yang, Moczulski, Denil, de Freitas, Smola, Song, and Wang], etc.

Method Params Accuracy (%) Compression (% )
Reference Model (CaffeNet) 60.9M 57.41 0
Neuron Pruning ([Srinivas and Babu(2015)]) 39.6M 55.60 35
SVD-quarter-F ([Yang et al.(2014)Yang, Moczulski, Denil, de Freitas, Smola, Song, and Wang]) 25.6M 56.19 58
Adaptive FastFood 16 ([Yang et al.(2014)Yang, Moczulski, Denil, de Freitas, Smola, Song, and Wang]) 18.7M 57.10 69
AL-conv-fc 19.6M 55.90 68
AL-fc 19.8M 54.30 68
AL-conv 47.8M 55.87 22
Table 4: Compression performance on CaffeNet.
Method Layers Learnt Architecture
Baseline N/A 96  256  384  384  256  4096  4096  1000
AL-fc fc[6,7] 96  256  384  384  256  1536  1317  1000
AL-conv conv[1,2,3,4,5] 80  127  264  274  183  4096  4096  1000
AL-conv-fc conv[5] - fc[6,7] 96  256  384  384  237  1761  1661  1000
Table 5: Architectures learnt by our method whose performance is given in Table 4.

5 Conclusions

We have presented a method to learn a neural network’s architecture along with weights. Rather than directly selecting width and depth of networks, we introduced a small number of real-valued hyper-parameters which selected width and depth for us. We also saw that we get smaller architectures for MNIST and ImageNet datasets that perform on par with the large architectures. Our method is very simple and straightforward, and can be suitably applied to any neural network. This can also be used as a tool to further explore the dependence of architecture on the optimization and convergence of neural networks.


  • [Bergstra et al.(2010)Bergstra, Breuleux, Bastien, Lamblin, Pascanu, Desjardins, Turian, Warde-Farley, and Bengio] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a cpu and gpu math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3. Austin, TX, 2010.
  • [Collins and Kohli(2014)] Maxwell D. Collins and Pushmeet Kohli. Memory bounded deep convolutional networks. CoRR, abs/1412.1442, 2014. URL http://arxiv.org/abs/1412.1442.
  • [Denil et al.(2013)Denil, Shakibi, Dinh, de Freitas, et al.] Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, pages 2148–2156, 2013.
  • [Denton et al.(2014)Denton, Zaremba, Bruna, LeCun, and Fergus] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, pages 1269–1277, 2014.
  • [Fahlman and Lebiere(1989)] Scott E Fahlman and Christian Lebiere. The cascade-correlation learning architecture. 1989.
  • [Gong et al.(2014)Gong, Liu, Yang, and Bourdev] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.
  • [Han et al.(2015)Han, Pool, Tran, and Dally] Song Han, Jeff Pool, John Tran, and William J Dally. Learning both weights and connections for efficient neural networks. arXiv preprint arXiv:1506.02626, 2015.
  • [Hassibi et al.(1993)Hassibi, Stork, et al.] Babak Hassibi, David G Stork, et al. Second order derivatives for network pruning: Optimal brain surgeon. Advances in Neural Information Processing Systems, pages 164–164, 1993.
  • [He et al.(2015)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. arXiv preprint arXiv:1502.01852, 2015.
  • [Jaderberg et al.(2014)Jaderberg, Vedaldi, and Zisserman] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman.

    Speeding up convolutional neural networks with low rank expansions.

    In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.
  • [Jia et al.(2014)Jia, Shelhamer, Donahue, Karayev, Long, Girshick, Guadarrama, and Darrell] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
  • [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
  • [Kulkarni et al.(2015)Kulkarni, Zepeda, Jurie, Pérez, and Chevallier] Praveen Kulkarni, Joaquin Zepeda, Frederic Jurie, Patrick Pérez, and Louis Chevallier. Learning the structure of deep architectures using l1 regularization. In Mark W. Jones Xianghua Xie and Gary K. L. Tam, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 23.1–23.11. BMVA Press, September 2015. ISBN 1-901725-53-7. doi: 10.5244/C.29.23. URL https://dx.doi.org/10.5244/C.29.23.
  • [LeCun et al.(1989)LeCun, Denker, Solla, Howard, and Jackel] Yann LeCun, John S Denker, Sara A Solla, Richard E Howard, and Lawrence D Jackel. Optimal brain damage. In Advances in Neural Information Processing Systems, volume 2, pages 598–605, 1989.
  • [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [Long et al.(2015)Long, Shelhamer, and Darrell] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3431–3440, 2015.
  • [Murray and Ng(2010)] Walter Murray and Kien-Ming Ng.

    An algorithm for nonlinear optimization problems with binary variables.

    Computational Optimization and Applications, 47(2):257–288, 2010.
  • [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei-Fei] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015. doi: 10.1007/s11263-015-0816-y.
  • [Simonyan and Zisserman(2015)] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  • [Snoek et al.(2012)Snoek, Larochelle, and Adams] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
  • [Srinivas and Babu(2015)] Suraj Srinivas and R. Venkatesh Babu. Data-free parameter pruning for deep neural networks. In Mark W. Jones Xianghua Xie and Gary K. L. Tam, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 31.1–31.12. BMVA Press, September 2015. ISBN 1-901725-53-7. doi: 10.5244/C.29.31. URL https://dx.doi.org/10.5244/C.29.31.
  • [Stanley and Miikkulainen(2002)] Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies. Evolutionary computation, 10(2):99–127, 2002.
  • [Stanley et al.(2009)Stanley, D’Ambrosio, and Gauci] Kenneth O Stanley, David B D’Ambrosio, and Jason Gauci. A hypercube-based encoding for evolving large-scale neural networks. Artificial life, 15(2):185–212, 2009.
  • [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [Yang et al.(2014)Yang, Moczulski, Denil, de Freitas, Smola, Song, and Wang] Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and Ziyu Wang. Deep fried convnets. arXiv preprint arXiv:1412.7149, 2014.
  • [Yao(1999)] Xin Yao. Evolving artificial neural networks. Proceedings of the IEEE, 87(9):1423–1447, 1999.

Properties of the method

Here we identify a few properties of our architecture selection method.

  1. Non-redundancy of architecture: The learnt final architecture must not have any redundant neurons. Removing neurons should necessarily degrade performance.

  2. Local-optimality of weights: The performance of the learnt final architecture must at least be equal to a trained neural network initialized with this final architecture.

  3. Mirroring data-complexity: A ‘harder’ dataset should result in a larger model than an ‘easier’ dataset.

We intuitively observe that all these properties would automatically hold if a ‘master’ property which requires both the architecture and the weights be globally optimal holds. Given that the optimization objective of neural networks is highly non-convex, global optimality cannot be guaranteed. As a result, we restrict ourselves to studying the three properties listed.

In the text that follows, we provide statements that hold for our method. These are obtained by analysing widths of each layer of a neural network assuming that depth is never collapsed. In other words, these hold for neural networks with a single hidden layer. Proofs are provided in a later section.

Non-redundancy of architecture

This is an important property that forms the main motivation for doing architecture-learning. Such a procedure can replace the node-pruning techniques that are used to compress neural networks.

Proposition 1.

At convergence, the loss () of the proposed method over the train set satisfies

This statement implies that change in architecture is inversely proportional to change in loss. In other words, if the architecture grows smaller, the loss must increase. While there isn’t a strict relationship between loss and accuracy, a high loss generally indicates worse accuracy.

Local Optimality of weights

The proposed method learns both architecture and weights. What would happen if we initialized a neural network with this learnt architecture, and proceeded to learn only the weights? This property ensures that in both cases we fall into a local minimum with architecture .

Proposition 2.

Let be the loss over the train set at convergence obtained by training a neural network on data with a fixed architecture . Let be the loss at convergence when the neural network is trained with the proposed method on data such that it results in the same final architecture . Then, and for any .

Mirroring data-complexity

Characterizing data-complexity has traditionally been hard. Here, we consider the following approach.

Proposition 3.

Let and be two datasets which produce train losses and upon training with a fixed architecture such that . When trained with the proposed method, the final architectures and (corresponding to and ) satisfy the relation at convergence.

Here, is the ‘harder’ dataset because it produces a higher loss on the same neural network architecture. As a result, the ‘harder’ dataset always produces a larger final architecture. We do not provide a proof for this statement. Instead, we experimentally verify this in Section 4.2.

Proofs of Propositions

Let be total objective function, where is the binarizing regularizer, is the model complexity term. At convergence, we assume that

as the corresponding weights are all binary or close to binary. Let the maximum step size (due to gradient clipping) for

w and d be .

Proof of proposition 1.

At convergence, we assume
, for some .

for some sufficiently small.

Proof of proposition 2.

Let at iteration with architecture . Let be the architecture at iteration such that at iterations , architecture is .

an iteration such that , being the maximum step size.

Let . Let be parameterized by as follows.

Hence, for large enough , . After iterations, we have


for some . However, if , then , such that .

Without loss of generality, let us assume that neurons corresponding to first weights are selected for, while the rest are inactive. As a result, , for . Hence, the following holds . This, along with equation 6, proves the assertion.

Hyper-parameter selection

For effective usage of our method, we need a good set of s. Here, we describe how to do so practically.

First, we set to a low value based on the initial widths and loss values. Recall that this value multiplies with the number of neurons in the cost function. That is, if a network has a layer with neurons, we get . Hence, if multiplies by 10, divides by 10, so that the regularizer value remains the same. We used for MNIST-network and for AlexNet. For a given initial architecture, a large places more emphasis on getting small models than reducing loss.

Second, we set to be about times . Using a positive shifts the curve in Fig. 1(b) to the right. By letting , the curve shifts to the extreme right with the peak at . Hence if , we set .

We simply set and to of and respectively.