1 Introduction
Estimating and consequently adequately setting representational capacity in deep neural networks for any given task has been a long standing challenge. Fundamental understanding still seems to be insufficient to rapidly decide on suitable network sizes and architecture topologies. While widely adopted convolutional neural networks (CNNs) such as proposed by Krizhevsky et al. (2012); Simonyan & Zisserman (2015); He et al. (2016); Zagoruyko & Komodakis (2016) demonstrate high accuracies on a variety of problems, the memory footprint and computational complexity vary. An increasing amount of recent work is already providing valuable insights and proposing new methodology to address these points. For instance, the authors of Baker et al. (2016)
propose a reinforcement learning based metalearning approach to have an agent select potential CNN layers in a greedy, yet iterative fashion. Other suggested architecture selection algorithms draw their inspiration from evolutionary synthesis concepts
(Shafiee et al., 2016; Real et al., 2017). Although the former methods are capable of evolving architectures that rival those crafted by human design, it is currently only achievable at the cost of navigating large search spaces and hence excessive computation and time. As a tradeoff in present deep neural network design processes it thus seems plausible to consider layer types or depth of a network to be selected by an experienced engineer based on prior knowledge and former research. A variety of techniques therefore focus on improving already well established architectures. Procedures ranging from distillation of one network’s knowledge into another (Hinton et al., 2014), compressing and encoding learned representations Han et al. (2016), pruning alongside potential retraining of networks (Han et al., 2015, 2017; Shrikumar et al., 2016; Hao et al., 2017) and the employment of different regularization terms during training (He et al., 2015; Kang et al., 2016; Rodriguez et al., 2017; Alvarez & Salzmann, 2016), are just a fraction of recent efforts in pursuit of reducing representational complexity while attempting to retain accuracy. Underlying mechanisms rely on a multitude of criteria such as activation magnitudes (Shrikumar et al., 2016) and small weight values (Han et al., 2015)that are used as pruning metrics for either single neurons or complete feature maps, in addition to further combination with regularization and penalty terms.
Common to these approaches is the necessity of training networks with large parameter quantities for maximum representational capacity to full convergence and the lack of early identification of insufficient capacity. In contrast, this work proposes a bottomup approach with the following contributions:

We introduce a computationally efficient, intuitive metric to evaluate feature importance at any point of training a neural network. The measure is based on feature time evolution, specifically the normalized crosscorrelation of each feature with its initialization state.

We propose a bottomup greedy algorithm to automatically expand fixeddepth networks that start with one feature per layer until adequate representational capacity is reached. We base addition of features on our newly introduced metric due to its computationally efficient nature, while in principle a family of similarly constructed metrics is imaginable.

We revisit popular CNN architectures and compare them to automatically expanded networks. We show how our architectures systematically scale in terms of complexity of different datasets and either maintain their reference accuracy at reduced amount of parameters or achieve better results through increased network capacity.

We provide insights on how evolved network topologies differ from their reference counterparts where conventional design commonly increases the amount of features monotonically with increasing network depth. We observe that expanded architectures exhibit increased feature counts at early to intermediate layers and then proceed to decrease in complexity.
2 Building neural networks bottomup feature by feature
While the choice and size of deep neural network model indicate the representational capacity and thus determine which functions can be learned to improve training accuracy, training of neural networks is further complicated by the complex interplay of choice of optimization algorithm and model regularization. Together, these factors define define the effective capacity
. This makes training of deep neural networks particularly challenging. One practical way of addressing this challenge is to boost model sizes at the cost of increased memory and computation times and then applying strong regularization to avoid overfitting and minimize generalization error. However, this approach seems unnecessarily cumbersome and relies on the assumption that optimization difficulties are not encountered. We draw inspiration from this challenge and propose a bottomup approach to increase capacity in neural networks along with a new metric to gauge the effective capacity in the training of (deep) neural networks with stochastic gradient descent (SGD) algorithms.
2.1 Normalized weighttensor crosscorrelation as a measure for neural network effective capacity
In SGD the objective function is commonly equipped with a penalty on the parameters , yielding a regularized objective function:
(1) 
Here, weights the contribution of the penalty. The regularization term is typically chosen as a norm, coined weightdecay, to decrease model capacity or a norm to enforce sparsity. Methods like dropout (Srivastava et al., 2014)
(Ioffe & Szegedy, 2015) are typically employed as further implicit regularizers.In principle, our rationale is inspired by earlier works of Hao et al. (2017) who measure a complete feature’s importance by taking the
norm of the corresponding weight tensor instead of operating on individual weight values. In the same spirit we assign a single importance value to each feature based on its values. However we do not use the weight magnitude directly and instead base our metric on the following hypothesis: While a feature’s absolute magnitude or relative change between two subsequent points in time might not be adequate measures for direct importance, the relative amount of change a feature experiences with respect to its original state provides an indicator for how many times and how much a feature is changed when presented with data. Intuitively we suggest that features that experience high structural changes must play a more vital role than any feature that is initialized and does not deviate from its original states’ structure. There are two potential reasons why a feature that has randomly been initialized does not change in structure: The first being that its form is already initialized so well that it does not need to be altered and can serve either as is or after some scalar rescaling or shift in order to contribute. The second possibility is that too high representational capacity, the nature of the cost function, too large regularization or the type of optimization algorithm prohibit the feature from being learned, ultimately rendering it obsolete. As deep neural networks are commonly initialized from using a distribution over highdimensional space the first possibility seems unlikely
(Goodfellow et al., 2016).As one way of measuring the effective capacity at a given state of learning, we propose to monitor the time evolution of the normalized crosscorrelation for all weights with respect to their state at initialization. For a convolutional neural network composed of layers and complementary weighttensors with spatial dimensions defining a mapping from an input featurespace onto the output feature space that serves as input to the next layer, we define the following metric:
(2) 
which is a measure of selfresemblance. In this equation, is the state of a layer’s weighttensor at time or the initial state after initialization . is the mean taken over spatial and input feature dimensions. depicts the Hadamard product that we use in an extended fashion from matrices to tensors where each dimension is multiplied in an elementwise fashion analogously. Similarly the terms in the denominator are defined as the
norm of the weighttensor taken over said dimensions and thus resulting in a scalar value. Above equation can be defined in an analogous way for multilayer perceptrons by omitting spatial dimensions.
The metric is easily interpretable as no structural changes of features lead to a value of zero and importance approaches unity the more a feature is deviating in structure. The usage of normalized crosscorrelation with the norm in the denominator has the advantage of having an inherent invariance to effects such as translations or rescaling of weights stemming from various regularization contributions. Therefore the contribution of the sumterm in equation 1 does not change the value of the metric if the gradient term vanishes. This is in contrast to the measure proposed by Hao et al. (2017), as absolute weight magnitudes are affected by rescaling and make it more difficult to interpret the metric in an absolute way and find corresponding thresholds.
2.2 Bottomup construction of neural network representational capacity
We propose a new method to converge to architectures that encapsulate necessary task complexity without the necessity of training huge networks in the first place. Starting with one feature in each layer, we expand our architecture as long as the effective capacity as estimated through our metric is not met and all features experience structural change. In contrast to methods such as Baker et al. (2016); Shafiee et al. (2016); Real et al. (2017) we do not consider flexible depth and treat the amount of layers in a network as a prior based on the belief of hierarchical composition of the underlying factors. Our method, shown in algorithm 1, can be summarized as follows:

For a given network arrangement in terms of function type, depth and a set of hyperparameters: initialize each layer with one feature and proceed with (minibatch) SGD.

After each update step evaluate equation 2 independently per layer and increase feature dimensionality by (one or higher if a complexity prior exists) if all currently present features in respective layer are differing from their initial state by more than a constant .

Reinitialize all parameters if architecture has expanded.
The constant is a numerical stability parameter that we set to a small value such as , but could in principle as well be used as a constraint. We have decided to include the reinitialization in step 3 (lines ) to avoid the pitfalls of falling into local minima^{1}^{1}1We have empirically observed promising results even without reinitialization, but deeper analysis of stability (e.g. expansion speed vs. training rate), initialization of new features during training (according to chosen scheme or aligned with already learned representations?) is required.. Despite this sounding like a major detriment to our method, we show that networks nevertheless rapidly converge to a stable architectural solution that comes at less than perchance expected computational overhead and at the benefit of avoiding training of too large architectures. Naturally at least one form of explicit or implicit regularization has to be present in the learning process in order to prevent infinite expansion of the architecture. We would like to emphasize that we have chosen the metric defined in equation 2 as a basis for the decision of when to expand an architecture, but in principle a family of similarly constructed metrics is imaginable. We have chosen this particular metric because it does not directly depend on gradient or higherorder term calculation and only requires multiplication of weights with themselves. Thus, a major advantage is that computation of equation 2 can be parallelized completely and therefore executed at less cost than a regular forward pass through the network.
3 Revisiting popular architectures with architecture expansion
We revisit some of the most established architectures ”GFCNN” (Goodfellow et al., 2013) ”VGGA & E” (Simonyan & Zisserman, 2015) and ”Wide Residual Network: WRN” (Zagoruyko & Komodakis, 2016) (see appendix for architectural details) with batch normalization (Ioffe & Szegedy, 2015). We compare the number of learnable parameters and achieved accuracies with those obtained through expanded architectures that started from a single feature in each layer. For each architecture we include allconvolutional variants (Springenberg et al., 2015)
that are similar to WRNs (minus the skipconnections), where all pooling layers are replaced by convolutions with larger stride. All fullyconnected layers are furthermore replaced by a single convolution (affine, no activation function) that maps directly onto the space of classes. Even though the value of more complex type of subsampling functions has already empirically been demonstrated
(Lee et al., 2015), the amount of features of the replaced layers has been constrained to match in dimensionality with the preceding convolution layer. We would thus like to further extend and analyze the role of layers involving subsampling by decoupling the dimensionality of these larger stride convolutional layers.We consider these architectures as some of the best CNN architectures as each of them has been chosen and tuned carefully according to extensive amounts of hyperparameter search. As we would like to demonstrate how representational capacity in our automatically constructed networks scales with increasing task difficulty, we perform experiments on the MNIST (LeCun et al., 1998), CIFAR10 & 100 (Krizhevsky, 2009)
datasets that intuitively represent little to high classification challenge. We also show some preliminary experiments on the ImageNet
(Russakovsky et al., 2015) dataset with ”Alexnet” (Krizhevsky et al., 2012) to conceptually show that the algorithm is applicable to large scale challenges. All training is closely inspired by the procedure specified in Zagoruyko & Komodakis (2016)with the main difference of avoiding heavy preprocessing. We preprocess all data using only trainset mean and standard deviation (see appendix for exact training parameters). Although we are in principle able to achieve higher results with different sets of hyperparameters and preprocessing methods, we limit ourselves to this training methodology to provide a comprehensive comparison and avoid masking of our contribution. We train all architectures five times on each dataset using a Intel i76800K CPU (data loading) and a single NVIDIA TitanX GPU. Code has been written in both Torch7
(Collobert et al., 2011)and PyTorch (
http://pytorch.org/) and will be made publicly available.3.1 The topdown perspective: Feature importance for pruning
We first provide a brief example for the use of equation 2 through the lens of pruning to demonstrate that our metric adequately measures feature importance. We evaluate the contribution of the features by pruning the weighttensor feature by feature in ascending order of feature importance values and reevaluating the remaining architecture. We compare our normalized crosscorrelation metric 2 to the weight norm metric introduced by Hao et al. (2017) and ranked mean activations evaluated over an entire epoch. In figure 1 we show the pruning of a trained GFCNN, expecting that such a network will be too large for the easier MNIST and too small for the difficult CIFAR100 task. For all three metrics pruning any feature from the architecture trained on CIFAR100 immediately results in loss of accuracy, whereas the architecture trained on MNIST can be pruned to a smaller set of parameters by greedily dropping the next feature with the currently lowest feature importance value. We notice how all three metrics perform comparably. However, in contrast to the other two metrics, our normalized crosscorrelation captures whether a feature is important on absolute scale. For MNIST the curve is very close to zero, whereas the metric is close to unity for all CIFAR100 features. Ultimately this is the reason our metric, in the way formulated in equation 2, is used for the algorithm presented in 1 as it doesn’t require a difficult process to determine individual layer threshold values. Nevertheless it is imaginable that similar metrics based on other tied quantities (gradients, activations) can be formulated in analogous fashion.
As our main contribution lies in the bottomup widening of architectures we do not go into more detailed analysis and comparison of pruning strategies. We also remark that in contrast to a bottomup approach to finding suitable architectures, pruning seems less desirable. It requires convergent training of a huge architectures with lots of regularization before complexity can be introduced, pruning is not capable of adding complexity if representational capacity is lacking, pruning percentages are difficult to interpret and compare (i.e. pruning percentage is 0 if the architecture is adequate), a majority of parameters are pruned only in the last ”fullyconnected” layers (Han et al., 2015), and pruning strategies as suggested by Han et al. (2015, 2017); Shrikumar et al. (2016); Hao et al. (2017) tend to require many crossvalidation with consecutive finetuning steps. We thus continue with the bottomup perspective of expanding architectures from low to high representational capacity.
3.2 The bottomup perspective: Expanding architectures
We use the described training procedure in conjunction with algorithm 1 to expand representational complexity by adding features to architectures that started with just one feature per layer with the following additional settings:
Architecture expansion settings and considerations:
Our initial experiments added one feature at a time, but large speedups can be introduced by means of adding stacks of features. Initially, we avoided suppression of late reinitialization to analyze the possibility that rarely encountered worstcase behavior of restarting on an almost completely trained architecture provides any benefit. After some experimentation our final report used a stability parameter ending the network expansion if half of the training has been stable (no further change in architecture) and added and features per expansion step for MNIST and CIFAR10 & 100 experiments respectively.
We show an exemplary architecture expansion of the GFCNN architecture’s layers for MNIST and CIFAR100 datasets in figure 2 and the evolution of the overall amount of parameters for five different experiments. We observe that layers expand independently at different points in time and more features are allocated for CIFAR100 than for MNIST. When comparing the five different runs we can identify that all architectures converge to a similar amount of network parameters, however at different points in time. A good example to see this behavior is the solid (green) curve in the MNIST example, where the architecture at first seems to converge to a state with lower amount of parameters and after some epochs of stability starts to expand (and reinitialize) again until it ultimately converges similarly to the other experiments.
We continue to report results obtained for the different datasets and architectures in table 1. The table illustrates the mean and standard deviation values for error, total amount of parameters and the mean overall time taken for five runs of algorithm 1 (deviation can be fairly large due to the behavior observed in 2). We make the following observations:
GFCNN  VGGA  VGGE  WRN2810  

type  original  expanded  original  expanded  original  expanded  original  expanded  
MNIST 
standard  error [%]  overfit  
params []  
time [h]  
allconv  error [%]  
params []  
time [h]  
CIFAR10+ 
standard  error [%]  
params []  
time [h]  
allconv  error [%]  
params  
time [h]  
CIFAR100+ 
standard  error [%]  
params []  
time [h]  
allconv  error [%]  
params []  
time [h] 

Without any prior on layer widths, expanding architectures converge to states with at least similar accuracies to the reference at reduced amount of parameters, or better accuracies by allocating more representational capacity.

For each architecture type there is a clear trend in network capacity that is increasing with dataset complexity from MNIST to CIFAR10 to CIFAR100 ^{2}^{2}2For the WRN CIFAR100 architecture the signifies hardware memory limitations due to the arrangement of architecture topology and thus expansion is limited. This is because increased amount of earlylayer features requires more memory in contrast to late layers, which is particularly intense for the coupled WRN architecture..

Even though we have introduced reintialization of the architecture the time taken by algorithm 1 is much less than one would invest when doing a manual, grid or randomsearch.

The large reference VGGE (lower accuracy than VGGA on CIFAR) and WRN2810 (complete overfit on MNIST) seem to run into optimization difficulties for these datasets. However, expanded alternate architecture clearly perform significantly better.
In general we observe that these benefits are due to unconventional, yet always coinciding, network topology of our expanded architectures. These topologies suggest that there is more to CNNs than simply following the rule of thumb of increasing the number of features with increasing architectural depth. Before proceeding with more detail on these alternate architecture topologies, we want to again emphasize that we do not report experiments containing extended methodology such as excessive preprocessing, data augmentation, the oscillating learning rates proposed in Loshchilov & Hutter (2017) or better sets of hyperparameters for reasons of clarity, even though accuracies rivaling stateoftheart performances can be achieved in this way.
3.3 Alternate formation of deep neural network topologies
Almost all popular convolutional neural network architectures follow a design pattern of monotonically increasing feature amount with increasing network depth (LeCun et al., 1998; Goodfellow et al., 2013; Simonyan & Zisserman, 2015; Springenberg et al., 2015; He et al., 2016; Zagoruyko & Komodakis, 2016; Loshchilov & Hutter, 2017; Urban et al., 2017). For the results presented in table 1 all automatically expanded network topologies present alternatives to this pattern. In figure 3, we illustrate exemplary mean topologies for a VGGE and VGGE allconvolutional network as constructed by our expansion algorithm in five runs on the three datasets. Apart from noticing the systematic variations in representational capacity with dataset difficulty, we furthermore find topological convergence with small deviations from one training to another. We observe the highest feature dimensionality in early to intermediate layers with generally decreasing dimensionality towards the end of the network differing from conventional CNN design patterns. Even if the expanded architectures sometimes do not deviate much from the reference parameter count, accuracy seems to be improved through this topological rearrangement. For architectures where pooling has been replaced with larger stride convolutions we also observe that dimensionality of layers with subsampling changes independently of the prior and following convolutional layers suggesting that highlycomplex subsampling operations are learned. This an extension to the proposed allconvolutional variant of Springenberg et al. (2015), where introduced additional convolutional layers were constrained to match the dimensionality of the previously present pooling operations.
If we view the deep neural network as being able to represent any function that is limited rather by concepts of continuity and boundedness instead of a specific form of parameters, we can view the minimization of the cost function as learning a functional mapping instead of merely adopting a set of parameters (Goodfellow et al., 2016). We hypothesize that evolved network topologies containing higher feature amount in early to intermediate layers generally follow a process of first mapping into higher dimensional space to effectively separate the data into many clusters. The network can then more readily aggregate specific sets of features to form clusters distinguishing the class subsets.
Empirically this behavior finds confirmation in all our evolved network topologies that are visualized in the appendix. Similar formation of topologies, restricted by the dimensionality constraint of the identity mappings, can be found in the trained residual networks.
While He et al. (2015) has shown that deep VGGlike architectures do not perform well, an interesting question for future research could be whether plainly stacked architectures can perform similarly to residual networks if the arrangement of feature dimensionality is differing from the conventional design of monotonic increase with depth.
3.4 An outlook to ImageNet
We show two first experiments on the ImageNet dataset using an allconvolutional Alexnet to show that our methodology can readily be applied to large scale. The results for the two runs can be found in table 2 and corresponding expanded architectures are visualized in the appendix. We observe that the experiments seem to follow the general pattern and again observe that topological rearrangement of the architecture yields substantial benefits. In the future we would like to extend experimentation to more promising ImageNet architectures such as deep VGG and residual networks. However, these architectures already require 48 GPUs and large amounts of time in their baseline evaluation, which is why we presently are not capable of evaluating these architectures and keep this section at a very brief proof of concept level.
Alexnet  1  Alexnet  2  

•  top1 error  top5 error  params  time  top1 error  top5 error  params  time 
original  43.73 %  20.11 %  35.24  27.99 h  43.73 %  20.11 %  35.24  27.99 h 
expanded  37.84 %  15.88 %  34.76  134.21 h  38.47 %  16.33 %  32.98  118.73 h 
4 Conclusion
In this work we have introduced a novel bottomup algorithm to start neural network architectures with one feature per layer and widen them until a task depending suitable representational capacity is achieved. For the use in this framework we have presented one potential computationally efficient and intuitive metric to gauge feature importance. The proposed algorithm is capable of expanding architectures that provide either reduced amount of parameters or improved accuracies through higher amount of representations. This advantage seems to be gained through alternative network topologies with respect to commonly applied designs in current literature. Instead of increasing the amount of features monotonically with increasing depth of the network, we empirically observe that expanded neural network topologies have high amount of representations in early to intermediate layers.
Future work could include a reevaluation of plainly stacked deep architectures with new insights on network topologies. We have furthermore started to replace the currently present reinitialization step in the proposed expansion algorithm by keeping learned filters. In principle this approach looks promising but does need further systematic analysis of new feature initialization with respect to the already learned feature subset and accompanied investigation of orthogonality to avoid falling into local minima.
Acknowledgements
This work has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 687384. Kishora Konda and Tobias Weis received funding from Continental Automotive GmbH. We would like to further thank Anjaneyalu Thippaiah for help with execution of ImageNet experiments.
References
 Alvarez & Salzmann (2016) Jose M. Alvarez and Mathieu Salzmann. Learning the Number of Neurons in Deep Networks. In NIPS, 2016.
 Ba & Caurana (2014) Lei J. Ba and Rich Caurana. Do Deep Nets Really Need to be Deep ? arXiv preprint arXiv:1312.6184, 2014.
 Baker et al. (2016) Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing Neural Network Architectures using Reinforcement Learning. arXiv preprint arXiv:1611.02167, 2016.

Collobert et al. (2011)
Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet.
Torch7: A matlablike environment for machine learning.
BigLearn, NIPS Workshop, 2011.  Goodfellow et al. (2013) Ian J. Goodfellow, David WardeFarley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout Networks. In ICML, 2013.
 Goodfellow et al. (2016) Ian J. Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
 Han et al. (2015) Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both Weights and Connections for Efficient Neural Networks. In NIPS, 2015.
 Han et al. (2016) Song Han, Huizi Mao, and William J. Dally. Deep Compression  Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In ICLR, 2016.
 Han et al. (2017) Song Han, Huizi Mao, Enhao Gong, Shijian Tang, William J. Dally, Jeff Pool, John Tran, Bryan Catanzaro, Sharan Narang, Erich Elsen, Peter Vajda, and Manohar Paluri. DSD: DenseSparseDense Training For Deep Neural Networks. In ICLR, 2017.
 Hao et al. (2017) Li Hao, Asim Kadav, Hanan Samet, Igor Durdanovic, and Hans Peter Graf. Pruning Filters For Efficient Convnets. In ICLR, 2017.
 He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification. In ICCV, 2015.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016.
 Hinton et al. (2014) Geoffrey E. Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network. In NIPS Deep Learning Workshop, 2014.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arxiv preprint arXiv:1502.03167, 2015.
 Kang et al. (2016) Guoliang Kang, Jun Li, and Dacheng Tao. Shakeout: A New Regularized Deep Neural Network Training Scheme. In AAAI, 2016.
 Krizhevsky (2009) Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical report, Toronto, 2009.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc., 2012.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11), 1998.
 Lee et al. (2015) ChenYu Lee, Patrick W. Gallagher, and Zhuowen Tu. Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree. 2015.
 Loshchilov & Hutter (2017) Ilya Loshchilov and Frank Hutter. SGDR: Stochastic Gradient Descent With Warm Restarts. In ICLR, 2017.
 Real et al. (2017) Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Quoc Le, and Alex Kurakin. LargeScale Evolution of Image Classifiers. arXiv preprint arXiv:1703.01041, 2017.
 Rodriguez et al. (2017) Pau Rodriguez, Jordi González, Guillem Cucurull, Josep M. Gonfaus, and Xavier Roca. Regularizing CNNs With Locally Constrained Decorrelations. In ICLR, 2017.

Russakovsky et al. (2015)
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein,
Alexander C. Berg, and Li FeiFei.
ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision (IJCV)
, 115(3):211–252, 2015.  Shafiee et al. (2016) Mohammad J. Shafiee, Akshaya Mishra, and Alexander Wong. EvoNet: Evolutionary Synthesis of Deep Neural Networks. arXiv preprint arXiv:1606.04393, 2016.
 Shrikumar et al. (2016) Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not Just a Black Box: Interpretable Deep Learning by Propagating Activation Differences. In ICML, 2016.
 Simonyan & Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for LargeScale Image Recognition. In ICLR, 2015.
 Springenberg et al. (2015) Jost T. Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for Simplicity: The All Convolutional Net. In ICLR, 2015.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout : A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 15, 2014.
 Urban et al. (2017) Gregor Urban, Krzysztof J. Geras, Samira Ebrahimi Kahou, Ozlem Aslan, Shengjie Wang, Abdelrahman Mohamed, Matthai Philipose, Matthew Richardson, and Rich Caruana. Do Deep Convolutional Nets Really Need To Be Deep And Convolutional? In ICLR, 2017.
 Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks. In BMVC, 2016.
Appendix A Appendix
a.1 Datasets

MNIST (LeCun et al., 1998): 50000 train images of handdrawn digits of spatial size belonging to one of 10 equally sampled classes.

CIFAR10 & 100 (Krizhevsky, 2009): 50000 natural train images of spatial size each containing one object belonging to one of 10/100 equally sampled classes.

ImageNet (Russakovsky et al., 2015): Approximately 1.2 million training images of objects belonging to one of 1000 classes. Classes are not equally sampled with 7321300 images per class. Dataset contains 50 000 validation images, 50 per class. Scale of objects and size of images varies.
a.2 Training hyperparameters
All training is closely inspired by the procedure specified in Zagoruyko & Komodakis (2016)
with the main difference of avoiding heavy preprocessing. Independent of dataset, we preprocess all data using only trainset mean and standard deviation. All training has been conducted using crossentropy as a loss function and weight initialization following the normal distribution as proposed by
He et al. (2015). All architectures are trained with batchnormalization with a constant of , a batchsize of , a weightdecay of , a momentum ofand nesterov momentum.
Small datasets:
We use initial learning rates of and for the CIFAR and MNIST datasets respectively. We have rescaled MNIST images to (CIFAR size) and repeat the image across color channels in order to use architectures without modifications. CIFAR10 & 100 are trained for 200 epochs and the learning rate is scheduled to be reduced by a factor of 5 every multiple of 60 epochs. MNIST is trained for 60 epochs and learning rate is reduced by factor of 5 once after 30 epochs. We augment the CIFAR10 & 100 training by introducing horizontal flips and small translations of up to 4 pixels during training. No data augmentation has been applied to the MNIST dataset.
ImageNet:
We use the singlecrop technique where we rescale the image such that the shorter side is equal to 224 and take a centered crop of spatial size . In contrast to Krizhevsky et al. (2012) we limit preprocessing to subtraction and divison of trainset mean and standard deviation and do not include local response normalization layers. We randomly augment training data with random horizontal flips. We set an initial learning rate of and follow the learning rate schedule proposed in Krizhevsky et al. (2012) that drops the learning rate by a factor of every 30 epochs and train for a total of 74 epochs.
The amount of epochs for the expansion of architectures is larger due to the reinitialization. For these architectures the mentioned amount of epochs corresponds to training during stable conditions, i.e. no further expansion. The procedure is thus equivalent to training the converged architecture from scratch.
a.3 Architectures

(Goodfellow et al., 2013) Three convolution layer network with larger filters (followed by two fullyconnected layers, but without ”maxout”. The exact sequence of operations is:

Convolution 2: with padding batchnormalization ReLU maxpooling with stride .

Convolution 3: with padding batchnormalization ReLU maxpooling with stride .

Fullyconnected 1: batchnormalization ReLU.

Fullyconnected 2: classes.
Represents the family of rather shallow ”deep” networks.

(Simonyan & Zisserman, 2015) ”VGGA” (8 convolutions) and ”VGGE” (16 convolutions) networks. Both architectures include three fullyconnected layers. We set the number of features in the MLP to 512 features per layer instead of 4096 because the last convolutional layer of these architecture already produces outputs of spatial size (in contrast to on ImageNet) on small datasets. Batch normalization is used before the activation functions. Examples of stacking convolutions that do not alter spatial dimensionality to create deeper architectures.

(Zagoruyko & Komodakis, 2016) Wide Residual Network architecture: We use a depth of 28 convolutional layers (each block completely coupled, no bottlenecks) and a widthfactor of 10 as reference. When we expand these networks this implies an inherent coupling of layer blocks due to dimensional consistency constraints with outputs from identity mappings.

(Krizhevsky et al., 2012) We use the all convolutional variant where we replace the first fullyconnected large layer with a convolution of corresponding spatial filter size and filters and drop all further fullyconnected layers. The rationale behind this decision is that previous experiments, our own pruning experiments and those of Hao et al. (2017); Han et al. (2015), indicate that original fullyconnected layers are largely obsolete.
a.4 Automatically expanded architecture topologies
In addition to figure 3 we show mean evolved topologies including standard deviation for all architectures and datasets reported in table 1 and 2. In figure 4 and 5 all shallow and VGGA architectures and their respective allconvolutional variants are shown. Figure 6 shows the constructed wide residual 28 layer network architectures where blocks of layers are coupled due to the identity mappings. Figure 7 shows the two expanded Alexnet architectures as trained on ImageNet.
As explained in the main section we see that all evolved architectures feature topologies with large dimensionality in early to intermediate layers instead of in the highest layers of the architecture as usually present in conventional CNN design.
For architectures where pooling has been replaced with larger stride convolutions we also observe that dimensionality of layers with subsampling changes independently of the prior and following convolutional layers suggesting that highlycomplex pooling operations are learned. This an extension to the proposed allconvolutional variant of Springenberg et al. (2015), where introduced additional convolutional layers were constrained to match the dimensionality of the previously present pooling operations.