Neural Architecture Construction using EnvelopeNets

03/18/2018 ∙ by Purushotham Kamath, et al. ∙ Cisco 0

In recent years, advances in the design of convolutional neural networks have resulted in significant improvements on the image classification and object detection problems. One of the advances is networks built by stacking complex cells, seen in such networks as InceptionNet and NasNet. These cells are either constructed by hand, generated by generative networks or discovered by search. Unlike conventional networks (where layers consist of a convolution block, sampling and non linear unit), the new cells feature more complex designs consisting of several filters and other operators connected in series and parallel. Recently, several cells have been proposed or generated that are supersets of previously proposed custom or generated cells. Influenced by this, we introduce a network construction method based on EnvelopeNets. An EnvelopeNet is a deep convolutional neural network of stacked EnvelopeCells. EnvelopeCells are supersets (or envelopes) of previously proposed handcrafted and generated cells. We propose a method to construct improved network architectures by restructuring EnvelopeNets. The algorithm restructures an EnvelopeNet by rearranging blocks in the network. It identifies blocks to be restructured using metrics derived from the featuremaps collected during a partial training run of the EnvelopeNet. The method requires less computation resources to generate an architecture than an optimized architecture search over the entire search space of blocks. The restructured networks have higher accuracy on the image classification problem on a representative dataset than both the generating EnvelopeNet and an equivalent arbitrary network.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, several bespoke neural networks (InceptionNet (39), DenseNet (18), and ResNet (17)) have shown significant improvements on the image classification and object detection problems. More recently, search algorithms and recurrent networks have found network architectures that outperform these bespoke architectures.

The time needed to discover a network by these algorithms is fundamentally limited by the need to run a full training and evaluation cycle for every iteration of the construction algorithm. It is known that some networks converge faster than others depending on the structure, hyperparameters and other factors 

(4). Hence search methods have to wait for accuracy to converge before comparing networks. The search space for construction of these networks is exponential in the number of operators and each iteration requires waiting for this convergence. The search algorithms address the long search and evaluation times by various methods including cell search with stacked cell design (rather than a network search) (45), parameters prediction and sharing (33; 6), shorter training runs (43) and other methods.

This work proposes construction methods that are not dependent on running a full training and evaluation cycle. They are based on intuition from previous work that indicates different stages of the network play different roles in the overall classification task. Zeiler et. al. (42) show that shallower layers of a network extract fine features while deeper layers extract grosser features. Li et. al. (23) show that pruning can be effective in reducing the size of a network without significant degradation in accuracy. Both of these indicate that, at different layers of a network, some filters are more important than others.

This work shows that statistics obtained from the featuremaps at the outputs of the filters of a network during training, i.e. featuremap statistics, can be used to compare the utility of filters within a network. These statistics reach a state where the utility of different filters within a network and hence, their relative importance to the classification (or other) task can be evaluated. The time needed for filters to reach this state, is much less than the time needed for the accuracy of a network to converge i.e. the time needed for an accurate comparison of the performance of two networks. Therefore the pruning and expansion can be done without needing to wait for a full training cycle to complete. A filter’s utility is calculated based on statistics obtained from their featuremaps obtained during the training. These statistics reach a state where the utility of different filters within a network can be evaluated. Experimentally, we find that the time needed for filters to reach this state, is much less than the time needed for the accuracy of a network to converge i.e. the time needed for an accurate comparison of the performance of two networks, resulting in a speedup in the construction time. EnvelopeNet construction exploits this property for the iterative construction of convolutional neural networks.

The pruning and expansion algorithm fits in well with intuition from previous work that indicates different layers of the network play different roles in the overall classification task  (42; 31). The layers closer to the head of the network extract gross features (edges, boundaries, shapes) while deeper layers compose these into more abstract features (such as facial features). Further, (15)

indicates that each stage of a network iteratively refines its estimates of the same features. In InceptionNet 

(8), the 3 parallel paths with different filters were shown to extract features at different levels. After training, it was found that for most layers, one of the paths dominated the others, indicating that one path was primarily activated at each layer.

The algorithm also mirrors theories on the ontogenesis of neurons in the brain. Brain development is believed to consist of neurogenesis 

(32), where the neural structure initially develops, gradually followed by apoptosis (10), where neural cells are eliminated, introduction of more neurons by hippocampal neurogenesis (12) and synaptic pruning (19)

, where synapses are eliminated. The NAC algorithm consists of analogous steps run in iterations: model initialization with a prior (neurogenesis), a truncated training cycle, pruning filters (apoptosis), adding new cells (hippocampal neurogenesis) and pruning of skip connections (synaptic pruning).

2 Related work

Recent work in automated neural network architecture design/search can be broadly classified into two categories.

  • Network design: Networks are designed through combinatorial search, evolutionary algorithms or recurrent networks. E.g. Neuroevolution (14), AmoebaNet (34), Bayesian Optimization (21), MetaQNN (3), Genetic CNN (41)

  • Cell or block design: Search, evolutionary, recurrent networks or other methods are used to find a cell (or block) of operators. Multiple cells are stacked in series to form a network. E.g. NasNet (45), BlockQNN (43) and Dutta et. al.  (9).

  • Optimization based methods: E.g. DARTs (26), NAO (27)

Neuroevolution methods (14) encompass a range of evolutionary algorithms/techniques that discover network architectures. Real et. al. (35) proposed an evolutionary algorithm to pick a combination of architecture and hyperparameters. A regularized version of this technique called AmoebaNet (34) improves performance and starts the neuroevolution from a prior. Elsken et. al. (11) use hill climbing to incrementally build neural networks. Traditionally, hyperparameter tuning has used a variety of blackbox techniques such as grid search, Bayesian optimization and random search (5). These techniques are effective for continuous valued parameters such as the number of layers and filter sizes, but are hard to apply to network architecture. Kandasamy et. al. (21) proposes a distance metric to apply Bayesian optimization to the architecture search. Neural Architecture Search (44)

uses an RNN that generates the number of filters, filter size, and stride for a convolution network. Super neural networks use top down approach by designing a large network fabric, based on previously proposed networks, and recovering architectures by selecting a path over an ensemble of architectures. Convolutional Neural fabrics 

(38), PathNet (13), Budgeted SuperNets (40) adopt this strategy.

The large search space of the neuroevolution methods lead to the search for cell architectures. Cell based design was influenced by bespoke hand designed networks such as InceptionNet (39) and XceptionNet (8). These networks have repeated structures of cells and/or connections and/or repeated design motifs. NasNet(45) uses recurrent networks with reinforcement learning to generate and optimize cell designs composed from a fixed set of operators (blocks). Subsequent work (45) showed the transferability of these methods, addressing concerns that construction methods are susceptible to overfitting. The paper observed that the cells they generate are often envelopes over a broad class of human invented architectures, motivating the use of envelopes in our construction method. Progressive NAS (24) extends the work by searching for a good cell composed from a limited set of blocks using a sequential model-based optimization strategy.

The performance of both cell and network based designs have exceeded that of the state of the art bespoke networks. However, both the network as well as the cell construction methods need computation resources for both the search phase as well as the evaluation phase. Several methods have been proposed to reduce the computation requirements. Efficient NAS (33) reduces the computation resources needed for NasNet through parameter sharing across iterations of generation and reducing the search space. BlockQNN (43) uses an early stop mechanism to stop training early with reward being function of accuracy at early stop as well as model complexity. Architecture search by network transformation (7) is another method that uses parameter inheritance along with network transformation. DeepArchitect (30) and Liu et. al. (25) use efficient search methods by representing the search space in a structured manner. SMASH (6) avoids the full training of candidate models by generating their weight from Hypernets (16) and predicting accuracy.

Outside the field of architecture search, model reduction methods that analyze filter statistics have been used to reduce model size for inference on resource constrained platforms, without substantial loss of accuracy. Li et. al. (23) showed that filter pruning of convolutional networks can reduce parameters, training and inference times without significant degradation in accuracy. Roy et. al. (36), and Molchanov et. al. (29) also use pruning to remove duplicates and to reduce model size. Mittal et. al (28) surveyed several metrics for pruning and showed that random pruning can be as effective as algorithmic pruning when performed on a hand designed network (such as ResNet (17)).

Most of the cell and network search methods are bottom up methods, agnostic to filter statistics, while model reduction has traditionally been top down, using filter statistics. EnvelopeNets integrates techniques from both domains (evolution and pruning) to enable construction using featuremap statistics.

3 Motivation and Hypothesis

The motivation of this work is to answer three questions:

  • Do there exist featuremap statistics, (any statistic extracted from the time series of the featuremaps of the filters in the network during training) that reach a state, featuremap stability, when the performance of the filters can be compared?

  • Can networks with improved accuracy be constructed automatically using featuremap statistics to control the construction process? Do the constructed networks perform significantly better than equivalent arbitrarily constructed or randomly generated networks?

  • Do the featuremap statistics reach stability significantly faster than the network accuracy converges i.e. the time needed to make a reliable comparison between the accuracy of two networks?

Our construction method is based on EnvelopeNets. An EnvelopeNet is a deep convolutional neural network of stacked EnvelopeCells. EnvelopeCells are supersets (or envelopes) of previously proposed handcrafted and/or generated cells constructed from basic operators or blocks The network is structured in stages. A stage is a sequence of layers separated by widening cells, which are layer of cells that increase the channel width of the image. The construction method iteratively restructures the EnvelopeNet based on the utility of the filters within the network. Note that the definition of featuremap stability does not imply convergence of the featuremap statistic.

Our hypothesis, that EnvelopeNets can be restructured to yield higher performing networks, is based on intuition from related work around cell construction, network search and pruning filters and connections. Part of the intuition behind the algorithm comes from visualization techniques that indicate what the individual layers of a neural network perform (42; 31). The studies indicate that after training a network, the layers closer to the head of the network extract gross features (edges, boundaries, shapes) while deeper layers compose these into more abstract features or objects (such as meshes, facial features). Their results also show that visual inspection of filter performance can be used for architecture selection and can have a significant impact on performance. The reasoning is that after a reasonable amount of training is complete, filters generally identify the scale of the features which they extract. The filters that extract less important features would be better suited to be placed in a layer where they can contribute more to the classification task. This hypothesis is supported by the results of Zeiler et. al. (42), that show that, after training, some filters in a deep network end up with featuremaps with "dead" features while other have cleaner distinctive features. Further,  (15) indicates that each stage of a network iteratively refines its estimates of the same features. This has influenced the design of the EnvelopeNet which is structured in stages, and construction, which is done independently on each stage.

The algorithms described in this work are non optimal. However, we show empirically, that they generate constructions whose performance exceeds that of the EnvelopeNet, several arbitrarily constructed and randomly generated networks of the same network complexity (same depth, same blocks and approximately the same number of parameters).

4 Construction using EnvelopeNets

Figure 1: Neural Architecture Construction (NAC) using EnvelopeNets

Construction has three components: the design of the EnvelopeNet, the choice of the featuremap statistic and the construction algorithm.

4.1 EnvelopeNet

The EnvelopeCell is a set of convolution blocks connected in parallel. E.g. one of the EnvelopeCells used in this work has 6 convolution blocks connected in parallel: 1x1 convolution, 3x3 convolution, 3x3 separable convolution, 5x5 convolution, 5x5 separable convolution and 7x7 separable convolution. This is a subset of the blocks used in the cell discovered in (33)

. Each block consists of a convolution block, a Relu unit and a batch normalization. In addition the network has three additional types of cells: Wideners (a maxpool unit and 3x3 convolution filter connected in parallel) which downsample the image dimensions by a factor of two and double the channel width, a stem (initial cell) for the network that increases the input channel width

for the EnvelopeNet and a classification cell consisting of an average pooling block, a fully connected layer with dropout and softmax.

The EnvelopeNet consists of a number of the EnvelopeCells stacked in series organized into stages of layers, separated by wideners. The cells are organized in stages, with wideners separating each stage. The stem and classification blocks are placed at the head and tail of the network.

We refer to a network using the notation, /--…/ where is the number of input channels to the EnvelopeNet (output of the stem cell), is the number of layer in stage and is the number of filters in an EnvelopeCell. E.g. in this work one of the EnvelopeNets we use is an 128/10-1-1-1/6 EnvelopeNet

4.2 Featuremap statistic

(a)

(b)

(c)
Figure 2: (a) is the EnvelopeNet (b) and (c) are networks and obtained by removing filters and respectively.

Figure 1(a) shows a single hidden layer of network with convolution blocks , …., connected in parallel, whose outputs are concatenated. Figure 1(b) and Figure 1(c) shows networks and constructed from the original network by pruning convolutional blocks and respectively. In order to retain the dimensionality for doing analysis we use , a convolution filter of all zeros with the same dimensions as or , as a replacement for the empty cell.

We assume the network to be fully trained and its parameters are inherited by networks and . The output of the networks for an input image can be written as:

where and are Toeplitz matrices corresponding to the convolution operations and respectively. We define the relative mean square error of the constructed network and as the MSE of the output of the layer relative to the output of the EnvelopeNet . Without loss of generality, assume that network has higher relative MSE than i.e. is the correct filter to prune. The relative MSE, between and is less than the error between and .

Here, is the number of images over which the relative MSE is calculated. , the sum of the squared -norm over a given training or validation set, is a featuremap statistic.

This implies that if has lower relative MSE than , then the featuremap statistic must be lower for , under the assumptions made. We can use the squared norm as the featuremap statistic to identify the filter in to prune i.e to choose between network and . It can be shown that the same featuremap statistic can be be used to choose between any number of filters in parallel in the same layer of a network. While the analysis is subject to a number of strong assumptions (linear model, single layer) that do not hold in practice, it provides a starting point for the exploration of featuremap statistics. We conjecture that this metric can be used for more complex networks, and use it in our implementation along with other metrics. While a real implementation must maximize accuracy not minimize the relative MSE, in practice we find accuracy improves when using this metric as the basis for construction. Other differences between this analysis and the implemented algorithm include the calculation of the featuremap statistic during training (not after training, as in analysis), and retraining the generated networks rather than using parameter sharing.

4.3 Construction algorithm

// Neurogenesis: Set network prior
while  do
       // Learning: Truncated training
      
       for stage in network do
            
             // Apoptosis: Prune sorted filters per stage, subject to constraints: Do not prune a cell if it is the last cell in a layer and limit number of pruned filters per stage
            
             // Synaptic pruning: Prune skip connections
            
             // Hippocampal neurogenesis: Add envelopecell to the tail of stage
            
            
       end for
      
      
end while
return
ALGORITHM 1 Neural Architecture Construction

The construction algorithm is shown in Algorithm 1. It starts by training an EnvelopeNet for steps (neurogenesis) During the training, statistics from the featuremaps at the outputs of the filters are collected. The metric (squared -norm of the elements of the featuremap of each filter) is calculated over the training set. After an iteration, the filters with the lowest featuremap statistic within a stage are removed (aptosis), and an EnvelopeCell is added to the tail of the stage (hippocampal neurogenesis). The filters within each stage are then sorted in order of the metric and the filters with the lowest value of the metric are removed, subject to the constraint that every layer must have at least one filter. Other constraints may be applied, e.g. may be adjusted based on the number of filters in the layer or the construction may be enabled on a subset of the stages. An EnvelopeCell is added to the tail end (deepest end) of the stage. The reason for adding an EnvelopeCell to the end of the stage, is that we do not know, a priori, which filters will improve the performance of the network, so we add an EnvelopeCell, with the understanding that subsequent iterations will remove the unnecessary filters. We applied DenseNet (18) style skip connections by doing depthwise concatenation on all inputs followed by a 1x1 convolution filter to control the number of output channels. We assign a scalar weight to all incoming skip connections for every layer. These weights get trained along with the whole network and during the pruning phase we halve the number of skip connections by eliminating connections with lower weight (synaptic pruning). The contruction algorithm is run for iterations. The result is the network narrows and deepens while maintaining the overall network parameter count approximately same. The parameters (number of layer per stage, envelope cell, number of stages) and constraints () used for construction are the hyperparameters of the construction algorithm.

The algorithm uses the squared

-norm as the metric. Another metric that was considered was the running feature map variance (filters which have consistently low variance in the distribution of their output featuremap over the training, intuitively contribute less to the classifier’s output). The feature map variance performed close to, but lower than the squared

norm.

The construction time is the sum of the time required for the algorithm to run, the time for training to extract the featuremap statistics in each iteration and the time for the training and evaluation of the final network i.e. + + , where is the ratio of the time needed to extract featuremap statistics to a full training cycle and is th number of iterations. For our experiments was 10 epochs/100 epochs = 0.1 and was 5. The algorithm run time and evaluation time are negligible, so total time is i.e O(). This compares favorably with evolutionary methods, where the total time is where is the number of combinations explored, i.e O() It also compares well with generative methods where total time is , i.e. O() where is the number of iterations to train the generating network until it generates the final network. represents the reduction in construction time obtained by using the featuremap statistics to reduce training time.

5 Results

The algorithm was evaluated on the image classification problem using the CIFAR-10 (22) and ImageNet (37) datasets. Both construction and the evaluation of the generated networks used a common set of hyperparameters. No hyperparameter tuning was done on the EnvelopeNets or the generated networks. The training used preprocessing techniques such as random cropping, varying brightness and contrast. Training used an SGD optimizer with exponentially decaying weight with initial learning rate of and decay factor of per epochs. The batch size was set to for CIFAR-10 and for Imagenet experiments. The number of restructuring iterations was 5. The number of training steps for the construction algorithm was 10 epochs. The number of filters to be pruned in an iteration was 6, subject to a maximum of of the filters in a stage. Training and evaluation of the base and generated networks ran for at least 100 epochs. There was no parameter sharing across iterations of the construction algorithm. The experiments were run using AMLA (20) and the results and hyperparameters are available along with the source code (2).

Figure 3: Running featuremap statistic of individual filters at different layers in a network vs. training iterations for a 128/10-1-1-1/6 EnvelopeNet. The graphs show the squared norm, normalized by the featuremap size collected over the 10 epochs of training.
Network Dataset Params Test error (%) Search time (GPU days)
NAS-v3 CIFAR10 37.4M 3.65 1800
Block-QNN CIFAR10 39.8M 3.54 96
AmoebaNet-B CIFAR-10 34.9M 2.13 3150
PNAS CIFAR10 3.2M 3.41 225
ENAS CIFAR-10 4.6M 3.54 0.45
DARTS CIFAR-10 4.6M 2.76 4
NAO CIFAR-10 128M 2.07 200
NAC (128/7-6-2/6) CIFAR10 10M 3.33 0.25
NASNet-A Imagenet 4.9M 8.4 1800
AmoebaNet-A ImageNet 6.4M 7.6 3150
DARTS ImageNet 4.7M 8.7 4
NAC (64/7-6-2/6) Imagenet 9.9M 11.77 0.25
Table 1: Accuracy, search time and number of parameters for NAC construction using EnvelopeNets vs. other methods. State of the art numbers are indicated in bold. The NAC experiments were run using single Nvidia V100 GPU. The search time reported here is the sum of the time required for the algorithm to output the final architecture, but does not include the time for the training of the final network.

Figure 3 shows the running squared norm for filters in different layers of a network vs. training steps. After 10 epochs the squared norm shows a reasonable separation between each other, although they have not yet converged. The graphs show that the squared norm of feature maps reach this state within 10 epochs for CIFAR-10 - substantially lower than the number of iterations required to train the network to convergence on the same dataset (usually 100 epochs). Typically in our experiments , the ratio of the featuremap stability time to convergence time was 0.1, making the construction time 10% of the time that would be needed, were the candidate networks fully trained to convergence, although the benefit would be less were parameter sharing or early stop used.

Figure 5 and Table 1 show performance for two sample networks.

Table 1 shows the performance of a network generated from a 128/2-2-2/6 Envelope with 23.8M parameters, run for 5 iterations to generate a 128/7-6-2/6 Constructed Network for CIFAR10. We use the same network to train on ImageNet and compare along with other methods, with state of the art performance from pther algorithms indicated.

EnvelopeNet A, the Constructed Network A and 10 equivalent random networks (10) were evaluated on the image classification task using CIFAR-10. The random networks were generated by fixing the depth of the stages equal to the stages in Constructed Network B and adding the same number of blocks, chosen randomly at each stage, subject to a minimum of one block per layer. Each network was trained on the CIFAR-10 dataset for 100 epochs with performance on the test set evaluated every 5 epochs. In addition, a worst case network was also constructed using the same construction algorithm, except it pruned the best performing filters. Figure 5

shows the constructed network clearly outperforming the original EnvelopeNet, the worst case network and the sample average (with standard deviations) of 10 randomly generated networks (it outperformed 9 out of 10 randomly generated networks). The performance of the constructed network A is approximately one standard deviation higher than the average random network performance. Roughly half of the randomly generated networks had lower performance than the EnvelopeNet, indicating that structure is critical for incremental construction - a network with less filters and parameters (the EnvelopeNet) can do better than a network with more filters and parameters. This indicates that structure of the generated network is responsible for some of the gain, and that the entire gains do not come from deepening the network or increasing the parameters.

Generated Network B from EnvelopeNet B was evaluated on the image classification task using CIFAR 10 and was run for 240 epochs and reached an test error rate of 5.57%. Generated Network A (generated using CIFAR-10) was also run on ImageNet. The stem cell for ImageNet was modified by adding 3 layers of convolution filters (3x3, 5x5, 7x7) in parallel to downsample the image from 299x299 to before passing it to the network. Table 1 shows the number of filters for each layer, the parameters and flops and accuracy on all datasets for the EnvelopeNets, the generated networks as well as for 2 randomly generated networks that are equivalent to Constructed Network A. The construction time in Table 1 is the sum of the time required for the algorithm to run, the time for training to extract the featuremap statistics in each iteration and the time for the training and evaluation of the final network as described in the previous section.

Figure 4: Accuracy vs. training iterations for the EnvelopeNet, the Constructed Net (NAC) and random networks on the CIFAR-10 data set (100 epochs (100K steps))
Figure 5: Average width of a layer with standard deviations at different depths for repeated constructions of a 15-6-4-4 network from the same 10-1-1-1 EnvelopeNet

Figure 5 shows the average width of the generated network with standard deviation at different depths, for multiple runs of the construction algorithm, constructing different 128/16-5-4-4/6 networks from the same 128/10-1-1-1/6 EnvelopeNet. Each construction run generates a different sample of featuremap statistics because of the random initialization of the filter weights and preprocessing of the images. Despite this, the structure of the constructed networks are similar. The graph shows that the standard deviation of the width distribution at each layer is in the order of a single filter. This indicates that the algorithm tends to prune filters consistently at particular layers within particular stages, which is a strong indication that the algorithm identifies structural improvements.

6 Discussion

Figure 5 indicates that there is a consistent positive difference between the accuracy of the generated vs. the other networks. The root cause for this may lie in the structure of the network. Chollet (8) indicates that different filter types can extract different characteristics. By providing 3 parallel paths in InceptionNet, with different filters, each inception cell was shown to extract features at different levels. After training, it was found that for most layers, one of the paths dominated the others, indicating that one path was primarily activated at each layer. However at the outset it is unclear which paths need activation and which do not, leading a human designer to make a choice, overprovision the network with all possible paths and prune, or do an architecture search with no prior. This lead us to the approach of EnvelopeNets and pruning.

The EnvelopeNet provides a strong prior to the network construction procedure by setting up initial layout of the architecture. This helps the algorithm restructure the network rather than use resources for the discovery of a base network structure. The base network structure uses known design practices e.g., setting the initial number of layers per stage to reasonable values to limit the parameters, reducing the featuremap’s dimension and increasing number of channels periodically after a certain number of layers. The prior is provided as a set of hyperparameters to the construction algorithm.

Previous studies have indicated that neural networks exhibit a form of plasticity, allowing random pruning of filters from a network, with little degradation of accuracy provided the fraction of filters pruned is reasonably small (23). However, a counter intuitive result shown recently indicates, that random pruning is as effective as algorithmic pruning when it comes to the accuracy of final trained model (28). The key difference between our work and these studies is that they are based on pruning a hand crafted network that has already gone through an optimized design, unlike an EnvelopeNet, which is an over provisioned network. As we prune from a large EnvelopeNet to an intermediate network (e.g. a handcrafted network) to a fully pruned network, the benefit of pruning may decrease, possibly hitting a knee around the intermediate network. This would fit in well with our observations as well as results from pruning studies.

Another construction method could be to use reduce a large (deeper/wider) supernet like EnvelopeNet, rather than reduction and addition from a small EnvelopeNet. However, the training of a large network increases construction time. Also, the reduction/addition method spreads the restructuring over multiple iterations, rather than in a single iteration, introducing a form of regularization over the architecture search, preventing the structure from overfitting on the artifacts of a single iteration. In this regard, the pruning part of the restructuring algorithm, bears some resemblance to the dropout regularization method. While the motivation behind each method is different, they both remove elements of filters, one probabilistically, during training and the other, physically during construction, although construction permanently removes filters. The EnvelopeNet construction method lies between bottom up incremental methods of construction and top down reduction methods and provides a reasonable compromise that allows generation of a network without overfitting the structure.

The primary limitation of the restructuring method is that we find the gains from structure appear to reduce when the network parameters increase. In our results we see this when the number of parameters is extremely large, e.g. if we use a classification block with several fully connected blocks or as the number of wideners (number of output channels) increases. Note that in this regime, networks take much larger computation resources to train and any possible gain in the accuracy of network comes from a larger number of parameters rather than intelligent structure design.

7 Conclusions

It appears that neural networks exhibit properties that may allow simple heuristic based construction techniques using internal statistics derived during training. We have exploited these properties in a construction method that can design networks with lower construction time than a search method that evaluates the accuracy of networks. The generated networks show close to state of the art performance and can outperform most equivalent randomly generated networks, handcrafted networks with equivalent number of parameters and blocks. Future directions for work include restructuring algorithms that perform closer to optimal, a deeper understanding of the relationship between structure and performance, and the application of this method to other networks.

References

Appendix A Hyperparameters

Hyperparameters Value Batch size 50 Optimizer Momentum, 0.9 Learning rate 0.04 Learning rate schedule Exponential decay per 2 epochs Learning rate decay factor 0.999 Dropout

Keep probability of 0.8

Table 2: Hyperparameters for the candidate networks during construction
Hyperparameters Value Batch size 64 Optimizer Momentum, 0.9 Learning rate max - 0.05, min - 0.001 Learning rate schedule Cosine decay per 20 epochs regularization norm based, gradient clipping norm based, 5.0 Dropout Keep probability of 0.8 Data augmentation random crop, flip, cutout Stem cell 3x3 convolution, 128 output channels
Table 3: Hyperparameters for CIFAR-10 final network
Hyperparameters Value Batch size 128 Optimizer Momentum, 0.9 Learning rate 0.1 Learning rate schedule Exponential decay per 25 epochs Learning rate decay factor 0.97 regularization norm based, gradient clipping norm based, 5.0 Dropout Keep probability of 0.7 Data augmentation random flip Stem cell 3x3 convolution, 32 output channels, stride 2 3x3 convolution, 64 output channels, stride 2 3x3 convolution, 64 output channels, stride 2
Table 4: Hyperparameters for the Imagenet final network

For training of candidate networks we use the hyperparameters in Table 2 and for the CIFAR-10 and ImageNet final network we use the hyperparameters in Table 3 and Table 4 respectively. Apart from these hyperparameters, the NAC algorithm requires certain hyperparameters, described below. Some of them are common to most neural architecture search methods and the remaining are specific to our method. The hyperparameters described below are specific to the task of image classification. All hyperparameters are available in configuration files in the code (2).

a.1 Envelope cell

The EnvelopeCell is an overprovisioned cell that uses a set of convolution filters with following kernels: 1x1, 3x3, 3x3sep, 5x5, 5x5sep, 7x7sep which is a subset of the commonly used kernel search space in architecture search methods.

a.2 Reduction Cell (Widener)

The primary purpose of using the reduction cell (also called widener) is to perform downscaling of the image dimensions and increasing the number of channels in the featuremap. We use a simple factorized reduction cell with concatenated output of one 3x3 convolution filter with stride 2 and one max pooling filter with stride 2 connected in parallel. Although many architecture search methods search for the reduction cell, we find that the overall performance is not sensitive to tuning of this reduction cell.

a.3 Number of stages

This hyperparameter controls the number of stages in the network. The number of stages should be different depending upon the dataset being used for network design. Each stage is separated by a reduction cell, so there is a limit on number of stage because of the dimension of the input images (most architecture search methods fix the number of stages by fixing number of reduction cells in their architecture). The networks constructed in this work use 3 stages.

a.4 Maximum layers per stage

This hyperparameter enables construction of a network with a different number of layers in each stage. It is an array of integers that indicates the maximum number of layers in each stage. The length of array is equal to the number of stages. Our experiments sets this hyperparameter to [7,6,2].

a.5 Stage Construction

The stage construction hyperparameter allows construction to be restricted to a subset of all the stages. The hyperparameter is a boolean array with array length being equal to the number of stages with the value indicating whether construction is applied to a particular stage. In our experiments we set this hyperparameter to [true, true, false] for a 3 stage network.

a.6 Maximum filters to prune

This hyperparameter limits the number of blocks pruned from each stage. It has a substantial effect on sparsity of the final model obtained and hence total number of parameters in the final model. This hyperparameter is set to 6 in our experiments.

a.7 Stem Cell

The stem cell is the initial network which preprocesses the image before it is passed to the constructed network. It is used for increasing the number of channels and downscaling image dimensions to required size. We use a single layer of 3x3 convolution filter with 128 channels as the stem cell during architecture search.

a.8 Classification Cell

The classification cell is the final part of the network architecture. It requires one fully connected layer in the end with number of output units equal to the number of labels in the image classification task. We use

layers in our all experiments: the first layer is a global average pooling layer, followed by a flattening layer to convert batch of 3D featuremaps to a batch of vectors. The third layer adds dropout and the fourth layer is the final fully connected layer.