EAS
Efficient Architecture Search by Network Transformation, in AAAI 2018
view repo
Techniques for automatically designing deep neural network architectures such as reinforcement learning based approaches have recently shown promising results. However, their success is based on vast computational resources (e.g. hundreds of GPUs), making them difficult to be widely used. A noticeable limitation is that they still design and train each network from scratch during the exploration of the architecture space, which is highly inefficient. In this paper, we propose a new framework toward efficient architecture search by exploring the architecture space based on the current network and reusing its weights. We employ a reinforcement learning agent as the metacontroller, whose action is to grow the network depth or layer width with functionpreserving transformations. As such, the previously validated networks can be reused for further exploration, thus saves a large amount of computational cost. We apply our method to explore the architecture space of the plain convolutional neural networks (no skipconnections, branching etc.) on image benchmark datasets (CIFAR10, SVHN) with restricted computational resources (5 GPUs). Our method can design highly competitive networks that outperform existing networks using the same design scheme. On CIFAR10, our model without skipconnections achieves 4.23% test error rate, exceeding a vast majority of modern architectures and approaching DenseNet. Furthermore, by applying our method to explore the DenseNet architecture space, we are able to achieve more accurate networks with fewer parameters.
READ FULL TEXT VIEW PDF
We introduce a new functionpreserving transformation for efficient neur...
read it
Neural networks have recently had a lot of success for many tasks. Howev...
read it
Latest algorithms for automatic neural architecture search perform remar...
read it
Neural networks (NNs) have been successfully deployed in many applicatio...
read it
The design of compact deep neural networks is a crucial task to enable
w...
read it
The performance of a Convolutional Neural Network (CNN) depends on its
h...
read it
Convolutional neural networks (CNNs) are the backbones of deep learning
...
read it
Efficient Architecture Search by Network Transformation, in AAAI 2018
The great success of deep neural networks in various challenging applications [Krizhevsky, Sutskever, and Hinton2012, Bahdanau, Cho, and Bengio2014, Silver et al.2016] has led to a paradigm shift from feature designing to architecture designing, which still remains a laborious task and requires human expertise. In recent years, many techniques for automating the architecture design process have been proposed [Snoek, Larochelle, and Adams2012, Bergstra and Bengio2012, Baker et al.2017, Zoph and Le2017, Real et al.2017, Negrinho and Gordon2017], and promising results of designing competitive models against humandesigned models are reported on some benchmark datasets [Zoph and Le2017, Real et al.2017]. Despite the promising results as reported, their success is based on vast computational resources (e.g. hundreds of GPUs), making them difficult to be used in practice for individual researchers, small sized companies, or university research teams. Another key drawback is that they still design and train each network from scratch during exploring the architecture space without any leverage of previously explored networks, which results in high computational resources waste.
In fact, during the architecture design process, many slightly different networks are trained for the same task. Apart from their final validation performances that are used to guide exploration, we should also have access to their architectures, weights, training curves etc., which contain abundant knowledge and can be leveraged to accelerate the architecture design process just like human experts [Chen, Goodfellow, and Shlens2015, Klein et al.2017]. Furthermore, there are typically many welldesigned architectures, by human or automatic architecture designing methods, that have achieved good performances at the target task. Under restricted computational resources limits, instead of totally neglecting these existing networks and exploring the architecture space from scratch (which does not guarantee to result in better performance architectures), a more economical and efficient alternative could be exploring the architecture space based on these successful networks and reusing their weights.
In this paper, we propose a new framework, called EAS, Efficient Architecture Search, where the metacontroller explores the architecture space by network transformation operations such as widening a certain layer (more units or filters), inserting a layer, adding skipconnections etc., given an existing network trained on the same task. To reuse weights, we consider the class of functionpreserving transformations [Chen, Goodfellow, and Shlens2015] that allow to initialize the new network to represent the same function as the given network but use different parameterization to be further trained to improve the performance, which can significantly accelerate the training of the new network especially for large networks. Furthermore, we combine our framework with recent advances of reinforcement learning (RL) based automatic architecture designing methods [Baker et al.2017, Zoph and Le2017], and employ a RL based agent as the metacontroller.
Our experiments of exploring the architecture space of the plain convolutional neural networks (CNNs), which purely consists of convolutional, fullyconnected and pooling layers without skipconnections, branching etc., on image benchmark datasets (CIFAR10, SVHN), show that EAS with limited computational resources (5 GPUs) can design competitive architectures. The best plain model designed by EAS on CIFAR10 with standard data augmentation achieves 4.23% test error rate, even better than many modern architectures that use skipconnections. We further apply our method to explore the DenseNet [Huang et al.2017] architecture space, and achieve 4.66% test error rate on CIFAR10 without data augmentation and 3.44% on CIFAR10 with standard data augmentation, surpassing the best results given by the original DenseNet while still maintaining fewer parameters.
There is a long standing study on automatic architecture designing. Neuroevolution algorithms which mimic the evolution processes in the nature, are one of the earliest automatic architecture designing methods [Miller, Todd, and Hegde1989, Stanley and Miikkulainen2002]. Authors in [Real et al.2017] used neuroevolution algorithms to explore a large CNN architecture space and achieved networks which can match performances of humandesigned models. In parallel, automatic architecture designing has also been studied in the context of Bayesian optimization [Bergstra and Bengio2012, Domhan, Springenberg, and Hutter2015, Mendoza et al.2016]. Recently, reinforcement learning is introduced in automatic architecture designing and has shown strong empirical results. Authors in [Baker et al.2017] presented a Qlearning agent to sequentially pick CNN layers; authors in [Zoph and Le2017] used an autoregressive recurrent network to generate a variablelength string that specifies the architecture of a neural network and trained the recurrent network with policy gradient.
As the above solutions rely on designing or training networks from scratch, significant computational resources have been wasted during the construction. In this paper, we aim to address the efficiency problem. Technically, we allow to reuse the existing networks trained on the same task and take network transformation actions. Both functionpreserving transformations and an alternative RL based metacontroller are used to explore the architecture space. Moreover, we notice that there are some complementary techniques, such as learning curve prediction [Klein et al.2017], for improving the efficiency, which can be combined with our method.
Generally, any modification to a given network can be viewed as a network transformation operation. In this paper, since our aim is to utilize knowledge stored in previously trained networks, we focus on identifying the kind of network transformation operations that would be able to reuse preexisting models. The idea of reusing preexisting models or knowledge transfer between neural networks has been studied before. Net2Net technique introduced in [Chen, Goodfellow, and Shlens2015] describes two specific functionpreserving transformations, namely Net2WiderNet and Net2DeeperNet, which respectively initialize a wider or deeper student network to represent the same functionality of the given teacher network and have proved to significantly accelerate the training of the student network especially for large networks. Similar functionpreserving schemes have also been proposed in ResNet particularly for training very deep architectures [He et al.2016a]. Additionally, the network compression technique presented in [Han et al.2015] prunes less important connections (lowweight connections) in order to shrink the size of neural networks without reducing their accuracy.
In this paper, instead, we focus on utilizing such network transformations to reuse preexisting models to efficiently and economically explore the architecture space for automatic architecture designing.
Our metacontroller in this work is based on RL [Sutton and Barto1998], techniques for training the agent to maximize the cumulative reward when interacting with an environment [Cai et al.2017]. We use the REINFORCE algorithm [Williams1992] similar to [Zoph and Le2017] for updating the metacontroller, while other advanced policy gradient methods [Kakade2002, Schulman et al.2015] can be applied analogously. Our action space is, however, different with that of [Zoph and Le2017] or any other RL based approach [Baker et al.2017], as our actions are the network transformation operations like adding, deleting, widening, etc., while others are specific configurations of a newly created network layer on the top of preceding layers. Specifically, we model the automatic architecture design procedure as a sequential decision making process, where the state is the current network architecture and the action is the corresponding network transformation operation. After steps of network transformations, the final network architecture, along with its weights transferred from the initial input network, is then trained in the real data to get the validation performance to calculate the reward signal, which is further used to update the metacontroller via policy gradient algorithms to maximize the expected validation performances of the designed networks by the metacontroller.
In this section, we first introduce the overall framework of our metacontroller, and then show how each specific network transformation decision is made under it. We later extend the functionpreserving transformations to the DenseNet [Huang et al.2017] architecture space where directly applying the original Net2Net operations can be problematic since the output of a layer will be fed to all subsequent layers.
We consider learning a metacontroller to generate network transformation actions given the current network architecture, which is specified with a variablelength string [Zoph and Le2017]. To be able to generate various types of network transformation actions while keeping the metacontroller simple, we use an encoder network to learn a lowdimensional representation of the given architecture, which is then fed into each separate actor network to generate a certain type of network transformation actions. Furthermore, to handle variablelength network architectures as input and take the whole input architecture into consideration when making decisions, the encoder network is implemented with a bidirectional recurrent network [Schuster and Paliwal1997] with an input embedding layer. The overall framework is illustrated in Figure 1, which is an analogue of endtoend sequence to sequence learning [Sutskever, Vinyals, and Le2014, Bahdanau, Cho, and Bengio2014].
Given the low dimensional representation of the input architecture, each actor network makes necessary decisions for taking a certain type of network transformation actions. In this work, we introduce two specific actor networks, namely Net2Wider actor and Net2Deeper actor which correspond to Net2WiderNet and Net2DeeperNet respectively.
Net2WiderNet operation allows to replace a layer with a wider layer, meaning more units for fullyconnected layers, or more filters for convolutional layers, while preserving the functionality. For example, consider a convolutional layer with kernel whose shape is where and denote the filter width and height, while and denote the number of input and output channels. To replace this layer with a wider layer that has () output channels, we should first introduce a random remapping function , which is defined as
(1) 
With the remapping function , we have the new kernel for the wider layer with shape
(2) 
As such, the first entries in the output channel dimension of are directly copied from while the remaining entries are created by choosing randomly as defined in . Accordingly, the new output of the wider layer is with , where is the output of the original layer and we only show the channel dimension to make the notation simpler.
To preserve the functionality, the kernel of the next layer should also be modified due to the replication in its input. The new kernel with shape is given as
(3) 
For further details, we refer to the original Net2Net work [Chen, Goodfellow, and Shlens2015].
In our work, to be flexible and efficient, the Net2Wider actor simultaneously determines whether each layer should be extended. Specifically, for each layer, this decision is carried out by a shared sigmoid classifier given the hidden state of the layer learned by the bidirectional encoder network. Moreover, we follow previous work and search the number of filters for convolutional layers and units for fullyconnected layers in a discrete space. Therefore, if the Net2Wider actor decides to widen a layer, the number of filters or units of the layer increases to the next discrete level, e.g. from 32 to 64. The structure of Net2Wider actor is shown in Figure 2.
Net2DeeperNet operation allows to insert a new layer that is initialized as adding an identity mapping between two layers so as to preserve the functionality. For a new convolutional layer, the kernel is set to be identity filters while for a new fullyconnected layer, the weight matrix is set to be identity matrix. Thus the new layer is set with the same number of filters or units as the layer below at first, and could further get wider when Net2WiderNet operation is performed on it. To fully preserve the functionality, Net2DeeperNet operation has a constraint on the activation function
, i.e. must satisfyfor all vectors
. This property holds for rectified linear activation (ReLU) but fails for sigmoid and tanh activation. However, we can still reuse weights of existing networks with sigmoid or tanh activation, which could be useful compared to random initialization. Additionally, when using batch normalization
[Ioffe and Szegedy2015], we need to set output scale and output bias of the batch normalization layer to undo the normalization, rather than initialize them as ones and zeros. Further details about the Net2DeeperNet operation is provided in the original paper [Chen, Goodfellow, and Shlens2015].The structure of the Net2Deeper actor is shown in Figure 3, which is a recurrent network whose hidden state is initialized with the final hidden state of the encoder network. Similar to previous work [Baker et al.2017]
, we allow the Net2Deeper actor to insert one new layer at each step. Specifically, we divide a CNN architecture into several blocks according to the pooling layers and Net2Deeper actor sequentially determines which block to insert the new layer, a specific index within the block and parameters of the new layer. For a new convolutional layer, the agent needs to determine the filter size and the stride while for a new fullyconnected layer, no parameter prediction is needed. In CNN architectures, any fullyconnected layer should be on the top of all convolutional and pooling layers. To avoid resulting in unreasonable architectures, if the Net2Deeper actor decides to insert a new layer after a fullyconnected layer or the final global average pooling layer, the new layer is restricted to be a fullyconnected layer, otherwise it must be a convolutional layer.
The original Net2Net operations proposed in [Chen, Goodfellow, and Shlens2015] are discussed under the scenarios where the network is arranged layerbylayer, i.e. the output of a layer is only fed to its next layer. As such, in some modern CNN architectures where the output of a layer would be fed to multiple subsequent layers, such as DenseNet [Huang et al.2017], directly applying the original Net2Net operations can be problematic. In this section, we introduce several extensions to the original Net2Net operations to enable functionpreserving transformations for DenseNet.
Different from the plain CNN, in DenseNet, the layer would receive the outputs of all preceding layers as input, which are concatenated on the channel dimension, denoted as , while its output would be fed to all subsequent layers.
Denote the kernel of the layer as with shape . To replace the layer with a wider layer that has output channels while preserving the functionality, the creation of the new kernel in the layer is the same as the original Net2WiderNet operation (see Eq. (1) and Eq. (2)). As such, the new output of the wider layer is with , where is the random remapping function as defined in Eq. (1). Since the output of the layer will be fed to all subsequent layers in DenseNet, the replication in will result in replication in the inputs of all layers after the layer. As such, instead of only modifying the kernel of the next layer as done in the original Net2WiderNet operation, we need to modify the kernels of all subsequent layers in DenseNet. For the layer where , its input becomes after widening the layer, thus from the perspective of layer, the equivalent random remapping function can be written as
(4) 
where is the number of input channels for the layer, the first part corresponds to , the second part corresponds to , and the last part corresponds to . A simple example of is given as
Accordingly the new kernel of layer can be given by Eq. (3) with replaced with .
To insert a new layer in DenseNet, suppose the new layer is inserted after the layer. Denote the output of the new layer as , and its input is . Therefore, for the layer, its new input after the insertion is . To preserve the functionality, similar to the Net2WiderNet case, should be the replication of some entries in . It is possible, since the input of the new layer is
. Each filter in the new layer can be represented with a tensor, denoted as
with shape , where and denote the width and height of the filter, and is the number of input channels. To make the output of to be a replication of the entry in , we can set (using the special case that = = 3 for illustration) as the following(5) 
while all other values in are set to be 0. Note that can be chosen randomly from for each filter. After all filters in the new layer are set, we can form an equivalent random remapping function for all subsequent layers as is done in Eq. (4) and modify their kernels accordingly.
Model Architecture  Validation Accuracy (%)  


87.07 
denotes a softmax layer with
output units.In line with the previous work [Baker et al.2017, Zoph and Le2017, Real et al.2017], we apply the proposed EAS on image benchmark datasets (CIFAR10 and SVHN) to explore high performance CNN architectures for the image classification task^{1}^{1}1Experiment code and discovered top architectures along with weights: https://github.com/hancai/EAS. Notice that the performances of the final designed models largely depend on the architecture space and the computational resources. In our experiments, we evaluate EAS in two different settings. In all cases, we use restricted computational resources (5 GPUs) compared to the previous work such as [Zoph and Le2017] that used 800 GPUs. In the first setting, we apply EAS to explore the plain CNN architecture space, which purely consists of convolutional, pooling and fullyconnected layers. While in the second setting, we apply EAS to explore the DenseNet architecture space.
The CIFAR10 dataset [Krizhevsky and Hinton2009] consists of 50,000 training images and 10,000 test images. We use a standard data augmentation scheme that is widely used for CIFAR10 [Huang et al.2017]
, and denote the augmented dataset as C10+ while the original dataset is denoted as C10. For preprocessing, we normalized the images using the channel means and standard deviations. Following the previous work
[Baker et al.2017, Zoph and Le2017], we randomly sample 5,000 images from the training set to form a validation set while using the remaining 45,000 images for training during exploring the architecture space.The Street View House Numbers (SVHN) dataset [Netzer et al.2011] contains 73,257 images in the original training set, 26,032 images in the test set, and 531,131 additional images in the extra training set. For preprocessing, we divide the pixel values by 255 and do not perform any data augmentation, as is done in [Huang et al.2017]. We follow [Baker et al.2017] and use the original training set during the architecture search phase with 5,000 randomly sampled images as the validation set, while training the final discovered architectures using all the training data, including the original training set and extra training set.
For the metacontroller, we use a onelayer bidirectional LSTM with 50 hidden units as the encoder network (Figure 1) with an embedding size of 16, and train it with the ADAM optimizer [Kingma and Ba2015].
At each step, the metacontroller samples 10 networks by taking network transformation actions. Since the sampled networks are not trained from scratch but we reuse weights of the given network in our scenario, they are then trained for 20 epochs, a relative small number compared to 50 epochs in
[Zoph and Le2017]. Besides, we use a smaller initial learning rate for this reason. Other settings for training networks on CIFAR10 and SVHN, are similar to [Huang et al.2017, Zoph and Le2017]. Specifically, we use the SGD with a Nesterov momentum
[Sutskever et al.2013] of 0.9, a weight decay of 0.0001, a batch size of 64. The initial learning rate is 0.02 and is further annealed with a cosine learning rate decay [Gastaldi2017]. The accuracy in the heldout validation set is used to compute the reward signal for each sampled network. Since the gain of improving the accuracy from 90% to 91% should be much larger than from 60% to 61%, instead of directly using the validation accuracy as the reward, as done in [Zoph and Le2017], we perform a nonlinear transformation on
, i.e., and use the transformed value as the reward. Additionally, we use an exponential moving average of previous rewards, with a decay of 0.95 as the baseline function to reduce the variance.
We start applying EAS to explore the plain CNN architecture space. Following the previous automatic architecture designing methods [Baker et al.2017, Zoph and Le2017], EAS searches layer parameters in a discrete and limited space. For every convolutional layer, the filter size is chosen from {1, 3, 5} and the number of filters is chosen from , while the stride is fixed to be 1 [Baker et al.2017]. For every fullyconnected layer, the number of units is chosen from . Additionally, we use ReLU and batch normalization for each convolutional or fullyconnected layer. For SVHN, we add a dropout layer after each convolutional layer (except the first layer) and use a dropout rate of 0.2 [Huang et al.2017].
We begin the exploration on C10+, using a small network (see Table 1), which achieves 87.07% accuracy in the heldout validation set, as the start point. Different from [Zoph and Le2017, Baker et al.2017], EAS is not restricted to start from empty and can flexibly use any discovered architecture as the new start point. As such, to take the advantage of such flexibility and also reduce the search space for saving the computational resources and time, we divide the whole architecture search process into two stages where we allow the metacontroller to take 5 steps of Net2Deeper action and 4 steps of Net2Wider action in the first stage. After 300 networks are sampled, we take the network which performs best currently and train it with a longer period of time (100 epochs) to be used as the start point for the second stage. Similarly, in the second stage, we also allow the metacontroller to take 5 steps of Net2Deeper action and 4 steps of Net2Wider action and stop exploration after 150 networks are sampled.
The progress of the two stages architecture search is shown in Figure 4, where we can find that EAS gradually learns to pick high performance architectures at each stage. As EAS takes functionpreserving transformations to explore the architecture space, we can also find that the sampled architectures consistently perform better than the start point network at each stage. Thus it is usually “safe” to explore the architecture space with EAS. We take the top networks discovered during the second stage and further train the networks with 300 epochs using the full training set. Finally, the best model achieves 95.11% test accuracy (i.e. 4.89% test error rate). Furthermore, to justify the transferability of the discovered networks, we train the top architecture (95.11% test accuracy) on SVHN from random initialization with 40 epochs using the full training set and achieves 98.17% test accuracy (i.e. 1.83% test error rate), better than both humandesigned and automatically designed architectures that are in the plain CNN architecture space (see Table 2).
We would like to emphasize that the required computational resources to achieve this result is much smaller than those required in [Zoph and Le2017, Real et al.2017]. Specifically, it takes less than 2 days on 5 GeForce GTX 1080 GPUs with totally 450 networks trained to achieve 4.89% test error rate on C10+ starting from a small network.
Model  C10+  SVHN  










Model  Depth  Params  C10+  












To further search better architectures in the plain CNN architecture space, in the second experiment, we use the top architectures discovered in the first experiment, as the start points to explore a larger architecture space on C10+ and SVHN. This experiment on each dataset takes around 2 days on 5 GPUs.
The summarized results of comparing with humandesigned and automatically designed architectures that use a similar design scheme (plain CNN), are reported in Table 2, where we can find that the top model designed by EAS on the plain CNN architecture space outperforms all similar models by a large margin. Specifically, comparing to humandesigned models, the test error rate drops from 7.25% to 4.23% on C10+ and from 2.35% to 1.73% on SVHN. While comparing to MetaQNN, the Qlearning based automatic architecture designing method, EAS achieves a relative test error rate reduction of 38.9% on C10+ and 16.0% on SVHN. We also notice that the best model designed by MetaQNN on C10+ only has a depth of 7, though the maximum is set to be 18 in the original paper [Baker et al.2017]. We suppose maybe they trained each designed network from scratch and used an aggressive training strategy to accelerate training, which resulted in many networks under performed, especially for deep networks. Since we reuse the weights of preexisting networks, the deep networks are validated more accurately in EAS, and we can thus design deeper and more accurate networks than MetaQNN.
We also report the comparison with stateoftheart architectures that use advanced techniques such as skipconnections, branching etc., on C10+ in Table 3. Though it is not a fair comparison since we do not incorporate such advanced techniques into the search space in this experiment, we still find that the top model designed by EAS is highly competitive even comparing to these stateoftheart modern architectures. Specifically, the 20layers plain CNN with 23.4M parameters outperforms ResNet, its stochastic depth variant and its preactivation variant. It also approaches the best result given by DenseNet. When comparing to automatic architecture designing methods that incorporate skipconnections into their search space, our 20layers plain model beats most of them except NAS with postprocessing, that is much deeper and has more parameters than our model. Moreover, we only use 5 GPUs and train hundreds of networks while they use 800 GPUs and train tens of thousands of networks.
Model  Depth  Params  C10  C10+ 

DenseNet ()  100  27.2M  5.83  3.74 
DenseNetBC ()  250  15.3M  5.19  3.62 
DenseNetBC ()  190  25.6M    3.46 
NAS (postprocessing)  39  37.4M    3.65 
EAS (DenseNet on C10)  70  8.6M  4.66   
EAS (DenseNet on C10+)  76  10.7M    3.44 
Our framework is not restricted to use the RL based metacontroller. Beside RL, one can also take network transformation actions to explore the architecture space by random search, which can be effective in some cases [Bergstra and Bengio2012]. In this experiment, we compare the performances of the RL based metacontroller and the random search metacontroller in the architecture space that is used in the above experiments. Specifically, we use the network in Table 1 as the start point and let the metacontroller to take 5 steps of Net2Deeper action and 4 steps of Net2Wider action. The result is reported in Figure 5, which shows that the RL based metacontroller can effectively focus on the right search direction, while the random search cannot (left plot), and thus find high performance architectures more efficiently than random search.
We also apply EAS to explore the DenseNet architecture space. We use the DenseNetBC () as the start point. The growth rate, i.e. the width of the nonbottleneck layer is chosen from , and the result is reported in Table 4. We find that by applying EAS to explore the DenseNet architecture space, we achieve a test error rate of 4.66% on C10, better than the best result, i.e. 5.19% given by the original DenseNet while having 43.79% less parameters. On C10+, we achieve a test error rate of 3.44%, also outperforming the best result, i.e. 3.46% given by the original DenseNet while having 58.20% less parameters.
In this paper, we presented EAS, a new framework toward economical and efficient architecture search, where the metacontroller is implemented as a RL agent. It learns to take actions for network transformation to explore the architecture space. By starting from an existing network and reusing its weights via the class of functionpreserving transformation operations, EAS is able to utilize knowledge stored in previously trained networks and take advantage of the existing successful architectures in the target task to explore the architecture space efficiently. Our experiments have demonstrated EAS’s outstanding performance and efficiency compared with several strong baselines. For future work, we would like to explore more network transformation operations and apply EAS for different purposes such as searching networks that not only have high accuracy but also keep a balance between the size and the performance.
This research was sponsored by Huawei Innovation Research Program, NSFC (61702327) and Shanghai Sailing Program (17YF1428200).
Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves.
In IJCAI.Workshop on Automatic Machine Learning
.Designing neural networks using genetic algorithms.
In ICGA. Morgan Kaufmann Publishers Inc.NIPS workshop on deep learning and unsupervised feature learning
.
Comments
There are no comments yet.