Deep neural networks have enabled advances in image recognition, sequential pattern recognition, recommendation systems, and various tasks in the past decades. However, selecting a suitable neural architecture is frequently arduous because of the classical and new neural architectures emerging daily. In general, manual design of network architectures according to the cases is achievable. However, hyperparameter tuning and architecture engineering through manual selection requires considerable time. Furthermore, manually designing a neural network architecture requires substantial experience in deep learning.
Given the aforementioned reasons, neural architecture search (NAS), an automated architecture engineering, has been successful in the past years. The NAS algorithm is divided into three dimensions, namely search space, search strategy, and performance estimation strategy[Elsken2019].
Search Strategy. The search strategy defines how to explore search spaces such as random search, Bayesian optimization, evolutionary method, reinforcement learning, and gradient-based method.
Search Space. The search space defines attributes such as the number of layers, type of operation, hyperparameters in operation, and connection of each layer. NAS space size is large and complex.
Performance Estimation Strategy. The performance estimation strategy defines the strategies used for estimating an architecture based on its performance on unseen data. However, training an architecture usually requires considerable GPU computational times, which necessitates designing a strategy to estimate its performance.
Outstanding results have been achieved using NAS with the reinforcement learning search strategy [zoph2017]. Here, a recurrent network was used to generate a string to form a child network. However, such a type of network exhibits two problems: noncontinuous and high-dimensional search space. The frequent large strings of action from the recurrent network and the discrete space result in difficulty in optimization. The critical contribution of this study is the improvement of the dimension and quality of the search space that could provide a more efficient framework to solve the two problems of searching architectures.
If a vector that can represent network architecture without discrete values is determined, then the noncontinuous aforementioned disadvantages can be addressed. We proposed the NAS in embedding space (NASES) method, which involves mapping origin architecture to architecture-embedding by using an architecture encoder. The advantage of embedding space includes the lower-dimensional and continues space, it considerably alleviates the difficulty in the optimization problem of the NAS procedure with reinforcement learning. To learn and search on the embedding space, we developed a mechanism to generate architecture encoder and decoder to promote origin architecture communication with the embedding space, and the autoencoder network was used in the mechanism[hinton2006]. The architecture simulator simulates the origin architecture space, which assists the real architecture encoder learning. The decoder realizes the relationship between origin architecture and architecture-embedding, which maps the architecture-embedding to the origin architecture.
The NASES procedure was implemented in two stages. We obtained a pretraining architecture decoder and a pretraining architecture simulator in the first stage and provided the compression rate between the embedding size and testing loss. In the second stage, we used the NASES procedure for image classification on CIFAR-10 by using the network pretrained in the first stage. The results of the experiment were efficient and indicated that NASES was highly efficient and considerably reduced the number of searching architectures to 100 in 12 GPU hours. Thus, the results were comparable with that of other popular NAS methods.
Reinforcement Learning with Action Embedding
Reinforcement learning is a general approach that can be applied broadly to various areas. However, the large and discrete action space causes problems in function approximation. The majority of studies have focused on two approaches, one approach factorizes the action space into binary subspaces [pazis2011, dulac2012]. The other approach involves embedding discrete actions into a continuous action, determining optimal actions in the continuous space, and selecting the nearest discrete action to reduce the scaling of action sizes [dulac2015, hasselt2009].
Search Strategy with Reinforcement Learning
Our search strategy is based on reinforcement learning. zoph2017 zoph2017 provided a novel NAS framework, which incorporated reinforcement learning and applied it to two agents of the child network and the controller of the recurrent network. The child network generated neural architecture that can be considered the action of the controller network. Unlike the use of policy gradient by zoph2017 zoph2017, will1992 will1992, and baker2016 baker2016 used q-learning to update the weight of the network.
NAS with Continues Vector
Most NAS procedures use a discrete search space. Unlike other approaches, such as the learning over discrete and nondifferentiable search space, liu2018 liu2018 proposed an approach of differentiable architecture search (DARTS), which was based on the continuous relaxation of the architecture representation. On the basis of DARTS, hundt2019 hundt2019 proposed sharp DARTS, which is a more general, balanced, and consistent design. The closest concept to NASES is the approach proposed by luo2018 luo2018 in which an encoder and a decoder was to map neural architectures in a continuous space on gradient-based optimization and a predictor was used to achieve embedding accuracy.
In this section, to elucidate the NASES procedure, we followed the aforementioned three dimensions: search strategy, search space, and performance estimation strategy.
The search strategy is a search method for fast and accurate exploration of the space of neural architectures and involves techniques such as reinforcement learning, evolutionary algorithm, and gradient-based method. These are popular strategies in NAS.
In the NAS with reinforcement learning, which is generally designed with two components of the controller network and child network (Figure 1), the controller network is usually used to control the child network architecture and generates string and the child network construct neural architecture by using the output of the controller in each NAS iteration. The controller network calculates the policy gradient to update the network by validating the performance of child network. Thus, severe penalty is imposed when performance is low. The controller that is constructed using a multilayer perceptron rather than the recurrent neural layer, generates such continuous value, rather than a string, in the NASES. This is discussed in the next subsection.
The main contributions of NASES is in the search space domain, which resolves the two aforementioned problems of noncontinuous and high-dimensional space in reinforcement learning; these problems lead to difficult optimization.
NASES is similar to the general NAS procedure, which also includes the child network and controller network. The optimization of maximize accuracy is also used as a policy gradient method. However, our method differed from the general NAS procedure; first, we developed an architecture encoder as the controller network to control the architecture of the child network and projected origin architecture into architecture-embedding. Second, we devised the architecture decoder network, which decodes architecture-embedding from the architecture controller network to the origin architecture to ensure the child network can understand and generate network by using architecture-embedding. That is, to alleviate these problems, the architecture decoder functions as a translator to translate low-dimension embedding into high-dimension vector for smooth child–controller network communication. Furthermore, the search space was bounded using micro search in this study. Thus, we did not apply the cell-based trick of hierarchical representation because we attempted to search neural network on the complete architecture and not only on the cell.
Three Principal Functions of NASES.
The NASES, has three principal functions. This section describes the functions of the architecture decoder, architecture simulator, and controller network.
The Architecture Decoder To obtain architecture-embedding decoder, first we created an approximate of virtual distribution transformation, which projected the low-dimension space into high-dimension space. That is, we transformed architecture-embedding into origin architecture.
where is a decode function parameterized by , is the architecture-embedding space, is the origin architecture space, is the set of origin architecture, is the set of architecture-embedding. This function released the controller network architecture and could be projected on another space not bounded on the origin architecture space. This function is efficient and provides distribution transformers. To develop this approximator, an architecture simulator is required, which is discussed in the next subsection.
Architecture Simulator An architecture simulator is an approach of the approximator of the virtual distribution transformation; its purpose creation of a function that simulates distribution to achieve architecture-embedding. Furthermore, the distribution of simulation is not limited to discrete or continuous space.
where is a function parametrized by
and U is a uniform distribution.
Obtaining the architecture encoder and architecture simulator To obtain the architecture encoder and simulator, we used the autoencoder network [hinton2006]
. The autoencoder network is a unsupervised algorithm for distribution transformation and dimension reduction. The autoencoder generates a representation by using the reduced encoding closest to its original input. Therefore, the architecture encoder and simulator were assembled in the autoencoder network. We pretrained an autoencoder network before training the controller, in which input space and target space had the same distribution. We used uniform distribution. For convenient policy gradient learning, the activation function of the middle layer used was sigmoid; which leads the output with Bernoulli distribution (Figure 3).
The controller A controller controls the child neural architecture. The neural architecture and hyperparameters of the controller are copied from the architecture simulator. The input is the child state of network, and the output is architecture-embedding with Bernoulli distribution.
where is an encode function parameterized by , is the architecture-embedding space, is the origin architecture space, is the set of origin architecture, and is the set of architecture-embedding. We did not retrain the controller on the NASES procedures again and only fine-tuned weights. The controller network can explore the architecture-embedding if the weights are initialized using the pretraining simulator of the autoencoder network because the pretraining simulator can project uniform distribution into architecture-embedding. Another advantage is fast searching because the controller network is not required to learn projecting uniform distribution into architecture-embedding again and focuses on searching architectures.
Moreover, we devised a new reward function by using the accuracy of the child network in training and validation. The reward is obtained only from the accuracy of the validation set of the past NAS with reinforcement learning. In this study, the reward function is different from the general function, and it not only considers accuracy on the validation set but also estimates generalization errors (Eq. 1).
The child model receives continuous vectors from the controller and generates neural architecture by using the pretrained decoder model. Here, we described the NASES mechanism for creating a network architecture. We required four hyperparameters, namely number of filters, filter size, kernel type, and connection coefficient, in a layer.
The Operations: The 12 operations were provided to the controller by using two hyperparameters, namely filter size and kernel type. The filter size represents the amount of neighbor information during the current layer processing, and kernel type represents the components of the neural network, including the convolution layer, depthwise-separable convolution layer, maximum pooling layer, and average pooling layer. The child network receives these continuous vectors, and we developed a rule that transforms into discrete vectors. For example, in the kernel type, the rule is that 7 assigns depthwise-separable convolution, 7 or 15 assigns convolution. For filter size, the rule is that 10 assigns to the 3 3 filter size. According to this principle, the operations available for the child network are convolution with filter sizes 3 3, 5 5, and 7 7; depthwise-separable convolution with filter sizes 3 3, 5 5, and 7 7; maximum pooling with filter sizes 3 3, 5 5, and 7 7; and average pooling with filter sizes 3 3, 5 5, and 7 7. In addition, each depthwise-separable convolution was applied twice [zoph2018].
The skip connections are essential connections that occur from the early layers to the later layers through addition or straight up concatenation. The reasoning behind this skip connection is that they exhibit an uninterrupted gradient flow from the first layer to the last layer, which tackles the vanishing gradient problem.
In this study, the hyperparameter of the connection coefficient assigned the early layer to connect to last multiple layers with closer connection coefficient; these layers were concatenated in the channel dimension at the end of the layer. Therefore, we created another rule, and the layers are a connection when the connection coefficient is close to three; for example, the layer one and two are a connection when their connection coefficient is between two and five.
The Order of the Blocks in Each Layer:
Performance was affected by the order of blocks in each layer. We applied the order of ReLu-conv-batchnorm[ioffe2015]. Moreover, the kernel size of 1 1 convolution filters can be applied to change the dimensionality in the filter space. We applied the order of ReLu-conv-batchnorm to 1 1 kernel size convolution layers before the convolution layer, except for the first layer.
Global Average Pooling We employed a trick into the NASES of the global average pooling [lin2013] which is an operation that calculates the average output of each feature map in the final convolution layer for reducing the number of parameters from the full connection layer.
Performance Estimation Strategy
To reduce the computational burden, we used two approaches of obtaining lower fidelities to estimate performance. First, the learning scheduler follows cosine annealing with = 0.05, = 0.001 and
= epochs[loshchilov2017]. Second, each architecture search was run for 10 epochs on the search phrase, and final architecture is run for 700 epochs. The detailed is presented in Alg 1.
|Origin Size||Embedding Size||Compression ratio||Training Loss||Testing Loss|
We describe two stages of the experiment in this section. The first stage involves training an architecture decoder and architecture simulator network and selecting an appropriate compression ratio of architecture-embedding. The second stage involves applying the result of first stage to discover novel neural architectures for image classification on CIFAR-10 [krizhevsky2009learning] by using NASES.
First Stage: Pretraining Architecture Decoder and Simulator Network
In the first stage of NASES experiment, our goal was to map origin architecture to architecture-embedding. We assembled the architecture simulator and decoder in an autoencoder network and trained this autoencoder network instead of training the simulator and decoder. To mimic the origin architecture space, we sampled 300000 as a training set data from uniform distribution and sampled 100000 as the testing set data. In this case, we sampled uniform distribution at the interval [0, 30].
where is the simulation of the origin architecture space.
The optimization of the autoencoder network was achieved by using an Adam [kingma2015] optimizer with a learning rate of 0.00001. During the training of the autoencoder network, the learning schedule is increased the batch size instead of decaying the learning rate [smith2018]
and saved weights with the lowest test loss during testing. In the network architecture, the policy gradient methods update the probability distribution of actions so that the controller’s actions with high expected reward exhibit a high probability for an observed state. Therefore, the activation function is a sigmoid function used in the middle layer hidden of the autoencoder network, and it leads the distribution of hidden output to Bernoulli distribution, suitable for computing the policy gradient to update the controller network.
Other details regarding the experimental procedures are as follows: Three fully connected layers with the number units of 1000, 500, and 100 were used for the simulator and decoder networks; the first layer exhibited the ReLu activation function, and the second and third layers exhibited the tanh activation function. The architecture decoder network exhibited the same neural architecture and hyperparameters setting as the architecture simulator network. The loss function used a least square error.
|1||Macro NAS with Q-Learning [Zhong2018]||10||8-10||11.2m||6.92|
|2||Net Transformation [cai2018]||5||2||19.7m.||5.70|
|2||NASES + more filters1.||1||0.5||20.4m||3.93|
|3||NAS + more filters||800||21-28||37.4m||3.65|
|3||ENAS + more filters.||1||0.32||38.0m||3.87|
|3||NASES + more filters2||1||0.5||28.4m||3.71|
To understand the loss of information from the embedding space, we evaluated the compression ratios of 0.83, 0.67, 0.5, 0.33, 0.17, 0.08, and 0.02 to examine the utility of compression. We set an upper bound of testing loss as a baseline, because if the decoder is incapable of decoding on architecture-embedding, the best strategy always is the predicted average value (average is 15 in this case).
The utility of compression ratio on different sizes of embedding by using the autoencoder network is presented in Table 1. The testing loss represents the loss of information after compression by the autoencoder network, and the loss of information increases with the compression ratio.
Figure 4 illustrates the double-axis plot of the testing loss and embedding size (according to Table 1). As illustrated in Figure 4, a definite trade-off exists between loss and compression rate; high compression rate leads to high information loss. In this case, we suggested a range of 20-30 as the appropriate embedding size by using the double-axis plot with cross-area of testing loss curve and embedding size curve. In this range, the compression rate is more than half and information loss is considerably low. For image classification (second experiment stage) we followed the experiment results of the first stage, and the set the compression rate at 0.33. That is, the origin size of 60 was projected into the embedding size of 20.
Second Stage: Image Classification on CIFAR-10
The second stage of the experiment is a multiclass classification for assigning a class to the image object. The CIFAR-10 [krizhevsky2009learning]
data set consists of 60000 color images of 32 × 32 RGB in 100 classes. Each class has 6000 images with 5000 training data and 1000 testing data. Additionally, to achieve standardization and normalization, we applied only three standard data augmentation techniques: (1) Subtracting the mean and then dividing the answer by the standard deviation, which ensures that all variables have mean zero and standard deviation 1. (2) Centrally padding on training set to 40 × 40 and randomly cropping images back to 32 × 32. (3) Randomly flipping images horizontally.
The spilled validation ratio was 0.9; we then randomly split 45000 and 5000 images for training and validation, respectively, in the neural searching procedure. Finally, we used 50000 images for training and 10000 images for testing when the NASES search procedure was complete. Each architecture search procedure was run for 10 epochs on the search phase, and the final architecture was run for 700 epochs
The child network is described in the paragraph following the method section. The hyperparameters setting of the child network considers ENAS [Pham2018]
as a reference. It was trained with Nesterov momentum, the momentum of 0.9[nesterov1983]. The learning schedule followed the learning rate decay with a cosine annealing for each batch ( = 0.05, = 0.001, = epochs) [loshchilov2017], batch size of 128, weight decay of 1e-4. We initialize with initialization[he2015] in the child network. We designed a 15-layer convolutional architecture by using 60 hyperparameters ( a layer for four hyperparameters contains: number of filters, filer size, kernel type, and connection coefficient). The NASES mechanism provided effective mechanism, and it only required applying 20 hyperparameters on embedding vector by using the 0.33 compression ratio.
By following the controller network perspectives, we took the pretrained parameters of an already trained model from the first phase experiment of the pretrained architecture simulator. The controller hyperparameters setting and neural architecture followed the first stage of the experiment.
According to the reward function engineering perspectives, Eq. 1 was used as our reward function; this function not only considered accuracy of the validation set but also estimated the generalization error. We stored the reward in a reward pool in each NASES iteration. Furthermore, we normalized and updated the rewards of the reward pool at the end of each iteration. To prevent the dependency of the final architecture on initial architectures, we ran random architectures ten times to collect rewards without updating parameters of the controller network in the beginning. We sampled three architecture examples from reward pool to update the controller network in each NASES iteration. Furthermore, the epsilon-greedy approach [watkins1989] occurred randomly with the probability epsilon. Therefore, we have 10 to generate architecture randomly to out of the reward pool too.
We ran the NASES procedure five times by using different random seeds on a single Nvidia V100 GPU, and NASES required approximately 12 hours to determine the final architecture for a NASES procedure. Furthermore, the average number of searching architectures to achieve final architecture for the NASES procedure was 100, the number of searching architectures was reduced considerably compared with past NAS approaches.
To determine the performance of the architecture, we evaluated the child network by using the final architecture network on the CIFAR-10 test dataset. Table 2 summarizes the performance of NASES and other NAS approaches by using the macro search algorithms. This final architecture is presented in Figure 5.
In Table 2, the approaches have been into three levels based on the number of parameters. The first block of Table 2 presents the performances of NAS approaches; the NASES final architecture that achieves 4.07 error rate on the testing set uses only 8 million parameters, which is comparable with other NAS approaches. For comparing more approaches and models, we added a number of filters to each layer of the final architecture. The second block of Table 2 represents the performance of NAS approaches when the number of parameters was approximately 20 million, and the NASES final architecture can be improved to 3.93 error rate by adding 100 to the number of filters of each layer. Finally, to evaluate the high parameter network, we added 150 to the number of filters of each layer. Notably, the NASES final architecture that achieved 3.71 error rate only used 28.4 million parameters, which was better than approximately 38 million parameters used by the ENAS [Pham2018] and NAS [zoph2017]. NASES required approximately half GPU day to discover the final architecture. The beneficial-performance and effectiveness of NASES was impressive even when only the architecture-embedding searching and pre-training controller were applied without other NAS tricks such as sharing parameters [Pham2018].
NAS with reinforcement learning is a powerful and novel framework for the automatic discovering process of neural architectures. Here, we designed a novel NAS framework, and this approach alleviated the two problems of noncontinuous and high-dimensional search space of NAS with reinforcement learning. We named this NAS framework NASES, in which the controller can be searched on embedding space by using the architecture decoder and architecture simulator. We achieved favorable results for image classification on CIFAR-10; the NASES exhibited efficient performance and high effectiveness when the number of searching architecture was reduced to 100 architectures. We proposed a simple method to estimate the compression ratio of architecture-embedding. The code of NASES for running on CIFAR-10 can be found at https://github.com/jimliu741523.