In the era of deep learning, how to design reasonable network architecture for specific problems is a challenging task. Designing architecture with state-of-the-art performance typically requires substantial efforts from human experts. In order to eliminate such exhausting engineering, much research has been devoted to automatically searching architectures, namely neural architecture search (NAS), which has achieved significant successes in a multitude of fields, including image classification(Zoph et al., 2018; Real et al., 2018; Liu et al., 2018a, b; Xie et al., 2018), semantic segmentation (Chen et al., 2018) and object detection (Zoph et al., 2018).
So far, there exist three basic frameworks that have gained a growing interest, i.e.
, evolution-based NAS, reinforcement learning-based NAS, and gradient-based NAS. Primitively, evolution-based NAS employs evolution algorithms to jointly learn architectures and parameters in networks, such as NEAT(Stanley & Miikkulainen, 2002). Due to the uneconomical searching strategy in evolution algorithms, evolution-based NAS inherently requires tremendous time and resource consumption. For instance, it takes 3150 GPU days for AmoebaNet (Real et al., 2018) to achieve the state-of-the-art performance in comparison of human-designed architectures. To boost the process, reinforcement learning-based NAS using back-propagation seems to be a natural and plausible choice. When compared with some evolution-based NAS, many reinforcement learning-based NAS methods like ENAS (Pham et al., 2018)
can dramatically reduce the time and resource consumption despite of the similar optimization mechanism. However, since the reinforcement learning-based NAS is generally established as a Markov decision process, temporal-difference is utilized to take structural decisions. As a result, the reward in NAS will only be observable until the architecture is chosen and the network is tested for accuracy, which is always subject to delayed rewards in temporal-difference learning(Arjona-Medina et al., 2018). To eliminate such deficiency, gradient-based NAS methods including DARTS (Liu et al., 2018b) and SNAS (Xie et al., 2018) are recently presented. They try to convert the concrete search space into a continuous one, so that architectures and parameters can be well-optimized by gradient descent.
In this work, we develop an effective and efficient NAS method, Differentiable ARchiTecture Search with Ensemble Gumbel-Softmax (DARTS-EGS), which is capable of discovering more diversified network architectures, while maintaining the differentiability of a promising NAS pipeline. In order to guarantee the diversity of network architectures along the search path, we represent the whole search space with binary codes. Benefiting from such artful modeling, the whole feed-forward networks consisting of the predefined number of ordered nodes are included into the search space in our model. To maintain the prerequisite efficiency in searching, we develop ensemble Gumbel-Softmax to replace the traditional softmax for yielding a differentiable end-to-end mechanism. That is, our ensemble Gumbel-Softmax is in a position to perform any structural decisions like policy gradient in the reinforcement-learning, but more efficient than temporal difference learning considering delayed rewards.
To sum up, the main contributions of this work are:
By generalizing the traditional Gumbel-Softmax, we develop an ensemble Gumbel-Softmax, which provides a successful attempt to effectively and efficiently perform any structural decisions like policy gradient in the reinforcement-learning, with higher efficiency.
Benefiting from the ensemble Gumbel-Softmax, the search space can be dramatically increased while maintaining the requisite efficiency in searching, which yields an end-to-end mechanism to identify the network architecture with requisite computational complexity.
Extensive experiments verify that our model outperforms current models in searching high-performance convolutional architectures for image classification and recurrent architectures for language modeling.
2 Related Work
Recently, discovering neural architecture automatically has raised great interest in both academia and industry (Bello et al., 2016; Baker et al., 2017; Brock et al., 2017; Tran et al., 2017; Suganuma et al., 2018; Veit & Belongie, 2018). Nowadays, neural architecture search method can be roughly divided into three class according to searching method (Elsken et al., 2018a), i.e., evolution-based NAS, reinforcement learning-based NAS, and gradient-based NAS.
Evolution-based neural architecture search methods (Real et al., 2017; Miikkulainen et al., 2017; Real et al., 2018; Elsken et al., 2018b; Kamath et al., 2018) utilize evolution algorithms to generate neural architecture, automatically. In (Real et al., 2017, 2018; Elsken et al., 2018b), a large CNN architecture space is explored, and modifications like inserting layer, adjusting filter size and adding identity mapping are designed as mutations in evolution. Despite the remarkable achievements, their methods require huge computation resource and less practical in large scale.
Reinforcement learning based neural architecture search methods are prevalent in recent work (Zoph & Le, 2016; Bello et al., 2017; Zoph et al., 2018; Zhong et al., 2018; Pham et al., 2018). In the pioneering work (Zoph & Le, 2016), an RNN network is utilized as the controller to decide the type and parameters of layers sequentially. The controller is trained by reinforcement learning with the accuracy of the generated architecture designed as reward. Although it achieves impressive results, the searching process is computational hunger and 800 GPUs are required. Based on (Zoph & Le, 2016), several methods have been proposed to accelerate the search process. Specifically, (Zoph et al., 2018; Zhong et al., 2018) diminish the search space by searching the architecture of block and then stack the searched block to generate final network. In (Pham et al., 2018), the weights of network are shared among child models, saving searching time by reducing the cost of getting and evaluation. Additionally, a series of well-performance methods have also been explored, including the progressive search (Liu et al., 2018a) and multi-objective optimization (Tan et al., 2018; Hsu et al., 2018).
Contrary to treating architecture search as black-box optimization problem, gradient based neural architecture search methods utilized the gradient obtained in the training process to optimize neural architecture (Shin et al., 2018; Luo et al., 2018; Liu et al., 2018b; Xie et al., 2018). Typically, NAO (Luo et al., 2018) utilizes RNN networks as the encoder and decoder to map architectures into a continuous network embeddings space and conduct optimization in this space with gradient-based method. Another typical method DARTS (Liu et al., 2018b)
chooses the best connection between nodes from a candidate primitive set by employing a softmax classifier. Although DARTS achieves impressive results, the discrete process is not totally differentiable and the connection between two nodes are limited to a single one in the candidate primitive set merely.
In the following, Section 3.1 introduces our notations, defines the search space and presents our strategy of encoding network architectures with binary codes. Section 3.2 introduces a conceptually intuitive yet powerful relaxation for searching in the discrete search space. Section 3.3 elaborates the proposed ensemble Gumbel-Softmax estimator, which plays a key role in constructing our scheme for jointly optimizing the architecture and its weights. Finally, Section 3.4 details the objective in our model.
3.1 Search Space
To balance the optimality and efficiency in NAS, we search for computation cells that constitute the whole network, following prior works (Zoph et al., 2018; Liu et al., 2018a; Real et al., 2018; Pham et al., 2018; Liu et al., 2018b; Xie et al., 2018; Cai et al., 2018). In essense, every such cell is a sub-network, which can be naturally considered as a directed acyclic graph (DAG) consisting of an ordered sequence of nodes. To cover abundant network architectures, our search space is set as the whole space of DAGs with the predefined number of nodes.
Without loss of generality and for simplicity, we denote a cell with nodes as , where indicates a directed edge from the -th node to the -th node. Corresponding to each directed edge , there are a set of candidate primitive operations , such as convolution, pooling, identity, and zero111“Zero” means no connection between two nodes., as defined in DARTS (Liu et al., 2018b). With these operations, the output at the -th node can be formulated as
where denotes the input from the -th node, and is a function applied to which can be decomposed into a superposition of primitive operations in , i.e.,
where is the -th candidate primitive operation in , and signifies a binary and thus discrete weight to indicate whether the operation is utilized on the edge . For clarity, we introduce the binary set as the network architecture code to represent a cell and as the edge architecture code to represent the structure of the edge . To reveal the capability of above encoding for representing cells, we have the following proposition
For an arbitrary feed-forward network consisting of limited numbers of ordered nodes and candidate operations in , there is one and only one architecture code that corresponds to it.
Proposition 1 guarantees the uniqueness of our model encoding. In the meanwhile, it also implies that our search space includes the whole set of feed-forward networks and guarantees the expressive capability of the formulation in Eq. (2). With such modeling, many previously defined seach spaces can be regarded as subsets of ours. For example, our modeling shall degrade into that of DARTS (Liu et al., 2018b) and SNAS (Xie et al., 2018) when is further introduced as a constraint. Limited by this constraint, only one operation will be chosen on each edge, which can be considered as a one-category problem.
3.2 Relaxation of The Search Space
Benefiting from the uniqueness property of our architecture code , as verified in Proposition 1, the task of learning the cell can therefore be converted to approaching the optimal code. Yet, as mentioned, one major obstacle to approaching efficient NAS is the difficulty of optimization in the discrete space. In this subsection, we shall introduce our strategy for searching in the continuous space as a proxy. Without loss of generality, we elaborate the strategy on the edge .
To make the search space continuous, we relax the categorical choice of a set of particular operations as follows
denotes the probability of choosing the-th operation on the edge , and
represents a binary function that suffices to map a probability vector to a binary code and pass gradients in a continuous manner. We will discuss more about it in Section3.3 and explain how we manage to pass gradients with it. Specifically, is chosen to be a monotonic increasing function in our method, i.e.,
By substituting with and considering instead as the variable to be optimized, we have successfully achieved a continuous relaxation. Benefiting from the flexibility of our formulation, it is capable of modeling multi-category problems and incorporating resource constraint seamlessly, as will be discussed.
3.2.1 Modeling Multi-Category Problems
Different from the traditional probability vector that solely pertains to manage one-category problem, the formulation in Eq. (3) is in a position to solve multi-category problems. This superiority inherently comes from the monotonicity of the function . It endows the formulation with a capability of modeling more general relationships between different categories, not limited to the incompatibility purely.
3.2.2 Modeling Resource Constraint
In NAS, desired methods should not only show effectiveness (i.e., test-set accuracy) but also possess superior efficiency (i.e., search cost). Aiming at a high performing NAS framework, we integrate the two core parts together and represent as
where and represent the credits in terms of effectiveness and efficiency when choosing the -th operation on the edge , and is a hyper-parameter for balancing the two parts. In practice, the validation-set accuracy and search time are employed to represent the effectiveness and efficiency of our model respectively, and is always fixed in this work.
3.3 Optimization with Ensemble Gumbel-Softmax
Although the relaxation presented in Section 3.2 makes the search space continuous, how to define the binary function as desired to map each of the probabilities to a binary code needs to be sorted out. In principle, it is still a problem of learing to take concrete decisions, which is unfortunately indifferentiable and has no gradients almost everywhere. To leverage the gradient information as in generic deep learning, we introduce an ensemble Gumbel-Softmax estimator to optimize the problem with a principled approximation. As such, the back-propagation algorithm can be directly adopted in a end-to-end manner, yielding an efficient and effective searching mechanism.
A natural formulation for representing discrete variable is to use the categorical distribution. However, partially due to the inability to back-propagate information through samples, it seems rarely applied in deep learning. In this work, we resort to the Gumbel-Max trick (Gumbel, 1954) for enabling back-propagation and and representing the process of taking decision as sampling from a categorical distribution, in order to perform NAS in a principled way. Specifically, given a probability vector
and a discrete random variable with, we sample from the discrete variable by introducing the Gumbel random variables. To be more specific, we let
where is a sequence of the standard Gumbel random variables, and they are typically sampled from the Gumbel distribution with . An obstacle to directly using such approach in our problem is that the argmax operation is not really continuous. One straightforward way of dealing with this problem is to replace the argmax operation with a softmax (Jang et al., 2016; Maddison et al., 2016). Formally, the Gumbel-Softmax (GS) estimation can be expressed as
where indicates the probability that is the maximal entry in , and is a temperature. When ,
converges to an one-hot vector, and in the other extreme it will become a discrete uniform distribution with.
From the expression in Eq. (7), we see that the traditional Gumbel-Softmax pertains solely to deal with the problems that only one category requires to be determined. In NAS, however, an optimal architecture may require multiple operations on an edge, considering the practical significance (He et al., 2016; Szegedy et al., 2016). For instance, the residual module in ResNets (He et al., 2016) consists of two operations with a learnable mapping and the identity . That is, choosing different operations in may not be mutually exclusive but compatible. One direct way of handling this limitation is to map all possible operation combinations to -dimensional vectors, where is the number of candidate operations in . However, it seems difficult to search architectures efficiently when there are many candidate operations (i.e., is large).
3.3.2 Ensemble Gumbel-Softmax
In order to address the aforementioned limitation in the traditional Gumbel-Softmax, we propose our Ensemble Gumbel-Softmax (EGS) estimator. With the assistance of this estimator, we will be able to choose multiple operations on each edge by sampling more diversified codes, not limited to the one-hot codes merely. To this end, we start by recoding binary codes. For the code on an edge , the whole architecture information included in can be recoded into a superposition of one-hot vectors, i.e.,
where is a -dimensional one-hot vector that uniquely corresponds to the operation . Such equivalence relationship reveals that compositing the results sampled from Gumbel-Softmax may be a possible way to sample straightforward, although Gumbel-Softmax is capable of sampling , as described in Section 3.3.1.
For an effective and efficient sampler, we composite the whole sampling results for every . That is, an ensemble of multiple Gumbel-Softmax samplers is profound for sampling diversified binary codes, i.e.,
For a -dimensional probability vector and one-hot vectors sampled from , the -dimensional binary code sampled with ensemble Gumbel-Softmax is
where is the number of sampling times, is the -th element in , and indicates the -th element in .
|Architecture||Test Error||Params||Search Cost||Ops||Search|
|DenseNet-BC (Huang et al., 2017)||3.46||25.6||-||-||manual|
|NASNet-A + cutout (Zoph et al., 2018)||2.65||3.3||2000||13||RL|
|BlockQNN (Zhong et al., 2018)||3.54||39.8||96||8||RL|
|ENAS + cutout (Pham et al., 2018)||2.89||4.6||0.5||6||RL|
|AmoebaNet-A (Real et al., 2018)||3.34||3.2||3150||19||evolution|
|AmoebaNet-B + cutout (Real et al., 2018)||2.55||2.8||3150||19||evolution|
|Hierarchical evolution (Liu et al., 2017)||3.75||15.7||300||6||evolution|
|PNAS (Liu et al., 2018a)||3.41||3.2||225||8||SMBO|
|DARTS (-th order) + cutout (Liu et al., 2018b)||3.00||3.3||1.5||7||gradient-based|
|DARTS (-th order) + cutout (Liu et al., 2018b)||2.76||3.3||4||7||gradient-based|
|SNAS + mild + cutout (Xie et al., 2018)||2.98||2.9||1.5||-||gradient-based|
|SNAS + moderate + cutout (Xie et al., 2018)||2.85||2.8||1.5||-||gradient-based|
|SNAS + aggressive + cutout (Xie et al., 2018)||3.10||2.3||1.5||-||gradient-based|
|Random search baseline + cutout||3.29||3.2||4||7||random|
3.3.3 Understanding Our EGS
To reveal the serviceability and sampling capability of our ensemble Gumbel-Softmax, from the definition, two basic propositions are given in the following.
First, whether an element in the sampled binary code is one depends on the probability at the corresponding location.
For arbitrary probability vector and number of sampling times , we have
where is the probability of . Furthermore,
Second, the capability of sampling binary codes is determined on , i.e., the number of one-hot vectors sampled form Gumbel-Softmax. Specifically, we have
For arbitrary probability vector and number of sampling times , the ensemble Gumbel-Softmax is capable of sampling different binary codes, which includes the whole binary codes with up to ones and at least one.
Proposition 3 indicates that the sampling capability of increases exponentially with . In practice, larger is always employed to deal with more complex tasks for effect, and smaller one can be utilized to search more lightweight networks for efficiency.
Synthetically, our ensemble Gumbel-Softmax is not only available for searching network architectures but also sampling capability for guaranteeing the performance in theory. In Figure 1, a visualized comparison between Gumbel-Softmax and ensemble Gumbel-Softmax intuitively shows that our ensemble Gumbel-Softmax is more excellent than the traditional Gumbel-Softmax, in terms of both sampling capability and rationality in practice.
3.4 Parameter Leaning and Architecture Sampling
To eliminate the compromise of the NAS pipeline, the parameters and architectures will be simultaneously optimized interms of ensemble Gumbel-Softmax. Analogous to architecture search in (Zoph & Le, 2016; Zoph et al., 2018; Pham et al., 2018; Liu et al., 2017; Real et al., 2018; Liu et al., 2018b), the validation set performance is considered as the reward in our model, but using an end-to-end differentiable manner. Denote the training loss as . The goal in architecture search is to find a high-performance architecture, i.e.,
The main process of optimizing this objective is to minimize the expected performance of architectures sampled with . That is, the network is first sampled with . Afterward, the loss on the training dataset can be calculated by forward propagation. Relying on this loss, the gradients of the network architecture parameter and the network parameter are yielded to modify these parameters better. Because of the differentiability, our model can be trained end-to-end by the standard back-propagation algorithm. In the end, the network architecture is identified by sampling with , and the network parameter is estimated by retraining on the training set. A conceptual visualization of our DARTS-EGS model is illustrated in Figure 2.
In this section, we systematically carry out extensive experiments to verify the capability of our model in discovering high-performance convolutional networks for image classification and recurrent networks for language modeling. For each task, the experiments consist of two stages, following with the previous work (Liu et al., 2018b; Xie et al., 2018). First, we search for the cell architectures based on our ensemble Gumbel-Softmax and find the best cells according to their validation performance. Second, the transferability of the best cells learned on CIFAR-10 (Krizhevsky & Hinton, 2009) and Penn Tree Bank (PTB) (Taylor et al., 2003) are investigated by employing them on large datasets, i.e.
, classification on ImageNet(Deng et al., 2009) and language modeling on WikiText-2 (WT2) (Merity et al., 2016), respectively. As a greatly improved work of DARTS, specifically, the experimental settings always inherit from it, except some special settings in each experiment.
4.1 Image Classification
|Architecture||Test Error (%)||Params||Search Cost||Search|
|Top 1||Top 5||(M)||(GPU days)|
|Inception-v1 (Szegedy et al., 2015)||30.2||10.1||6.6||-||manual|
|MobileNet (Howard et al., 2017)||29.4||10.5||4.2||-||manual|
|ShuffleNet-v1 2 (Zhang et al., 2018)||26.3||-||5||-||manual|
|ShuffleNet-v2 2 (Ma et al., 2018)||25.1||-||5||-||manual|
|AmoebaNet-A (Real et al., 2018)||25.5||8.0||5.1||3150||evolution|
|AmoebaNet-B (Real et al., 2018)||26.0||8.5||5.3||3150||evolution|
|AmoebaNet-C (Real et al., 2018)||24.3||7.6||6.4||3150||evolution|
|PNAS (Liu et al., 2018a)||25.8||8.1||5.1||225||SMBO|
|DARTS (searched on CIFAR-10) (Liu et al., 2018b)||26.7||8.7||4.7||4||gradient-based|
|SNAS (mild constraint) (Xie et al., 2018)||27.3||9.2||4.3||1.5||gradient-based|
|NASNet-A (Zoph et al., 2018)||26.0||8.4||5.3||2000||RL|
|NASNet-B (Zoph et al., 2018)||27.2||8.7||5.3||2000||RL|
|NASNet-C (Zoph et al., 2018)||27.5||9.0||4.9||2000||RL|
4.1.1 Architecture Search on CIFAR-10
In our experiments, eight typical operations are included in the candidate primitive set : and separable convolutions, and dilated separable convolutions, max pooling,
average pooling, identity, and zero. In order to preserve their spatial resolution, all operations are of stride one, and the convolutional feature maps are padded if necessary. In ensemble Gumbel-Softmax, the number of sampling timesis set for a rich search space. Following the settings in the previous work (Liu et al., 2018a, b; Xie et al., 2018)
, the ReLU-Conv-BN order is utilized in the whole convolution operations, and every separable convolution is always applied twice.
The settings of nodes in our convolutional cell are also following the previous work (Zoph et al., 2018; Real et al., 2018; Liu et al., 2018a, b). Specifically, every cell consists of nodes, among which the output node is defined as the depthwise concatenation of all the intermediate nodes. More larger networks are always established by stacking multiple cells together. In the -th cell, the first and second nodes are set equally to the outputs in the -th and -th cells respectively, with convolution as necessary. Furthermore, the reduction cell with the architecture coding is utilized at the and of the total depth of the network. The rest of cells are the normal cell with the architecture coding .
4.1.2 Architecture Evaluation on CIFAR-10
To evaluate the selected architecture, a large network of 20 cells is trained from scratch for 600 epoches with batch size 96 and report its performance on the test set. We add additional enhancements include cutout with size 16, path dropout of probability 0.2 and auxiliary towers with weight 0.4 following exiting works for fair comparison. We report the mean of 4 independent runs for our full model.
Figure 3 and Table 1 give the searched architectures and classification results on CIFAR-10, which shows that our DARTS-EGS achieces comparable results with the state-of-the-art with less computation resources. Such a good performance verifies that DARTS-EGS can effectively and efficiently search worthy architectures for classification. In DARTS-EGS, furthermore, higher accuracy is yielded when compared with . This scenario is in accordance with our motivation that more richer search spaces is beneficial for searching architectures.
|Variational RHN (Zilly et al., 2016)||67.9||65.4||23||-||-||manual|
|LSTM (Merity et al., 2017)||60.7||58.8||24||-||-||manual|
|LSTM + skip connections (Melis et al., 2017)||60.9||58.3||24||-||-||manual|
|LSTM + 15 softmax experts (Yang et al., 2017)||58.1||56.0||22||-||-||manual|
|DARTS (first order) (Liu et al., 2018b)||60.2||57.6||23||0.5||4||gradient-based|
|DARTS (second order) (Liu et al., 2018b)||58.1||55.7||23||1||4||gradient-based|
|NAS (Zoph & Le, 2016)||-||64.0||25||1e4 CPU days||4||RL|
|ENAS (Pham et al., 2018)||68.3||63.1||24||0.5||4||RL|
|Random search baseline||61.8||59.4||23||2||4||random|
|LSTM + augmented loss (Inan et al., 2016)||91.5||87.0||28||-||manual|
|LSTM + cache pointer (Grave et al., 2016)||-||68.9||-||-||manual|
|LSTM (Merity et al., 2017)||69.1||66.0||33||-||manual|
|LSTM + skip connections (Melis et al., 2017)||69.1||65.9||24||-||manual|
|LSTM + 15 softmax experts (Yang et al., 2017)||66.0||63.3||33||-||manual|
|DARTS (searched on PTB) (Liu et al., 2018b)||69.5||66.9||33||1||gradient-based|
|ENAS (searched on PTB) (Pham et al., 2018)||72.4||70.4||33||0.5||RL|
|Number of Sampling Times ()||1||2||3||4||5||6||7||8||9|
|Test Error (%)||3.38||3.22||3.09||3.05||3.01||2.84||2.79||2.74||2.73|
4.1.3 Transferability Evaluation on ImageNet
We apply mobile setting where the input image size is 224224 and the number of multiply-add operations of the model is restricted to be under 600M. A network of 14 cells is trained for 250 epoches with batch size 128, weight decay 3 and poly learning rate scheduler with initial learning rate 0.1. Label smoothing (Szegedy et al., 2016) and auxiliary loss (Lee et al., 2015)
are used during training. Other hyperparameters follow(Liu et al., 2018b).
In Table 2, we report the quantitative results on ImageNet. Note that the cell searched on CIFAR-10 can be smoothly employed to deal with the large-scale classification task. Compared with other gradient-based NAS methods, furthermore, greater margins are yielded on ImageNet. A possible reason is that more complex architectures can be searched in DARTS-EGS because of the larger search space. Consequently, such more complex architectures handle more complex task on ImageNet better.
4.2 Language Modeling
4.2.1 Architecture Search on PTB
In the language modeling task, our model is employed to search suitable activation function between nodes. Following the setting in(Zoph et al., 2018; Pham et al., 2018; Liu et al., 2018b), five popular functions are considered in the candidate primitive set , such as sigmoid, tanh, relu, identity, and zero. In addition, there are nodes in the recurrent cell, and the number of sampling times is set in ensemble Gumbel-Softmax for a rich search space. Similar to ENAS (Pham et al., 2018) and DARTS (Liu et al., 2018b)
, in cells, the very first intermediate node is obtained by linearly transforming the two input nodes, adding up the results and then passing through the tanh function, and the rest of activation functions are learned with our model and enhanced with the highway bypass(Zilly et al., 2017)
. The batch normalization(Ioffe & Szegedy, 2015) in each node to prevent gradient explosion during architecture search, and disable it during architecture evaluation. Our recurrent network consists of only a single cell, i.e., we do not assume any repetitive patterns within the recurrent architecture.
4.2.2 Architecture Evaluation on PTB
We train a single-layer recurrent network with searched cell for 1600 epoches with batch size 64 using averaged SGD, Both the embedding and the hidden sizes are set to 850 to ensure our model size is comparable with other baselines. Other hyper-parameters are set following (Liu et al., 2018b). Note that the model is not fine-tuned at the end of the optimization, nor do we use any additional enhancements for a fair comparison.
Table 3 lists the results in this experiment. From the table, we observe that DARTS-EGS also is in a position to search recurrent architectures effectively. It empirically shows that the back-propagation algorithm can guide DARTS-EGS to hit a preferable architecture in a larger search space, while maintaining the requisite efficiency. Similar to the conclusion in Section 4.1.2, lower perplexity is achieved when the larger is employed, which verifies that more richer search spaces is also valuable for recurrent architectures .
4.2.3 Transferability Evaluation on WT2
Different from the setting on PTB, we apply embdding hidden sizes 700, weight decay 5, and hidden-node variational dropout 0.15. Other hyperparameters remain the same in the experiment on PTB. In Table 4, the results on WT2 indicates that the transferability is also retentive on recurrent architectures. Conclusively, the consistent results in Section 4.1.3 and 4.2.3 strongly guarantee the transferability on both convolutional and recurrent architectures.
4.3 Ablation study
4.3.1 Sensitivity to Number of Sampling Times
We perform experiments on CIFAR-10 to analyze the sensitivities to the number of sampling time . Table 5 gives the results in this experiment. From this table, it can be observed that larger indicates higher performance, while more parameters will be introduced as increases. This is in accordance with the statement in Proposition 3, i.e., more capable networks might be found with larger M to get higher performance.
4.3.2 Performance on Semantic Segmentation
We also validate the capability of our method on a more complex task, i.e., semantic segmentation on VOC-2012. Compared with DARTS (), DARTS-EGS achieves better performance () with the larger margin than the classification on ImageNet. Such results demonstrate that DARTS-EGS may have more prominent superiority on more complex tasks, not just toy tasks.
We present a powerful framework to search network architectures, DARTS-EGS, which is capable of covering more diversified network architectures, while maintaining the differentiability of the NAS pipeline. For this purpose, the network architectures are represented with arbitrary discrete binary codes, guaranteeing the reach of search space. In order to ensure the efficiency in searching, ensemble Gumbel-Softmax is developed to search architectures in a differentiable end-to-end manner. By searching with the standard back-propagation, DARTS-EGS is able to outperform the state-of-the-art architecture search methods on various tasks, with remarkable efficiency.
Future work may include search the whole networks with our ensemble Gumbel-Softmax and injecting our ensemble Gumbel-Softmax into deep models to handle other machine learning tasks. For the first work, the sampling capability of ensemble Gumbel-Softmax guarantees the practicability of searching any networks, but how to improve the efficiency remains to be solved. For the second work, the differentiability of ensemble Gumbel-Softmax indicates that it can be utilized anywhere in networks. Relying on such insight, an interesting direction is to recast the clustering process into our ensemble Gumbel-Softmax. By aggregating inputs in each cluster, conclusively, a general pooling for both deep networks and deep graph networks (Battaglia et al., 2018; Chang et al., 2018) can be developed to deal with Euclidean and non-Euclidean structured data uniformly.
- Arjona-Medina et al. (2018) Arjona-Medina, J. A., Gillhofer, M., Widrich, M., Unterthiner, T., and Hochreiter, S. RUDDER: return decomposition for delayed rewards. CoRR, abs/1806.07857, 2018.
- Baker et al. (2017) Baker, B., Gupta, O., Raskar, R., and Naik, N. Accelerating neural architecture search using performance prediction. arXiv preprint arXiv:1705.10823, 2017.
- Battaglia et al. (2018) Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V. F., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., Gülçehre, Ç., Song, F., Ballard, A. J., Gilmer, J., Dahl, G. E., Vaswani, A., Allen, K., Nash, C., Langston, V., Dyer, C., Heess, N., Wierstra, D., Kohli, P., Botvinick, M., Vinyals, O., Li, Y., and Pascanu, R. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018.
- Bello et al. (2016) Bello, I., Pham, H., Le, Q. V., Norouzi, M., and Bengio, S. Neural combinatorial optimization with reinforcement learning. CoRR, abs/1611.09940, 2016.
- Bello et al. (2017) Bello, I., Zoph, B., Vasudevan, V., and Le, Q. V. Neural optimizer search with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 459–468, 2017.
- Brock et al. (2017) Brock, A., Lim, T., Ritchie, J. M., and Weston, N. SMASH: one-shot model architecture search through hypernetworks. CoRR, abs/1708.05344, 2017.
- Cai et al. (2018) Cai, H., Zhu, L., and Han, S. Proxylessnas: Direct neural architecture search on target task and hardware. CoRR, abs/1812.00332, 2018.
Chang et al. (2018)
Chang, J., Gu, J., Wang, L., Meng, G., Xiang, S., and Pan, C.
Structure-aware convolutional neural networks.In NeurIPS, pp. 11–20, 2018.
- Chen et al. (2018) Chen, L., Collins, M. D., Zhu, Y., Papandreou, G., Zoph, B., Schroff, F., Adam, H., and Shlens, J. Searching for efficient multi-scale architectures for dense image prediction. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pp. 8713–8724, 2018.
- Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Li, F. Imagenet: A large-scale hierarchical image database. In
- Elsken et al. (2018a) Elsken, T., Metzen, J. H., and Hutter, F. Neural architecture search: A survey. CoRR, abs/1808.05377, 2018a.
- Elsken et al. (2018b) Elsken, T., Metzen, J. H., and Hutter, F. Efficient multi-objective neural architecture search via lamarckian evolution. 2018b.
- Grave et al. (2016) Grave, E., Joulin, A., and Usunier, N. Improving neural language models with a continuous cache. CoRR, abs/1612.04426, 2016.
- Gumbel (1954) Gumbel, E. J. Statistical theory of extreme values and some practical applications: a series of lectures. Number 33. US Govt. Print. Office, 1954.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778, 2016.
- Howard et al. (2017) Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
- Hsu et al. (2018) Hsu, C., Chang, S., Juan, D., Pan, J., Chen, Y., Wei, W., and Chang, S. MONAS: multi-objective neural architecture search using reinforcement learning. CoRR, abs/1806.10332, 2018.
- Huang et al. (2017) Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 2261–2269, 2017.
- Inan et al. (2016) Inan, H., Khosravi, K., and Socher, R. Tying word vectors and word classifiers: A loss framework for language modeling. CoRR, abs/1611.01462, 2016.
- Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 448–456, 2015.
- Jang et al. (2016) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. CoRR, abs/1611.01144, 2016.
- Kamath et al. (2018) Kamath, P., Singh, A., and Dutta, D. Neural architecture construction using envelopenets. CoRR, abs/1803.06744, 2018.
- Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Master’s Thesis, Department of Computer Science, University of Torono, 2009.
Lee et al. (2015)
Lee, C., Xie, S., Gallagher, P. W., Zhang, Z., and Tu, Z.
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 9-12, 2015, 2015.
- Liu et al. (2018a) Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L., Fei-Fei, L., Yuille, A. L., Huang, J., and Murphy, K. Progressive neural architecture search. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, pp. 19–35, 2018a.
- Liu et al. (2017) Liu, H., Simonyan, K., Vinyals, O., Fernando, C., and Kavukcuoglu, K. Hierarchical representations for efficient architecture search. CoRR, abs/1711.00436, 2017.
- Liu et al. (2018b) Liu, H., Simonyan, K., and Yang, Y. DARTS: differentiable architecture search. CoRR, abs/1806.09055, 2018b.
- Luo et al. (2018) Luo, R., Tian, F., Qin, T., Chen, E., and Liu, T. Neural architecture optimization. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pp. 7827–7838, 2018.
- Ma et al. (2018) Ma, N., Zhang, X., Zheng, H., and Sun, J. Shufflenet V2: practical guidelines for efficient CNN architecture design. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, pp. 122–138, 2018.
- Maddison et al. (2016) Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. CoRR, abs/1611.00712, 2016.
- Melis et al. (2017) Melis, G., Dyer, C., and Blunsom, P. On the state of the art of evaluation in neural language models. CoRR, abs/1707.05589, 2017.
- Merity et al. (2016) Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. CoRR, abs/1609.07843, 2016.
- Merity et al. (2017) Merity, S., Keskar, N. S., and Socher, R. Regularizing and optimizing LSTM language models. CoRR, abs/1708.02182, 2017.
- Miikkulainen et al. (2017) Miikkulainen, R., Liang, J. Z., Meyerson, E., Rawal, A., Fink, D., Francon, O., Raju, B., Shahrzad, H., Navruzyan, A., Duffy, N., and Hodjat, B. Evolving deep neural networks. CoRR, abs/1703.00548, 2017.
- Pham et al. (2018) Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., and Dean, J. Efficient neural architecture search via parameter sharing. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 4092–4101, 2018.
- Real et al. (2017) Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y. L., Tan, J., Le, Q. V., and Kurakin, A. Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 2902–2911, 2017.
- Real et al. (2018) Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regularized evolution for image classifier architecture search. CoRR, abs/1802.01548, 2018.
- Shin et al. (2018) Shin, R., Packer, C., and Song, D. Differentiable neural network architecture search. 2018.
- Stanley & Miikkulainen (2002) Stanley, K. O. and Miikkulainen, R. Evolving neural networks through augmenting topologies. Evolutionary computation, 10(2):99–127, 2002.
Suganuma et al. (2018)
Suganuma, M., Shirakawa, S., and Nagao, T.
A genetic programming approach to designing convolutional neural network architectures.In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., pp. 5369–5373, 2018.
- Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. E., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 1–9, 2015.
- Szegedy et al. (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 2818–2826, 2016.
- Tan et al. (2018) Tan, M., Chen, B., Pang, R., Vasudevan, V., and Le, Q. V. Mnasnet: Platform-aware neural architecture search for mobile. CoRR, abs/1807.11626, 2018.
- Taylor et al. (2003) Taylor, A., Marcus, M., and Santorini, B. The penn treebank: an overview. In Treebanks, pp. 5–22. 2003.
- Tran et al. (2017) Tran, D., Ray, J., Shou, Z., Chang, S., and Paluri, M. Convnet architecture search for spatiotemporal feature learning. CoRR, abs/1708.05038, 2017.
- Veit & Belongie (2018) Veit, A. and Belongie, S. J. Convolutional networks with adaptive inference graphs. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, pp. 3–18, 2018.
- Xie et al. (2018) Xie, S., Zheng, H., Liu, C., and Lin, L. SNAS: stochastic neural architecture search. CoRR, abs/1812.09926, 2018.
- Yang et al. (2017) Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W. Breaking the softmax bottleneck: A high-rank RNN language model. CoRR, abs/1711.03953, 2017.
- Zhang et al. (2018) Zhang, X., Zhou, X., Lin, M., and Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 6848–6856, 2018.
- Zhong et al. (2018) Zhong, Z., Yan, J., Wu, W., Shao, J., and Liu, C. Practical block-wise neural network architecture generation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 2423–2432, 2018.
- Zilly et al. (2016) Zilly, J. G., Srivastava, R. K., Koutník, J., and Schmidhuber, J. Recurrent highway networks. CoRR, abs/1607.03474, 2016.
- Zilly et al. (2017) Zilly, J. G., Srivastava, R. K., Koutník, J., and Schmidhuber, J. Recurrent highway networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 4189–4198, 2017.
- Zoph & Le (2016) Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. CoRR, abs/1611.01578, 2016.
- Zoph et al. (2018) Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning transferable architectures for scalable image recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 8697–8710, 2018.