1 Introduction
In the era of deep learning, how to design reasonable network architecture for specific problems is a challenging task. Designing architecture with stateoftheart performance typically requires substantial efforts from human experts. In order to eliminate such exhausting engineering, much research has been devoted to automatically searching architectures, namely neural architecture search (NAS), which has achieved significant successes in a multitude of fields, including image classification
(Zoph et al., 2018; Real et al., 2018; Liu et al., 2018a, b; Xie et al., 2018), semantic segmentation (Chen et al., 2018) and object detection (Zoph et al., 2018).So far, there exist three basic frameworks that have gained a growing interest, i.e.
, evolutionbased NAS, reinforcement learningbased NAS, and gradientbased NAS. Primitively, evolutionbased NAS employs evolution algorithms to jointly learn architectures and parameters in networks, such as NEAT
(Stanley & Miikkulainen, 2002). Due to the uneconomical searching strategy in evolution algorithms, evolutionbased NAS inherently requires tremendous time and resource consumption. For instance, it takes 3150 GPU days for AmoebaNet (Real et al., 2018) to achieve the stateoftheart performance in comparison of humandesigned architectures. To boost the process, reinforcement learningbased NAS using backpropagation seems to be a natural and plausible choice. When compared with some evolutionbased NAS, many reinforcement learningbased NAS methods like ENAS (Pham et al., 2018)can dramatically reduce the time and resource consumption despite of the similar optimization mechanism. However, since the reinforcement learningbased NAS is generally established as a Markov decision process, temporaldifference is utilized to take structural decisions. As a result, the reward in NAS will only be observable until the architecture is chosen and the network is tested for accuracy, which is always subject to delayed rewards in temporaldifference learning
(ArjonaMedina et al., 2018). To eliminate such deficiency, gradientbased NAS methods including DARTS (Liu et al., 2018b) and SNAS (Xie et al., 2018) are recently presented. They try to convert the concrete search space into a continuous one, so that architectures and parameters can be welloptimized by gradient descent.In this work, we develop an effective and efficient NAS method, Differentiable ARchiTecture Search with Ensemble GumbelSoftmax (DARTSEGS), which is capable of discovering more diversified network architectures, while maintaining the differentiability of a promising NAS pipeline. In order to guarantee the diversity of network architectures along the search path, we represent the whole search space with binary codes. Benefiting from such artful modeling, the whole feedforward networks consisting of the predefined number of ordered nodes are included into the search space in our model. To maintain the prerequisite efficiency in searching, we develop ensemble GumbelSoftmax to replace the traditional softmax for yielding a differentiable endtoend mechanism. That is, our ensemble GumbelSoftmax is in a position to perform any structural decisions like policy gradient in the reinforcementlearning, but more efficient than temporal difference learning considering delayed rewards.
To sum up, the main contributions of this work are:

By generalizing the traditional GumbelSoftmax, we develop an ensemble GumbelSoftmax, which provides a successful attempt to effectively and efficiently perform any structural decisions like policy gradient in the reinforcementlearning, with higher efficiency.

Benefiting from the ensemble GumbelSoftmax, the search space can be dramatically increased while maintaining the requisite efficiency in searching, which yields an endtoend mechanism to identify the network architecture with requisite computational complexity.

Extensive experiments verify that our model outperforms current models in searching highperformance convolutional architectures for image classification and recurrent architectures for language modeling.
2 Related Work
Recently, discovering neural architecture automatically has raised great interest in both academia and industry (Bello et al., 2016; Baker et al., 2017; Brock et al., 2017; Tran et al., 2017; Suganuma et al., 2018; Veit & Belongie, 2018). Nowadays, neural architecture search method can be roughly divided into three class according to searching method (Elsken et al., 2018a), i.e., evolutionbased NAS, reinforcement learningbased NAS, and gradientbased NAS.
Evolutionbased neural architecture search methods (Real et al., 2017; Miikkulainen et al., 2017; Real et al., 2018; Elsken et al., 2018b; Kamath et al., 2018) utilize evolution algorithms to generate neural architecture, automatically. In (Real et al., 2017, 2018; Elsken et al., 2018b), a large CNN architecture space is explored, and modifications like inserting layer, adjusting filter size and adding identity mapping are designed as mutations in evolution. Despite the remarkable achievements, their methods require huge computation resource and less practical in large scale.
Reinforcement learning based neural architecture search methods are prevalent in recent work (Zoph & Le, 2016; Bello et al., 2017; Zoph et al., 2018; Zhong et al., 2018; Pham et al., 2018). In the pioneering work (Zoph & Le, 2016), an RNN network is utilized as the controller to decide the type and parameters of layers sequentially. The controller is trained by reinforcement learning with the accuracy of the generated architecture designed as reward. Although it achieves impressive results, the searching process is computational hunger and 800 GPUs are required. Based on (Zoph & Le, 2016), several methods have been proposed to accelerate the search process. Specifically, (Zoph et al., 2018; Zhong et al., 2018) diminish the search space by searching the architecture of block and then stack the searched block to generate final network. In (Pham et al., 2018), the weights of network are shared among child models, saving searching time by reducing the cost of getting and evaluation. Additionally, a series of wellperformance methods have also been explored, including the progressive search (Liu et al., 2018a) and multiobjective optimization (Tan et al., 2018; Hsu et al., 2018).
Contrary to treating architecture search as blackbox optimization problem, gradient based neural architecture search methods utilized the gradient obtained in the training process to optimize neural architecture (Shin et al., 2018; Luo et al., 2018; Liu et al., 2018b; Xie et al., 2018). Typically, NAO (Luo et al., 2018) utilizes RNN networks as the encoder and decoder to map architectures into a continuous network embeddings space and conduct optimization in this space with gradientbased method. Another typical method DARTS (Liu et al., 2018b)
chooses the best connection between nodes from a candidate primitive set by employing a softmax classifier. Although DARTS achieves impressive results, the discrete process is not totally differentiable and the connection between two nodes are limited to a single one in the candidate primitive set merely.
3 Methodology
In the following, Section 3.1 introduces our notations, defines the search space and presents our strategy of encoding network architectures with binary codes. Section 3.2 introduces a conceptually intuitive yet powerful relaxation for searching in the discrete search space. Section 3.3 elaborates the proposed ensemble GumbelSoftmax estimator, which plays a key role in constructing our scheme for jointly optimizing the architecture and its weights. Finally, Section 3.4 details the objective in our model.
3.1 Search Space
To balance the optimality and efficiency in NAS, we search for computation cells that constitute the whole network, following prior works (Zoph et al., 2018; Liu et al., 2018a; Real et al., 2018; Pham et al., 2018; Liu et al., 2018b; Xie et al., 2018; Cai et al., 2018). In essense, every such cell is a subnetwork, which can be naturally considered as a directed acyclic graph (DAG) consisting of an ordered sequence of nodes. To cover abundant network architectures, our search space is set as the whole space of DAGs with the predefined number of nodes.
Without loss of generality and for simplicity, we denote a cell with nodes as , where indicates a directed edge from the th node to the th node. Corresponding to each directed edge , there are a set of candidate primitive operations , such as convolution, pooling, identity, and zero^{1}^{1}1“Zero” means no connection between two nodes., as defined in DARTS (Liu et al., 2018b). With these operations, the output at the th node can be formulated as
(1) 
where denotes the input from the th node, and is a function applied to which can be decomposed into a superposition of primitive operations in , i.e.,
(2)  
where is the th candidate primitive operation in , and signifies a binary and thus discrete weight to indicate whether the operation is utilized on the edge . For clarity, we introduce the binary set as the network architecture code to represent a cell and as the edge architecture code to represent the structure of the edge . To reveal the capability of above encoding for representing cells, we have the following proposition
Proposition 1.
For an arbitrary feedforward network consisting of limited numbers of ordered nodes and candidate operations in , there is one and only one architecture code that corresponds to it.
Proposition 1 guarantees the uniqueness of our model encoding. In the meanwhile, it also implies that our search space includes the whole set of feedforward networks and guarantees the expressive capability of the formulation in Eq. (2). With such modeling, many previously defined seach spaces can be regarded as subsets of ours. For example, our modeling shall degrade into that of DARTS (Liu et al., 2018b) and SNAS (Xie et al., 2018) when is further introduced as a constraint. Limited by this constraint, only one operation will be chosen on each edge, which can be considered as a onecategory problem.
3.2 Relaxation of The Search Space
Benefiting from the uniqueness property of our architecture code , as verified in Proposition 1, the task of learning the cell can therefore be converted to approaching the optimal code. Yet, as mentioned, one major obstacle to approaching efficient NAS is the difficulty of optimization in the discrete space. In this subsection, we shall introduce our strategy for searching in the continuous space as a proxy. Without loss of generality, we elaborate the strategy on the edge .
To make the search space continuous, we relax the categorical choice of a set of particular operations as follows
(3)  
where
denotes the probability of choosing the
th operation on the edge , andrepresents a binary function that suffices to map a probability vector to a binary code and pass gradients in a continuous manner. We will discuss more about it in Section
3.3 and explain how we manage to pass gradients with it. Specifically, is chosen to be a monotonic increasing function in our method, i.e.,(4) 
By substituting with and considering instead as the variable to be optimized, we have successfully achieved a continuous relaxation. Benefiting from the flexibility of our formulation, it is capable of modeling multicategory problems and incorporating resource constraint seamlessly, as will be discussed.
3.2.1 Modeling MultiCategory Problems
Different from the traditional probability vector that solely pertains to manage onecategory problem, the formulation in Eq. (3) is in a position to solve multicategory problems. This superiority inherently comes from the monotonicity of the function . It endows the formulation with a capability of modeling more general relationships between different categories, not limited to the incompatibility purely.
3.2.2 Modeling Resource Constraint
In NAS, desired methods should not only show effectiveness (i.e., testset accuracy) but also possess superior efficiency (i.e., search cost). Aiming at a high performing NAS framework, we integrate the two core parts together and represent as
(5) 
where and represent the credits in terms of effectiveness and efficiency when choosing the th operation on the edge , and is a hyperparameter for balancing the two parts. In practice, the validationset accuracy and search time are employed to represent the effectiveness and efficiency of our model respectively, and is always fixed in this work.
3.3 Optimization with Ensemble GumbelSoftmax
Although the relaxation presented in Section 3.2 makes the search space continuous, how to define the binary function as desired to map each of the probabilities to a binary code needs to be sorted out. In principle, it is still a problem of learing to take concrete decisions, which is unfortunately indifferentiable and has no gradients almost everywhere. To leverage the gradient information as in generic deep learning, we introduce an ensemble GumbelSoftmax estimator to optimize the problem with a principled approximation. As such, the backpropagation algorithm can be directly adopted in a endtoend manner, yielding an efficient and effective searching mechanism.
3.3.1 GumbelSoftmax
A natural formulation for representing discrete variable is to use the categorical distribution. However, partially due to the inability to backpropagate information through samples, it seems rarely applied in deep learning. In this work, we resort to the GumbelMax trick (Gumbel, 1954) for enabling backpropagation and and representing the process of taking decision as sampling from a categorical distribution, in order to perform NAS in a principled way. Specifically, given a probability vector
and a discrete random variable with
, we sample from the discrete variable by introducing the Gumbel random variables. To be more specific, we let(6) 
where is a sequence of the standard Gumbel random variables, and they are typically sampled from the Gumbel distribution with . An obstacle to directly using such approach in our problem is that the argmax operation is not really continuous. One straightforward way of dealing with this problem is to replace the argmax operation with a softmax (Jang et al., 2016; Maddison et al., 2016). Formally, the GumbelSoftmax (GS) estimation can be expressed as
(7) 
where indicates the probability that is the maximal entry in , and is a temperature. When ,
converges to an onehot vector, and in the other extreme it will become a discrete uniform distribution with
.From the expression in Eq. (7), we see that the traditional GumbelSoftmax pertains solely to deal with the problems that only one category requires to be determined. In NAS, however, an optimal architecture may require multiple operations on an edge, considering the practical significance (He et al., 2016; Szegedy et al., 2016). For instance, the residual module in ResNets (He et al., 2016) consists of two operations with a learnable mapping and the identity . That is, choosing different operations in may not be mutually exclusive but compatible. One direct way of handling this limitation is to map all possible operation combinations to dimensional vectors, where is the number of candidate operations in . However, it seems difficult to search architectures efficiently when there are many candidate operations (i.e., is large).
3.3.2 Ensemble GumbelSoftmax
In order to address the aforementioned limitation in the traditional GumbelSoftmax, we propose our Ensemble GumbelSoftmax (EGS) estimator. With the assistance of this estimator, we will be able to choose multiple operations on each edge by sampling more diversified codes, not limited to the onehot codes merely. To this end, we start by recoding binary codes. For the code on an edge , the whole architecture information included in can be recoded into a superposition of onehot vectors, i.e.,
(8) 
where is a dimensional onehot vector that uniquely corresponds to the operation . Such equivalence relationship reveals that compositing the results sampled from GumbelSoftmax may be a possible way to sample straightforward, although GumbelSoftmax is capable of sampling , as described in Section 3.3.1.
For an effective and efficient sampler, we composite the whole sampling results for every . That is, an ensemble of multiple GumbelSoftmax samplers is profound for sampling diversified binary codes, i.e.,
Definition 1.
For a dimensional probability vector and onehot vectors sampled from , the dimensional binary code sampled with ensemble GumbelSoftmax is
where is the number of sampling times, is the th element in , and indicates the th element in .
Architecture  Test Error  Params  Search Cost  Ops  Search 
(%)  (M)  (GPU days)  
DenseNetBC (Huang et al., 2017)  3.46  25.6      manual 
NASNetA + cutout (Zoph et al., 2018)  2.65  3.3  2000  13  RL 
BlockQNN (Zhong et al., 2018)  3.54  39.8  96  8  RL 
ENAS + cutout (Pham et al., 2018)  2.89  4.6  0.5  6  RL 
AmoebaNetA (Real et al., 2018)  3.34  3.2  3150  19  evolution 
AmoebaNetB + cutout (Real et al., 2018)  2.55  2.8  3150  19  evolution 
Hierarchical evolution (Liu et al., 2017)  3.75  15.7  300  6  evolution 
PNAS (Liu et al., 2018a)  3.41  3.2  225  8  SMBO 
DARTS (th order) + cutout (Liu et al., 2018b)  3.00  3.3  1.5  7  gradientbased 
DARTS (th order) + cutout (Liu et al., 2018b)  2.76  3.3  4  7  gradientbased 
SNAS + mild + cutout (Xie et al., 2018)  2.98  2.9  1.5    gradientbased 
SNAS + moderate + cutout (Xie et al., 2018)  2.85  2.8  1.5    gradientbased 
SNAS + aggressive + cutout (Xie et al., 2018)  3.10  2.3  1.5    gradientbased 
Random search baseline + cutout  3.29  3.2  4  7  random 
DARTSEGS ()  3.01  2.6  1  7  gradientbased 
DARTSEGS ()  2.79  2.9  1  7  gradientbased 
3.3.3 Understanding Our EGS
To reveal the serviceability and sampling capability of our ensemble GumbelSoftmax, from the definition, two basic propositions are given in the following.
First, whether an element in the sampled binary code is one depends on the probability at the corresponding location.
Proposition 2.
For arbitrary probability vector and number of sampling times , we have
where is the probability of . Furthermore,
where .
Proposition 2 implies that our ensemble GumbelSoftmax is a monotonic increasing function in terms of probability, and in a position to act as the function in Eq. (3).
Second, the capability of sampling binary codes is determined on , i.e., the number of onehot vectors sampled form GumbelSoftmax. Specifically, we have
Proposition 3.
For arbitrary probability vector and number of sampling times , the ensemble GumbelSoftmax is capable of sampling different binary codes, which includes the whole binary codes with up to ones and at least one.
Proposition 3 indicates that the sampling capability of increases exponentially with . In practice, larger is always employed to deal with more complex tasks for effect, and smaller one can be utilized to search more lightweight networks for efficiency.
Synthetically, our ensemble GumbelSoftmax is not only available for searching network architectures but also sampling capability for guaranteeing the performance in theory. In Figure 1, a visualized comparison between GumbelSoftmax and ensemble GumbelSoftmax intuitively shows that our ensemble GumbelSoftmax is more excellent than the traditional GumbelSoftmax, in terms of both sampling capability and rationality in practice.
3.4 Parameter Leaning and Architecture Sampling
To eliminate the compromise of the NAS pipeline, the parameters and architectures will be simultaneously optimized interms of ensemble GumbelSoftmax. Analogous to architecture search in (Zoph & Le, 2016; Zoph et al., 2018; Pham et al., 2018; Liu et al., 2017; Real et al., 2018; Liu et al., 2018b), the validation set performance is considered as the reward in our model, but using an endtoend differentiable manner. Denote the training loss as . The goal in architecture search is to find a highperformance architecture, i.e.,
(9) 
The main process of optimizing this objective is to minimize the expected performance of architectures sampled with . That is, the network is first sampled with . Afterward, the loss on the training dataset can be calculated by forward propagation. Relying on this loss, the gradients of the network architecture parameter and the network parameter are yielded to modify these parameters better. Because of the differentiability, our model can be trained endtoend by the standard backpropagation algorithm. In the end, the network architecture is identified by sampling with , and the network parameter is estimated by retraining on the training set. A conceptual visualization of our DARTSEGS model is illustrated in Figure 2.
4 Experiments
In this section, we systematically carry out extensive experiments to verify the capability of our model in discovering highperformance convolutional networks for image classification and recurrent networks for language modeling. For each task, the experiments consist of two stages, following with the previous work (Liu et al., 2018b; Xie et al., 2018). First, we search for the cell architectures based on our ensemble GumbelSoftmax and find the best cells according to their validation performance. Second, the transferability of the best cells learned on CIFAR10 (Krizhevsky & Hinton, 2009) and Penn Tree Bank (PTB) (Taylor et al., 2003) are investigated by employing them on large datasets, i.e.
, classification on ImageNet
(Deng et al., 2009) and language modeling on WikiText2 (WT2) (Merity et al., 2016), respectively. As a greatly improved work of DARTS, specifically, the experimental settings always inherit from it, except some special settings in each experiment.4.1 Image Classification
Architecture  Test Error (%)  Params  Search Cost  Search  
Top 1  Top 5  (M)  (GPU days)  
Inceptionv1 (Szegedy et al., 2015)  30.2  10.1  6.6    manual 
MobileNet (Howard et al., 2017)  29.4  10.5  4.2    manual 
ShuffleNetv1 2 (Zhang et al., 2018)  26.3    5    manual 
ShuffleNetv2 2 (Ma et al., 2018)  25.1    5    manual 
AmoebaNetA (Real et al., 2018)  25.5  8.0  5.1  3150  evolution 
AmoebaNetB (Real et al., 2018)  26.0  8.5  5.3  3150  evolution 
AmoebaNetC (Real et al., 2018)  24.3  7.6  6.4  3150  evolution 
PNAS (Liu et al., 2018a)  25.8  8.1  5.1  225  SMBO 
DARTS (searched on CIFAR10) (Liu et al., 2018b)  26.7  8.7  4.7  4  gradientbased 
SNAS (mild constraint) (Xie et al., 2018)  27.3  9.2  4.3  1.5  gradientbased 
NASNetA (Zoph et al., 2018)  26.0  8.4  5.3  2000  RL 
NASNetB (Zoph et al., 2018)  27.2  8.7  5.3  2000  RL 
NASNetC (Zoph et al., 2018)  27.5  9.0  4.9  2000  RL 
DARTSEGS ()  25.7  8.5  4.3  1.5  gradientbased 
DARTSEGS ()  24.9  8.1  4.7  1.5  gradientbased 
4.1.1 Architecture Search on CIFAR10
In our experiments, eight typical operations are included in the candidate primitive set : and separable convolutions, and dilated separable convolutions, max pooling,
average pooling, identity, and zero. In order to preserve their spatial resolution, all operations are of stride one, and the convolutional feature maps are padded if necessary. In ensemble GumbelSoftmax, the number of sampling times
is set for a rich search space. Following the settings in the previous work (Liu et al., 2018a, b; Xie et al., 2018), the ReLUConvBN order is utilized in the whole convolution operations, and every separable convolution is always applied twice.
The settings of nodes in our convolutional cell are also following the previous work (Zoph et al., 2018; Real et al., 2018; Liu et al., 2018a, b). Specifically, every cell consists of nodes, among which the output node is defined as the depthwise concatenation of all the intermediate nodes. More larger networks are always established by stacking multiple cells together. In the th cell, the first and second nodes are set equally to the outputs in the th and th cells respectively, with convolution as necessary. Furthermore, the reduction cell with the architecture coding is utilized at the and of the total depth of the network. The rest of cells are the normal cell with the architecture coding .
4.1.2 Architecture Evaluation on CIFAR10
To evaluate the selected architecture, a large network of 20 cells is trained from scratch for 600 epoches with batch size 96 and report its performance on the test set. We add additional enhancements include cutout with size 16, path dropout of probability 0.2 and auxiliary towers with weight 0.4 following exiting works for fair comparison. We report the mean of 4 independent runs for our full model.
Figure 3 and Table 1 give the searched architectures and classification results on CIFAR10, which shows that our DARTSEGS achieces comparable results with the stateoftheart with less computation resources. Such a good performance verifies that DARTSEGS can effectively and efficiently search worthy architectures for classification. In DARTSEGS, furthermore, higher accuracy is yielded when compared with . This scenario is in accordance with our motivation that more richer search spaces is beneficial for searching architectures.
Architecture  Perplexity  Params  Search Cost  Ops  Search  
valid  test  (M)  (GPU days)  
Variational RHN (Zilly et al., 2016)  67.9  65.4  23      manual 
LSTM (Merity et al., 2017)  60.7  58.8  24      manual 
LSTM + skip connections (Melis et al., 2017)  60.9  58.3  24      manual 
LSTM + 15 softmax experts (Yang et al., 2017)  58.1  56.0  22      manual 
DARTS (first order) (Liu et al., 2018b)  60.2  57.6  23  0.5  4  gradientbased 
DARTS (second order) (Liu et al., 2018b)  58.1  55.7  23  1  4  gradientbased 
NAS (Zoph & Le, 2016)    64.0  25  1e4 CPU days  4  RL 
ENAS (Pham et al., 2018)  68.3  63.1  24  0.5  4  RL 
Random search baseline  61.8  59.4  23  2  4  random 
DARTSEGS ()  58.3  56.2  22  0.5  4  gradientbased 
DARTSEGS ()  57.1  55.3  23  0.5  4  gradientbased 
Architecture  Perplexity  Params  Search Cost  Search  
valid  test  (M)  (GPU days)  
LSTM + augmented loss (Inan et al., 2016)  91.5  87.0  28    manual 
LSTM + cache pointer (Grave et al., 2016)    68.9      manual 
LSTM (Merity et al., 2017)  69.1  66.0  33    manual 
LSTM + skip connections (Melis et al., 2017)  69.1  65.9  24    manual 
LSTM + 15 softmax experts (Yang et al., 2017)  66.0  63.3  33    manual 
DARTS (searched on PTB) (Liu et al., 2018b)  69.5  66.9  33  1  gradientbased 
ENAS (searched on PTB) (Pham et al., 2018)  72.4  70.4  33  0.5  RL 
DARTSEGS ()  67.3  64.6  33  1  gradientbased 
DARTSEGS ()  66.5  64.2  33  1  gradientbased 
Number of Sampling Times ()  1  2  3  4  5  6  7  8  9 

Test Error (%)  3.38  3.22  3.09  3.05  3.01  2.84  2.79  2.74  2.73 
Params (M)  2.29  2.51  2.56  2.60  2.72  2.84  2.79  2.90  3.02 
4.1.3 Transferability Evaluation on ImageNet
We apply mobile setting where the input image size is 224224 and the number of multiplyadd operations of the model is restricted to be under 600M. A network of 14 cells is trained for 250 epoches with batch size 128, weight decay 3 and poly learning rate scheduler with initial learning rate 0.1. Label smoothing (Szegedy et al., 2016) and auxiliary loss (Lee et al., 2015)
are used during training. Other hyperparameters follow
(Liu et al., 2018b).In Table 2, we report the quantitative results on ImageNet. Note that the cell searched on CIFAR10 can be smoothly employed to deal with the largescale classification task. Compared with other gradientbased NAS methods, furthermore, greater margins are yielded on ImageNet. A possible reason is that more complex architectures can be searched in DARTSEGS because of the larger search space. Consequently, such more complex architectures handle more complex task on ImageNet better.
4.2 Language Modeling
4.2.1 Architecture Search on PTB
In the language modeling task, our model is employed to search suitable activation function between nodes. Following the setting in
(Zoph et al., 2018; Pham et al., 2018; Liu et al., 2018b), five popular functions are considered in the candidate primitive set , such as sigmoid, tanh, relu, identity, and zero. In addition, there are nodes in the recurrent cell, and the number of sampling times is set in ensemble GumbelSoftmax for a rich search space. Similar to ENAS (Pham et al., 2018) and DARTS (Liu et al., 2018b), in cells, the very first intermediate node is obtained by linearly transforming the two input nodes, adding up the results and then passing through the tanh function, and the rest of activation functions are learned with our model and enhanced with the highway bypass
(Zilly et al., 2017). The batch normalization
(Ioffe & Szegedy, 2015) in each node to prevent gradient explosion during architecture search, and disable it during architecture evaluation. Our recurrent network consists of only a single cell, i.e., we do not assume any repetitive patterns within the recurrent architecture.4.2.2 Architecture Evaluation on PTB
We train a singlelayer recurrent network with searched cell for 1600 epoches with batch size 64 using averaged SGD, Both the embedding and the hidden sizes are set to 850 to ensure our model size is comparable with other baselines. Other hyperparameters are set following (Liu et al., 2018b). Note that the model is not finetuned at the end of the optimization, nor do we use any additional enhancements for a fair comparison.
Table 3 lists the results in this experiment. From the table, we observe that DARTSEGS also is in a position to search recurrent architectures effectively. It empirically shows that the backpropagation algorithm can guide DARTSEGS to hit a preferable architecture in a larger search space, while maintaining the requisite efficiency. Similar to the conclusion in Section 4.1.2, lower perplexity is achieved when the larger is employed, which verifies that more richer search spaces is also valuable for recurrent architectures .
4.2.3 Transferability Evaluation on WT2
Different from the setting on PTB, we apply embdding hidden sizes 700, weight decay 5, and hiddennode variational dropout 0.15. Other hyperparameters remain the same in the experiment on PTB. In Table 4, the results on WT2 indicates that the transferability is also retentive on recurrent architectures. Conclusively, the consistent results in Section 4.1.3 and 4.2.3 strongly guarantee the transferability on both convolutional and recurrent architectures.
4.3 Ablation study
4.3.1 Sensitivity to Number of Sampling Times
We perform experiments on CIFAR10 to analyze the sensitivities to the number of sampling time . Table 5 gives the results in this experiment. From this table, it can be observed that larger indicates higher performance, while more parameters will be introduced as increases. This is in accordance with the statement in Proposition 3, i.e., more capable networks might be found with larger M to get higher performance.
4.3.2 Performance on Semantic Segmentation
We also validate the capability of our method on a more complex task, i.e., semantic segmentation on VOC2012. Compared with DARTS (), DARTSEGS achieves better performance () with the larger margin than the classification on ImageNet. Such results demonstrate that DARTSEGS may have more prominent superiority on more complex tasks, not just toy tasks.
5 Conclusion
We present a powerful framework to search network architectures, DARTSEGS, which is capable of covering more diversified network architectures, while maintaining the differentiability of the NAS pipeline. For this purpose, the network architectures are represented with arbitrary discrete binary codes, guaranteeing the reach of search space. In order to ensure the efficiency in searching, ensemble GumbelSoftmax is developed to search architectures in a differentiable endtoend manner. By searching with the standard backpropagation, DARTSEGS is able to outperform the stateoftheart architecture search methods on various tasks, with remarkable efficiency.
Future work may include search the whole networks with our ensemble GumbelSoftmax and injecting our ensemble GumbelSoftmax into deep models to handle other machine learning tasks. For the first work, the sampling capability of ensemble GumbelSoftmax guarantees the practicability of searching any networks, but how to improve the efficiency remains to be solved. For the second work, the differentiability of ensemble GumbelSoftmax indicates that it can be utilized anywhere in networks. Relying on such insight, an interesting direction is to recast the clustering process into our ensemble GumbelSoftmax. By aggregating inputs in each cluster, conclusively, a general pooling for both deep networks and deep graph networks (Battaglia et al., 2018; Chang et al., 2018) can be developed to deal with Euclidean and nonEuclidean structured data uniformly.
References
 ArjonaMedina et al. (2018) ArjonaMedina, J. A., Gillhofer, M., Widrich, M., Unterthiner, T., and Hochreiter, S. RUDDER: return decomposition for delayed rewards. CoRR, abs/1806.07857, 2018.
 Baker et al. (2017) Baker, B., Gupta, O., Raskar, R., and Naik, N. Accelerating neural architecture search using performance prediction. arXiv preprint arXiv:1705.10823, 2017.
 Battaglia et al. (2018) Battaglia, P. W., Hamrick, J. B., Bapst, V., SanchezGonzalez, A., Zambaldi, V. F., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., Gülçehre, Ç., Song, F., Ballard, A. J., Gilmer, J., Dahl, G. E., Vaswani, A., Allen, K., Nash, C., Langston, V., Dyer, C., Heess, N., Wierstra, D., Kohli, P., Botvinick, M., Vinyals, O., Li, Y., and Pascanu, R. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018.
 Bello et al. (2016) Bello, I., Pham, H., Le, Q. V., Norouzi, M., and Bengio, S. Neural combinatorial optimization with reinforcement learning. CoRR, abs/1611.09940, 2016.
 Bello et al. (2017) Bello, I., Zoph, B., Vasudevan, V., and Le, Q. V. Neural optimizer search with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pp. 459–468, 2017.
 Brock et al. (2017) Brock, A., Lim, T., Ritchie, J. M., and Weston, N. SMASH: oneshot model architecture search through hypernetworks. CoRR, abs/1708.05344, 2017.
 Cai et al. (2018) Cai, H., Zhu, L., and Han, S. Proxylessnas: Direct neural architecture search on target task and hardware. CoRR, abs/1812.00332, 2018.

Chang et al. (2018)
Chang, J., Gu, J., Wang, L., Meng, G., Xiang, S., and Pan, C.
Structureaware convolutional neural networks.
In NeurIPS, pp. 11–20, 2018.  Chen et al. (2018) Chen, L., Collins, M. D., Zhu, Y., Papandreou, G., Zoph, B., Schroff, F., Adam, H., and Shlens, J. Searching for efficient multiscale architectures for dense image prediction. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., pp. 8713–8724, 2018.

Deng et al. (2009)
Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Li, F.
Imagenet: A largescale hierarchical image database.
In
2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 2025 June 2009, Miami, Florida, USA
, pp. 248–255, 2009.  Elsken et al. (2018a) Elsken, T., Metzen, J. H., and Hutter, F. Neural architecture search: A survey. CoRR, abs/1808.05377, 2018a.
 Elsken et al. (2018b) Elsken, T., Metzen, J. H., and Hutter, F. Efficient multiobjective neural architecture search via lamarckian evolution. 2018b.
 Grave et al. (2016) Grave, E., Joulin, A., and Usunier, N. Improving neural language models with a continuous cache. CoRR, abs/1612.04426, 2016.
 Gumbel (1954) Gumbel, E. J. Statistical theory of extreme values and some practical applications: a series of lectures. Number 33. US Govt. Print. Office, 1954.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pp. 770–778, 2016.
 Howard et al. (2017) Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
 Hsu et al. (2018) Hsu, C., Chang, S., Juan, D., Pan, J., Chen, Y., Wei, W., and Chang, S. MONAS: multiobjective neural architecture search using reinforcement learning. CoRR, abs/1806.10332, 2018.
 Huang et al. (2017) Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017, pp. 2261–2269, 2017.
 Inan et al. (2016) Inan, H., Khosravi, K., and Socher, R. Tying word vectors and word classifiers: A loss framework for language modeling. CoRR, abs/1611.01462, 2016.
 Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pp. 448–456, 2015.
 Jang et al. (2016) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbelsoftmax. CoRR, abs/1611.01144, 2016.
 Kamath et al. (2018) Kamath, P., Singh, A., and Dutta, D. Neural architecture construction using envelopenets. CoRR, abs/1803.06744, 2018.
 Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Master’s Thesis, Department of Computer Science, University of Torono, 2009.

Lee et al. (2015)
Lee, C., Xie, S., Gallagher, P. W., Zhang, Z., and Tu, Z.
Deeplysupervised nets.
In
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 912, 2015
, 2015.  Liu et al. (2018a) Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L., FeiFei, L., Yuille, A. L., Huang, J., and Murphy, K. Progressive neural architecture search. In Computer Vision  ECCV 2018  15th European Conference, Munich, Germany, September 814, 2018, Proceedings, Part I, pp. 19–35, 2018a.
 Liu et al. (2017) Liu, H., Simonyan, K., Vinyals, O., Fernando, C., and Kavukcuoglu, K. Hierarchical representations for efficient architecture search. CoRR, abs/1711.00436, 2017.
 Liu et al. (2018b) Liu, H., Simonyan, K., and Yang, Y. DARTS: differentiable architecture search. CoRR, abs/1806.09055, 2018b.
 Luo et al. (2018) Luo, R., Tian, F., Qin, T., Chen, E., and Liu, T. Neural architecture optimization. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., pp. 7827–7838, 2018.
 Ma et al. (2018) Ma, N., Zhang, X., Zheng, H., and Sun, J. Shufflenet V2: practical guidelines for efficient CNN architecture design. In Computer Vision  ECCV 2018  15th European Conference, Munich, Germany, September 814, 2018, Proceedings, Part XIV, pp. 122–138, 2018.
 Maddison et al. (2016) Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. CoRR, abs/1611.00712, 2016.
 Melis et al. (2017) Melis, G., Dyer, C., and Blunsom, P. On the state of the art of evaluation in neural language models. CoRR, abs/1707.05589, 2017.
 Merity et al. (2016) Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. CoRR, abs/1609.07843, 2016.
 Merity et al. (2017) Merity, S., Keskar, N. S., and Socher, R. Regularizing and optimizing LSTM language models. CoRR, abs/1708.02182, 2017.
 Miikkulainen et al. (2017) Miikkulainen, R., Liang, J. Z., Meyerson, E., Rawal, A., Fink, D., Francon, O., Raju, B., Shahrzad, H., Navruzyan, A., Duffy, N., and Hodjat, B. Evolving deep neural networks. CoRR, abs/1703.00548, 2017.
 Pham et al. (2018) Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., and Dean, J. Efficient neural architecture search via parameter sharing. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pp. 4092–4101, 2018.
 Real et al. (2017) Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y. L., Tan, J., Le, Q. V., and Kurakin, A. Largescale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pp. 2902–2911, 2017.
 Real et al. (2018) Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regularized evolution for image classifier architecture search. CoRR, abs/1802.01548, 2018.
 Shin et al. (2018) Shin, R., Packer, C., and Song, D. Differentiable neural network architecture search. 2018.
 Stanley & Miikkulainen (2002) Stanley, K. O. and Miikkulainen, R. Evolving neural networks through augmenting topologies. Evolutionary computation, 10(2):99–127, 2002.

Suganuma et al. (2018)
Suganuma, M., Shirakawa, S., and Nagao, T.
A genetic programming approach to designing convolutional neural network architectures.
In Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 1319, 2018, Stockholm, Sweden., pp. 5369–5373, 2018.  Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. E., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 712, 2015, pp. 1–9, 2015.
 Szegedy et al. (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pp. 2818–2826, 2016.
 Tan et al. (2018) Tan, M., Chen, B., Pang, R., Vasudevan, V., and Le, Q. V. Mnasnet: Platformaware neural architecture search for mobile. CoRR, abs/1807.11626, 2018.
 Taylor et al. (2003) Taylor, A., Marcus, M., and Santorini, B. The penn treebank: an overview. In Treebanks, pp. 5–22. 2003.
 Tran et al. (2017) Tran, D., Ray, J., Shou, Z., Chang, S., and Paluri, M. Convnet architecture search for spatiotemporal feature learning. CoRR, abs/1708.05038, 2017.
 Veit & Belongie (2018) Veit, A. and Belongie, S. J. Convolutional networks with adaptive inference graphs. In Computer Vision  ECCV 2018  15th European Conference, Munich, Germany, September 814, 2018, Proceedings, Part I, pp. 3–18, 2018.
 Xie et al. (2018) Xie, S., Zheng, H., Liu, C., and Lin, L. SNAS: stochastic neural architecture search. CoRR, abs/1812.09926, 2018.
 Yang et al. (2017) Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W. Breaking the softmax bottleneck: A highrank RNN language model. CoRR, abs/1711.03953, 2017.
 Zhang et al. (2018) Zhang, X., Zhou, X., Lin, M., and Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, pp. 6848–6856, 2018.
 Zhong et al. (2018) Zhong, Z., Yan, J., Wu, W., Shao, J., and Liu, C. Practical blockwise neural network architecture generation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, pp. 2423–2432, 2018.
 Zilly et al. (2016) Zilly, J. G., Srivastava, R. K., Koutník, J., and Schmidhuber, J. Recurrent highway networks. CoRR, abs/1607.03474, 2016.
 Zilly et al. (2017) Zilly, J. G., Srivastava, R. K., Koutník, J., and Schmidhuber, J. Recurrent highway networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pp. 4189–4198, 2017.
 Zoph & Le (2016) Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. CoRR, abs/1611.01578, 2016.
 Zoph et al. (2018) Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning transferable architectures for scalable image recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, pp. 8697–8710, 2018.