Differentiable Architecture Search with Ensemble Gumbel-Softmax

by   Jianlong Chang, et al.

For network architecture search (NAS), it is crucial but challenging to simultaneously guarantee both effectiveness and efficiency. Towards achieving this goal, we develop a differentiable NAS solution, where the search space includes arbitrary feed-forward network consisting of the predefined number of connections. Benefiting from a proposed ensemble Gumbel-Softmax estimator, our method optimizes both the architecture of a deep network and its parameters in the same round of backward propagation, yielding an end-to-end mechanism of searching network architectures. Extensive experiments on a variety of popular datasets strongly evidence that our method is capable of discovering high-performance architectures, while guaranteeing the requisite efficiency during searching.


page 1

page 2

page 3

page 4


TND-NAS: Towards Non-differentiable Objectives in Progressive Differentiable NAS Framework

Differentiable architecture search has gradually become the mainstream r...

Differentiable Neural Architecture Search via Proximal Iterations

Neural architecture search (NAS) recently attracts much research attenti...

Darts-Conformer: Towards Efficient Gradient-Based Neural Architecture Search For End-to-End ASR

Neural architecture search (NAS) has been successfully applied to tasks ...

UNAS: Differentiable Architecture Search Meets Reinforcement Learning

Neural architecture search (NAS) aims to discover network architectures ...

β-DARTS: Beta-Decay Regularization for Differentiable Architecture Search

Neural Architecture Search (NAS) has attracted increasingly more attenti...

Model Architecture Adaption for Bayesian Neural Networks

Bayesian Neural Networks (BNNs) offer a mathematically grounded framewor...

G-DARTS-A: Groups of Channel Parallel Sampling with Attention

Differentiable Architecture Search (DARTS) provides a baseline for searc...

1 Introduction

In the era of deep learning, how to design reasonable network architecture for specific problems is a challenging task. Designing architecture with state-of-the-art performance typically requires substantial efforts from human experts. In order to eliminate such exhausting engineering, much research has been devoted to automatically searching architectures, namely neural architecture search (NAS), which has achieved significant successes in a multitude of fields, including image classification 

(Zoph et al., 2018; Real et al., 2018; Liu et al., 2018a, b; Xie et al., 2018), semantic segmentation (Chen et al., 2018) and object detection (Zoph et al., 2018).

So far, there exist three basic frameworks that have gained a growing interest, i.e.

, evolution-based NAS, reinforcement learning-based NAS, and gradient-based NAS. Primitively, evolution-based NAS employs evolution algorithms to jointly learn architectures and parameters in networks, such as NEAT 

(Stanley & Miikkulainen, 2002). Due to the uneconomical searching strategy in evolution algorithms, evolution-based NAS inherently requires tremendous time and resource consumption. For instance, it takes 3150 GPU days for AmoebaNet (Real et al., 2018) to achieve the state-of-the-art performance in comparison of human-designed architectures. To boost the process, reinforcement learning-based NAS using back-propagation seems to be a natural and plausible choice. When compared with some evolution-based NAS, many reinforcement learning-based NAS methods like ENAS (Pham et al., 2018)

can dramatically reduce the time and resource consumption despite of the similar optimization mechanism. However, since the reinforcement learning-based NAS is generally established as a Markov decision process, temporal-difference is utilized to take structural decisions. As a result, the reward in NAS will only be observable until the architecture is chosen and the network is tested for accuracy, which is always subject to delayed rewards in temporal-difference learning 

(Arjona-Medina et al., 2018). To eliminate such deficiency, gradient-based NAS methods including DARTS (Liu et al., 2018b) and SNAS (Xie et al., 2018) are recently presented. They try to convert the concrete search space into a continuous one, so that architectures and parameters can be well-optimized by gradient descent.

In this work, we develop an effective and efficient NAS method, Differentiable ARchiTecture Search with Ensemble Gumbel-Softmax (DARTS-EGS), which is capable of discovering more diversified network architectures, while maintaining the differentiability of a promising NAS pipeline. In order to guarantee the diversity of network architectures along the search path, we represent the whole search space with binary codes. Benefiting from such artful modeling, the whole feed-forward networks consisting of the predefined number of ordered nodes are included into the search space in our model. To maintain the prerequisite efficiency in searching, we develop ensemble Gumbel-Softmax to replace the traditional softmax for yielding a differentiable end-to-end mechanism. That is, our ensemble Gumbel-Softmax is in a position to perform any structural decisions like policy gradient in the reinforcement-learning, but more efficient than temporal difference learning considering delayed rewards.

To sum up, the main contributions of this work are:

  • By generalizing the traditional Gumbel-Softmax, we develop an ensemble Gumbel-Softmax, which provides a successful attempt to effectively and efficiently perform any structural decisions like policy gradient in the reinforcement-learning, with higher efficiency.

  • Benefiting from the ensemble Gumbel-Softmax, the search space can be dramatically increased while maintaining the requisite efficiency in searching, which yields an end-to-end mechanism to identify the network architecture with requisite computational complexity.

  • Extensive experiments verify that our model outperforms current models in searching high-performance convolutional architectures for image classification and recurrent architectures for language modeling.

2 Related Work

Recently, discovering neural architecture automatically has raised great interest in both academia and industry (Bello et al., 2016; Baker et al., 2017; Brock et al., 2017; Tran et al., 2017; Suganuma et al., 2018; Veit & Belongie, 2018). Nowadays, neural architecture search method can be roughly divided into three class according to searching method (Elsken et al., 2018a), i.e., evolution-based NAS, reinforcement learning-based NAS, and gradient-based NAS.

Evolution-based neural architecture search methods (Real et al., 2017; Miikkulainen et al., 2017; Real et al., 2018; Elsken et al., 2018b; Kamath et al., 2018) utilize evolution algorithms to generate neural architecture, automatically. In (Real et al., 2017, 2018; Elsken et al., 2018b), a large CNN architecture space is explored, and modifications like inserting layer, adjusting filter size and adding identity mapping are designed as mutations in evolution. Despite the remarkable achievements, their methods require huge computation resource and less practical in large scale.

Reinforcement learning based neural architecture search methods are prevalent in recent work (Zoph & Le, 2016; Bello et al., 2017; Zoph et al., 2018; Zhong et al., 2018; Pham et al., 2018). In the pioneering work (Zoph & Le, 2016), an RNN network is utilized as the controller to decide the type and parameters of layers sequentially. The controller is trained by reinforcement learning with the accuracy of the generated architecture designed as reward. Although it achieves impressive results, the searching process is computational hunger and 800 GPUs are required. Based on (Zoph & Le, 2016), several methods have been proposed to accelerate the search process. Specifically, (Zoph et al., 2018; Zhong et al., 2018) diminish the search space by searching the architecture of block and then stack the searched block to generate final network. In (Pham et al., 2018), the weights of network are shared among child models, saving searching time by reducing the cost of getting and evaluation. Additionally, a series of well-performance methods have also been explored, including the progressive search (Liu et al., 2018a) and multi-objective optimization (Tan et al., 2018; Hsu et al., 2018).

Contrary to treating architecture search as black-box optimization problem, gradient based neural architecture search methods utilized the gradient obtained in the training process to optimize neural architecture (Shin et al., 2018; Luo et al., 2018; Liu et al., 2018b; Xie et al., 2018). Typically, NAO (Luo et al., 2018) utilizes RNN networks as the encoder and decoder to map architectures into a continuous network embeddings space and conduct optimization in this space with gradient-based method. Another typical method DARTS (Liu et al., 2018b)

chooses the best connection between nodes from a candidate primitive set by employing a softmax classifier. Although DARTS achieves impressive results, the discrete process is not totally differentiable and the connection between two nodes are limited to a single one in the candidate primitive set merely.

3 Methodology

In the following, Section 3.1 introduces our notations, defines the search space and presents our strategy of encoding network architectures with binary codes. Section 3.2 introduces a conceptually intuitive yet powerful relaxation for searching in the discrete search space. Section 3.3 elaborates the proposed ensemble Gumbel-Softmax estimator, which plays a key role in constructing our scheme for jointly optimizing the architecture and its weights. Finally, Section 3.4 details the objective in our model.

3.1 Search Space

To balance the optimality and efficiency in NAS, we search for computation cells that constitute the whole network, following prior works (Zoph et al., 2018; Liu et al., 2018a; Real et al., 2018; Pham et al., 2018; Liu et al., 2018b; Xie et al., 2018; Cai et al., 2018). In essense, every such cell is a sub-network, which can be naturally considered as a directed acyclic graph (DAG) consisting of an ordered sequence of nodes. To cover abundant network architectures, our search space is set as the whole space of DAGs with the predefined number of nodes.

Without loss of generality and for simplicity, we denote a cell with nodes as , where indicates a directed edge from the -th node to the -th node. Corresponding to each directed edge , there are a set of candidate primitive operations , such as convolution, pooling, identity, and zero111“Zero” means no connection between two nodes., as defined in DARTS (Liu et al., 2018b). With these operations, the output at the -th node can be formulated as


where denotes the input from the -th node, and is a function applied to which can be decomposed into a superposition of primitive operations in , i.e.,


where is the -th candidate primitive operation in , and signifies a binary and thus discrete weight to indicate whether the operation is utilized on the edge . For clarity, we introduce the binary set as the network architecture code to represent a cell and as the edge architecture code to represent the structure of the edge . To reveal the capability of above encoding for representing cells, we have the following proposition

Proposition 1.

For an arbitrary feed-forward network consisting of limited numbers of ordered nodes and candidate operations in , there is one and only one architecture code that corresponds to it.

Proposition 1 guarantees the uniqueness of our model encoding. In the meanwhile, it also implies that our search space includes the whole set of feed-forward networks and guarantees the expressive capability of the formulation in Eq. (2). With such modeling, many previously defined seach spaces can be regarded as subsets of ours. For example, our modeling shall degrade into that of DARTS (Liu et al., 2018b) and SNAS (Xie et al., 2018) when is further introduced as a constraint. Limited by this constraint, only one operation will be chosen on each edge, which can be considered as a one-category problem.

3.2 Relaxation of The Search Space

Benefiting from the uniqueness property of our architecture code , as verified in Proposition 1, the task of learning the cell can therefore be converted to approaching the optimal code. Yet, as mentioned, one major obstacle to approaching efficient NAS is the difficulty of optimization in the discrete space. In this subsection, we shall introduce our strategy for searching in the continuous space as a proxy. Without loss of generality, we elaborate the strategy on the edge .

To make the search space continuous, we relax the categorical choice of a set of particular operations as follows



denotes the probability of choosing the

-th operation on the edge , and

represents a binary function that suffices to map a probability vector to a binary code and pass gradients in a continuous manner. We will discuss more about it in Section 

3.3 and explain how we manage to pass gradients with it. Specifically, is chosen to be a monotonic increasing function in our method, i.e.,


By substituting with and considering instead as the variable to be optimized, we have successfully achieved a continuous relaxation. Benefiting from the flexibility of our formulation, it is capable of modeling multi-category problems and incorporating resource constraint seamlessly, as will be discussed.

Figure 1: A visualized comparison between Gumbel-Softmax (left) and ensemble Gumbel-Softmax (right, ). For a probability vector , Gumbel-Softmax solely pertains to sample only two binary codes with the same probability, i.e., . In contrast, our ensemble Gumbel-Softmax is capable of sampling more diversified binary codes, i.e., , and . Furthermore, the probabilities of sampling these binary codes are logical. Typically, it is conceptually intuitive that the probability of sampling is larger than the probabilities of sampling the others since the probabilities in are equal to each other.

3.2.1 Modeling Multi-Category Problems

Different from the traditional probability vector that solely pertains to manage one-category problem, the formulation in Eq. (3) is in a position to solve multi-category problems. This superiority inherently comes from the monotonicity of the function . It endows the formulation with a capability of modeling more general relationships between different categories, not limited to the incompatibility purely.

3.2.2 Modeling Resource Constraint

In NAS, desired methods should not only show effectiveness (i.e., test-set accuracy) but also possess superior efficiency (i.e., search cost). Aiming at a high performing NAS framework, we integrate the two core parts together and represent as


where and represent the credits in terms of effectiveness and efficiency when choosing the -th operation on the edge , and is a hyper-parameter for balancing the two parts. In practice, the validation-set accuracy and search time are employed to represent the effectiveness and efficiency of our model respectively, and is always fixed in this work.

3.3 Optimization with Ensemble Gumbel-Softmax

Although the relaxation presented in Section 3.2 makes the search space continuous, how to define the binary function as desired to map each of the probabilities to a binary code needs to be sorted out. In principle, it is still a problem of learing to take concrete decisions, which is unfortunately indifferentiable and has no gradients almost everywhere. To leverage the gradient information as in generic deep learning, we introduce an ensemble Gumbel-Softmax estimator to optimize the problem with a principled approximation. As such, the back-propagation algorithm can be directly adopted in a end-to-end manner, yielding an efficient and effective searching mechanism.

Figure 2: A conceptual visualization for the searching process within DARTS-EGS. (a) First, a cell (i.e., directed acyclic graph) consisting of four ordered nodes is predefined. (b) During the forward propagation, with three candidate primitive operations (i.e., green, orange and cyan lines), our ensemble Gumbel-Softmax is employed to sample a network in a differentiable manner. During the backward propagation, the standard back-propagation algorithm is utilized to simultaneously calculate the gradients of the both architecture and network. (c) Finally, the details of the cell can be sampled with our ensemble Gumbel-Softmax and utilized to handle specific tasks.

3.3.1 Gumbel-Softmax

A natural formulation for representing discrete variable is to use the categorical distribution. However, partially due to the inability to back-propagate information through samples, it seems rarely applied in deep learning. In this work, we resort to the Gumbel-Max trick (Gumbel, 1954) for enabling back-propagation and and representing the process of taking decision as sampling from a categorical distribution, in order to perform NAS in a principled way. Specifically, given a probability vector

and a discrete random variable with

, we sample from the discrete variable by introducing the Gumbel random variables. To be more specific, we let


where is a sequence of the standard Gumbel random variables, and they are typically sampled from the Gumbel distribution with . An obstacle to directly using such approach in our problem is that the argmax operation is not really continuous. One straightforward way of dealing with this problem is to replace the argmax operation with a softmax (Jang et al., 2016; Maddison et al., 2016). Formally, the Gumbel-Softmax (GS) estimation can be expressed as


where indicates the probability that is the maximal entry in , and is a temperature. When ,

converges to an one-hot vector, and in the other extreme it will become a discrete uniform distribution with


From the expression in Eq. (7), we see that the traditional Gumbel-Softmax pertains solely to deal with the problems that only one category requires to be determined. In NAS, however, an optimal architecture may require multiple operations on an edge, considering the practical significance (He et al., 2016; Szegedy et al., 2016). For instance, the residual module in ResNets (He et al., 2016) consists of two operations with a learnable mapping and the identity . That is, choosing different operations in may not be mutually exclusive but compatible. One direct way of handling this limitation is to map all possible operation combinations to -dimensional vectors, where is the number of candidate operations in . However, it seems difficult to search architectures efficiently when there are many candidate operations (i.e., is large).

3.3.2 Ensemble Gumbel-Softmax

In order to address the aforementioned limitation in the traditional Gumbel-Softmax, we propose our Ensemble Gumbel-Softmax (EGS) estimator. With the assistance of this estimator, we will be able to choose multiple operations on each edge by sampling more diversified codes, not limited to the one-hot codes merely. To this end, we start by recoding binary codes. For the code on an edge , the whole architecture information included in can be recoded into a superposition of one-hot vectors, i.e.,


where is a -dimensional one-hot vector that uniquely corresponds to the operation . Such equivalence relationship reveals that compositing the results sampled from Gumbel-Softmax may be a possible way to sample straightforward, although Gumbel-Softmax is capable of sampling , as described in Section 3.3.1.

For an effective and efficient sampler, we composite the whole sampling results for every . That is, an ensemble of multiple Gumbel-Softmax samplers is profound for sampling diversified binary codes, i.e.,

Definition 1.

For a -dimensional probability vector and one-hot vectors sampled from , the -dimensional binary code sampled with ensemble Gumbel-Softmax is

where is the number of sampling times, is the -th element in , and indicates the -th element in .

Figure 3: Three cells learned by DARTS-EGS (). (a) Normal cell learned on CIFAR-10. (b) Reduction cell learned on CIFAR-10. (c) Recurrent cell learned on PTB. Specifically, the green and yellow nodes indicate the input and output, respectively. The solid and dashed lines means the learnable and predefined operations, respectively.
Architecture Test Error Params Search Cost Ops Search
(%) (M) (GPU days)
DenseNet-BC (Huang et al., 2017) 3.46 25.6 - - manual
NASNet-A + cutout (Zoph et al., 2018) 2.65 3.3 2000 13 RL
BlockQNN (Zhong et al., 2018) 3.54 39.8 96 8 RL
ENAS + cutout (Pham et al., 2018) 2.89 4.6 0.5 6 RL
AmoebaNet-A (Real et al., 2018) 3.34 3.2 3150 19 evolution
AmoebaNet-B + cutout (Real et al., 2018) 2.55 2.8 3150 19 evolution
Hierarchical evolution (Liu et al., 2017) 3.75 15.7 300 6 evolution
PNAS (Liu et al., 2018a) 3.41 3.2 225 8 SMBO
DARTS (-th order) + cutout (Liu et al., 2018b) 3.00 3.3 1.5 7 gradient-based
DARTS (-th order) + cutout (Liu et al., 2018b) 2.76 3.3 4 7 gradient-based
SNAS + mild + cutout (Xie et al., 2018) 2.98 2.9 1.5 - gradient-based
SNAS + moderate + cutout (Xie et al., 2018) 2.85 2.8 1.5 - gradient-based
SNAS + aggressive + cutout (Xie et al., 2018) 3.10 2.3 1.5 - gradient-based
Random search baseline + cutout 3.29 3.2 4 7 random
DARTS-EGS () 3.01 2.6 1 7 gradient-based
DARTS-EGS () 2.79 2.9 1 7 gradient-based
Table 1: Comparison with state-of-the-art image classifiers on CIFAR-10 (lower error rate is better).

3.3.3 Understanding Our EGS

To reveal the serviceability and sampling capability of our ensemble Gumbel-Softmax, from the definition, two basic propositions are given in the following.

First, whether an element in the sampled binary code is one depends on the probability at the corresponding location.

Proposition 2.

For arbitrary probability vector and number of sampling times , we have

where is the probability of . Furthermore,

where .

Proposition 2 implies that our ensemble Gumbel-Softmax is a mono-tonic increasing function in terms of probability, and in a position to act as the function in Eq. (3).

Second, the capability of sampling binary codes is determined on , i.e., the number of one-hot vectors sampled form Gumbel-Softmax. Specifically, we have

Proposition 3.

For arbitrary probability vector and number of sampling times , the ensemble Gumbel-Softmax is capable of sampling different binary codes, which includes the whole binary codes with up to ones and at least one.

Proposition 3 indicates that the sampling capability of increases exponentially with . In practice, larger is always employed to deal with more complex tasks for effect, and smaller one can be utilized to search more lightweight networks for efficiency.

Synthetically, our ensemble Gumbel-Softmax is not only available for searching network architectures but also sampling capability for guaranteeing the performance in theory. In Figure 1, a visualized comparison between Gumbel-Softmax and ensemble Gumbel-Softmax intuitively shows that our ensemble Gumbel-Softmax is more excellent than the traditional Gumbel-Softmax, in terms of both sampling capability and rationality in practice.

3.4 Parameter Leaning and Architecture Sampling

To eliminate the compromise of the NAS pipeline, the parameters and architectures will be simultaneously optimized interms of ensemble Gumbel-Softmax. Analogous to architecture search in (Zoph & Le, 2016; Zoph et al., 2018; Pham et al., 2018; Liu et al., 2017; Real et al., 2018; Liu et al., 2018b), the validation set performance is considered as the reward in our model, but using an end-to-end differentiable manner. Denote the training loss as . The goal in architecture search is to find a high-performance architecture, i.e.,


The main process of optimizing this objective is to minimize the expected performance of architectures sampled with . That is, the network is first sampled with . Afterward, the loss on the training dataset can be calculated by forward propagation. Relying on this loss, the gradients of the network architecture parameter and the network parameter are yielded to modify these parameters better. Because of the differentiability, our model can be trained end-to-end by the standard back-propagation algorithm. In the end, the network architecture is identified by sampling with , and the network parameter is estimated by retraining on the training set. A conceptual visualization of our DARTS-EGS model is illustrated in Figure 2.

4 Experiments

In this section, we systematically carry out extensive experiments to verify the capability of our model in discovering high-performance convolutional networks for image classification and recurrent networks for language modeling. For each task, the experiments consist of two stages, following with the previous work (Liu et al., 2018b; Xie et al., 2018). First, we search for the cell architectures based on our ensemble Gumbel-Softmax and find the best cells according to their validation performance. Second, the transferability of the best cells learned on CIFAR-10 (Krizhevsky & Hinton, 2009) and Penn Tree Bank (PTB) (Taylor et al., 2003) are investigated by employing them on large datasets, i.e.

, classification on ImageNet 

(Deng et al., 2009) and language modeling on WikiText-2 (WT2) (Merity et al., 2016), respectively. As a greatly improved work of DARTS, specifically, the experimental settings always inherit from it, except some special settings in each experiment.

4.1 Image Classification

Architecture Test Error (%) Params Search Cost Search
Top 1 Top 5 (M) (GPU days)
Inception-v1 (Szegedy et al., 2015) 30.2 10.1 6.6 - manual
MobileNet (Howard et al., 2017) 29.4 10.5 4.2 - manual
ShuffleNet-v1 2 (Zhang et al., 2018) 26.3 - 5 - manual
ShuffleNet-v2 2 (Ma et al., 2018) 25.1 - 5 - manual
AmoebaNet-A (Real et al., 2018) 25.5 8.0 5.1 3150 evolution
AmoebaNet-B (Real et al., 2018) 26.0 8.5 5.3 3150 evolution
AmoebaNet-C (Real et al., 2018) 24.3 7.6 6.4 3150 evolution
PNAS (Liu et al., 2018a) 25.8 8.1 5.1 225 SMBO
DARTS (searched on CIFAR-10) (Liu et al., 2018b) 26.7 8.7 4.7 4 gradient-based
SNAS (mild constraint) (Xie et al., 2018) 27.3 9.2 4.3 1.5 gradient-based
NASNet-A (Zoph et al., 2018) 26.0 8.4 5.3 2000 RL
NASNet-B (Zoph et al., 2018) 27.2 8.7 5.3 2000 RL
NASNet-C (Zoph et al., 2018) 27.5 9.0 4.9 2000 RL
DARTS-EGS () 25.7 8.5 4.3 1.5 gradient-based
DARTS-EGS () 24.9 8.1 4.7 1.5 gradient-based
Table 2: Comparison with state-of-the-art image classifiers on ImageNet in the mobile setting (lower error rate is better).

4.1.1 Architecture Search on CIFAR-10

In our experiments, eight typical operations are included in the candidate primitive set : and separable convolutions, and dilated separable convolutions, max pooling,

average pooling, identity, and zero. In order to preserve their spatial resolution, all operations are of stride one, and the convolutional feature maps are padded if necessary. In ensemble Gumbel-Softmax, the number of sampling times

is set for a rich search space. Following the settings in the previous work (Liu et al., 2018a, b; Xie et al., 2018)

, the ReLU-Conv-BN order is utilized in the whole convolution operations, and every separable convolution is always applied twice.

The settings of nodes in our convolutional cell are also following the previous work (Zoph et al., 2018; Real et al., 2018; Liu et al., 2018a, b). Specifically, every cell consists of nodes, among which the output node is defined as the depthwise concatenation of all the intermediate nodes. More larger networks are always established by stacking multiple cells together. In the -th cell, the first and second nodes are set equally to the outputs in the -th and -th cells respectively, with convolution as necessary. Furthermore, the reduction cell with the architecture coding is utilized at the and of the total depth of the network. The rest of cells are the normal cell with the architecture coding .

4.1.2 Architecture Evaluation on CIFAR-10

To evaluate the selected architecture, a large network of 20 cells is trained from scratch for 600 epoches with batch size 96 and report its performance on the test set. We add additional enhancements include cutout with size 16, path dropout of probability 0.2 and auxiliary towers with weight 0.4 following exiting works for fair comparison. We report the mean of 4 independent runs for our full model.

Figure 3 and Table 1 give the searched architectures and classification results on CIFAR-10, which shows that our DARTS-EGS achieces comparable results with the state-of-the-art with less computation resources. Such a good performance verifies that DARTS-EGS can effectively and efficiently search worthy architectures for classification. In DARTS-EGS, furthermore, higher accuracy is yielded when compared with . This scenario is in accordance with our motivation that more richer search spaces is beneficial for searching architectures.

Architecture Perplexity Params Search Cost Ops Search
valid test (M) (GPU days)
Variational RHN (Zilly et al., 2016) 67.9 65.4 23 - - manual
LSTM (Merity et al., 2017) 60.7 58.8 24 - - manual
LSTM + skip connections (Melis et al., 2017) 60.9 58.3 24 - - manual
LSTM + 15 softmax experts (Yang et al., 2017) 58.1 56.0 22 - - manual
DARTS (first order) (Liu et al., 2018b) 60.2 57.6 23 0.5 4 gradient-based
DARTS (second order) (Liu et al., 2018b) 58.1 55.7 23 1 4 gradient-based
NAS (Zoph & Le, 2016) - 64.0 25 1e4 CPU days 4 RL
ENAS (Pham et al., 2018) 68.3 63.1 24 0.5 4 RL
Random search baseline 61.8 59.4 23 2 4 random
DARTS-EGS () 58.3 56.2 22 0.5 4 gradient-based
DARTS-EGS () 57.1 55.3 23 0.5 4 gradient-based
Table 3: Comparison with state-of-the-art language models on PTB (lower perplexity is better).
Architecture Perplexity Params Search Cost Search
valid test (M) (GPU days)
LSTM + augmented loss (Inan et al., 2016) 91.5 87.0 28 - manual
LSTM + cache pointer (Grave et al., 2016) - 68.9 - - manual
LSTM (Merity et al., 2017) 69.1 66.0 33 - manual
LSTM + skip connections (Melis et al., 2017) 69.1 65.9 24 - manual
LSTM + 15 softmax experts (Yang et al., 2017) 66.0 63.3 33 - manual
DARTS (searched on PTB) (Liu et al., 2018b) 69.5 66.9 33 1 gradient-based
ENAS (searched on PTB) (Pham et al., 2018) 72.4 70.4 33 0.5 RL
DARTS-EGS () 67.3 64.6 33 1 gradient-based
DARTS-EGS () 66.5 64.2 33 1 gradient-based
Table 4: Comparison with state-of-the-art language models on WT2 (lower perplexity rate is better).
Number of Sampling Times () 1 2 3 4 5 6 7 8 9
Test Error (%) 3.38 3.22 3.09 3.05 3.01 2.84 2.79 2.74 2.73
Params (M) 2.29 2.51 2.56 2.60 2.72 2.84 2.79 2.90 3.02
Table 5: Sensitivity to number of sampling times on CIFAR-10 (lower error rate is better).

4.1.3 Transferability Evaluation on ImageNet

We apply mobile setting where the input image size is 224224 and the number of multiply-add operations of the model is restricted to be under 600M. A network of 14 cells is trained for 250 epoches with batch size 128, weight decay 3 and poly learning rate scheduler with initial learning rate 0.1. Label smoothing (Szegedy et al., 2016) and auxiliary loss (Lee et al., 2015)

are used during training. Other hyperparameters follow 

(Liu et al., 2018b).

In Table 2, we report the quantitative results on ImageNet. Note that the cell searched on CIFAR-10 can be smoothly employed to deal with the large-scale classification task. Compared with other gradient-based NAS methods, furthermore, greater margins are yielded on ImageNet. A possible reason is that more complex architectures can be searched in DARTS-EGS because of the larger search space. Consequently, such more complex architectures handle more complex task on ImageNet better.

4.2 Language Modeling

4.2.1 Architecture Search on PTB

In the language modeling task, our model is employed to search suitable activation function between nodes. Following the setting in 

(Zoph et al., 2018; Pham et al., 2018; Liu et al., 2018b), five popular functions are considered in the candidate primitive set , such as sigmoid, tanh, relu, identity, and zero. In addition, there are nodes in the recurrent cell, and the number of sampling times is set in ensemble Gumbel-Softmax for a rich search space. Similar to ENAS (Pham et al., 2018) and DARTS (Liu et al., 2018b)

, in cells, the very first intermediate node is obtained by linearly transforming the two input nodes, adding up the results and then passing through the tanh function, and the rest of activation functions are learned with our model and enhanced with the highway bypass 

(Zilly et al., 2017)

. The batch normalization 

(Ioffe & Szegedy, 2015) in each node to prevent gradient explosion during architecture search, and disable it during architecture evaluation. Our recurrent network consists of only a single cell, i.e., we do not assume any repetitive patterns within the recurrent architecture.

4.2.2 Architecture Evaluation on PTB

We train a single-layer recurrent network with searched cell for 1600 epoches with batch size 64 using averaged SGD, Both the embedding and the hidden sizes are set to 850 to ensure our model size is comparable with other baselines. Other hyper-parameters are set following (Liu et al., 2018b). Note that the model is not fine-tuned at the end of the optimization, nor do we use any additional enhancements for a fair comparison.

Table 3 lists the results in this experiment. From the table, we observe that DARTS-EGS also is in a position to search recurrent architectures effectively. It empirically shows that the back-propagation algorithm can guide DARTS-EGS to hit a preferable architecture in a larger search space, while maintaining the requisite efficiency. Similar to the conclusion in Section 4.1.2, lower perplexity is achieved when the larger is employed, which verifies that more richer search spaces is also valuable for recurrent architectures .

4.2.3 Transferability Evaluation on WT2

Different from the setting on PTB, we apply embdding hidden sizes 700, weight decay 5, and hidden-node variational dropout 0.15. Other hyperparameters remain the same in the experiment on PTB. In Table 4, the results on WT2 indicates that the transferability is also retentive on recurrent architectures. Conclusively, the consistent results in Section 4.1.3 and 4.2.3 strongly guarantee the transferability on both convolutional and recurrent architectures.

4.3 Ablation study

4.3.1 Sensitivity to Number of Sampling Times

We perform experiments on CIFAR-10 to analyze the sensitivities to the number of sampling time . Table 5 gives the results in this experiment. From this table, it can be observed that larger indicates higher performance, while more parameters will be introduced as increases. This is in accordance with the statement in Proposition 3, i.e., more capable networks might be found with larger M to get higher performance.

4.3.2 Performance on Semantic Segmentation

We also validate the capability of our method on a more complex task, i.e., semantic segmentation on VOC-2012. Compared with DARTS (), DARTS-EGS achieves better performance () with the larger margin than the classification on ImageNet. Such results demonstrate that DARTS-EGS may have more prominent superiority on more complex tasks, not just toy tasks.

5 Conclusion

We present a powerful framework to search network architectures, DARTS-EGS, which is capable of covering more diversified network architectures, while maintaining the differentiability of the NAS pipeline. For this purpose, the network architectures are represented with arbitrary discrete binary codes, guaranteeing the reach of search space. In order to ensure the efficiency in searching, ensemble Gumbel-Softmax is developed to search architectures in a differentiable end-to-end manner. By searching with the standard back-propagation, DARTS-EGS is able to outperform the state-of-the-art architecture search methods on various tasks, with remarkable efficiency.

Future work may include search the whole networks with our ensemble Gumbel-Softmax and injecting our ensemble Gumbel-Softmax into deep models to handle other machine learning tasks. For the first work, the sampling capability of ensemble Gumbel-Softmax guarantees the practicability of searching any networks, but how to improve the efficiency remains to be solved. For the second work, the differentiability of ensemble Gumbel-Softmax indicates that it can be utilized anywhere in networks. Relying on such insight, an interesting direction is to recast the clustering process into our ensemble Gumbel-Softmax. By aggregating inputs in each cluster, conclusively, a general pooling for both deep networks and deep graph networks (Battaglia et al., 2018; Chang et al., 2018) can be developed to deal with Euclidean and non-Euclidean structured data uniformly.