Recently, substantial studies [28, 34, 67, 12] have shown that automatically discovered architectures are able to achieve highly competitive performance compared to the hand-crafted architectures. However, there are some limitations to the NAS based architecture design methods. In fact, since there is an extremely large search space [34, 67] (e.g., billions of candidate architectures), these methods are hard to be trained and often produce sub-optimal architectures, leading to the limited representation performance or substantial computational cost. Thus, even for the architectures searched by NAS methods, it is still necessary to optimize their redundant operations to achieve better performance and/or reduce the computational cost.
To optimize the architectures, Luo et al. recently proposed a Neural Architecture Optimization (NAO) method . Specifically, NAO first encodes an architecture into an embedding in the continuous space and then conducts gradient descent to obtain a better embedding. After that, it uses a decoder to map the embedding back to obtain an optimized architecture. However, NAO has its own set of limitations. First, NAO often produces a totally different architecture from the input architecture, making it hard to analyze the relationship between the optimized and the original architectures (See Fig. 1). Second, NAO may improve the architecture design at the expense of introducing extra parameters or computational cost. Third, similar to the NAS methods, NAO has a very large search space, which may not be necessary for architecture optimization and may make the optimization problem very expensive to solve. An illustrative comparison between our methods and NAO can be found in Fig. 1.
Unlike existing methods that design/find neural architectures, we have proposed a Neural Architecture Transformer (NAT)  method to automatically optimize neural architectures to achieve better performance and/or lower computational cost. To this end, NAT replaces the expensive operations or redundant modules in an architecture with the more efficient operations. Note that NAT can be used as a general architecture optimizer that takes any architecture as input and outputs an optimized architecture. NAT has shown great performance in optimizing various architectures on several benchmark datasets. However, NAT only considers three operation transitions, i.e., remaining unchanged, replacing with null connection, replacing with skip connection. Such a small search/transition space may hamper the performance of architecture optimization. Thus, it is important and necessary to enlarge the search space of architecture optimization.
In this paper, based on NAT, we propose a Neural Architecture Transformer++ (NAT++) method which considers a larger search space to conduct architecture optimization in a finer manner. To this end, we present a two-level transition rule to simultaneously change both the type and the kernel size of an operation in architecture optimization. Specifically, NAT++ encourages operations to have more efficient types (e.g., convolutionseparable convolution) or smaller kernel sizes (e.g., ). For convenience, we use valid transitions to denote those transitions that do not increase the computational cost. Note that different operations may have different valid transitions. To make NAT++ accommodate all the considered operations, we propose a Binary-Masked Softmax (BMSoftmax) layer to omit all the invalid transitions that violate the transition rule. In this way, NAT++ is able to predict the optimal transitions for the operations with different valid transitions simultaneously. Extensive experiments show that our NAT++ significantly outperforms existing methods.
The contributions of this paper are summarized as follows.
We propose a Neural Architecture Transformer (NAT) method which optimizes arbitrary architectures for better performance and/or less computational cost. To this end, NAT either removes the redundant operations or replaces them with skip connections. To better exploit the adjacency information of operations in an architecture, we propose to exploit graph convolutional network (GCN) to build the architecture optimization model.
Based on NAT, we propose a Neural Architecture Transformer++ (NAT++) method which considers a larger search space for architecture optimization. Specifically, NAT++ presents a two-level transition rule which encourages operations to have a more efficient type and/or a smaller kernel size. Thus, NAT++ is able to automatically obtain the valid transitions (i.e., the transitions to more efficient operations).
To accommodate the operations which may have different valid transitions, we propose a Binary-Masked Softmax (BMSoftmax) layer to build a general NAT++ model which predicts the optimal transitions for all the operations simultaneously.
Extensive experiments on several benchmark datasets show that our NAT and NAT++ consistently improve the design of various architectures, including both hand-crafted and NAS based architectures. Compared to the original architectures, the optimized architectures tend to yield significantly better performance and/or lower computational cost.
This paper extends our preliminary version  from several aspects. 1) We propose an advanced version NAT++ by enlarging the search space to improve the performance of architecture optimization. 2) We present a two-level transition rule to automatically obtain the valid transitions for each operation on both the operation type level and the kernel size level. 3) We propose a Binary-Masked Softmax (BMSoftmax) layer to omit all the invalid transitions. 4) We compare the computational cost of different operations and analyze the effect of the transitions among them on our method. 5) We provide more analysis about the impact of different operations on the convergence speed of architectures. 6) We investigate the possible bias towards the architectures with too many skip connections in the proposed method. 7) We provide more empirical results to show the effectiveness of NAT and NAT++ based on various architectures.
2 Related Work
2.1 Hand-crafted Architecture Design
Many studies have proposed a series of deep neural architectures, such as AlexNet , VGG  and so on. Based on these models, many efforts have been made to further increase the representation ability of deep networks. Szegedy et al. propose the GoogLeNet  which consists of a set of convolutions with different kernel sizes. He et al. propose the residual network (ResNet)  by introducing residual shortcuts between different layers. To design more compact models, MobileNet [18, 40] employs depthwise separable convolution to reduce model size and computational overhead. ShuffleNet [61, 31] exploits pointwise group convolution and channel shuffle to significantly reduce computational cost while maintaining comparable accuracy. However, the human-designed process often requires substantial human effort and cannot fully explore the whole architecture space.
2.2 Neural Architecture Search
use a recurrent neural network as the controller
to construct each convolution by determining the optimal stride, the number and the shape of filters. Phamet al. propose a weight sharing technique  to significantly improve search efficiency. Liu et al. propose a differentiable NAS method, called DARTS , which relaxes the search space to be continuous. Recently, Luo et al. propose the Neural Architecture Optimization (NAO)  method to perform architecture search on continuous space by exploiting encoding-decoding technique. Unlike these methods, our method optimizes architectures without introducing extra computational cost (See comparisons in Fig. 1).
2.3 Architecture Adaptation and Model Compression
Several methods have been proposed to adapt architectures to some specific platform or compress some existing architectures. To obtain compact models, [58, 24, 9, 6] adapt architectures to the more compact ones by learning the optimal settings of each convolution. One can also exploit model compression methods [25, 17, 29, 66] to remove the redundant channels to obtain compact models. Recently, ESNAC  uses Bayesian optimization techniques to search for a compressed network via layer removal, layer shrinkage, and adding skip connections. ASP  proposes an affine parameter sharing method to search for the optimal channel numbers of each layer to optimize architectures. Nevertheless, these methods have to learn a compressed model for a specific architecture and have limited generalization ability to different architectures. Unlike these methods, we seek to learn a general optimizer for arbitrary architecture.
3 Neural Architecture Transformer
3.1 Problem Definition
Following [34, 28], we consider a cell as the basic block to build the entire network. Given a cell based architecture space , we can represent an architecture as a directed acyclic graph (DAG), i.e., , where is a set of nodes that denote the feature maps in DNNs and is an edge set [67, 34, 28], as shown in Fig. 2. Here, a DAG contains nodes , where and denote the outputs of two previous cells, and denotes the output node that concatenates all intermediate nodes . Each intermediate node is able to connect with all previous nodes. The directed edge denotes some operation (e.g.,
convolution or max pooling) that transforms the feature map from nodeto . For convenience, we divide the edges in into three categories, namely, , , , as shown in Fig. 2. Here, denotes the skip connection, denotes the null connection (i.e., no edge between two nodes), and denotes the operations other than skip connection or null connection (e.g., convolution or max pooling). Note that different operations have different cost. Specifically, let be a function to evaluate the computational cost. Obviously, we have .
In this paper, we propose an architecture optimization method, called Neural Architecture Transformer (NAT), to optimize any given architecture to achieve better performance and/or less computational cost. To avoid introducing extra computational cost, an intuitive way is to make the original operation have less computational cost, e.g., replacing operations with skip or null connection. Although skip connection has a slightly higher cost than null connection, it often can significantly improve the performance [15, 16]. Thus, we enable the transition from null connection to skip connection to increase the representation ability of deep networks. In summary, we constrain the possible transitions among , and in Fig. 2 in order to reduce the computational cost.
3.2 Markov Decision Process for NAT
In this paper, we seek to learn a general architecture optimizer which takes any given architecture as input and outputs the corresponding optimized architecture. Let be the input architecture which follows some distribution , e.g., multivariate uniformly discrete distribution. We seek to obtain the optimized architecture by learning the mapping , where denotes the learnable parameters. Let and be the well-learned model parameters of architectures and , respectively. We measure the performance of and by some metric and , e.g., accuracy. For convenience, we define the performance improvement between and by . To illustrate our method, we first discuss the architecture optimization problem for a specific architecture and then generalize it to the problem for different architectures.
To learn a good architecture transformer to optimize a specific , we can maximize the performance improvement . However, simply maximizing may easily find an architecture with much higher computational cost than the input counterpart . Instead, we seek to obtain the optimized architectures with better performance without introducing additional computational cost. To this end, we introduce a constraint to encourage the optimized architecture to have lower computational cost than the input one. Moreover, it is worth mentioning that, directly obtaining the optimal w.r.t. the input architecture is non-trivial . Following [67, 34], we instead learn a policy and use it to produce an optimized architecture, i.e., . To learn the policy, we seek to solve the following optimization problem:
where denotes the expectation operation over .
However, the optimization problem in Eqn. (1) only focuses on a single input architecture. To learn a general architecture transformer that is able to optimize any given architecture, we maximize the expectation of performance improvement over the distribution of input architecture . Formally, the expected performance improvement over different input architectures can be formulated by . Consequently, the optimization problem becomes
Unlike conventional neural architecture search (NAS) methods that design/find an architecture from scratch [34, 28], we hope to optimize any given architectures by replacing redundant operations (e.g., convolution) in the input architecture with the more efficient ones (e.g., skip connection). Since we only allow the transitions that do not increase the computational cost (also called valid transitions) in Fig. 2, compared to the input architecture , the optimized architecture would have less or at least the same computational cost. Thus, the proposed method can naturally satisfy the cost constraint .
As mentioned above, our NAT only takes a single architecture as input to predict the optimized architectures. However, one may obtain a better optimized architecture if we consider the previous success and failure optimization results/records of other architectures. In this case, the optimization problem would be extremely complicated and hard to solve. To alleviate the training difficulty of the optimization problem, we formulate it as a Markov Decision Process (MDP). Specifically, we exploit the Markov property to optimize the current architecture without considering the previous optimization results (similar to the MDP formulation in the multi-arm bandit problem [51, 1]). In this way, MDP is able to greatly simplify the decision process. We put more discussions on our MDP formulation in the supplementary.
MDP formulation details. A typical MDP  is defined by a tuple , where is a finite set of states, is a finite set of actions, is the state transition distribution, is the reward function, is the distribution of initial state, and is a discount factor. Here, we define an architecture as a state, a transformation mapping as an action. Here, we use the accuracy improvement on the validation set as the reward. Since the problem is a one-step MDP, we can omit the discount factor . Based on the problem definition, we transform any into an optimized architecture with the policy .
3.3 Policy Learning by Graph Convolutional Network
As mentioned in Section 3.2, NAT takes an architecture graph as input and outputs the optimization policy . To learn the optimal policy, since the optimization of an operation/edge in the architecture graph depends on the adjacent nodes and edges, we consider both the current edge and its neighbors. Therefore, we build the controller model with a graph convolutional networks (GCN)  to exploit the adjacency information of the operations in the architecture. Here, an architecture graph can be represented by a data pair , where denotes the adjacency matrix of the graph and denotes the attributes of the nodes together with their two input edges. We put more details in the supplementary.
Note that a graph convolutional layer is able to extract features by aggregating the information from the neighbors of each node (i.e., one-hop neighbors) . Nevertheless, building the model with too many graph convolutional layers (i.e., high-order model) may introduce redundant information  and hamper the performance (See results in Fig. 7(a)). In practice, we build our NAT with a two-layer GCN, which can be formulated as
where and denote the weights of two graph convolution layers, denotes the weight of the fully-connected layer,
is a non-linear activation function (e.g.
denotes the softmax layer, and
refers to the probability distribution ofover 3 transitions on the edges, i.e., “remaining unchanged”, “replacing with null connection”, and “replacing with skip connection”.
It is worth mentioning that, the controller model is essentially a 3-class GCN based classifier. Given
edges in an architecture, NAT outputs the probability distribution. For convenience, we denote as the parameters of the controller model of NAT.
3.4 Training Method for NAT
As shown in Fig. 3, given an architecture as input, NAT outputs the policy/distribution over different candidate transitions. Based on , we conduct sampling to obtain the optimized architecture . After that, we compute the reward to guide the search process. To learn NAT, we first update the supernet parameters and then update the architecture transformer parameters in each iteration. We show the detailed training procedure in Algorithm 1.
Training the parameters of the supernet . Given any , we need to update the supernet parameters based on the training data. To accelerate the training process, we adopt the parameter sharing technique . Then, we can use the shared parameters to represent the parameters for different architectures. For any architecture , let
be the loss function,e.g., the cross-entropy loss. Then, given any sampled architectures, the updating rule for with parameter sharing can be given by , where is the learning rate.
Training the parameters of the controller model .
We train the transformer with reinforcement learning (i.e., policy gradient)  for several reasons. First, from Eqn. (2), there are no supervision signals (i.e., “ground-truth” better architectures) to train the model in a supervised manner. Second, the metrics of both accuracy and computational cost are non-differentiable. As a result, the gradient-based methods cannot be directly used for training. To address these issues, we use reinforcement learning to train our model by maximizing the expected reward over the optimization results of different architectures.
To encourage exploration, we use an entropy regularization term in the objective to prevent the transformer from converging to a local optimum too quickly , e.g., selecting the “original” option for all the operations. The objective can be formulated as
where is the probability to sample some architecture from the distribution , is the probability to sample some architecture from the distribution , evaluates the entropy of the policy, and controls the strength of the entropy regularization term. For each input architecture, we sample optimized architectures from the distribution in each iteration. Thus, the gradient of Eqn. (4) w.r.t. becomes
The regularization term encourages the distribution to have high entropy, i.e., high diversity in the decisions on the edges. Thus, the decisions for some operations would be encouraged to choose the “skip” or “null” operations during training. In this sense, NAT is able to explore the whole search space to find the optimal architecture.
3.5 Inferring the Optimized Architectures
After the training process in Algorithm 1, we obtain the parameters of the architecture transformer model . Based on the NAT model, we take any given architecture as input and output the architecture optimization policy . Then, we conduct sampling according to the learned policy to obtain the optimized architecture, i.e., . Specifically, we predict the optimal transition among three candidate transitions (i.e., “remaining unchanged”, “replacing with null connection”, and “replacing with skip connection”) for each edge in the architecture graph. Note that the sampling method is not an iterative process and we perform sampling once for each operation/edge. We can also obtain the optimized architecture by selecting the operation with the maximum probability, which, however, tends to reach a local optimum and yields worse results than the sampling based method (See results in supplementary).
4 Neural Architecture Transformer++
As mentioned in Section 3, NAT replaces the redundant operations in with the null connections or the skip connections according to the transition scheme in Fig. 2. However, there are still several limitations of NAT. First, merely replacing an operation with the null or skip connection makes the search space very small and may hamper the performance of architecture optimization. Second, when we divide into more specific operations, the number of transitions between every two categories would significantly increase. As a result, it is non-trivial to manually design valid transitions for each operation using NAT. Third, since operations may have different valid transitions to reduce the computational cost, it is hard to build a general GCN based classifier to predict the optimal transitions for all the operations.
To address the above limitations, we further consider more possible operation transitions to enlarge the search space and develop more flexible operation transition rules. The proposed method is called Neural Architecture Transformer++ (NAT++), whose operation transition scheme is shown in Fig. 4. In NAT++, we propose a two-level transition rule which encourages operations to have more efficient types or smaller kernel sizes to produce more compact architectures. Note that different operations may have different valid transitions. To predict the optimal transitions for the operations with different valid transitions, we propose a Binary-Masked Softmax (BMSoftmax) layer to build the NAT++ model. We will depict our NAT++ in the following.
4.1 Operation Transition Scheme for NAT++
Note that NAT  only considers three operation transitions, i.e., remaining unchanged, replacing with null connection, replacing with skip connection. As a result, the search space may be very limited and may hamper the performance of architecture optimization. To consider a larger search space, we propose a two-level transition scheme which encourages operations to have more efficient types and/or smaller kernel sizes (See Fig. 4).
4.1.1 Two-level Transition Scheme
In NAT++, we consider a larger search space to enable more possible transitions for architecture optimization. Specifically, we allow the transitions among six operation types, namely standard convolution, separable convolution, dilated separable convolution, max/average pooling, skip connection, and null connection. For each operation type, we consider three kernel sizes, i.e., , , and 111We put the details about all the considered operations in supplementary.. To optimize both the type and kernel size of operations, we design a type transition rule and a kernel transition rule, respectively.
Type Transition: We seek to reduce the computational cost by changing operation into a more computationally efficient one. According to Fig. 4, we use the following rule:
where denotes the transition direction. Since max pooling has a similar computational cost to average pooling, we enable the transition between max pooling and average pooling.
Kernel Transition: Given a specific operation type, one can also adjust the kernel size to change the operation. In general, a larger kernel would induce higher computational cost. Thus, to make sure that all the transitions can reduce the computational cost, we consider the following rule:
It is worth noting that only using any of the two rules cannot guarantee that we can reduce the computational cost. Specifically, according to Fig. 4, if we only focus on the rule on operation type, there may still exist some transitions that increase the computational cost by changing the operation type to a more efficient one but increasing the kernel size, e.g., conv_ sep_conv_. Similarly, if we only reduce the kernel size, there may also exist some transitions that introduce extra computational cost by changing the operation type to a more expensive one, e.g., sep_conv_ conv_. Thus, in practice, we make all the transitions meet the above two rules simultaneously to avoid increasing the computational cost. With the proposed two-level transition rule, unlike NAT, our NAT++ is able to automatically obtain the valid transitions for all the operations.
4.1.2 Search Space of NAT++
NAT++ has more possible transitions than NAT and thus has a larger search space. Given a cell structure with nodes and edges, we consider 13 operations/states in total (See more details in Fig. 4 and supplementary). Based on a specific , the size of the largest search space of NAT++ is , which is larger than the largest search space of NAT with the size of . Therefore, NAT++ has the ability to find the architectures with better performance and lower computational cost than NAT (See results in Section 5). Note that NAT++ also allows the transitions , , and . Hence, the search space of NAT is a true subset of the search space of NAT++.
4.1.3 Complexity Analysis of Different Operations
Note that our NAT and NAT++ seek to replace operations with the more efficient ones to avoid introducing additional computation cost. To determine which operations are more efficient, we compare the computational cost of different operations in terms of the number of multiply-adds (MAdds) and the number of parameters.
In Fig. 4, we sort the operations according to the number of parameters and MAdds in descending order. From Fig. 4, we draw the following observations. First, given a fixed kernel size, different operation types have different computational cost. Specifically, separable and dilated separable convolution have lower computational cost than the standard convolution. The max/average pooling, skip connection, and null connection have less or even no computational cost. Second, when we fix the operation type, the kernel size is also an important factor that affects the computational cost of operations. In general, a smaller kernel tends to have a lower computational cost.
4.2 Policy Learning for NAT++
To learn the optimal policy for NAT++, we also use a GCN based classifier to predict the optimal transition for each operation/edge. However, it is hard to directly apply the GCN based classifier in NAT to predict the optimal transitions for the operations with different valid transitions. Note that, in NAT, all the operations share the same valid transitions, i.e., remaining unchanged, replacing with null connection, replacing with skip connection. However, in NAT++, each operation has its own valid transitions and these transitions directly determine the considered classes of the GCN based classifier. As a result, we may have to design a GCN classifier for each operation, which, however, is very expensive in practice.
To address this issue, we make the following changes to build the GCN model of NAT++.
First, we increase the number of output channels of the final FC layer to match all the considered operations. In this way, NAT++ is able to consider more possible transitions than NAT.
Second, according to the transition scheme in Fig. 4, we replace the standard softmax layer in Eqn. (3) with a Binary-Masked Softmax (BMSoftmax) layer to omit all the invalid transitions that violate the two-level transition rule.
Specifically, given different operations, we represent the transitions for each operation as a binary mask (1 for valid transitions and 0 for invalid transitions). To omit the invalid transitions, NAT++ only computes the probabilities of all the valid transitions and leaves the probabilities of the invalid ones to be zero.
Let be the predicted logits by NAT++ over
be the predicted logits by NAT++ overtransitions. We compute the probability for the -th transition by
Based on BMSoftmax, NAT++ is able to determine the optimal transition for the operations with different valid transitions.
4.3 Possible Bias Risk of NAT and NAT++
As shown in Figs. 2 and 4, both NAT and NAT++ seek to replace redundant operations with skip connections when optimizing architectures. However, the architectures with more skip connections tend to converge faster than other architectures [59, 7]. As a result, the competition between skip connections and other operations may easily become unfair  and mislead the search process. Consequently, the NAS methods may incur a bias towards those architectures which converge faster but may yield poor generalization performance [44, 59, 7]. More analysis on the bias issue can be found in supplementary.
To address the bias issue, Zhou et al. introduce a
binary gate to each operation and propose a path-depth-wise regularization method to encourage the gates along the long paths in the supernet .
Such a regularization forces NAS methods to explore the architectures with slow convergence speed.
It is worth mentioning that, based on NAT and NAT++, we can alleviate the bias issue without the need for complex regularization.
As shown in Algorithm 1, unlike ENAS  and DARTS  ,
we decouple the supernet training from architecture search by sampling architectures from a uniform distribution on ImageNet (See Table
, we decouple the supernet training from architecture search by sampling architectures from a uniform distributionrather than the learned policy . Since all the operations have the same probability to be sampled, we provide an equal opportunity to train the architectures with different operations. In this sense, we can alleviate the possible bias issue (See results in Section 6.5). More critically, our methods are able to find better architectures than the architecture searched by 
on ImageNet (See TableIII).
|Model||Method||#Params (M)||#MAdds (M)||Acc. (%)||Model||Method||#Params (M)||#MAdds (M)||Acc. (%)|
|ESNAC ||14.6||295||95.26||74.43||ESNAC ||133||14523||73.6||91.5|
|APS ||15.0||305||95.53||74.79||APS ||137||15220||73.9||91.7|
|NAO ||0.4||61||92.44||71.22||NAO ||17.9||2246||70.8||89.7|
|ESNAC ||0.3||40||92.87||71.58||ESNAC ||11.2||1544||71.0||89.9|
|APS ||0.3||42||93.14||71.84||APS ||11.2||1547||70.9||90.0|
|NAO ||1.3||199||95.27||74.25||NAO ||34.8||4505||77.4||93.2|
|ESNAC ||0.8||125||95.33||74.30||ESNAC ||25.0||3484||77.4||93.3|
|APS ||0.8||123||94.54||73.58||APS ||24.9||3461||77.6||93.4|
|NAO ||1.4||251||93.16||72.04||NAO ||3.5||217||68.2||86.5|
|ESNAC ||0.8||153||93.21||72.14||ESNAC ||2.2||131||68.4||86.6|
|APS ||0.9||161||93.47||72.40||APS ||2.4||138||68.9||87.0|
|NAO ||2.9||131||94.75||73.79||NAO ||4.5||513||72.2||90.6|
|ESNAC ||2.1||84||94.87||73.94||ESNAC ||3.1||277||72.4||90.8|
|APS ||2.3||90||95.03||74.14||APS ||3.4||303||72.3||90.6|
|Model||Method||#Params (M)||#MAdds (M)||Acc. (%)|
We apply our method to optimize some well-designed architectures, including hand-crafted architectures and NAS based architectures. We have released the code for both NAT 222The code of NAT is available at https://github.com/guoyongcs/NAT. and NAT++ 333The code of NAT++ is available at https://github.com/guoyongcs/NATv2..
5.1 Implementation Details
|Model||Method||#Params (M)||#MAdds (M)||Acc. (%)||Model||Method||#Params (M)||#MAdds (M)||Acc. (%)|
|AmoebaNet ||/||3.2||-||96.73||-||AmoebaNet ||/||5.1||555||74.5||92.0|
|PNAS ||3.2||-||96.67||81.13||PNAS ||5.1||588||74.2||91.9|
|SNAS ||2.9||-||97.08||82.47||SNAS ||4.3||522||72.7||90.8|
|GHN ||5.7||-||97.22||-||GHN ||6.1||569||73.0||91.3|
|PR-DARTS ||3.4||-||97.68||83.55||PR-DARTS ||5.0||543||75.9||92.7|
|ENAS ||/||4.6||804||97.11||82.87||ENAS ||/||5.6||607||73.8||91.7|
|NAO ||4.5||763||97.05||82.57||NAO ||5.5||589||73.7||91.7|
|ESNAC ||4.1||717||97.13||83.15||ESNAC ||5.0||542||73.5||91.4|
|APS ||4.4||744||97.26||83.45||APS ||5.5||591||74.0||91.9|
|DARTS ||/||3.3||533||97.06||83.03||DARTS ||/||4.7||574||73.1||91.0|
|NAO ||3.5||577||97.09||83.12||NAO ||5.0||621||73.3||91.1|
|ESNAC ||2.8||457||97.21||83.36||ESNAC ||4.0||494||73.5||91.2|
|APS ||3.2||515||97.25||83.44||APS ||4.5||539||73.3||91.2|
|NAONet ||/||128||66016||97.89||84.33||NAONet ||/||11.3||1360||74.3||91.8|
|NAO ||143||73705||97.91||84.42||NAO ||11.8||1417||74.5||92.0|
|ESNAC ||107||55187||97.98||84.49||ESNAC ||9.5||1139||74.6||92.1|
|APS ||125||63468||97.96||84.47||APS ||11.0||1286||74.5||92.1|
|PC-DARTS ||/||3.6||570||97.43||84.21||PC-DARTS ||/||5.3||597||75.8||92.7|
|NAO ||4.7||725||97.49||84.30||NAO ||6.7||706||76.0||92.8|
|ESNAC ||3.3||503||97.44||84.20||ESNAC ||4.7||529||75.9||92.7|
|APS ||3.4||529||97.47||84.28||APS ||5.0||557||76.0||92.7|
We build the supernet by stacking 8 cells with the initial channel number of 20. We train the transformer for 200 epochs. Following the setting of, we set , , and in Eqn. (5). To cover all possible architectures, we set to be a uniform distribution. For the evaluation of networks, we replace the original cells with the optimized cells and train the models from scratch. For all the considered architectures, we follow the same settings of the original papers, i.e., we build the models with the same number of layers and channels as the original ones. We only apply cutout to the NAS based architectures on CIFAR.
5.2 Results on Hand-crafted Architectures
In this experiment, we apply both NAT and NAT++ to four popular hand-crafted architectures, namely VGG , ResNet , ShuffleNet  and MobileNetV2 . To make all architectures share the same graph representation method defined in Section 3.2, we add null connections into the hand-crafted architectures to ensure that each node has two input nodes (See Fig. 5). Note that each hand-crafted architecture may have multiple graph representations. However, our methods yield stable results on different graph representations (See results in supplementary).
5.2.1 Quantitative Results
From Table I, our NAT based models consistently outperform the original models by a large margin with approximately the same computational cost. Compared to NAT, NAT++ produces better optimized architectures with higher accuracy and lower computational cost. These results show that, by enlarging the search space, NAT++ is able to further improve the performance of architecture optimization. Moreover, compared to existing methods (i.e., NAO, ESNAC and ASP), NAT++ produces the architectures with higher accuracy and lower computational cost. Note that NAT and NAT++ yield the same results when optimizing MobileNetV2. The main reason is that the operations in MobileNetV2 are either conv_ or sep_conv_, which have already been very efficient operations. Thus, it is hard to benefit from the extended transition scheme of NAT++ when there are very few valid operation transitions.
We also evaluate our method on face recognition tasks. In this experiment, we consider three benchmark datasets (i.e., LFW , CFP-FP  and AgeDB-30 ) and two baselines (i.e., LResNet34E-IR  and MobileFaceNet ). We adopt the same settings as that in . More training details can be found in the supplementary. From Table II, the models optimized by NAT consistently outperform the original models without introducing extra computational cost. Moreover, NAT++ yields the best optimization results w.r.t. both architectures on all datasets.
5.2.2 Visualization of the Optimized Architectures
In this section, we visualize the original and optimized hand-crafted architectures in Fig. 5. From Fig. 5, NAT is able to introduce additional skip connections to the architecture to improve the architecture design. Unlike NAT, NAT++ conducts architecture optimization in a finer manner. Specifically, NAT++ replaces some standard convolutions with separable convolutions for VGG and ResNet. In this way, NAT++ not only reduces the number of parameters and computational cost but also further improves the performance (See Table I).
5.3 Results on NAS Based Architectures
We also apply the proposed methods to the automatically searched architectures. In this experiment, we consider four state-of-the-art NAS based architectures, namely DARTS , ENAS , NAONet  and PC-DARTS . Moreover, we compare our optimized architectures with other NAS based architectures, including AmoebaNet , PNAS , SNAS , GHN , and PR-DARTS .
From Table III, given different input architectures, the architectures obtained by NAT consistently yield higher accuracy than their original counterparts and the architectures optimized by existing methods. For example, given DARTS as input, NAT not only reduces 15% parameters and 23% computational cost but also achieves 0.6% improvement in terms of Top-1 accuracy on ImageNet. For NAONet, NAT reduces approximately 25% parameters and computational cost, and achieves 0.5% improvement in terms of Top-1 accuracy. Moreover, we also evaluate the architectures optimized by NAT++. As shown in Table III, equipped with the extended transition scheme, NAT++ is able to find better architectures with higher accuracy and lower computational cost than the architectures found by NAT and existing methods. Due to the page limit, we show the visualization results of the optimized architectures in the supplementary. These results show the effectiveness of the proposed method.
6 Further Experiments
6.1 Results on Randomly Sampled Architectures
We apply our NAT and NAT++ to 20 randomly sampled architectures from the whole architecture space. We train all architectures using momentum SGD with a batch size of 128 for 600 epochs. From Table IV and Fig. 6, the architectures optimized by NAT surpass the original ones in terms of both accuracy and computational cost. Moreover, equipped with the two-level transition scheme, NAT++ further improves the architecture optimization results. To better illustrate this, we exhibit the result of each architecture in Fig. 6, which shows that the models optimized by NAT++ achieve higher accuracy with fewer parameters than NAT. In this sense, our method has good generalizability on a wide range of architectures, making it possible to be applied in real-world applications.
|Test accuracy (%)||95.831.08||96.560.47||96.790.32|
6.2 Effect of the Number of Layers in GCN
We investigate the effect of the number of layers in GCN on the performance of our method. Specifically, we apply both NAT and NAT++ to optimize 20 randomly sampled architectures. We build 4 GCN models with layers, respectively. Note that a graph convolutional layer aims to extract features by aggregating the information from the neighbors of each node (i.e., one-hop neighbors) . The GCN with multiple layers is able to exploit the information from multi-hop neighbors in a graph [26, 11].
From Fig. 7(a), when we build a single-layer GCN, the model yields very poor performance since a single-layer model cannot handle the information from the nodes with more than 1 hop. However, if we build the GCN model with 5 or 10 layers, the larger models also hamper the performance since the models with too many graph convolutional layers (i.e., high-order model) may introduce redundant information . To learn a good policy, we build a two-layer GCN in practice.
6.3 Effect of in Eqn. (4)
In this part, we investigate the effect of (which makes a trade-off between the reward and the entropy term in Eqn. (4)) on the performance of architecture optimization. We train NAT and NAT++ with and report the average accuracy over the optimization results of 20 randomly sampled architectures. From Fig. 7(b), when we increase from to , the entropy term gradually becomes more important and encourages the model to explore the search space. In this way, it prevents the model from converging to a local optimum and helps find better optimized architectures. If we further increase to , the entropy term would overwhelm the objective function and hamper the performance. When we use a very large , the search process becomes approximately the same as random search and yields the architectures even worse than the original counterparts. In practice, we set .
6.4 Effect of and in Eqn. (5)
In this section, we investigate the effect of the hyper-parameters and on the performance of our method. When we gradually increase during training, more architectures have to be evaluated via additional forward propagations through the supernet to compute the reward. The search cost would significantly increase with the increase of . From Table V, we do not observe obvious performance improvement when we consider a large . One possible reason is that, based on the uniform distribution , even sampling one architecture in each iteration has provided sufficient diversity of the input architectures to train our model. Thus, we set in practice.
We also investigate the effect of the hyper-parameter which controls the number of sampled optimized architectures for each input architecture. When we consider a large , we have to evaluate more optimized architectures to compute the reward in each iteration, yielding significantly increased search cost. As shown in Table V, similar to , our model only achieves marginal performance improvement with the increase of . In practice, works well in NAT and NAT++. The main reason is that most of the sampled architectures can be very similar based on a fixed policy/distribution . As a result, increasing the number of sampled optimized architectures may provide limited benefits for the training process. Actually, a similar phenomenon is also observed in ENAS .
6.5 Discussions on the Possible Bias Risk
In this section, based on our methods, we investigate the possible bias issue towards the architectures that have fast convergence speed (in the early stage) but poor generalization performance. In this experiment, we randomly collect a set of architectures and use NAT and NAT++ to optimize them. Then, we compare the convergence curves of the original architectures and the optimized architectures on CIFAR-10. From Fig. 8, some of the original architectures incur the issue of “fast convergence in the early stage but with poor generalization performance”, e.g., Arch2 and Arch4. In contrast, all of the architectures optimized by NAT and NAT++ have a relatively stable convergence speed and yield better generalization performance than their original counterparts. From these results, the bias problem is not obvious in our methods. The main reason is that, in NAT and NAT++, all the operations have the same probability to be sampled and we would offer an equal opportunity to train the architectures with different operations. In this sense, we are able to alleviate the too fast convergence issue incurred by skip connection. Due to the page limit, we put the convergence curves of more architectures in the supplementary.
In this paper, we have proposed a novel Neural Architecture Transformer (NAT) for the task of architecture optimization. To solve this problem, we seek to replace the existing operations with more computationally efficient operations. Specifically, we propose a NAT to replace the redundant or non-significant operations with the skip connection or null connection. Moreover, we design an advanced NAT++ to further enlarge the search space. To be specific, we present a two-level transition rule which encourages operations to have a more efficient type or smaller kernel size to produce the more compact architectures. To verify the proposed method, we apply NAT and NAT++ to optimize both hand-crafted architectures and Neural Architecture Search (NAS) based architectures. Extensive experiments on several benchmark datasets demonstrate the effectiveness of the proposed method in improving the accuracy and compactness of neural architectures.
-  (1987) Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part i: iid rewards. IEEE Transactions on Automatic Control 32 (11), pp. 968–976. Cited by: §3.2.
-  (2017) Designing neural network architectures using reinforcement learning. In International Conference on Learning Representations, Cited by: §2.2.
-  (2019) ProxylessNAS: direct neural architecture search on target task and hardware. In International Conference on Learning Representations, Cited by: §2.2.
-  (2019) Learnable embedding space for efficient neural architecture compression. In International Conference on Learning Representations, Cited by: §2.3, TABLE I, TABLE II, TABLE III.
-  (2018) Mobilefacenets: efficient cnns for accurate real-time face verification on mobile devices. In Chinese Conference on Biometric Recognition, pp. 428–438. Cited by: TABLE II, §5.2.1.
-  (2016) Net2net: accelerating learning via knowledge transfer. In International Conference on Learning Representations, Cited by: §2.3.
Stabilizing differentiable architecture search via perturbation-based regularization.
International Conference on Machine Learning, Cited by: §4.3.
-  (2019) Fairnas: rethinking evaluation fairness of weight sharing neural architecture search. arXiv preprint arXiv:1907.01845. Cited by: §4.3.
-  (2019) Chamnet: towards efficient network design through platform-aware model adaptation. In , pp. 11398–11407. Cited by: §2.3.
-  (2019) Arcface: additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Cited by: TABLE II, §5.2.1.
-  (2019) Cognitive graph for multi-hop reading comprehension at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Cited by: §6.2.
-  (2020) Breaking the curse of space explosion: towards efficient nas with curriculum search. In International Conference on Machine Learning, pp. 3822–3831. Cited by: §1.
Double forward propagation for memorized batch normalization. In
AAAI Conference on Artificial Intelligence, pp. 3134–3141. Cited by: §1.
-  (2019) NAT: neural architecture transformer for accurate and compact architectures. In Advances in Neural Information Processing Systems, pp. 735–747. Cited by: §1, §1, §4.1.
-  (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: Fig. 2, §1, §2.1, §3.1, §5.2.
-  (2016) Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630–645. Cited by: §3.1.
-  (2017) Channel pruning for accelerating very deep neural networks. In IEEE International Conference on Computer Vision, pp. 1398–1406. Cited by: §2.3.
Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §2.1.
-  (2007) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report Technical Report 07-49, University of Massachusetts, Amherst. Cited by: §5.2.1.
-  (2017) Variational deep embedding: An unsupervised and generative approach to clustering. In International Joint Conference on Artificial Intelligence, pp. 1965–1972. Cited by: §1.
-  (2016) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, Cited by: §3.3.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §1, §2.1.
-  (1989) Backpropagation Applied to Handwritten zip Code Recognition. Neural Computation 1 (4), pp. 541–551. Cited by: §1.
-  (2019) Structured pruning of neural networks with budget-aware regularization. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 9108–9116. Cited by: §2.3.
-  (2017) Pruning filters for efficient convnets. In International Conference on Learning Representations, Cited by: §2.3.
Multi-hop knowledge graph reasoning with reward shaping. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, Brussels, Belgium, October 31-November 4, 2018, Cited by: §6.2.
-  (2018) Progressive neural architecture search. In European Conference on Computer Vision, pp. 19–34. Cited by: §5.3, TABLE III.
-  (2019) Darts: differentiable architecture search. In International Conference on Learning Representations, Cited by: §1, §2.2, §3.1, §3.2, §4.3, §5.1, §5.3, TABLE III.
-  (2019) ThiNet: pruning CNN filters for a thinner net. In IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, pp. 2525–2538. Cited by: §2.3.
-  (2018) Neural architecture optimization. In Advances in Neural Information Processing Systems, pp. 7816–7827. Cited by: Fig. 1, §1, §2.2, TABLE I, TABLE II, §5.3, TABLE III.
-  (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In European Conference on Computer Vision, pp. 116–131. Cited by: §2.1.
-  (2017) Agedb: the first manually collected, in-the-wild age database. In cvprw, pp. 51–59. Cited by: §5.2.1.
-  (2010) Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, pp. 807–814. Cited by: §3.3.
-  (2018) Efficient neural architecture search via parameter sharing. In International Conference on Machine Learning, pp. 4095–4104. Cited by: §1, §2.2, §3.1, §3.2, §3.2, §3.4, §4.3, §5.3, TABLE III, §6.4.
Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (1), pp. 121–135. Cited by: §1.
-  (2018) Runtime network routing for efficient image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
-  (2019) Regularized evolution for image classifier architecture search. In AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4780–4789. Cited by: §5.3, TABLE III.
-  (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. Cited by: §1.
-  (2016) Object Detection Networks on Convolutional Feature Maps. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
-  (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §2.1, §5.2.
-  (2015) Facenet: A Unified Embedding for Face Recognition and Clustering. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823. Cited by: §1.
-  (2015) Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §3.2.
-  (2016) Frontal to profile face verification in the wild. In wacv, pp. 1–9. Cited by: §5.2.1.
-  (2019) Understanding architectures learnt by cell-based neural architecture search. In International Conference on Learning Representations, Cited by: §4.3.
-  (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §1, §2.1, §5.2.
-  (2019) The evolved transformer. In International Conference on Machine Learning, Cited by: §2.2.
-  (2015) Training Very Deep Networks. In Advances in Neural Information Processing Systems, pp. 2377–2385. Cited by: §1.
-  (2015) Deeply Learned Face Representations are Sparse, Selective, and Robust. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2892–2900. Cited by: §1.
-  (2015) Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §2.1.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §2.2.
-  (2005) Multi-armed bandit algorithms and empirical evaluation. In European conference on machine learning, pp. 437–448. Cited by: §3.2.
-  (2020) Revisiting parameter sharing for automatic neural channel number search. Advances in Neural Information Processing Systems 33. Cited by: §2.3, TABLE I, TABLE II, TABLE III.
-  (2015) HCP: a flexible cnn framework for multi-label image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (9), pp. 1901–1907. Cited by: §1.
-  (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (3-4), pp. 229–256. Cited by: §3.4.
-  (2020) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §3.3, §6.2.
-  (2019) SNAS: Stochastic neural architecture search. In International Conference on Learning Representations, Cited by: §5.3, TABLE III.
-  (2020) Pc-darts: partial channel connections for memory-efficient differentiable architecture search. In International Conference on Learning Representations, Cited by: §5.3, TABLE III.
-  (2018) Netadapt: platform-aware neural network adaptation for mobile applications. In European Conference on Computer Vision, pp. 285–300. Cited by: §2.3.
-  (2019) Understanding and robustifying differentiable architecture search. In International Conference on Learning Representations, Cited by: §4.3.
-  (2019) Graph hypernetworks for neural architecture search. In International Conference on Learning Representations, Cited by: §5.3, TABLE III.
-  (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: §2.1, §5.2.
-  (2016) Accelerating Very Deep Convolutional Networks for Classification and Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10), pp. 1943–1955. Cited by: §1.
-  (2018) Practical block-wise neural network architecture generation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2423–2432. Cited by: §2.2.
-  (2020) Theory-inspired path-regularized differential network architecture search. In Advances in Neural Information Processing Systems, Cited by: §4.3, §5.3, TABLE III.
-  (2019) Multi-hop convolutions on weighted graphs. arXiv preprint arXiv:1911.04978. Cited by: §3.3, §6.2.
-  (2018) Discrimination-aware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pp. 875–886. Cited by: §2.3.
-  (2017) Neural architecture search with reinforcement learning. In International Conference on Learning Representations, Cited by: §1, §2.2, §3.1, §3.2.
-  (2018) Learning transferable architectures for scalable image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710. Cited by: §3.4.