DAAS
'Discretization-Aware Architecture Search' alleviates the discretization gap in one-shot differentiable NAS. DAAS has been accepted by PR (2021).
view repo
The search cost of neural architecture search (NAS) has been largely reduced by weight-sharing methods. These methods optimize a super-network with all possible edges and operations, and determine the optimal sub-network by discretization, i.e., pruning off weak candidates. The discretization process, performed on either operations or edges, incurs significant inaccuracy and thus the quality of the final architecture is not guaranteed. This paper presents discretization-aware architecture search (DA2S), with the core idea being adding a loss term to push the super-network towards the configuration of desired topology, so that the accuracy loss brought by discretization is largely alleviated. Experiments on standard image classification benchmarks demonstrate the superiority of our approach, in particular, under imbalanced target network configurations that were not studied before.
READ FULL TEXT VIEW PDF'Discretization-Aware Architecture Search' alleviates the discretization gap in one-shot differentiable NAS. DAAS has been accepted by PR (2021).
Network architecture search (NAS) is a research topic aimming to explore the design of neural networks in a large space that is not well covered by human expertise. To alleviate the computational burden of the reinforcement-based
[38, 39] and evolutionary [26, 32, 25] algorithms that evaluate sampled architecture individually, researchers proposed one-shot search methods [2] which first optimized a super-network with all possible architectures included, and then sampled sub-networks from it for evaluation [24]. By sharing computation, this kind of methods accelerated NAS by – orders of magnitudes.A representative example of one-shot search is differentiable architecture search (DARTS [20]), which formulates the super-network into a differentiable form with respect to a set of architectural parameters, e.g., operations and connections, so that the entire NAS process can be optimized in an end-to-end manner. DARTS did not require an explicit process for evaluating each sub-network, but performed a standalone discretization process to determine the optimal sub-architecture, on which re-training is performed. Such an efficient search strategy does not require the search cost to increase dramatically as the size of search space, and the space can be much larger compared with other NAS approaches.
Despite of the superiority about efficiency, DARTS is believed to suffer the gap between the optimized super-network and the sampled sub-networks. In particular, as illustrated in [5], the difference between the number of cells can cause a ‘depth gap’, and the search performance is largely stabilized by alleviating the gap. In this paper, we point out another gap, potentially more important, caused by the process of discretizing architectural weights of the super-network. To be specific, DARTS combines candidate operations and edges with a weighted sum (the weights are learnable), and preserves a fixed number of candidates with strong weights and discards others. However, there is no guarantee that the discarded weights are relatively small – if not, this discretization process can introduce significant inaccuracy in neural responses to each cell. Such inaccuracy accumulates and finally causes that a well-optimized super-network does not necessarily generates high-quality sub-networks, in particular (i) when the the discarded candidates still have moderate weights; and/or (ii) the number of pruned edges is relatively small compared to that in the super-network. Figure 1 shows a cell optimized by DARTS. One can see that discretization causes the super-network accuracy to drop dramatically, which also harms the performance of searched architecture in the re-training stage.
To alleviate the above issue, we propose discretization-aware architecture search (DA^{2}
S). The main idea is to introduce an additional term to the loss function, so that the architectural parameters of the super-network is gradually pushed towards the desired configuration during the search process. To be specific, we formulate the new loss term into an entropy function based on the property that minimizing the entropy of a system drives maximizing the sparsity and discretization of the elements (weights) in the system. The objective of entropy is to enforce each weight to get close to either
or , with the number of ’s determined by the desired configuration, so that the discretization process, by removing candidates with weights close to, does not incur significant accuracy loss. Being differentiable to architectural parameters, the entropy function can be freely plugged into the system for SGD optimization. We perform experiments on two standard image classification benchmarks, namely, CIFAR10 and ImageNet, based on PC-DARTS
[34], an efficient variant of DARTS. Note that two sets of architectural parameters exist in PC-DARTS, taking control of operations in an edge and edges that sum into a node, respectively, and they are potentially equipped with different loss terms. We evaluate different configurations (i.e., varying from each other in the number of preserved edges for each node), most of which have not been studied before. When each search process reaches the end, the super-network converges into a discretization-friendly form, and the discretization process causes much smaller accuracy drop than that reported without the entropy loss. Consequently, the searched architecture, under any configuration, enjoys higher yet more stable performance, and the advantage is more significant as the configuration becomes more imbalanced, on which the original search method suffers a larger ‘discretization gap’.The rapid development of deep learning
[16], in particular convolutional neural networks, have largely changed the way of designing computational models in computer vision. Recent years have witnessed a trend of stacking more and more convolutional layers to a deep network
[1, 27, 10, 14] so that more trainable parameters are included and higher recognition accuracy is achieved.Going one step further, researchers started to consider the possibility that designs deep networks automatically, and thereby created a new research area termed neural architecture search (NAS) [38]
. NAS defines a sub-field of automated machine learning (AutoML)
[13]and has attracted increasing attentions in both academia and industry. The idea is to construct a sufficiently large space and thus enables the architecture to adjust according to training data, simulating the process of evolutionary computation. With carefully monitored search strategies, NAS has claimed better performance compared to hand-designed networks in a wide range of applications including image classification
[38], object detection [8], and semantic segmentation [17].The early efforts of NAS mainly involved heuristic search in a very large space, and the sampled architectures were often evaluated individually. Representative examples include using reinforcement learning (RL) to formulate network or block designs
[38, 39, 18], applying evolutionary algorithms (EA) to force the network evolve throughout iterations
[26, 32], or simply performing guided random search to find competitive solutions [25]. These methods often require a vast amount of computation, e.g., thousands of GPU-days. To accelerate the search process, one-shot architecture search was proposed to share computation among architectures with similar building blocks [2].One-shot architecture search was later developed into weight-reusing [3] and weight-sharing [24] methods which can reduce the search costs by orders of magnitudes. Beyond this point, researchers proposed to improve the search stability using better sampling methods [28, 7], explored the importance of the search space [12]
, and tried to integrate hardware consumption such as latency as additional evaluation metrics
[31]. These efforts eventually leads to powerful architectures that achieve state-of-the-art performance on ImageNet [3] with moderate computational cost overhead.A special family of one-shot architecture search falls into formulating the search space into a super-network which can adjust itself in a continuous space [21]. Based on this, the network and architectural parameters can be jointly optimized, which leads to a differentiable approach for architecture search. DARTS [20], a representative differentiable framework, designed an over-parameterized super-network which contains exponentially many sub-networks with shared weights. It performed bi-level optimization to update network weights and architectural weights alternately and, at the end of the search stage, used a greedy algorithm to prune off the operations and edges with lower weights. Partially-Connected DARTS [34] pursed a more efficient search by sampling a small part of super-network to reduce the redundancy in exploring the network space.
Recent DARTS methods [5, 34] have achieved success on both architecture quality and search efficiency. Nevertheless few researchers noticed that the discretization process incurs a significant accuracy loss, which makes it difficult to obtain a high-quality sub-network from the optimal sub-network [35]. This paper investigates this problem born with DARTS methods in a systematic way with the target to search discretization-aware architectures from the perspective of model regularization.
DARTS [20] designs a cell-based search space to facilitate efficient differentiable architecture search. Each cell is represented as a directed acyclic graph with nodes, where each node defines a network layer. There is a pre-defined space of operations denoted by , where each element, , denotes a fixed operation. Commonly used operations include identity connection, and convolution performed at a network layer.
Within a cell, the searching goal is to choose one operation from for each pair of nodes. Let denote a pair of nodes, where . The primary idea of DARTS is to formulate the information and gradient propagated from to as a weighted sum over operations, as , where and denotes the output of the -th node, and is a set of architectural parameters to weight operations within each edge, with determining the weight of in edge . Following PC-DARTS [34], we introduce an extra set of architectural parameters () in our DA^{2}S to determine the weight of each edge. Thus, the output of a node is the sum of all input flows, i.e., , where . The output of the entire cell is formed by concatenating the output of all prior nodes, i.e., . Note that the first two nodes, and , are input nodes to a cell, which are fixed during the search procedure.
This design makes the entire framework differentiable to both layer weights and hyper-parameters , so that it is possible to perform architecture search in an end-to-end fashion. After the search process is completed, on each edge , the operation with the largest value is preserved, and each node is connected to two precedents with the largest preserved. Denote the architectural parameters as , and the overall super-network as , which is parameterized by both and . The learning procedure of DARTS optimizes the image classification loss to determine and , as
(1) |
where denotes a batch of training samples with corresponding class labels.
It is well acknowledged that DARTS-based approaches suffer limited stability, i.e., when the same search procedure runs for several times individually, the searched architectures can report varying performance during the re-training stage. For this reason, the original DARTS [20] evaluated the architectures found in four individual search phases on the validation set and picked up the best one, which results in search cost. More importantly, as the search space gets enlarged, the number of trials require to find a high-quality architecture may also increase, and finally, the DARTS-based approaches may lose the advantage in efficiency.
An important insight that our work delivers is that the instability is partly caused by the discretization loss. Here, by discretization we mean the process that picks up the best operation and/or edge and discards others according to the architectural weights of the super-network, i.e., the continuous parameters, and , are discretized so that a pre-defined number of elements are optimized towards 1 and others close to 0. This obviously introduces inaccuracy to the well-trained super-network. To show this, we follow DARTS to train a super-network on CIFAR10, which reports an accuracy of on the validation set. Then, we investigate the impact of discretization by replacing the corresponding part with the trained weights, e.g., on each edge, keeping the dominating operation (using a weight of ) with its parameters (e.g., convolutional weights) unchanged. Results are shown in Figure 1. The accuracy drop is dramatic, e.g., under the setting of DARTS (each node has edges preserved), the validation accuracy drops from to . If we investigate a more imbalanced discretization (the first two nodes have edge each and the last two nodes have edges each), the validation accuracy drops to , which is even close to a random guess. This is unexpected and violates the design nature of one-shot NAS, which suggests that dramatically bad sub-networks can be sampled from a well-trained super-network. Consequently, there is no guarantee that architectures found in this way can eventually report good performance, even after a complete re-training process has been performed.
We argue that such gap is caused by that the training process is not aware of that a discretization process will be performed afterwards. For example, when operations are competing in an edge , they ‘assume’ that the input, is a weighted sum of the outputs of all nodes prior to . When discretization is performed, is modified into the output of the dominating node, but the weights on edge may not match the new input. Such inaccuracy accumulates throughout the entire network and eventually leads to catastrophic accuracy drop. Therefore, the key to alleviate the gap is to make the search process aware of discretization, as well as the topology of the final architecture. We will elaborate our solution in the next part.
Figure 2 shows the overall framework of discretization-aware search. The main idea is to use the topology constraint to guide the optimization process, so that super-network eventually gets close to a sub-network that is allowed to appear as the final architecture. This is achieved by adding a loss function that measures the minimal distance between the current super-network and any acceptable sub-network. Specifically, we introduce an entropy-based loss function for each set of architectural parameters to fulfil this goal.
Below we elaborate the details when applying this methodology to two sets of parameters, (operation) and (edge), followed by discussions on the priority of discretization and the relationship between prior works and our DA^{2}S.
Discretization of and
We start with discretizing . To guarantee that only one operation dominates on each edge when the search process ends, we compute the following loss for each edge :
(2) |
Summarizing the loss term on all edges obtains the operation loss:
(3) |
Note that Eq. (2) is an entropy-based loss function on
, the probability of choosing
as the operation of each . Minimizing pushes the weights of all operations to a one-hot distribution, i.e., the probability of one operation is close to while that of others are close to . Note that is jointly optimized with , implying that when the search process is complete, the network parameters, , have been adjusted according to the one-hot , consequently, the inaccuracy introduced by discretization is much smaller.Things become a bit different when we try to discrete , because the configuration often requires to preserve more than one candidates, e.g., according to the standard DARTS formulation, each node receives input from two previous nodes. To handle it, we add an extra term to the previous entropy loss and constrain the maximum value of any to 1, and the overall loss is shown as:
(4) |
where . Note that the sum of can be changed according to the search configuration. In the experimental part, we will show how this formulation generalizes to other types of desired topology, e.g., preserving or out of edges.
Similarly, summarizing this term on all nodes obtains the edge loss:
(5) |
and the discretization-aware objective function for architecture search is:
(6) |
Discretization Priority
Edge discretization and operation discretization depend on the performance estimation by each other. This warped paradox perplexes the community a lot for a long time, and can be eased by independently enforcing additional regularization on
and . While, exploring the discretization priority of operation discretization and edge discretization further narrows the discretization gap. By introducing regularization control functions, the discretization-aware objective function for architecture search can be improved as:(7) |
where , and are regularization control factors related to classification accuracy, operation discretization and edge discretization, respectively.
Considering the dynamic change of node connections, operation weights, and network parameters during the searching process, the regularization factors are defined as functions of training epochs, and simplified to be chosen from five representative increasing functions, as shown in Figure
LABEL:fig:dimension, to reveal the regular pattern of optimization priority. At early training epochs, the network is not well trained. The regularization factors are small so that the training focuses on network parameters. As the optimization process continues, the network gets better trained and more attentions are paid on selecting operations and edges.There exist prior works to push the architectural parameters towards either or so as to align with the requirement of discretization.
For example, FairDARTS [6] introduced the zero-one loss as to quantize the architectural parameters, , by using individual sigmoid rather than softmax, where
indicates the sigmoid function. In addition, by considering NAS as an annealing process in which the system converges to a less chaotic status, XNAS
[23] proposed to reduce the temperature term of the cross-entropy loss so that weaker candidates get eliminated. However, FairDARTS was not able to control the exact number of preserved candidates – sometimes there can be multiple weights pushing towards but only one of them is allowed to be kept; on the other hand, XNAS cannot support more than one candidates to be preserved, which suffers limited flexibility when applied to multi-choice scenarios. In comparison, our approach can adjust the loss function according to the desired topology – we will show a variety of examples in Table 3. If needed, it can freely generalize to choose multiple operations for each edge.In this section, we first describe the experimental settings. We then validate the effect of our discretization-aware search approach. We also report the performance of our approach on balanced and imbalanced configurations, and compare it with the state-of-the-arts.
Dataset The commonly used CIFAR10 and ImageNet datasets are used to evaluate our network architecture search approach. CIFAR10 consists of 60K images, which are of a low spatial resolution of 32 32. The images are equally distributed over 10 classes, with 50K training and 10K testing images. ImageNet contains 1,000 object categories, which consists of 1.3M high-resolution training images and 50K validation images. The images are almost equally distributed overall classes. Following the commonly used settings, we apply the mobile setting where the input image size is ﬁxed to be 224 224 and the number of multi-add operations does not exceed 600M in the testing stage [34].
Implementation Details. Following DARTS as well as conventional architecture search approaches, we use an individual stage for architecture search, and after the optimal architecture is obtained, we use an additional process to train the classification model from scratch. During the search stage, the goal is to find the optimal and under the entropy-based discretization regularization in an end-to-end manner. We search architectures on CIFAR10 and then transfer to ImageNet.
During the search procedure, we split the training data into two parts, one for each stage of the search process. As for search space, we follow DARTS but without zero as it requires to choose a low weight operation when zero has a advantage to form a standard cell. There are in total 7 options including 33 and 55 separable convolution, 33 and 55 dilated separable convolution, 3
3 max-pooling, 3
3 average-pooling, and skip-connect.When searching, the over-parameterized super-network is constructed by stacking cells ( normal cells and reduction cells) with the initial number of channels , and each cell consists of N = nodes. The 50K training set of CIFAR10 is split into two subsets with equal size, with one subset used for training network weights and the other used for architecture hyper-parameters.
We train the super-network for epochs and super-network weights are optimized by the momentum SGD algorithm, a momentum of , and a weight decay of . The learning rate is reduced progressively to zero following a cosine schedule from an intial learning rate of without restart. We use an Adam optimizer [15] for both and , both with a ﬁxed learning rate of , a momentum of and a weight decay of [34]. The memory cost of our implementation is smaller than GB so that it can be trained on most modern GPUs.
In Table 1, we compare the proposed approach with the state-of-the-art approaches. It can be seen that our approach outperforms the baseline DARTS method with a large margin (2.42% vs. 2.76%), and outperforms recent gradient-based methods including P-DARTS [5], PC-DARTS [34], and BayesNAS [37]. Note that the significant performance gains are achieved with moderate parameter size (3.4 M) and computational cost (0.3 GPU days). The performance gains validate the effectiveness of our entropy-based regularization method and the importance of discretization-aware search itself. This part we will first introduce our approach to search standard cells (select edges from balanced configuration), and then to further illustrate the effectiveness of our approach, we will search non-standard cells (imbalanced configuration).
Operation and Edge Discretization. There are 7 operations in total for all cells (the ‘none’ operation is not used). Each cell has 14 edges and the network consists of two kinds of cells: the normal cells and the reduction cells, that the network architecture depends on the search of 28 edges. That is to say is the sum of 28 operation entropy losses. And is the sum of all edge entropy losses, Eq. (4). In Eq. (7), we experimentally define that , and . Then we evaluate the results with and under different setting of functions shown in the Figure LABEL:fig:dimension.
In Figure 3, we present the evolution of softmax of operation weights on CIFAR10 with edges in a normal cell. It can be seen that after about training epochs, the softmax of operation weights begin to significantly differentiate. At the final epoch, a single largest (towards 1) is obtained with the rest of small values (towards 0), which clearly demonstrate the effect of operation discretization. In Figure 4, we present the softmax evolution of , which validates the effect of edge discretization. Note that there are two edges selected at the same time for each pair of nodes, which shows the effectiveness of connection constraints in Eq. (4).
Discretization Priority. The entropy loss function inevitably interferes the searching procedure of DARTS, particularly, at the early epochs. Therefore, we propose to progressively increase the regularization factors using monotonous functions as shown in Figure LABEL:fig:dimension. In Table 2, we fix and as ‘const’ (equals ), it can be seen that the fast increasing functions, such as ‘linear’ and ‘log’, outperform slow ones for regularization factor , while ‘linear’ achives the best performance. It can be explained that moderately quick (linear) enhancement of the regularization on classification loss may have the smallest interference to the searching procedure.
In Table 4.2, we test the priority of operation and edge discretization using different regularizaton control functions with set to ‘linear’ as default. We fix as ‘const’ and evaluate using different control functions since the epochs before which is fixed as 0. This means that the priority of operation discretization is higher than that of edge. Under this setting, the best performance (2.49%) is achieved by the ‘step’ function. On the other hand, we fix and change under the same conditions. The best performance (2.42%) is achieved by the ‘log’ function.
The higher performance obtained by fixing shows that when the edge discretization dominates the search procedure, quick convergence of the topology of cell can lead the operation discretization-aware search converge better with fast increasing regularization control function (‘log’) utilized.
Imbalanced Configurations. In the above settings, it is defined that there are two inputs for each node in cells and the optimization objective is to select 8 out of 14 edges. This constraint largely reduces the difficulty of search, a random search can find architectures of moderate accuracy. To further validate the effectiveness and generalization of our approach, we search architectures with imbalanced configurations. Specifically, we break the setting about choosing 8 from 14 and choosing fewer edges to magnify the gap between architectures before and after discretization.
Four configurations, namely, preserving – out of edges, are used to validate our approach and compared it with DARTS. For DARTS, we use the default searched architecture and select 3, 4, 5, or 6 edges according to the weights of operations. For our approach, to select 3 edges, we pose edge entropy-loss on node2 and node3, and select the largest one, and pose edge entropy-loss on node4 and node5 to select one on each. To select 4 edges, we pose edge entropy-loss on four inner nodes so that each of them has a single edge. For 5 edge edges, we select two on node5 and one on other 3 nodes. For 6 edges, we select two on node4/node5 and one on node2/node3.
In Table 3, the performance under imbalanced configurations of DARTS and our approach is compared. Under imbalanced configurations, the performance of DARTS dramatically drops in a large margin around [77.75-78.00], which demonstrates that the discretization process does bring a significant gap before and after prunning. Such gap has unpredictable impact upon searched architecture. In contrast, with discretization-aware constraint, our approach achieves relatively stable performance that the classification accuracy drop are significantly reduced to [0.21, 21.29]. For each configuration, it outperforms DARTS with significant margins (2.16%, 1.75%, 0.51%, 0.27%) after re-training.
This part we use large-scale ImageNet to test the transferability of cells searched on CIFAR10 as shown in Figure 5. Same configuration as DARTS is adopt, i.e., the entire network is construct by stacking cells with an initial channel number of . We train the network for epochs from scratch with batch size on Tesla V100 GPUs. An SGD optimizer is used for optimizing the network parameters with an initial learning rate of (decayed linearly after each epoch), and also a momentum of 0.9 and a weight decay of . Other enhancements including label smoothing [30] and auxiliary loss are used during training, and learning rate warmup [9] is applied for the first epochs.
In Table 4, we evaluate the proposed approach and compare the result with the state-of-the-art approaches under the mobile setting (the FLOPs does not exceed ). DA^{2}S outperforms the direct baseline, DARTS, by a significant margin of (an error rate of vs. ). DA^{2}S also produces competitive performance among some recently published work including P-DARTS, PC-DARTS, and BeyesNAS, when the network architecture is searched on CIFAR and transferred to ImageNet. This further verifies the superiority of our DA^{2}S in mitigating the discretization gap in the differentiable architecture search framework.
In this paper, we propose a discretization-aware NAS method, which works by introducing an entropy-based loss term to push the super-network towards a discretization-friendly status according to the pre-defined target. This strategy can be applied to either selecting an operator for each edge, or selecting a fixed number of edges for each node. Experiments on standard image classification benchmarks demonstrate the superiority of our approach, in particular, under some imbalanced configurations which were not studied before.
This work provides another evidence that one-shot neural architecture search can benefit from shrinking the gap between the super-network and sub-networks. As the search space becomes more complicated in the future, we expect our approach to serve as a standard tool to alleviate the discretization gap. We also look forward to investigate some uncovered problems, e.g., whether discretization can be done in a gradual manner so as to further reduce the error.
Regularized evolution for image classifier architecture search
. In AAAI, pp. 4780–4789. Cited by: §1, §2, Table 1, Table 4.