arch2vec
[NeurIPS 2020] "Does Unsupervised Architecture Representation Learning Help Neural Architecture Search?" by Shen Yan, Yu Zheng, Wei Ao, Xiao Zeng, Mi Zhang
view repo
Existing Neural Architecture Search (NAS) methods either encode neural architectures using discrete encodings that do not scale well, or adopt supervised learning-based methods to jointly learn architecture representations and optimize architecture search on such representations which incurs search bias. Despite the widespread use, architecture representations learned in NAS are still poorly understood. We observe that the structural properties of neural architectures are hard to preserve in the latent space if architecture representation learning and search are coupled, resulting in less effective search performance. In this work, we find empirically that pre-training architecture representations using only neural architectures without their accuracies as labels considerably improve the downstream architecture search efficiency. To explain these observations, we visualize how unsupervised architecture representation learning better encourages neural architectures with similar connections and operators to cluster together. This helps to map neural architectures with similar performance to the same regions in the latent space and makes the transition of architectures in the latent space relatively smooth, which considerably benefits diverse downstream search strategies.
READ FULL TEXT VIEW PDF[NeurIPS 2020] "Does Unsupervised Architecture Representation Learning Help Neural Architecture Search?" by Shen Yan, Yu Zheng, Wei Ao, Xiao Zeng, Mi Zhang
Unsupervised representation learning has been successfully used in a wide range of domains including natural language processing
Mikolov et al. (2013); Devlin et al. (2019); Radford et al. (2018)Oord et al. (2016); He et al. (2019), robotic learning Finn et al. (2016); Jang et al. (2018), and network analysis Perozzi et al. (2014); Grover and Leskovec (2016). Although differing in specific data type, the root of such success shared across domains is learning good data representations that are independent of the specific downstream task. In this work, we investigate unsupervised representation learning in the domain of neural architecture search (NAS), and demonstrate how NAS search spaces encoded through unsupervised representation learning could benefit downstream search strategies.Standard NAS methods encode the search space with the adjacency matrix and focus on designing different downstream search strategies based on reinforcement learning
Williams (1992)Real et al. (2017), and Bayesian optimization Falkner et al. (2018) to perform architecture search in discrete search spaces. This encoding is simple yet a natural choice since neural architectures are by nature discrete. However, the size of the adjacency matrix grows quadratically as search space scales up, making downstream architecture search less efficient in large search spaces Elsken et al. (2019). To reduce the computational overhead, recent NAS methods employ dedicated networks to learn continuous representations of neural architectures and perform architecture search in the continuous search space Luo et al. (2018); Liu et al. (2019); Xie et al. (2019); Shi et al. (2019). In these methods, architecture representations and search strategies are jointly optimized in a supervised manner, guided by the accuracies of architectures selected by the search strategies. However, these methods are biased towards weight-free operations (e.g. identity, max-pooling) which are often preferred early on in the search, resulting in lower final accuracies
Guo et al. (2019); Shu et al. (2020); Zela et al. (2020b, a).In this work, we propose arch2vec, a simple yet effective unsupervised architecture representation learning method for neural architecture search. arch2vec
uses a variational graph isomorphism autoencoder to injectively capture the local structural information of neural architectures in the latent space and map distinct architectures into unique embeddings. It circumvents the bias caused by joint optimization through decoupling architecture representation learning and architecture search into two separate processes. By learning architecture representations using only neural architectures without their accuracies, it helps to model a more smoothly-changing architecture performance surface in the latent space. Such smoothness greatly helps the downstream search since architectures with similar performance tend to locate near each other in the latent space instead of locating randomly. We visualize the learned architecture representations in §
4.1. It shows that architecture representations learned by arch2vec can better preserve structural similarity of local neighborhoods than its supervised architecture representation learning counterparts. In particular, it can capture topology (e.g. skip connections or straight networks) and operation similarity, which helps to cluster architectures with similar accuracy.We follow the NAS best practices checklist Lindauer and Hutter (2019) to conduct our experiments. We validate the performance of arch2vec on three commonly used NAS search spaces NAS-Bench-101 Ying et al. (2019), NAS-Bench-201 Dong and Yang (2020) and DARTS Liu et al. (2019) and two search strategies based on reinforcement learning (RL) and Bayesian optimization (BO). Our results show that, with the same downstream search strategy, arch2vec consistently outperforms its discrete encoding and supervised architecture representation learning counterparts across all three search spaces.
Our contributions are summarized as follows:
Existing NAS methods typically use discrete encodings that do not scale well, or joint optimization that leverages accuracy as supervision signal for architecture representation learning. We demonstrate that pre-training architecture representations without using their accuracies helps to build a smoother latent space w.r.t. architecture performance.
We propose arch2vec, a simple yet effective unsupervised architecture representation learning method that injectively maps distinct architectures to unique representations in the latent space. By decoupling architecture representation learning and architecture search into two separate processes, arch2vec is able to construct less biased architecture representations, and thus benefits the architecture sampling process in terms of efficiency and robustness.
The pre-trained architecture representations considerably benefit the downstream architecture search. This finding is consistent across three search spaces, two search strategies and two datasets, demonstrating the importance of unsupervised architecture representation learning for neural architecture search.
Unsupervised Representation Learning of Graphs. Our work is closely related to unsupervised representation learning of graphs. In this domain, some methods have been proposed to learn representations using local random walk statistics and matrix factorization-based learning objectives Perozzi et al. (2014); Grover and Leskovec (2016); Tang et al. (2015); Wang et al. (2016); some methods either reconstruct a graph’s adjacency matrix by predicting edge existence Kipf and Welling (2016); Hamilton et al. (2017) or maximize the mutual information between local node representations and a pooled graph representation Veličković et al. (2019)
. The expressiveness of Graph Neural Networks (GNNs) is studied in
Xu et al. (2019) in terms of their ability to distinguish any two graphs. It also introduces Graph Isomorphism Networks (GINs), which is proved to be as powerful as the Weisfeiler-Lehman test Weisfeiler and Lehman (1968) for graph isomorphism. Zhang et al. (2019) proposes an asynchronous message passing scheme to encode DAG computations using RNNs. In contrast, we injectively encode architecture structures using GINs, and we show a strong pre-training performance based on its highly expressive aggregation scheme.Regularized Autoencoders
. Autoencoders can be seen as energy-based models trained with reconstruction energy
LeCun et al. (2006). Our goal is to encode neural architectures with similar performance into the same regions of the latent space, and to make the transition of architectures in the latent space relatively smooth. To prevent degenerated mapping where latent space is free of any structure, there is a rich literature on restricting the low-energy area for data points on the manifold Kavukcuoglu et al. (2010); Vincent et al. (2008); Kingma and Welling (2014); Makhzani et al. (2016); Ghosh et al. (2020). Here we adopt the popular variational autoencoder framework Kingma and Welling (2014); Kipf and Welling (2016) to optimize the variational lower bound w.r.t. the variational parameters, which as we show in our experiments is a simple yet effective regularization.Neural Architecture Search (NAS). As mentioned in the introduction, typical NAS methods are built upon discrete encodings Zoph and Le (2017); Baker et al. (2017); Falkner et al. (2018); Real et al. (2019); Kandasamy et al. (2018), which face the scalability challenge in large search spaces. To address this challenge, recent NAS methods shift from conducting architecture search in discrete spaces to continuous spaces Luo et al. (2018); Shi et al. (2019); White et al. (2019); Wen et al. (2019) using different architecture encoders such as MLP, LSTM Hochreiter and Schmidhuber (1997) or GCN Kipf and Welling (2017). However, what lies in common across these methods is that the architecture representation and search direction are jointly optimized by the supervision signal (e.g. accuracies of the selected architectures), which could bias the architecture representation learning and search direction. There is concurrent work Liu et al. (2020) showing that architectures searched without using labels are competitive to their counterparts searched with labels. Different from their approach which performs pretext tasks using image statistics, we use architecture reconstruction objective to preserve local structure relationship in the latent space.
In this section, we describe the details of arch2vec, followed by two downstream architecture search strategies we use in this work.
We restrict our search space to the cell-based architectures. Following the configuration in NAS-Bench-101 Ying et al. (2019), each cell is a labeled DAG , with as a set of nodes and as a set of edges. Each node is associated with a label chosen from a set of predefined operations. A natural encoding scheme of cell-based neural architectures is an upper triangular adjacency matrix and an one-hot operation matrix . This discrete encoding is not unique, as permuting of the adjacency matrix along with operation matrix will lead to the same graph, which is known as isomorphism Weisfeiler and Lehman (1968).
In order to learn a continuous representation that is invariant to isomorphic graphs, we leverage Graph Isomorphism Networks (GINs) Xu et al. (2019) to encode the graph-structured architectures given its better expressiveness. We augment the adjacency matrix as to transfer original directed graphs into undirected graphs, allowing bi-directional information flow. Similar to Kipf and Welling (2016), the inference model, i.e. the encoding part of the model, is defined as:
(1) |
We use the -layer GIN to get the node embedding matrix :
(2) |
where . is a trainable bias. The MLP
here is a multi-layer perception where each layer is a linear-batchnorm-ReLU triplet. Then, the node embedding matrix
is fed to two fully-connected layers to get the meanand the variance
of the posterior approximation in Eq. (1), respectively. During the inference, the architecture representation is derived by summing the representation vectors of all nodes.
Our decoder is a generative model aiming at reconstructing and from the latent variables :
(3) |
(4) |
where is the sigmoid activation, softmax(·) is the softmax activation applied row-wise, and indicates the operation selected from the predifined set of opreations at the n-th node. and are learnable weights and biases of the decoder.
In practice, our variational graph isomorphism autoencoder consists of five-layer GINs and two-layer MLPs with hidden dimension 128 for each layer. We set the dimensionality of the embedding to 16. During training, model weights are learned by iteratively maximizing a tractable variational lower bound:
(5) |
where as we assume that the adjacency matrix and operation matrix are conditional independent given the latent variable . The second term on the right hand side of Eq. (5
) denotes the Kullback-Leibler divergence
Kullback and Leibler (1951) which is used to measure the difference between the posterior distribution and the prior distribution . Here we choose a Gaussian prior due to its simplicity. We use reparameterization trick Kingma and Welling (2014) for training since it can be thought of as injecting noise to the code layer. Using random noise injection mechanism has been proved to be effective on the regularization of neural networks Sietsma and Dow (1991); An (1996); Kingma and Welling (2014). The loss is optimized using mini-batch gradient descent over neural architectures.We use reinforcement learning (RL) and Bayesian optimization (BO) as two representative search algorithms to evaluate arch2vec on the downstream architecture search.
We use REINFORCE Williams (1992) as our RL-based search strategy as it has been shown to converge better than more advanced RL methods such as PPO Schulman et al. (2017) for neural architecture search. We use a single-layer LSTM as the controller and output a 16-dimensional output as the mean vector to the Gaussian policy with a fixed identity covariance matrix. We use the validation accuracy of the sampled architecture as the reward and decode the sampled architecture representation to a valid architecture using L2 distance to find the nearest neighbor in the pre-trained latent space.
We use DNGO Snoek et al. (2015) as our BO-based search strategy. We use a one-layer adaptive basis regression network with hidden dimension 128 to model distributions over functions. It serves as an alternative to Gaussian process in order to avoid cubic scaling Garnett et al. (2014). We use expected improvement (EI) Mockus (1977) as the acquisition function which is widely used in NAS Kandasamy et al. (2018); White et al. (2019); Shi et al. (2019)
. During the search process, the best performing architectures are selected and added to the pool. The network is retrained in the next iteration using samples in the updated pool. This process is iterated until the maximum estimated wall-clock time is arrived.
We validate arch2vec
on three commonly used NAS search spaces. The details of the hyperparameters we used for pre-training and search on each search space are included in Appendix
A.NAS-Bench-101. NAS-Bench-101 Ying et al. (2019) is the first rigorous NAS dataset designed for benchmarking NAS methods. It targets the cell-based search space used in many popular NAS methods Zoph et al. (2018); Liu et al. (2018, 2019) and contains unique neural architectures. Each architecture comes with pre-computed validation and test accuracies on CIFAR-10. The cell consists of 7 nodes and can take on any DAG structure from the input to the output with at most 9 edges, with the first node as input and the last node as output. The intermediate nodes can be either 11 convolution, 33 convolution or 33 max pooling. We split the dataset into 90% training and 10% held-out test sets for arch2vec pre-training.
NAS-Bench-201. Different from NAS-Bench-101, the cell-based search space in NAS-Bench-201 Dong and Yang (2020) is represented as a DAG with nodes representing sum of feature maps and edges associated with operation transforms. Each DAG is generated by 4 nodes and 5 associated operations: 11 convolution, 33 convolution, 33 average pooling, skip connection and zero, resulting in a total of
unique neural architectures. The training details for each architecture candidate are provided for three datasets: CIFAR-10, CIFAR-100 and ImageNet-16-120
Chrabaszcz et al. (2017). We use the same data split as used in NAS-Bench-101.DARTS search space. The DARTS search space Liu et al. (2019) is a popular search space for large-scale NAS experiments. The search space consists of two cells: a convolutional cell and a reduction cell, each with six nodes. For each cell, the first two nodes are the outputs from the previous two cells. The next four nodes contain two edges as input, creating a DAG. The network is then constructed by stacking the cells. Following Liu et al. (2018), we use the same cell for both normal and reduction cell, allowing roughly DAGs without considering graph isomorphism. We randomly sample 600,000 unique architectures in this search space following the mobile setting Liu et al. (2019). We use the same data split as used in NAS-Bench-101.
In the following, we first evaluate the pre-training performance of arch2vec (§4.1) and then the neural architecture search performance based on its pre-trained representations (§4.2).
Observation (1): We compare arch2vec with two popular baselines GAE Kipf and Welling (2016) and VGAE Kipf and Welling (2016) using three metrics suggested by Zhang et al. (2019): 1) Reconstruction Accuracy (reconstruction accuracy of the held-out test set), 2) Validity (how often a random sample from the prior distribution can generate a valid architecture), and 3) Uniqueness (unique architectures out of valid generations). As shown in Table 1, arch2vec achieves the highest reconstruction accuracy, validity, and uniqueness in all three search spaces. Encoding with GINs outperforms GCNs in reconstruction accuracy due to its better neighbor aggregation scheme. The variational formulation acts as an effective regularizer that leads to better generative performance including validity and uniqueness. Given its superior performance, we stick to arch2vec for the remaining of our experiments.
Observation (2): We compare arch2vec with its supervised architecture representation learning counterpart on the predictive performance of the latent representations. This metric measures how well the latent representations can predict the corresponding architectures’ performance. Being able to accurately predict the performance of architectures based on the latent representation makes it easier to search for high-performance points in the latent space. We train a Gaussian Process model with 250 sampled data to predict all data and report the results across 10 different seeds. We use RMSE and the Pearson correlation coefficient (Pearson’s r) to evaluate points with test accuracy larger than 0.8. Figure 1 compares the predictive performance between arch2vec and its supervised architecture representation learning counterpart on NAS-Bench-101. As shown, arch2vec outperforms its supervised learning counterpart^{1}^{1}1The RMSE and Pearson’s r are: 0.0380.025 / 0.530.09 for the supervised architecture representation learning, and 0.0180.001 / 0.670.02 for arch2vec. A smaller RMSE and a larger Pearson’s r indicates a better predictive performance., indicating that arch2vec is able to better capture the local structure relationship of the input space and thus is more informative on guiding the search optimization.
Observation (3): In Figure 4, we plot the relationship between the L2 distance in the latent space and the edit distance of the corresponding DAGs between two architectures. For arch2vec, the L2 distance grows monotonically with increasing edit distance, indicating that arch2vec can preserve the closeness between two architectures measured by edit distance, which potentially benefits the effectiveness of the downstream search. In contrast, such closeness is not well captured by supervised architecture representation learning.
Observation (4): In Figure 4, we visualize the latent spaces of NAS-Bench-101 learned by arch2vec (left) and its supervised architecture representation learning counterpart (right) in 2-dimensional space using t-SNE. As shown, for arch2vec, the embeddings of architectures span the whole latent space, and architectures with similar accuracies are clustered together. Conducting architecture search on such a smoothly performance-changing latent space is much easier. In contrast, for the supervised counterpart, the embeddings are discontinuous in the latent space, and the transition of accuracy is non-smooth. This indicates that joint optimization guided by accuracy cannot injectively encode architecture structures. As a result, architecture does not have its unique embedding in the latent space, making the task of architecture search more challenging.
Observation (5): To provide a closer look at the learned latent space, Figure 2 visualizes the architecture cells decoded from the latent space of arch2vec (upper) and supervised architecture representation learning (lower). For arch2vec, the adjacent architectures change smoothly and embrace similar connections and operations. This indicates that unsupervised architecture representation learning helps to model a smoothly-changing structure surface. As we show in the next section, such smoothness greatly helps the downstream search since architectures with similar performance tend to locate near each other in the latent space instead of locating randomly. In contrast, the supervised counterpart does not group similar connections and operations well and have much higher edit distances between adjacent architectures. This biases the search direction since dependencies between architecture structures cannot be captured.
NAS results on NAS-Bench-101. For fair comparison, we reproduced the NAS methods which use the adjacency matrix-based encoding in Ying et al. (2019), including Random Search (RS) Bergstra and Bengio (2012), Regularized Evolution (RE) Real et al. (2019), REINFORCE Williams (1992) and BOHB Falkner et al. (2018). For the supervised learning-based search methods, the hyperparameters are the same as arch2vec, except that the architecture representation learning and search are jointly optimized. Figure 5 and Table 2 summarize our results.
Observation (1): BOHB and RE are two best-performing search methods using the adjacency matrix-based encoding. However, as shown in Figure 5, they perform slightly worse than supervised architecture representation learning because the relative high-dimensional input could tend to require more observations for the optimization. In contrast, supervised architecture representation learning focuses on low-dimensional continuous optimization and thus makes the search more efficient.
Observation (2): As shown in Figure 5 (left), arch2vec considerably outperforms its supervised counterpart and the adjacency matrix-based encoding after wall clock seconds. Figure 5 (right) further shows that arch2vec is able to robustly achieve the lowest final test regret after seconds across 500 independent runs.
Observation (3): Table 2 shows the search performance comparison in terms of number of architecture queries. Notably, while RL-based search using discrete encoding suffers from the scalability issue, arch2vec encodes architectures into a lower dimensional continuous space and is able to achieve competitive RL-based search performance with only a simple 1-layer LSTM controller.
NAS results on NAS-Bench-201.
For CIFAR-10, we follow the same implementation established in NAS-Bench-201 by searching based on the validation accuracy obtained after 12 training epochs with converged learning rate scheduling. The search budget is set to
seconds. The NAS experiments on CIFAR-100 and ImageNet-16-120 are conducted with budget that corresponds to the same number of queries as used in CIFAR-10. As shown in Table 3, searching with arch2vec leads to better validation and test accuracy as well as reduced variability among different runs on all datasets.NAS results on DARTS search space. Similar to White et al. (2019), we set the computational budget to 100 queries in this search space. In each query, a sampled architecture is trained for 50 epochs and the average validation error of the last 5 epochs is computed. To ensure fair comparison with same hyparameters setup, we re-trained the architectures from papers that exactly use DARTS search space and reported the final architecture. As shown in Table 4, arch2vec
generally leads to competitive search performance among different cell-based NAS methods with comparable model parameters. The best performed cells and transfer learning results on ImageNet
Deng et al. (2009) are included in Appendix C.arch2vec is a simple yet effective unsupervised architecture representation learning method for neural architecture search. By learning architecture representations without using their accuracies, it helps to model a more smoothly-changed architecture performance surface in the latent space compared to its supervised architecture representation learning counterpart, which further benefits different downstream search strategies. We have demonstrated its effectiveness on three NAS search spaces. We suggest that it is desirable to take a closer look at architecture representation learning for neural architecture search. It is possible that designing neural architecture search using arch2vec with a better search strategy in continuous space will give better results.
The effects of adding noise during backpropagation training on a generalization performance
. In Neural Computation, Cited by: §3.2.Regularized evolution for image classifier architecture search
. In AAAI, Cited by: Table 5, Appendix C, §2, §4.2, Table 3.Extracting and composing robust features with denoising autoencoders
. In ICML, Cited by: §2.As described in §3, we use adjacency matrix and operation matrix as inputs to our neural architecture encoder (§3.1). In this section, we present pre-training and search details for NAS-Bench-101 [59], NAS-Bench-201 [7] and DARTS [31] search spaces.
We followed the encoding scheme in NAS-Bench-101 [59]. Specifically, a cell in NAS-Bench-101 is represented as a directed acyclic graph (DAG) where nodes represent operations and edges represent data flow. A upper-triangular binary matrix is used to encode edges. A operation matrix is used to encode operations, input, and output, with the order as {input, 1 1 conv, 3 3 conv, 3 3 max-pool (MP), output}. For cells with less than
nodes, their adjacency and operator matrices are padded with trailing zeros. Figure
6 shows an example of a 7-node cell in NAS-Bench-101 search space and its corresponding adjacency and operation matrices.We use a five-layer Graph Isomorphism Network (GIN) with hidden sizes {128, 128, 128, 128, 16} as the encoder and two-layer MLPs with hidden dimension 128 for each layer as the decoder. The adjacency matrix is preprocessed as an undirected graph to allow bi-directional information flow. After forwarding the inputs to the model, the reconstruction error is minimized using Adam optimizer [21] with learning rate . We train the model with batch size 32 and the training loss is able to converge well after 20 epochs. After training, we extract the architecture embeddings from the encoder for downstream architecture search.
For RL-based search, We use REINFORCE [56] as the search strategy. We use a single-layer LSTM with hidden dimension 128 as the controller and output a 16-dimensional output as the mean vector to the Gaussian policy with a fixed identity covariance matrix. The controller is optimized using Adam optimizer [21] with learning rate . The number of sampled architectures in each episode is set to 16 and the discount factor is set to 0.8. The baseline value is set to 0.95. The maximum estimated wall-clock time for each run is set to seconds.
For BO-based search, we use DNGO [47] as the search strategy. We use a one-layer fully connected network with hidden dimension 128 to perform adaptive basis function regression. We randomly sample 16 architectures at the beginning, and select the top 5 best-performing architectures to the architecture pool in each architecture sampling iteration. The network is optimized using selected architecture samples in the pool using Adam optimizer with learning rate and trained for 100 epochs in each architecture sampling iteration. The best function value of expected improvement (EI) is set to 0.95. We use the same time budget used in RL-based search.
Different from NAS-Bench-101, NAS-Bench-201 [7] employs a fixed cell-based DAG representation of neural architectures, where nodes represent the sum of feature maps and edges are associated with operations that transform the feature maps from the source node to the destination node. To represent the architectures in NAS-Bench-201 with discrete encoding that is compatible with our neural architecture encoder, we first transform the original DAG in NAS-Bench-201 into a DAG with nodes representing operations and edges representing data flow as the ones in NAS-Bench-101. We then use the same discrete encoding scheme in NAS-Bench-101 to encode each cell into an adjacency matrix and operation matrix. An example is shown in Figure 7. The hyperparameters we used for pre-training on NAS-Bench-201 are the same as described in §A.1.
For RL-based search, the search is stopped when it meets the time budget , , seconds for CIFAR-10, CIFAR-100, and ImageNet-16-200, respectively. For CIFAR-10, we follow the same implementation established in NAS-Bench-201 by searching based on the validation accuracy obtained after 12 training epochs with converged learning rate scheduling. The discount factor and the baseline value is set to 0.4. All the other hyperparameters are the same as described in §A.1.
For BO-based search, we initially sampled 16 architectures and select the best performing architecture to the pool in each iteration. The best function value of EI is set to 1.0 for all datasets. We use the same search budget as used in RL-based search. All the other hyperparameters are the same as described in §A.1.
The cell in the DARTS search space has the following property: two input nodes are from the output of two previous cells; each intermediate node is connected by two predecessors, with each connection associated with one operation; the output node is the concatenation of all of the intermediate nodes within the cell [31].
Based on these properties, a upper-triangular binary matrix is used to encode edges and a operation matrix is used to encode operations, with the order as {, , zero, 3 3 max-pool, 3 3 average-pool, identity, 3 3 separable conv, 5 5 separable conv, 3 3 dilated conv, 5 5 dilated conv, }. An example is shown in Figure 8. Following [30], we use the same cell for both normal and reduction cell, allowing roughly DAGs without considering graph isomorphism. We randomly sample 600,000 unique architectures in this search space following the mobile setting [31]. The hyperparameters we used for pre-training on DARTS search space are the same as described in §A.1.
We set the computational budget to 100 architecture queries in this search space. In each query, a sampled architecture is trained for 50 epochs and the average validation accuracy of the last 5 epochs is computed. All the other hyperparamers we used for RL-based search and BO-based search are the same as described in §A.1.
We split the the dataset into 90% training and 10% held-out test sets for arch2vec pre-training on each search space. In §4.1, we evaluate the pre-training performance of arch2vec using three metrics suggested by [62]: 1) Reconstruction Accuracy (reconstruction accuracy of the held-out test set) which measures how well the embeddings can errorlessly remap to the original structures; 2) Validity (how often a random sample from the prior distribution can generate a valid architecture) which measures the generative ability the model; and 3) Uniqueness (unique architectures out of valid generations) which measures the smoothness and diversity of the generated samples.
To compute Reconstruction Accuracy, we report the proportion of decoded neural architectures of the held-out test set that are identical to the inputs. To compute Validity, we randomly pick up 10,000 points generated by the Gaussian prior and then apply std() + mean(), where are the encoded means of the training data. It scales the sampled points and shifts them to the center of the embeddings of the training set. We report the proportion of the decoded architectures that are valid in the search space. To compute Uniqueness, we report the proportion of unique architectures out of valid decoded architectures.
The validity check criteria varies across different search spaces. For NAS-Bench-101 and NAS-Bench-201, we use the NAS-Bench-101^{2}^{2}2https://github.com/google-research/nasbench/blob/master/nasbench/api.py and NAS-Bench-201^{3}^{3}3https://github.com/D-X-Y/NAS-Bench-201/blob/master/nas_201_api/api.py official APIs to verify whether a decoded architecture is valid or not in the search space. For DARTS search space, a decoded architecture has to pass the following validity checks: 1) the first two nodes must be the input nodes and ; 2) the last node must be the output node ; 3) except the two input nodes, there are no nodes which do not have any predecessor; 4) except the output node, there are no nodes which do not have any successor; 5) each intermediate node must contain two edges from the previous nodes; and 6) it has to be an upper-triangular binary matrix (representing a DAG).
Figure 9 shows the best cell found by arch2vec using RL-based and BO-based search strategy. As observed in [40], the shapes of normalized empirical distribution functions (EDFs) for NAS design spaces on ImagetNet [5] match their CIFAR-10 counterparts. This suggests that NAS design spaces developed on CIFAR-10 are transferable to ImageNet [40]. Therefore, we evaluate the performance of the best cell found on CIFAR-10 using arch2vec for ImageNet. In order to compare in a fair manner, we consider the mobile setting [64, 41, 31] where the number of multiply-add operations of the model is restricted to be less than 600M. We follow [27] to use the exactly same training hyperparameters used in the DARTS paper [31]. Table 5 shows the transfer learning results on ImageNet. With comparable computational complexity, arch2vec-RL and arch2vec-BO outperform DARTS [31] and SNAS [57] methods in the DARTS search space, and is competitive among all cell-based NAS methods under this setting.
NAS-Bench-101. In Figure 10, we visualize three randomly selected pairs of sequences of architecture cells decoded from the learned latent space of arch2vec (upper) and supervised architecture representation learning (lower) on NAS-Bench-101. Each pair starts from the same point, and each architecture is the closest point of the previous one in the latent space excluding previously visited ones. As shown, architecture representations learned by arch2vec can better capture topology and operation similarity than its supervised architecture representation learning counterparts. In particular, Figure 10 (a) and (b) show that arch2vec is able to better cluster straight networks, while supervised learning encodes straight networks and networks with skip connections together in the latent space.
NAS-Bench-201. Similarly, Figure 11 shows the visualization of five randomly selected pairs of sequences of decoded architecture cells using arch2vec (upper) and supervised architecture representation learning (lower) on NAS-Bench-201. The red mark denotes the change of operations between consecutive samples. Note that the edge flow in NAS-Bench-201 is fixed; only the operator associated with each edge can be changed. As shown, arch2vec leads to a smoother local change of operations than its supervised architecture representation learning counterpart.
DARTS Search Space. For the DARTS search space, we can only visualize the decoded architecture cells using arch2vec since there is no architecture accuracy recorded in this large-scale search space. Figure 12 shows an example of the sequence of decoded neural architecture cells using arch2vec. As shown, the edge connections of each cell remain unchanged in the decoded sequence, and the operation associated with each edge is gradually changed. This indicates that arch2vec preserves the local structural similarity of neighborhoods in the latent space.
Comments
There are no comments yet.