This repo contains the PyTorch implementation of the SNAS-Series papers
Controversy exists on whether differentiable neural architecture search methods discover wiring topology effectively. To understand how wiring topology evolves, we study the underlying mechanism of several existing differentiable NAS frameworks. Our investigation is motivated by three observed searching patterns of differentiable NAS: 1) they search by growing instead of pruning; 2) wider networks are more preferred than deeper ones; 3) no edges are selected in bi-level optimization. To anatomize these phenomena, we propose a unified view on searching algorithms of existing frameworks, transferring the global optimization to local cost minimization. Based on this reformulation, we conduct empirical and theoretical analyses, revealing implicit inductive biases in the cost's assignment mechanism and evolution dynamics that cause the observed phenomena. These biases indicate strong discrimination towards certain topologies. To this end, we pose questions that future differentiable methods for neural wiring discovery need to confront, hoping to evoke a discussion and rethinking on how much bias has been enforced implicitly in existing NAS methods.READ FULL TEXT VIEW PDF
This repo contains the PyTorch implementation of the SNAS-Series papers
Neural architecture search (NAS) aims to design neural architecture in an automatic manner, eliminating efforts in heuristics-based architecture design. Its paradigm in the past was mainly black-box optimization(Stanley and Miikkulainen, 2002; Zoph and Le, 2016; Kandasamy et al., 2018). Failing to utilize the derivative information in the objective makes them computationally demanding. Recently, differentiable architecture search (Liu et al., 2018; Xie et al., 2018; Cai et al., 2018) came in as a much more efficient alternative, obtaining popularity once at their proposal. These methods can be seamlessly plugged into the computational graph for any generic loss, for which the convenience of automated differentiation infrastructure can be leveraged. They have also exhibited impressive versatility in searching not only operation types but also kernel size, channel number (Peng et al., 2019), even wiring topology (Liu et al., 2018; Xie et al., 2018).
The effectiveness of wiring discovery in differentiable architecture search, however, is controversial. Opponents argue that the topology discovered is strongly constrained by the manual design of the supernetwork (Xie et al., 2019). And it becomes more speculative if ones take a close look at the evolution process, as shown in Fig.1. From the view of networks with the largest probability, all edges tend to be dropped in the beginning, after which some edges recover. The recovery seems slower in depth-encouraging edges and final cells show preference towards width. These patterns occur under variation of search space and search algorithms. A natural question to ask is, if there exists such a universal regularity, why bother searching the topology?
In this work, we study the underlying mechanism of this wiring evolution. We start from unifying existing differentiable NAS frameworks, transferring NAS to a local cost minimization problem at each edge. As the decision variable at each edge is enumerable, this reformulation makes it possible for an analytical endeavor. Based on it, we introduce a scalable Monte Carlo estimator of the cost that empowers us with a microscopic view of the evolution. Inspired by observations at this level, we conduct empirical and theoretical study on the cost assignment mechanism and the cost evolution dynamics, revealing their relation with the surrogate loss. Based on the results and analysis, we conclude that patterns above are caused (at least partially) by inductive biases in the training/searching scheme of existing NAS frameworks. In other words, there exists implicit discrimination towards certain types of wiring topology that are not necessarily worse than others if trained alone. This discovery poses challenges to future proposals of differentiable neural wiring discovery. To this end, we want to evoke a discussion and rethinking on how much bias has been enforced implicitly in existing neural architecture search methods.
The recent trend to automatically search for neural architectures starts from Zoph and Le (2016)
. They proposed to model the process of constructing a neural network as a Markov Decision Process, whose policy is optimized towards architectures with high accuracy. Then moving their focus to the design of search space, they further proposed ENAS inPham et al. (2018). Particularly, they introduced a Directed Acyclic Graph (DAG) representation of the search space, which is followed by later cell-based NAS frameworks (Liu et al., 2018; Xie et al., 2018). Since this DAG is one of the main controversial points on the effectiveness of differentiable neural wiring discovery, we provide a comprehensive introduction to fill readers in with the background. Fig.2 illustrates a minimal cell containing all distinct components.
Nodes in this DAG represent feature maps. Two different types of nodes are colored differently in Fig.2. Input nodes and output nodes (blue) are nodes indicating the inter-connection between cells. Here for instance, represents both the input to this cell and the output from the previous cell. Similarly, represents both the input to this cell and the output from the one before the previous cell. In symmetry, is the output of this cell, and is also the input to some next cells. Between input nodes and output nodes, there are intermediate nodes (orange). The number of each type of nodes are hyper-parameters of this design scheme.
Edges represent information flows between nodes and . Every intermediate node is connected to all previous nodes with smaller indices. If the edge is between an input node and an intermediate node, it is called input edge. If its vertices are both intermediate nodes, it is called intermediate edge. The rest are output edges, concatenating all intermediate nodes together to the output node. Notice that all edges except output edges are compounded with a list of operation candidates . Normally, candidates include convolution, pooling and skip-connection, augmented with a ReLU-Conv-BN or Pool-BN stacking.
Model-free policy search methods in ENAS leave over the derivative information in network performance evaluators and knowledge about the transition model, two important factors that can boost the optimization efficiency. Building upon this insight, differentiable NAS frameworks are proposed (Liu et al., 2018; Xie et al., 2018). Their key contributions are novel differentiable instantiations of the DAG, representing not only the search space but also the network construction process. Notice that the operation selection process can be materialized as
multiplying a one-hot random variableto each edge . And if we further replace the non-differentiable network accuracy with differentiable surrogate loss, e.g. training loss or testing loss, the NAS objective would become
where and are parameters of architecture distribution and operations respectively, and is the surrogate loss, which can be calculated on either training sets as in single-level optimization or testing sets in bi-level optimization. For classification,
where denotes dataset, is -th pair of input and label, is the output of the sampled network, is the cross entropy loss.
DARTS (Liu et al., 2018) and SNAS (Xie et al., 2018) deviate from each other on how to make in differentiable. DARTS relaxes Eq.1 continuously with deterministic weights . SNAS keeps the network sampling process represented by Eq.1, while making it differentiable with the softened one-hot random variable from the gumbel-softmax technique:
where is the temperature of the softmax, is the -th Gumbel random variable, is a uniform random variable. A detailed discussion on SNAS’s benefits over DARTS from this sampling process is provided in Xie et al. (2018).
The sophisticated design scheme in Sec.2.1 is expected to serve the purpose of neural wiring discovery, more specifically, simultaneous search of operation and topology, two crucial aspects in neural architecture design (He et al., 2016). Back to the example in Fig.2, if the optimized architecture does not pick any candidate at all input edges from certain input nodes, e.g. and , it can be regarded as a simplified inter-cell wiring structure is discovered. Similarly, if the optimized architecture does not pick any candidate at certain intermediate edges, a simplified intra-cell wiring structure is discovered. In differentiable NAS frameworks (Liu et al., 2018; Xie et al., 2018), this is achieved by adding a None operation to the candidate list. The hope is that finally some None operations will be selected.
However, effectiveness of differentiable neural wiring discovery has become speculative recently. On the one hand, only searching for operations in the single-path setting such as ProxylessNAS (Cai et al., 2018) and SPOS (Guo et al., 2019) reaches state-of-the-art performance, without significant modification on the overall training scheme. Xie et al. (2019) searches solely for topology with random graph generators, achieves even better performance. On the other hand, post hoc analysis has shown that networks with width-preferred topology similar to those discovered in DARTS and SNAS are easier to optimize comparing with their depth-preferred counterparts (Shu et al., 2020). Prior to these works, there has been a discussion on the same topic, although from a low-level view. Xie et al. (2018) points out that the wiring discovery in DARTS is not a direct derivation from the optimized operation weight. Rather, None operation is excluded during final operation selection and an post hoc scheme to select edges with top- weight from is designed for the wiring structure. They further illustrate the bias of this derivation scheme. Although this bias exists, some recent works have reported impressive results based on DARTS (Xu et al., 2019; Chu et al., 2019), especially in wiring discovery (Wortsman et al., 2019).
In this work, we conduct a genetic study on differentiable neural wiring discovery.
Three evolution patterns occur in networks with largest probability when we replicate experiments of DARTS, SNAS, and DSNAS111We extend single-path DSNAS Hu et al. (2020) to cells. All setting same as in Xie et al. (2018), except no resource constraint. (Fig. 3):
(P1) Growing tendency: Even though starting from a fully-connected super-network, the wiring evolution with single-level optimization is not a pruning process. Rather, all edges are dropped222Drop is soft as it describes the probability of operations. in the beginning. Some of them then gradually recover.
(P2) Width preference: Intermediate edges barely recover from the initial drop, even in single-level optimization, unless the search space is exetremely small.
(P3) Catastrophic failure: If bi-level optimization is used, where architecture parameters are updated with a held-out set, no edge can recover from the initial drop.
These patterns are universal in the sense that they robustly hold with variation of random seeds, learning rate, the number of cells, the number of nodes in a cell, and the number of operation candidates at each edge. More evolution details are provided in Appx.A.
Universal phenomena reported above motivate us to seek for a unified framework. In Sec. 2.2, we unify the formulation of DARTS and SNAS. However, this high-level view does not provide us with much insight on the underlying mechanism of architecture evolution. Apart from this high-level unification, Xie et al. (2018) also provides a microscopic view by deriving the policy gradient equivalent of SNAS’s search gradient
where is the gumbel-softmax random variable, denotes that is a cost independent from for gradient calculation, highlights the approximation introduced by . Zooming from the global-level objective to local-level cost minimization, it naturally divides the intractable NAS problem into tractable ones, because we only have a manageable set of operation candidates. In this section we generalize it to unify other differentiable NAS frameworks.
Recently, Hu et al. (2020) propose to approximate it at the discrete limit , and the gradient above comes
where is a strictly one-hot random variable, is the th element in it. Exploiting the one-hot nature of , i.e. only on edge is 1, others i.e. are , they further reduce the cost function to
We can also derive the cost function of DARTS and ProxylessNAS in a similar form:
where and highlight the loss is an approximation of . highlights the attention-based estimator, also referred to as the straight-through technique (Bengio et al., 2013). is inherited from above. Details are provided in Appx.B.
Seeing the consistency in these gradients, we reformulate architecture optimization in existing differentiable NAS frameworks as local cost optimization with:
Previously, Xie et al. (2018) interpreted this cost as the first-order Taylor decomposition (Montavon et al., 2017) of the loss with a convoluted notion, effective layer. Aiming at a cleaner perspective, we expand the point-wise form of the cost, with denoting the feature map from edge , denoting marginalizing out all other edges except , denoting the width, height, channel and batch size:
It does not come straightforward how this cost minimization (Eq.4) achieves wiring topology discovery. In this section, we conduct an empirical and theoretical study on the assignment and the dynamics of the cost with the training of operation parameters and how it drives the evolution of wiring topology. Our analysis focuses on the appearance and disappearance of None operation because the wiring evolution is a manifestation of choosing None or not at edges. We start by digging into statistical characteristics of the cost behind phenomena in Sec.3 (Sec.5.1). We then derive the cost assignment mechanism from output edges to input edges and intermediate edges (Sec. 5.2) and introduce the dynamics of the total cost, a strong indicator for the growing tendency (Sec. 5.3). By exhibiting the unequal assignment of cost (Sec. 5.4), we explain the underlying cause of width preference. Finally, we discuss the exceptional cost tendency in bi-level optimization, underpinning catastrophic failure (Sec. 5.5). Essentially, we reveal the aforementioned patterns are rooted in some inductive biases in existing differentiable NAS frameworks.
Since the cost of None fixes as , the network topology is determined by signs of cost of other operations, which may change with operation parameters updated. Though we cannot analytically calculate the expected cost of other operations, we can estimate it with Monte Carlo simulation.
To simplify the analysis without lost of generality, we consider DSNAS on one minimal cell with only one intermediate edge (Fig.2
). Later we will show why conclusions on this cell is generalizable. One cell is sufficient since cost at the last cell dominates with at least one-degree-larger magnitude, which can be explained by the vanishing gradient phenomenon. All experiments reported below are on CIFAR-10 dataset. The Monte Carlo simulation is conducted in this way. After initialization333
in the batch normalization is set to be 1e-5., subnetworks are sampled uniformly with weight parameters and architecture parameters fixed. And cost of each sampled operation at each edge is stored. After 50 epochs of sampling, statistics of the cost are calculated.
Surprisingly, for all operations except None, cost is inclined towards positive at initialization (Fig.4(a)). Similarly, we estimate the cost mean statistics after updating weight parameters for 150 epochs444Here we strictly follow the training setup in (Liu et al., 2018), with BN affine (Ioffe and Szegedy, 2015) disabled. with architecture parameters still fixed. As shown in Fig.4(b), most of the cost becomes negative. It then becomes apparent that None operations are preferred in the beginning as they minimize these costs. While after training, the cost minimizer would prefer operations with the smallest negative cost. We formalize this hypothesis as:
Hypothesis: Cost of operations except None are positive near initialization. It eventually turns negative with the training of operation parameters. Cell topology thus exhibits a tendency of growing.
Though Fig.4(a)(b) show statistics of the minimal cell, this Monte Carlo estimation can scale up to more cells with more nodes. In our analysis method, it works as a microscope to the underlying mechanism. We provide its instantiation in original training setting (Xie et al., 2018) i.e. stacking 8 cells with 4 intermediate nodes in Appx.C, whose inclination is consistent with the minimal cell here.
To theoretically validate this hypothesis, we analyze the sign-changing behavior of the cost (Eq.5). We start by listing the circumscription, eliminating possible confusion in derivation below.
Search space base architecture (DAG) A: A can be constructed as in Sec. 2.1.
Although Eq.5 generally applies to search space such as cell-based, single-path or modular, in this work we showcase the wiring evolution in w.o.l.g. We recommend readers to review Sec. 2.1 before continue reading. Particularly, we want to remind readers that intermediate edges are those connecting intermediate nodes.
We first study the assignment mechanism of cost, which by intuition is driven by gradient back-propagation.
There are basically two types of edges in terms of gradient back-propagation. The more complicated ones are edges pointing to nodes with outgoing intermediate edges. For instance, and in Fig. 2. Take for example, its cost considers two paths of back-propagation, i.e. path (4-2-0) and path (4-3-2-0):
where and denote the cost functions calculated from first path and second path respectively, and we have at the path and on the path, is result from edge .
The rest are edges pointing to nodes whose only outgoing edges are output edges, e.g. , and . Take for example, its cost only involves one path of back-propagation, i.e., (4-3-0):
A path does not distribute cost from its output edge after passing one intermediate edge.
(Sketch) Let denotes the Conv output on edge , we expand at path (4-3-2-0):
Expanding to the element-wise calculation, we can prove
Exploiting the property of normalization, we have
Check Appx.D for the full proof. Note that this result can be generalized to arbitrary back-propagation path involving intermediate edges. The theorem is thus proved. ∎
By Thm.1, cost on is only distributed from its subsequent output . Subsequent intermediate does not contribute to it. As illustrated in Fig.5(a)(b), Thm.1 reveals the difference between the cost assignment and gradient back-propagation.555It thus implies that the notion of multiple effective layers in Xie et al. (2018) is a illusion in search space . An edge only involves in one layer of cost assignment. With this theorem, the second term in Eq.6 can be dropped. Eq.6 is thus in the same form as Eq.7. That is, this theorem brings about a universal form of cost on all edges:
Total cost distributes to edges in the same cell as
where is the output node, intermediate nodes, all nodes.
For example, in the minimal cell above, we have
In the remaining parts of this paper, we would refer to with the cost at output nodes, and refer to with the cost at intermediate nodes. Basically these are cost sums of edges pointing to nodes. Even though all cells have this form of cost assignment, we can prove the total cost in each cell except the last one sum up to 0. More formally:
In cells except the last, for intermediate nodes that are connected the same output node, cost of edges pointing to them sums up to 0.
For all search spaces in , input nodes of the last cell e.g , are the output node of previous cells, if they exist (Sec.2.1). Consider the output node of the second last cell. The cost sum at this node is distributed from the last cell along paths with at least one intermediate node, e.g. (4-2-0) and (4-3-2-0), as illustrated in Fig.6. By Thm.1, this cost sum is 0. The same claim for output nodes in other cells can be proved by induction. Check Fig.7 for a validation. ∎
Let us take a close look at the total cost of the last cell.
Cost at output edges of the last cell has the form . It is negatively related to classification accuracy. It tends to be positive at low accuracy, negative at high accuracy.
(Sketch) Negatively related: We can first prove that for the last cell’s output edges, cost of one batch with sampled architecture has an equivalent form:
where is Eq.3, is the entropy of network output. Obviously, the cost sum is positively correlated to the loss, thus negatively correlated to the accuracy. With denoting the output of -th node from -th image, we have
Positive at low accuracy: Exploiting normalization and weight initialization, we have:
Negative at high accuracy: With operation parameters updated towards convergence, the probability of
-th image being classified to the correct labelincreases towards 1. Since , we have
Derivation details can be found in Appx.E. ∎
As shown in Fig.8, cost of output edges at the last cell starts positive and eventually becomes negative. Intuitively, because and are both optimized to minimize Eq.2, training is also minimizing the cost. More formally, Eq.8 bridges the dynamics of cost sum with the learning dynamics (Saxe et al., 2014; Liao and Couillet, 2018) of and :
The cost sum in one batch decreases locally if , increases locally if . By Thm.2, the global trend of the cost sum at the last cell should be decreasing.
If the trends of cost at all edges are consistent with Thm.2, eventually the growing tendency (P1) occurs.
If all edges are born equally, then by the cost assignment mechanism in Sec.5.2, what edges are finally picked would be dependent mainly on the task specification (dataset and objective) and randomness (lottery ticket hypothesis (Frankle and Carbin, 2018)), possibly also affected by the base architecture. But it does not explain the width preference. The width preference implies the distinction of intermediate edges. To analyze it, we further simplify the minimal cell assuming the symmetry of and , as shown in Fig.9.
Fig.10(a) shows the cost estimated with the Monte Carlo estimation, averaged over operations. Cost at intermediate is higher than cost of the same operation at , no matter whether there is sampling on or not, or how many operation candidates on each edge. We describe these various settings in Appx.F.
We conjecture that the distinctiveness of cost at intermediate edges is associated with the fact that it is less trained than input edges. It may be their lag in training that induces a lag in the cost-decreasing process, with a higher loss than frequently-trained counterparts. Why are they less trained? Note that in every input must be followed by an output edge. Reflected in the simplified cell, and are always trained as long as they are not sampled as None. Particularly, is updated with gradients from two paths (3-2-1-0) and (3-1-0). When None is sampled on , can be updated with gradient from path (3-1-0). However, when a None is sampled on , cannot be updated because its input is zero. Even if None is not included in , there are more model instances on path (3-2-1-0) than path (3-2-0) and (3-1-0) that share the training signal.
We design controlled experiment by deleting the output and fixing operation at , as shown in Fig.9(b). This is no longer an instance of since one intermediate node is not connected to output node. But in this base architecture path (3-2-1-0) can be trained with equal frequency as (3-2-0). As shown in Fig.9(d), the cost bias on is resolved. Interestingly, path (3-2-1-0) is deeper than (3-2-0), but the cost on becomes positive and dropped. The preference of the search algorithm is altered to be depth over width. Hence the distinction of intermediate edges is likely due to unequal training between edges caused by base architecture design in , subnet sampling and weight sharing.
In , with subnet sampling and weight sharing, for edges pointing to the same node, cost at intermediate edges is higher than input edges with the same operation.
Given a system in following Remark 1, the cost sum eventually decreases to become negative. Since the cost at intermediate edges tend to be higher than cost on input edges, if there is at least one edge with positive cost in the middle of evolution, this one must be an intermediate edge. It then becomes clear why intermediate edges recovers from the initial drop much later than input edges.
Until now we do not distinguish between single-level optimization and bi-level ones on . This is because the cost minimization formulation Eq.4 generally applies to both. However, that every edge drops and almost none of them finally recovers in DARTS’s bi-level version seems exceptional. As shown in Fig.11(a), the cost sum of edges even increases, which is to the opposite of Thm.2 and Remark 2. Because the difference between bi-level optimization and the single-level is that the cost is calculated on a held-out search set, we scrutinize its classification loss and output entropy , whose difference by Thm.2 is the cost sum.
Fig.11(b) shows the comparison of and in the training set and the search set. For correct classification, and are almost comparable in the training set and the search set. But for data classified incorrectly, the classification loss is much larger in the search set. That is, data in the search set are classified poorly. This can be be explained by overfitting. Apart from that, in the search set is much lower than its counterpart in the training set. This discrepancy in
was previously discussed in the Bayesian deep learning literature: the softmax output of deep neural nets under-estimates the uncertainty. We would like to direct readers who are interested in a theoretical derivation toGal and Ghahramani (2016). In sum, subnetworks are erroneously confident in the held-out set, on which their larger actually indicates their misclassification. As a result, the cost sum in bi-level optimization becomes more and more positive. None operation is chosen at all edges.
Cost from the held-out set may incline towards positive due to false classification and erroneously captured uncertainty. Hereby catastrophic failures (P3) occur in bi-level optimization.
In this work we study the underlying mechanism of wiring discovery in differentiable neural architecture search, motivated by some universal patterns in its evolution. We introduce a formulation that transfers NAS to a tractable local cost minimization problem, unifying existing frameworks. Since this cost is not static, we further investigate its assignment mechanism and learning dynamics. We discover some implicit inductive biases in existing frameworks, namely
The cost assignment mechanism for architecture is non-trivially different from the credit assignment in gradient back-propagation for parameters . Exaggerating this discrepancy, base architectures and training schemes, which are expected to facilitate the search on intermediate edges, turn out to backfire on them.
During training, cost decreases from positive and eventually turns negative, promoting the tendency of topology growth. If this cost-decreasing process is hindered or reversed, as in bi-level optimization, there will be a catastrophic failure in wiring discovery .
In sum, some topologies are chosen by existing differentiable NAS methods not because they are generally better, but because they fit these methods better. The observed regularity is a product of dedicated base architectures and training schemes in these methods, rather than global optimality. This conclusion invites some deeper questions: Are there any ways to circumvent these biases without sacrificing generality? Are there any other implicit biases in existing NAS methods? To this end, we hope our work can evoke an open discussion on them. We also hope our analytical framework can get ones equipped for future theoretical study of differentiable neural architecture search.
Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §4.
The dynamics of learning: a random matrix approach. In International Conference on Machine Learning, pp. 3072–3081. Cited by: §5.3.
Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §2.1.
ProxylessNAS (Cai et al., 2018) inherits DARTS’s learning objective, and introduce the BinnaryConnect technique to empirically save memory. To achieve that, they propose the following approximation:
A path does not distribute cost from its output edge after passing one intermediate edge.
For each intermediate edge, three operations (ReLU, Conv/pooling, BN) are sequentially ordered as shown in Fig. 19. To prove Thm. 1, we analyse the effect of each operation on the cost assignment in the intermediate edge. Let denotes the input of Conv operation on edge 666We do not consider ReLU operation (before Conv) here, since ReLU operation obviously satisfies the proof. Thm. 1 can be easily generalized to pooling operations., denotes the Conv operation output, is the filter weight in the Conv operation. The full proof consists of three following steps.
Step 1: By expanding at path (4-3-2-0), we have:
Step 2: Utilizing the linear property of Conv operation, we can simplify the above equation as:
Proof of Step 2: To derive Eq.(11), we first calculate the gradient w.r.t Conv input (black-underlined part). Note that (), (), () and denote the index of width, height, channel and batch dimensions of Conv input (output), and is the Conv filter width and height.
where the second equality holds by changing the order of indexes , , , . Using the linear property of Conv operation, the red-underlined part in the forth equality can be derived due to:
Step 3: We have shown Conv/pooling operations does not change the cost value in the intermediate edge, so next we analyse the effect of batch normalization (Ioffe and Szegedy, 2015) on the cost assignment. Exploiting the property of batch normalization (Ioffe and Szegedy, 2015), we have:
Proof of Step 3: Similar as step2, we compute the gradient w.r.t. the Batch Normalization input (black-underlined part). Before we start, we first show the batch normalization process as below:
where and denote the input and output of BN operation, and
are the mean and variance statistics of-th channel. Note that d denotes the spatial size . Then, the gradients with respect with the output (black-underlined part
) are calculated based on chain rule: