1 Introduction
Deep networks have been applied to many applications, where proper architectures are extremely important to ensure good performance. Recently, the neural architecture search (NAS) Zoph and Le (2017); Baker et al. (2017) has been developed as a promising approach to replace human experts on designing architectures, which can find networks with fewer parameters and better performance than handcrafted ones Yao et al. (2018); Hutter et al. (2018). NASNet Zoph and Le (2017)
is the pioneered work along this direction and it models the design of convolutional neural networks (CNNs) as a multistep decision problem and solves it with reinforcement learning
Sutton and Barto (1998). However, since the search space is discrete and extremely large, NASNet requires a month with hundreds of GPU to obtain a satisfying architecture. Later, observing the good transferability of networks from small to large ones Yosinski et al. (2014), NASNetA Zoph et al. (2017) proposed to cut the networks into blocks and then the search only needs to be carried within such a block or cell. The identified cell is then used as a building block to assemble large networks. Such twostage search strategy dramatically reduces the size of the search space, and subsequently leads to the significant speedup of various previous search algorithms (e.g., evolution algorithm Liu et al. (2018b); Real et al. (2018), greedy search Liu et al. (2018a), and reinforcement learning Zhong et al. (2018)).Although the size of search space is reduced, the search space is still discrete that is generally hard to be efficiently searched Bertsekas (1997). More recent endeavors focused on how to change the landscape of the search space from a discrete to a differentiable one Luo et al. (2018); Liu et al. (2019); Xie et al. (2019). The benefit of such idea is that a differentiable space enables computation of gradient information, which could speed up the convergence of underneath optimization algorithm Bertsekas (1997). Various techniques have been proposed, e.g., DARTS Liu et al. (2019) smooths design choices with softmax and trains an ensemble of networks; SNAS Xie et al. (2019) enhances reinforcement learning with a smooth sampling scheme; NAO Luo et al. (2018) maps the search space into a new differentiable space with an autoencoder.
Among all these works (Table 1), the stateoftheart is DARTS Liu et al. (2019) as it combines the best of both worlds, i.e., fast gradient descent (differentiable computation) within a cell (small search space). However, its search efficiency and performance of identified architectures are still not satisfying enough. From the computational perspective, all operations need to be forward and backward propagated during gradient descent while only one operation will be selected. From the perspective of performance, operations typically correlate with each other Xie et al. (2019); Guo et al. (2019), e.g., a 7x7’s convolution filter can cover a 3x3 one as a special case Shin et al. (2018). When updating a network’s weights, the ensemble constructed by DARTS during the search may lead to inferior architecture being discovered. Moreover, as mentioned in Xie et al. (2019), DARTS is not complete (Table 1), i.e., the final structure needs to be reidentified after the search. This causes a bias between the searched and the final architecture, and might lead a decay on the performance of the final architecture.
space  complete  architecture  search algorithm  
diff  cell  constraint  
NASNet Zoph and Le (2017); Baker et al. (2017)  none  reinforcement learning  
NASNetA Zoph et al. (2017)  none  reinforcement learning  
AmoebaNet Real et al. (2018)  none  evolution algorithm  
SNAS Xie et al. (2019)  none  reinforcement learning  
DARTS Liu et al. (2019)  none  gradient descent  
NASP (proposed)  discrete  proximal algorithm 
In this work, we propose NAS with proximal iterations to improve the efficiency and performance of DARTS. Except for the popularly discussed and used i) cutting the search space by cell, ii) changing the discrete search space into differentiable, and iii) complete search process, we introduce a new perspective, i.e., constraint during the search, into NAS. Specifically, we keep the architecture to be differentiable, but constrain only one of all possible operations to be actually employed during forward and backward propagation. However, such discrete constraint is hard to optimize, and we propose a new strategy derived from the proximal algorithm Parikh and Boyd (2013) to solve it. Compared with DARTS, our NASP is not only ten times faster but also can discover better architectures. Experiments demonstrate that our NASP can obtain the stateoftheart performance on both test accuracy and computation efficiency.
2 Preliminaries
2.1 Differentiable Architecture Search (DARTS)
We introduce DARTS Liu et al. (2019) firstly. Specifically, a cell (Figure 1(a)) is a directed acyclic graph consisting of an ordered sequence of nodes, and it has two input nodes and a single output node Zoph et al. (2017). For convolutional cells, the input nodes are defined as the cell outputs of the previous two layers. For recurrent cells, these are defined as the input at the current step and the state carried from the previous step. The output of the cell is obtained by concatenating all the intermediate nodes.
Within a cell, each node is a latent representation and each directed edge is associated with some operations that transforms to . Thus, each intermediate node is computed using all of its predecessors as (Figure 1(a)), i.e., . However, such search space is discrete. DARTS Liu et al. (2019) uses softmax relaxation to make discrete choices into smooth ones (Figure 1(b)), i.e., each is replaced by as
(1) 
where is a normalization term, denotes the th operation in search space . Thus, the choices of operations for an edge
is replaced by a real vector
, and all choices in a cell can be represented in a matrix (see Figure 1(d)). Vectors are denoted by lowercase boldface, matrices by uppercase boldface in this paper.With such a differentiable relaxation, the search problem in DARTS is formulated as
(2) 
where (resp. ) is the loss on validation (resp. training) set, and gradient descent is used for the optimization. Let , the gradient w.r.t. is given by
(3) 
where is the stepsize and a second order derivative, i.e., is involved. However, the evaluation of the second order term is extremely expensive, which requires two extra computations of gradient w.r.t. and two forward passes of . Finally, a final architecture needs to be discretized from the relaxed . The overall procedure is summarized in Appendix LABEL:darts:alg.
Due to the differentiable relaxation in (1), an ensemble of operations are maintained and all operations in the search space need to be forward and backwardpropagated when updating ; the evaluation of the second order term in (3) is very expensive known as a computation bottleneck of DARTS Xie et al. (2019); Noy et al. (2019). Besides, the performance obtained from DARTS is also not as good as desired. Due to possible correlations among operations Guo et al. (2019) and the need of deriving a new architecture after the search (i.e., lack of completeness) Xie et al. (2019).
2.2 Proximal Algorithm (PA)
Proximal algorithm (PA) Parikh and Boyd (2013)
, is a popular optimization technique in machine learning for handling constrained optimization problem as
, where is a smooth objective and is a constraint set. The crux of PA is the proximal step:(4) 
Closedform solution for the PA update exists for many constraint sets in (4), such as  and norm ball Duchi et al. (2008). Then, PA generates a sequence of by
(5) 
where is learning rate. PA guarantees to obtain the critical points of when is a convex constraint, and produces limit points when the proximal step can be exactly computed Yao et al. (2017)
. Due to its nice theoretical guarantee and good empirical performance, it has been applied in many deep learning problems, e.g., network binarization
Bai et al. (2018) and sparsification Wen et al. (2016). Another variant of PA with lazy proximal step Xiao (2010) maintains two copies of during iterating, i.e.,(6) 
is also popularily used in deep learning for network quantization Courbariaux et al. (2015); Hou et al. (2017). It does not have convergence guarantee in nonconvex case, but empirically performs well on network quantization tasks. Finally, neither (5) nor (6) have been introduced into NAS.
3 Our Approach: NASP
As introduced in Section 2.1, DARTS actually maintains an ensemble of networks and derives final architecture using differentiable relaxation in (1). This enables the usage of gradient descent, but brings high computational cost and a deterioration on performance of final architectures. Recall that in earlier works of NAS, e.g., NASNet Baker et al. (2017); Zoph and Le (2017) and GeNet Xie and Yuille (2017), architectures are discrete when updating networks’ parameters. Such discretization naturally avoids the problem of completeness and correlations among operations compared with DARTS. Thus, can we search differentiable architectures but keep discrete ones when updating network’s parameters? If this can be done, the above problems of DARTS may be addressed.
3.1 A New Relaxation: Discrete Constraint on
As NAS can be seen as a blackbox optimization problem Yao et al. (2018); Hutter et al. (2018), here, we bring the wisdom of constraint optimization Bertsekas (1997) to deal with the NAS problem. Specifically, we can still keep to be continuous allowing the usage of gradient descent, but constrain the values of to be discrete ones. Thus, we propose to use the following relaxation instead of (1) on architectures:
(7) 
where the constraint set is defined as . Thus, while is continuous, the constraint keeps its choices to be discrete, and there is one operation actually activated for each edge during training network parameter as illustrated in Figure 1(c). Finally, NAS problem under our new relaxation (7) becomes
(8) 
Literally, learning with a discrete constraint has only been explored with parameters, e.g., deep networks compression with binary weights Courbariaux et al. (2015), discrete matrix factorization Zhang et al. (2007) and gradient quantization Alistarh et al. (2017), but not in hyperparameter or architecture optimization. Meanwhile, other constrains have been considered in NAS, e.g., memory cost and latency He et al. (2018); Tan et al. (2018); Cai et al. (2019). We are the first to introduce searched constraints on architecture into NAS (Table 1).
3.2 Neural Architecture Search with Proximal Iterations
Compared with the original search problem (2) in DARTS, the only difference is the newly introduced constraint . A direct solution would be PA mentioned in Section 2.2, then architecture can be either updated by (5), i.e.,
(9) 
or updated by lazy proximal step (6), i.e.,
(10) 
where the gradient can be evaluated by (3) and computation of secondorder approximation is still required. Let and (i.e., ). The closedform solution on proximal step is offered in Proposition 1 (Proofs in Appendix LABEL:app:proposition1).
Proposition 1.
.
However, solving (8) is not easy. Due to the discrete nature of the constraint set, proximal iteration (9) is hard to obtain a good solution Courbariaux et al. (2015). Besides, while (6) empirically leads to better performance than (5) in binary networks Courbariaux et al. (2015); Hou et al. (2017); Bai et al. (2018), lazyupdate (10) will not success here neither. The reason is that, as in DARTS Liu et al. (2019), is naturally in range but (10) can not guarantee that. This in turn will bring negative impact on the searching performance.
Instead, motivated by Proposition 1, we keep to be optimized as continuous variables but constrained by . Similar boxconstraints have been explored in sparse coding and nonnegative matrix factorization Lee and Seung (1999), which help to improve the discriminative ability of learned factors. Here, as demonstrated in experiments, it helps to identify better architectures. Then, we also introduce another discrete constrained by derived from during iterating. Note that, it is easy to see is guaranteed. The proposed procedure is described in Algorithm 1.
Compared with DARTS, NASP also alternatively updates architecture (step 4) and network parameters (step 6). However, note that is discretized at step 3 and 5. Specifically, in step 3, discretized version of architectures are more stable than the continuous one in DARTS, as it is less likely for subsequent updates in to change . Thus, we can take (step 4) as a constant w.r.t. , which helps us remove the second order approximation in (3) and significantly speeds up architectures updates. In step 5, network weights need only to be propagated with the selected operation. This helps to reduce models’ training time and decouples operations for training networks. Finally, we do not need an extra step to discretize architecture from a continuous one like DARTS, since a discrete architecture is already maintained during the search. This helps us to reduce the gap between the search and finetuning, which leads to better architectures being identified.
Finally, unlike DARTS and PA with lazyupdates, the convergence of the proposed NASP can be guaranteed in Theorem 2.
Theorem 2.
Assume is differentiable and , then sequence generated by Algorithm 1 has limit points.
Note that, previous analysis cannot be applied. As the algorithm steps are different from all previous works (i.e., Hou et al. (2017); Bai et al. (2018)), and it is the first time that PA is introduced into NAS. While two assumptions are made in Theorem 2, smoothness of
can be satisfied using proper loss functions, e.g., the crossentropy in this paper, and the second assumption can empirically hold as in our experiments.
3.3 Regularization on Model Complexity
We have proposed an efficient algorithm derived from proximal iteration for NAS. However, we may also want to regularize model parameters to tradeoff between accuracy and model complexity Cai et al. (2019); Xie et al. (2019). Specifically, in the search space of NAS, different operations have distinct number of parameters. For example, the parameter number of "sep_conv_7x7" is ten times that of operation "conv_1x1".
Here, we introduce a novel regularizer on to fulfill the above goal. Recall that, one column in denotes one possible operation (Figure 1(d)), and whether one operation will be selected depending on its value (a row in ). Thus, if we suppress the value of a specific column in , its operation will be less likely to be selected in Algorithm 1, due to the proximal step on . These motivates us to introduce a regularizer as
(11) 
where is the th column in , the parameter number with the th operation is , and . The objective for NAS now becomes
(12) 
where is to balance between complexity and accuracy, and a larger leads to smaller architectures. As is smooth, it is easy to see (12) can still be trained with the proposed Algorithm 1.
4 Experiments
In this section, we first perform experiments with searching CNN in Section 4.1 and RNN in Section 4.2. Then, we show how NASP is balanced with regularization on model size in Section 4.3. Finally, detailed comparisons with DARTS and PA are in Section 4.4
. Four datasets, i.e., CIFAR10, Tiny ImageNet, PTB, WT2 will be utilized in our experiments (details are in Appendix
B.1).4.1 Architecture Search for CNN
Searching Cells on CIFAR10. Same as Zoph and Le (2017); Zoph et al. (2017); Liu et al. (2019); Xie et al. (2019); Luo et al. (2018), we search architectures on CIFAR10 dataset (Krizhevsky (2009)). Following Liu et al. (2019); Xie et al. (2019), the convolutional cell consists of nodes, and the network is obtained by stacking cells for
times; in the search process, we train a small network stacked by 8 cells with 50 epochs. More training details can be seen in Appendix
B.2.Two different search spaces are considered here. The first one is the same as DARTS and contains 7 operations. The second one is larger, which contains 12 operations, details can be seen in Appendix B.3. Besides, our search space for normal cell and reduction cell is different. For normal cell, the search space only consists of identity and convolutional operations; for reduction cell, the search space only consists of identity and pooling operations.
Results compared with stateoftheart NAS methods can be found in Table 2, the searched cells are in Figure 77 (Appendix B.4). Note that ProxlessNAS Cai et al. (2019), Mnasnet Tan et al. (2018), and Single Path OneShot Guo et al. (2019) are not compared as their codes are not available and they focus on NAS for mobile devices; GeNet Xie and Yuille (2017) is not compared, as its performance is much worse than ResNet. Note that we remove the extra data augmentation Cubuk et al. (2018) for ASAP, and only adopt cutout for a fair comparison. We can see that when in the same space (with 7 operations), NASP has comparable performance with DARTS (2ndorder) and is much better than DARTS (1storder). Then, in the larger space (with 12 operations), NASP is still much faster than DARTS, with much lower test error than all other stateofthearts methods. Note that, NASP on the larger space also has larger models, as will be detailed in Section 4.4, this is because NASP can find operations giving lower test error, while others cannot.
Architecture  Test Error  Para  Ops  Search Cost 
(%)  (M)  (GPU days)  
DenseNetBC Huang et al. (2017)  3.46  25.6     
NASNetA + ct Zoph et al. (2017)  2.65  3.3  13  1800 
AmoebaNetA + ct Real et al. (2018)  3.34 0.06  3.2  19  3150 
AmoebaNetB + ct Real et al. (2018)  2.55 0.05  2.8  19  3150 
PNAS Liu et al. (2018a)  3.41 0.09  3.2  8  225 
ENAS Pham et al. (2018)  2.89  4.6  6  0.5 
Oneshot Small (F=128) Bender et al. (2018)  3.9 0.2  19.3  7   
Random search + ct Liu et al. (2019)  3.29 0.15  3.2  7  4 
DARTS (1storder) + ct Liu et al. (2019)  3.00 0.14  3.3  7  1.5 
DARTS (2ndorder) + ct Liu et al. (2019)  2.76 0.09  3.3  7  4 
SNAS + ct Xie et al. (2019)  2.98  2.9  7  1.5 
ASAP + ct Noy et al. (2019)  3.06  2.6  7  0.2 
NASP + ct  2.8  3.3  7  0.2 
NASP (more ops) + ct  2.5  7.4  12  0.3 
Classification errors of NASP and stateoftheart image classifiers on CIFAR10. “ct” denotes cutout; “Ops” denotes the number of operations in the search space.
Transferring to Tiny ImageNet. The architecture transferability is important for cells to transfer to other datasets Zoph et al. (2017). To explore the transferability of our searched cells, we stack searched cells for 14 times on Tiny ImageNet, and train the network for 250 epochs. Results are in Table 3. We can see NASP exhibits good transferablity, and its performance is also better than other methods except NASNetA. But our NASP is much faster than NASNetA.
4.2 Architecture Search for RNN
Search Cells on PTB. Following the setting of DARTS Liu et al. (2019), the recurrent cell consists of
nodes; the first intermediate node is obtained by linearly transforming the two input nodes, adding up the results and then passing through a tanh activation function; then the results of the first intermediate node should be transformed by an activation function. The activation functions utilized are tanh, relu, sigmoid and identity. In the search process, we train a small network with sequence length 35 for 50 epochs. To evaluate the performance of searched cells on PTB, a singlelayer recurrent network with the discovered cell is trained for at most 8000 epochs until convergence with batch size 64. Results can be seen in Table
4, and searched cells are in Figure 7 (Appendix B.4). Again, we can see DARTS’s 2ndorder is much slower than 1storder, and NASP can be not only much faster than DARTS but also achieve comparable test performance with other stateoftheart methods.Architecture  Test Accuracy (%)  Params  Search Cost  
top1  top5  (M)  (GPU days)  
ResNet18 He et al. (2016)  52.67  76.77  11.7   
NASNetA Zoph et al. (2017)  58.99  77.85  4.8  1800 
AmoebaNetA Real et al. (2018)  57.16  77.62  4.2  3150 
ENAS Pham et al. (2018)  57.81  77.28  4.6  0.5 
DARTS Liu et al. (2019)  57.42  76.83  3.9  4 
SNAS Xie et al. (2019)  57.81  76.93  3.3  1.5 
ASAP Noy et al. (2019)  54.21  74.67  3.3  0.2 
NASP  58.12  77.62  4.0  0.2 
NASP (more ops)  58.32  77.54  8.9  0.3 
Architecture  Perplexity (%)  Params  Search Cost  
valid  test  (M)  (GPU days)  
Variational RHN Zilly et al. (2017)  67.9  65.4  23   
LSTM Merity et al. (2017)  60.7  58.8  24   
LSTM + skip connections Melis et al. (2017)  60.9  58.3  24   
LSTM + 15 softmax experts Yang et al. (2017)  58.1  56.0  22   
NAS Zoph and Le (2017)    64.0  25  10,000 
ENAS Pham et al. (2018)  68.3  63.1  24  0.5 
Random search Liu et al. (2019)  61.8  59.4  23  2 
DARTS (1st order) Liu et al. (2019)  60.2  57.6  23  0.5 
DARTS (2nd order) Liu et al. (2019)  59.7  56.4  23  1 
NASP  59.9  57.3  23  0.1 
Transferring to WikiText2. Following Liu et al. (2019), we test the transferable ability of RNN’s cell with WikiText2 (WT2) Pham et al. (2018) dataset. We train a singlelayer recurrent network with the searched cells on PTB for at most 8000 epochs. Results can be found in Table 4.2. Unlike previous case with TinyImageNet, performance obtained from NAS methods are not better than human designed ones. This is due to WT2 is harder to be transfered, which is also observed in Liu et al. (2019). However, NASP is much more efficient than DARTS with comparable performance.
4.3 Regularization on Model Complexity
In above experiments, we have set for (12). Here, we vary and the results are demonstrated in Figure 2. TinyImageNet is employed here. We can see that the model size get smaller with larger . However, the testing accuracy increases a little and the decreases, which may due to the introduced regularization can somehow prevent overfitting.
4.4 Comparison with DARTS
In Section 4.1, we have show an overall comparison between DARTS and NASP. Here, we show detailed comparisons on updating network’s parameter (i.e., ) and architectures (i.e., ). Timing results and searched performance are in Table 6. First, NASP removes lots of computational cost, as no 2ndorder approximation of and propagating with selected operations. This clearly justifies our motivation in Section 3.1. Second, the discretized helps to decouple operations on updating , this helps NASP finds better operations with larger search space.
computational time (in seconds)  
Ops  update (validation)  update (training)  total  error  params  
1storder  2ndorder  forward  backward  (%)  (M)  
7  DARTS  270  1315  103  162  1693  2.76  3.3 
NASP  176    25  31  343  2.8  3.3  
12  DARTS  489  2381  187  293  3060  3.0  8.4 
NASP  303    32  15  483  2.5  7.4 
We conduct experiments to compare the search time and validation accuracy in Figure 3(a) and Figure 3(b). We can see that in the same search time, our NASP obtains higher accuracy while our NASP cost less time in the same accuracy. This further verifies the efficiency of NASP over DARTS.
(a) v.s. DARTS.  (b) v.s. DARTS.  (c) v.s. PA variants.  (d) v.s. PA variants. 
Finally, we illustrate why the second order approximation is a need for DARTS but not for NASP. Recall that, as in Section 2.1, as continuously changes during iteration second order approximation is to better capture ’s impact for . Then, in Section 3.2, we argue that, since is discrete, ’s impact will not lead to frequent changes in . This removes the needs of capturing future dynamics using the second order approximation. We plot for DARTS and for NSAP in Figure 4. In Figure 4, the xaxis represents the training epochs while the yaxis represents the operations (there are five operations choosed in our figure). There are 14 connections between nodes, so there are 14 subfigures in boath Figure 4(a) and Figure 4(b). Indeed, is more stable than in DARTS, which verifies the correctness of our motivation.
4.5 Comparison with Standard PA
Finally, we demonstrate the needs of our designs in Section 3.2 for NASP. CIFAR10 with small search space is used here. Three algorithms are compared: 1). PA (standard), which is given by (9); 2). PA (lazyupdate), which is given by (10); and 3) proposed NASP. Results are in Figure 3(c) and Figure 3(d). First, good performance cannot be obtained from a direct proximal step, which is due to the discrete constraint. Same observation is also previous made for binary networks Courbariaux et al. (2015). Second, PA(lazyupdate) is much better than PA(standard) but still worse than NASP. This verifies the needs to keep , as it can encourage better operations.
5 Conclusion
We introduce NASP, a fast and differentiable neural architecture search method by proximal iterations. Compared with DARTS, our method is more efficient and obtains better performance. The key contribution of NASP is the proximal iterations in search process. This approach makes only one operation updated, which saves much time and makes it possible to utilize a larger search space. Besides, our NASP eliminates the correlation among operations. Experiments demonstrate that our NASP can run faster and obtain better performance than baselines. As for future work, we plan to conduct our algorithm on ImageNet to verify its effectiveness much further.
References
 [1] (2017) QSGD: communicationefficient sgd via gradient quantization and encoding. In NeurIPS, pp. 1709–1720. Cited by: §3.1.
 [2] (2018) Proxquant: quantized neural networks via proximal operators. In ICLR, Cited by: §2.2, §3.2, §3.2.
 [3] (2017) Designing neural network architectures using reinforcement learning. In ICLR, Cited by: Table 1, §1, §3.
 [4] (2018) Understanding and simplifying oneshot architecture search. In ICML, Cited by: Table 2.
 [5] (1997) Nonlinear programming. Taylor & Francis. Cited by: §1, §3.1.
 [6] (2019) ProxylessNAS: direct neural architecture search on target task and hardware. In ICLR, Cited by: §3.1, §3.3, §4.1.
 [7] (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In NeurIPS, pp. 3123–3131. Cited by: §2.2, §3.1, §3.2, §4.5.
 [8] (2018) AutoAugment: learning augmentation policies from data. CoRR abs/1805.09501. Cited by: §4.1.

[9]
(2017)
Improved regularization of convolutional neural networks with cutout.
arXiv: Computer Vision and Pattern Recognition
. Cited by: §B.1.  [10] (2008) Efficient projections onto the l1ball for learning in high dimensions. In ICML, pp. 272–279. Cited by: §2.2.
 [11] (2019) Single path oneshot neural architecture search with uniform sampling. Technical report Arvix. Cited by: §1, §2.1, §4.1.
 [12] (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: Table 3.
 [13] (2018) AMC: automl for model compression and acceleration on mobile devices. In ECCV, Cited by: §3.1.
 [14] (2017) Lossaware binarization of deep networks. In ICLR, Cited by: §2.2, §3.2, §3.2.
 [15] (2017) Densely connected convolutional networks. In CVPR, pp. 4700–4708. Cited by: Table 2.
 [16] F. Hutter, L. Kotthoff, and J. Vanschoren (Eds.) (2018) Automated machine learning: methods, systems, challenges. Springer. Cited by: §1, §3.1.
 [17] (2016) Tying word vectors and word classifiers: a loss framework for language modeling. arXiv preprint arXiv:1611.01462. Cited by: §4.2.
 [18] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §B.1, §4.1.
 [19] (2015) Tiny imagenet visual recognition challenge. CS 231N. Cited by: §B.1.
 [20] (1999) Learning the parts of objects by nonnegative matrix factorization. Nature 401, pp. 788–791. Cited by: §3.2.
 [21] (2018) Progressive neural architecture search. In ECCV, Cited by: §1, Table 2.
 [22] (2018) Hierarchical representations for efficient architecture search. In ICLR, Cited by: §1.
 [23] (2019) DARTS: differentiable architecture search. In ICLR, Cited by: Table 1, §1, §1, §2.1, §2.1, §3.2, §4.1, §4.2, §4.2, §4.2, Table 2, Table 3, Table 4.
 [24] (2018) Neural architecture optimization. In NeurIPS, Cited by: §1, §4.1.
 [25] (2017) On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589. Cited by: §4.2, Table 4.
 [26] (2017) Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182. Cited by: Table 4.
 [27] (2019) ASAP: architecture search, anneal and prune. Technical report arXiv preprint arXiv:1904.04123. Cited by: §2.1, Table 2, Table 3.
 [28] (2013) Proximal algorithms. Foundations and Trends in Optimization 1 (3), pp. 123–231. Cited by: §1, §2.2.
 [29] (2018) Efficient neural architecture search via parameter sharing. arXiv preprint. Cited by: §4.2, §4.2, Table 2, Table 3, Table 4.
 [30] (2018) Regularized evolution for image classifier architecture search. arXiv. Cited by: Table 1, §1, Table 2, Table 3.
 [31] (2018) Differentiable neural network architecture search. ICLR Workshop. Cited by: §1.
 [32] (1998) Reinforcement learning: an introduction. MIT press. Cited by: §1.
 [33] (2018) Mnasnet: platformaware neural architecture search for mobile. Technical report arXiv. Cited by: §3.1, §4.1.
 [34] (2016) Learning structured sparsity in deep neural networks. In NeurIPS, pp. 2074–2082. Cited by: §2.2.
 [35] (2010) Dual averaging methods for regularized stochastic learning and online optimization. JMLR 11 (Oct), pp. 2543–2596. Cited by: §2.2.
 [36] (2017) Genetic CNN. In ICCV, Cited by: §3, §4.1.
 [37] (2019) SNAS: stochastic neural architecture search. In ICLR, Cited by: Table 1, §1, §1, §2.1, §3.3, §4.1, Table 2, Table 3.
 [38] (2017) Breaking the softmax bottleneck: a highrank rnn language model. arXiv preprint arXiv:1711.03953. Cited by: §4.2, Table 4.
 [39] (2017) Efficient inexact proximal gradient algorithm for nonconvex problems. In IJCAI, pp. 3308–3314. Cited by: §2.2.
 [40] (2018) Taking human out of learning applications: a survey on automated machine learning. Technical report arXiv preprint. Cited by: §1, §3.1.
 [41] (2014) How transferable are features in deep neural networks?. In NeurIPS, Cited by: §1.
 [42] (2007) Binary matrix factorization with applications. In ICDM, pp. 391–400. Cited by: §3.1.
 [43] (2018) Practical blockwise neural network architecture generation. In CVPR, Cited by: §1.
 [44] (2017) Recurrent highway networks. In ICML, pp. 4189–4198. Cited by: Table 4.
 [45] (2017) Neural architecture search with reinforcement learning. In ICLR, Cited by: Table 1, §1, §3, §4.1, Table 4.
 [46] (2017) Learning transferable architectures for scalable image recognition. In CVPR, Cited by: Table 1, §1, §2.1, §4.1, §4.1, Table 2, Table 3.
Appendix A Complexity Analysis
In this section, we compare time complexity between NASP and others. We assume the size of search space is , and the time cost for each operation is . For DARTS, the time cost is because that all operations need to be forward and backward propagated during gradient descent. However, the time cost is for our NASP due to that our NASP only propagates the selected operation. As for the time complexity of SNAS, it is the same as DARTS. Because that ASAP prunes operations while training, the time complexity for ASAP is more than but less than . ENAS’s time complexity is low because of parameter sharing. For NASNETA or other search methods without gradient information, the time complexity is extremely high because of the large number of trials.
Appendix B Experiment Details
b.1 Datasets
CIFAR10: CIFAR10 Krizhevsky (2009) is a basic dataset for image classification, which consists of 50,000 training images and 10,000 testing images. CIFAR10 can be downloaded from "http://www.cs.toronto.edu/ kriz/cifar.html". Half of the CIFAR10 training images will be utilized as the validation set. Data augmentation like cutout Devries and Taylor (2017) and HorizontalFlip will be utilized in our experiments. After training, we will test the model on test dataset and report accuracy in our experiments.
Tiny ImageNet: Tiny ImageNet Le and Yang (2015) contains a training set of 100,000 images, a testing set of 10,000 images. These images are sourced from 200 different classes of objects from ImageNet. Tiny ImageNet can be downloaded from "http://tinyimagenet.herokuapp.com/".Note that due to small number of training images for each class and lowresolution for images, Tiny ImageNet is harder to be trained than the original ImageNet yao2015tiny. Data augmentation like RandomRotation and RandomHorizontalFlip are utilized. After training, we will test the model on test dataset and report accuracy in our experiments.
PTB: PTB is an English corpus used for probabilistic language modeling, which consists of approximately 7 million words of partofspeech tagged text, 3 million words of skeletally parsed text, over 2 million words of text parsed for predicateargument structure, and 1.6 million words of transcribed spoken text annotated for speech disfluencies. PTB can be downloaded from "http://www.fit.vutbr.cz/ imikolov/rnnlm/simpleexamples.tgz". We will choose the model with the best performance on validation dataset and test it on test dataset.
WT2: Compared to the preprocessed version of Penn Treebank (PTB), WikiText2 (WT2) is over 2 times larger. WT2 features a far larger vocabulary and retains the original case, punctuation and numbers  all of which are removed in PTB.As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies. WT2 can be downloaded from "https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext2v1.zip". We will choose the model with the best performance on validation dataset and test it on test dataset.
b.2 Training details
For training CIFAR10, the convolutional cell consists of nodes, and the network is obtained by stacking cells for
times; in the search process, we train a small network stacked by 8 cells with 50 epochs. SGD is utilized to optimize the network’s weights, and Adam is utilized for the parameters of network architecture. To evaluate the performance of searched cells, the searched cells are stacked for 20 times; the network will be finetuned for 600 epochs with batch size 96. Additional enhancements like path dropout (of probability 0.2) and auxiliary towers (with weight 0.4) are also used. We have run our experiments for three times and report the mean.
b.3 Search Space
NASP’s search space:
identity, 1x3 then 3x1 convolution, 3x3 dilated convolution, 3x3 average pooling, 3x3 max pooling, 5x5 max pooling, 7x7 max pooling, 1x1 convolution, 3x3 convolution, 3x3 depthwiseseparable conv, 5x5 depthwiseseperable conv, 7x7 depthwiseseparable conv.
Comments
There are no comments yet.