Meta-Learning of Neural Architectures for Few-Shot Learning

11/25/2019 ∙ by Thomas Elsken, et al. ∙ University of Freiburg 0

The recent progress in neural architectures search (NAS) has allowed scaling the automated design of neural architectures to real-world domains such as object detection and semantic segmentation. However, one prerequisite for the application of NAS are large amounts of labeled data and compute resources. This renders its application challenging in few-shot learning scenarios, where many related tasks need to be learned, each with limited amounts of data and compute time. Thus, few-shot learning is typically done with a fixed neural architecture. To improve upon this, we propose MetaNAS, the first method which fully integrates NAS with gradient-based meta-learning. MetaNAS optimizes a meta-architecture along with the meta-weights during meta-training. During meta-testing, architectures can be adapted to a novel task with a few steps of the task optimizer, that is: task adaptation becomes computationally cheap and requires only little data per task. Moreover, MetaNAS is agnostic in that it can be used with arbitrary model-agnostic meta-learning algorithms and arbitrary gradient-based NAS methods. Empirical results on standard few-shot classification benchmarks show that MetaNAS with a combination of DARTS and REPTILE yields state-of-the-art results.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Illustration of our proposed method MetaNAS and related work. Gray highlights task learning, blue meta-learning, and orange NAS components. Top: gradient-based meta-learning with fixed architecture such as MAML [17] or REPTILE [35]. Middle: applying NAS to meta-learning such as AutoMeta [25]. Bottom: Proposed joint meta-learning of architecture and weights with MetaNAS. Since architectures are adapted during task learning, the proposed method can learn task-specific architectures.

Neural architecture search (NAS) [15]

has seen remarkable progress on various computer vision tasks, such as image classification

[55, 38, 8], object detection [21], semantic segmentation [9, 29, 34]

, and disparity estimation

[40]. One key prerequisite for this success is the availability of large and diverse (labeled) data sets for the respective task. Furthermore, NAS requires considerable compute resources for optimizing the neural architecture for the target task.

This makes it difficult to apply NAS in use-cases where one does not focus on a single task but is interested in a large set (distribution) of tasks. To be effective in this setting, learning must not require large amounts of data and compute for every task but should, like humans, be able to rapidly adapt to novel tasks by building upon experience from related tasks [27]. This concept of learning from experience and related tasks is known as meta-learning or learning to learn [42, 46, 22]. Here, we consider the problem of few-shot learning, i.e., learning new tasks from just a few examples. Prior work has proposed meta-learning methods for this problem that are model-agnostic [17, 35] and allow meta-learning weights of fixed neural architectures (see Figure 1, top).

In this work, we fully integrate meta-learning with NAS, by proposing MetaNAS. MetaNAS allows adapting architectures to novel tasks based on few datapoints with just a few steps of a gradient-based task optimizer. This allows MetaNAS to generate task-specific architectures that are adapted to every task separately (but from a joint meta-learned meta-architecture). This is in contrast to prior work that applied NAS to multi-task or few-shot learning, where a single neural architecture is optimized to work well on average across all tasks [25, 36] (see Figure 1, middle). Moreover, our method directly provides trained weights for these task-specific architectures, without requiring meta re-training them as in concurrent work [1]. A conceptual illustration of our method is shown in Figure 1, bottom, and in Figure 3.

The key contributions of MetaNAS are as follows:

  1. [leftmargin=*]

  2. We show that model-agnostic, gradient-based meta-learning methods (such as [17]) can very naturally be combined with recently proposed gradient-based NAS methods, such as DARTS [31]. This allows for joint meta-learning of not only the weights (for a given, fixed architecture) but also meta-learning the architecture itself (Section 3, see Figure 1 for an illustration).

  3. We propose MetaNAS, a meta-learning algorithm that can quickly adapt the meta-architecture to a task-dependent architecture. This optimization of the architecture for the task can be conducted with few labeled datapoints and only a few steps of the task optimizer (see Figure 3 for an illustration).

  4. We extend DARTS such that task-dependent architectures need not be (meta) re-trained, which would be infeasible in the few-shot learning setting with task-dependent architectures for hundreds of tasks (requiring hundreds re-trainings).

    We achieve this by introducing a novel soft-pruning mechanism based on a temperature annealing into DARTS (see Figure 2). This mechanism lets architecture parameters converge to the architectures obtained by the hard-pruning at the end of DARTS, while giving the weights time to adapt to this pruning. Because of this, pruning no longer results in significant drops in accuracy, which might be of independent interest. We give more details in Section 4.

MetaNAS is agnostic in the sense that it is compatible with arbitrary gradient-based model-agnostic meta-learning algorithms and arbitrary gradient-based NAS methods employing a continuous relaxation of the architecture search space. Already in combination with the very simple meta-learning algorithm REPTILE[35] and NAS algorithm DARTS[31], MetaNAS yields state-of-the-art results on the standard few-shot classification benchmarks Omniglot and MiniImagenet.

This paper is structured as follows: in Section 2, we review related work on few-shot learning and neural architecture search. In Section 3, we show that model agnostic, gradient-based meta-learning can naturally be combined with gradient-based NAS. The soft-pruning strategy to obtain task-dependent architectures without the need for re-training is introduced in Section 4. We conduct experiments on standard few-shot learning data sets in Section 5 and conclude in Section 6.

2 Related Work

Few-Shot Learning via Meta-Learning

Few-Shot Learning refers to the problem of learning to solve a task (e.g., a classification problem) from just a few training examples. This problem is challenging in combination with deep learning as neural networks tend to be highly over-parameterized and therefore prone to overfitting when only very little data is available. Prior work 

[37, 17, 19, 20] often approaches few-shot learning via meta-learning or learning to learn [42, 46, 22], where one aims at learning from a variety of learning tasks in order to learn new tasks much faster than otherwise possible [47].

There are various approaches to few-shot learning, e.g., learning to compare new samples to previously seen ones [44, 48] or meta-learning a subset of weights that is shared across tasks but fixed during task learning [54, 20].

In this work, we focus on a particular class of approaches denoted as model-agnostic meta-learning  [17, 18, 35, 2, 19]. These methods meta-learn an initial set of weights for neural networks that can be quickly adapted to a new task with just a few steps of gradient descent. For this, the meta-learning objective is designed to explicitly reward quick adaptability by incorporating the task training process into the meta-objective. This meta-objective is then usually optimized via gradient-based methods. Our method extends these approaches to not only meta-learn an initial set of weights for a given, fixed architecture but to also meta-learn the architecture itself. As our method can be combined with any model-agnostic meta-learning method, future improvements in these methods can be directly utilized in our framework.

Neural Architecture Search

Neural Architecture Search (NAS), the process of automatically designing neural network architectures [15], has recently become a popular approach in deep learning as it can replace the cumbersome manual design of architectures while at the same time achieving state-of-the-art performance on a variety of tasks [55, 38, 10, 40]. We briefly review the main approaches and refer to the recent survey by Elsken et al[15]

for a more thorough literature overview. Researchers often frame NAS as a reinforcement learning problem 

[3, 55, 53, 56]

or employ evolutionary algorithms 

[45, 33, 39, 38]. Unfortunately, most of these methods are very expensive, as they require training hundreds or even thousands of architectures from scratch. Therefore, most recent work focuses on developing more efficient methods, e.g., via network morphisms [13, 7, 14, 43], weight sharing [41, 6, 5], or multi-fidelity optimization [4, 16, 28, 52]; however, they are often still restricted to relatively small problems.

To overcome this problem, Liu et al[31] proposed a continuous relaxation of the architecture search space that allows optimizing the architecture via gradient-based methods. This is achieved by using a weighted sum of possible candidate operations for each layer, where the real-valued weights then effectively parametrize the network’s architecture as follows:


where are normalized mixture weights that sum to 1, and represent feature maps in the network, denotes a set of candidate operations (e.g., convolution, convolution, average pooling, …) for transforming previous feature maps to new ones, denotes the regular weights of the operations, and serves as a real valued, unconstrained parameterization of the architecture. The mixture of candidate operations is denoted as mixed operation and the model containing all the mixed operations is often referred to as the one-shot model.

DARTS then optimizes both the weights of the one-shot model and architectural parameters by alternating gradient descent on the training and validation loss, respectively.

After the search phase, a discrete architecture is obtained by choosing a predefined number (usually two) of most important incoming operations (those with the highest operation weighting factor ) for each intermediate node while all others are pruned. This hard-pruning deteriorates performance: e.g., in the experiment conducted by Xie et al[50] on CIFAR-10, the validation performance drops from (one-shot model’s validation accuracy) to (pruned model’s accuracy). Thus, the pruned model requires retraining . We address this shortcoming in Section 4.

In our work, we choose DARTS for neural architecture search because of conceptual similarities to gradient-based meta-learning, such as MAML [17], which will allow us to combine the two kinds of methods.

Neural Architecture Search for Meta-Learning

There has been some recent work on combining NAS and meta-learning. Wong et al[49]

train an automated machine learning (AutoML 


) system via reinforcement learning on multiple tasks and then use transfer learning to speed up the search for hyperparameters and architecture via the learned system on new tasks. Their work is more focused on hyperparameters rather than the architecture; the considered architecture search space is limited to choosing one of few pretrained architectures.

Closest to our work are [25, 1]. Kim et al[25] wrap neural architecture search around meta-learning as illustrated in Figure 1 (middle). They apply progressive neural architecture search [30] to few shot learning, but this is inefficient, since it requires running the entire meta-training from scratch in every iteration of the NAS algorithm; therefore, their approach requires large computational costs of more than a hundred GPU days. The approach is also limited to searching for a single architecture suitable for few-shot learning rather than learning task-dependent architectures, which our methods supports. In concurrent work [1], anonymous authors proposed to combine gradient-based NAS and meta-learning for finding task-dependent architectures, similar to our work. However, as they employ the hard-pruning strategy from DARTS with significant drops in performance, they require re-running meta-training for every task-dependent architecture (potentially hundreds), rendering the evaluation of novel tasks expensive. In contrast, our method does not require re-training task-dependent architectures and thus a single run of meta-training suffices.

3 Marrying Gradient-based Meta-Learning and Gradient-based NAS

Our goal is to build a meta-learning algorithm that yields a meta-learned architecture with corresponding meta-learned weights . Then, given a new task , both and shall quickly adapt to based on few labeled samples. To solve this problem, we now derive MetaNAS, a method that naturally combines gradient-based meta-learning methods with gradient-based NAS and allows meta-learning along with . In Section 4, we will then describe how the meta-architectures encoded by can be quickly specialized to a new task without requiring re-training of .

Figure 2: Illustration of sparsity of the one-shot model after search. Left: vanilla DARTS (no sparsity enforced at all). Middle: enforcing sparsity over mixture of operations. Right: additionally enforcing sparsity over input nodes (here only a single input per node).

3.1 Problem Setup for Few-Shot Classification

In the classic supervised deep learning setting, the goal is to find optimal weights of a neural network by minimizing a loss function

given a single, large task with corresponding training and test data. In contrast, in few-shot learning, we are given a distribution over comparably small training tasks and test tasks . We usually consider -way, -shot tasks, meaning each task is a classification problem with classes (typically ) and (typically ) training examples per class. In combination with meta-learning, the training tasks are used to meta-learn how to improve learning of new tasks from the test task distribution.

3.2 Gradient-based Meta-Learning of Neural Architectures

Similar to MAML’s meta-learning strategy in the weight space [17], our goal is to meta-learn an architecture with corresponding weights which is able to quickly adapt to new tasks. We do so by minimizing the meta-objective


with respect to a real-valued parametrization of the neural network architecture and corresponding weights . denotes a training task sampled from the training task distribution , the corresponding task loss, and the task learning algorithm or simply task learner, where refers to iterations of learning/weight updates. Prior work [17, 35, 25] considered a fixed, predefined architecture and chose to be an optimizer like SGD for the weights:

with the one-step updates

and . In contrast, we choose to be k steps of gradient-based neural architecture search inspired by DARTS [31] with weight learning rate and architecture learning rate :


Therefore, does not only optimize task weights but also optimizes task architecture . As we use a real-valued parametrization of and a gradient-based task optimizer, the meta-objective (Equation 2) is differentiable with respect to and . This means we can use any gradient-based meta-learning algorithm not only for but also for the architecture . For the case of MAML [17], by using SGD on the meta-objective, this yields

or in the case of REPTILE [35] simply


Note that in principle one could also use different meta-learning algorithms for and . However, we restrict ourselves to the same meta-learning algorithm for both for simplicity. We refer to Algorithm 1 for a generic template of our proposed framework for meta-learning neural architectures and to Algorithm 2 for a concrete implementation using DARTS as task learning and REPTILE as a meta-learning algorithm.

By incorporating a NAS algorithm directly into the meta-learning algorithm, we can search for architectures with a single run of the meta-learning algorithm, while prior work [25] required full meta-learning of hundreds of proposed architectures during the architecture search process. We highlight that Algorithm 1 is agnostic in that it can be combined with any gradient-based NAS method and any gradient-based meta-learning method.

1:  Input:distribution over tasks task learner # e.g. DARTS [31] meta-learner # e.g. REPTILE [35]
2:  Initialize
3:  while not converged do
4:     Sample tasks from
5:     for all   do
7:     end for
10:  end while
11:  return
Algorithm 1 MetaNAS: Meta-Learning of Neural Architectures
1:  Input:distribution over tasks task loss function
2:  Initialize
3:  while not converged do
4:     Sample tasks from
5:     for all   do
8:        for   do
11:        end for
12:     end for
15:  end while
16:  return
Algorithm 2 Meta-Learning of Neural Architectures with DARTS and REPTILE
Figure 3: Conceptual illustration of the architectures at different stages of MetaNAS. Left: after initializing the one-shot model. Middle: meta-learned architecture. Right: architecture adapted to respective tasks based on meta-architecture. Colours of the edges (red, blue, green) denote different operations (Conv3x3, Conv5x5 and MaxPooling, respectively. Line-width of edges visualizes size of architectural weights (i.e., large line-width correspond to large values).

4 Task-dependent Architecture Adaptation

Using a NAS algorithm as a task optimizer does not only allow directly incorporating architecture search into the meta-training loop, but it also allows an adaptation of the found meta-architecture after meta-learning to new tasks (i.,e., during meta-testing). That is, it allows in principle finding a task-dependent architecture, compare Algorithm 3. This is in contrast to prior work where the architecture is always fixed during meta-testing [17, 35]. Also prior work using NAS for meta-learning [25] searches for a single architecture that is then shared across all tasks.

Unfortunately, the task-dependent architecture obtained by the DARTS task optimizer is non-sparse, that is, the do not lead to mixture weights being strictly 0 or 1, compare Figure 2 (left) for an illustration. As discussed in Section 2, DARTS addresses this issue with a hard-pruning strategy at the end of the architecture search to obtain the final architecture from the one-shot model (line 8 in Algorithm 3). Since this hard-pruning deteriorates performance heavily, the pruned architectures require retraining. This is particularly problematic in a few-shot learning setting as it requires meta re-training all the task-dependent architectures. This is the approach followed by [1], but it unfortunately increases the cost of a single task training during meta-testing from a few steps of the task optimizer to essentially a full meta-training run of MAML/REPTILE with a fixed architecture.

We now propose a method to remove the need for re-training by proposing two modification to DARTS that essentially reparametrize the search space and substantially alleviate the drop in performance resulting from hard-pruning. This is achieved by enforcing the mixture weights of the mixed operations to slowly converge to 0 or 1 during task training while giving the operation weights time to adapt to this soft pruning.

1:  Input: new task meta-learned architecture and weights
4:  for   do
7:  end for
9:  Evaluate with
Algorithm 3 Learning of new task after meta-learning (i.e., meta-testing) with DARTS.

4.1 Soft-Pruning of Mixture over Operations

The first modification sparsifies the mixture weights of the operations forming a mixed operation that transforms node to node . We achieve this by changing the normalization of the mixture weights from Equation 1

to become increasingly sparse, that is: more similar to a one-hot encoding for every

. To achieve this, we add a temperature that is annealed to over the course of a task training:


A similar approach was proposed by [50] and [12] in the context of relaxing discrete distributions via the Gumbel distribution [24, 32]. Note that this results in (approximate) one-hot mixture weights in a single mixed operation (compare Figure 2 (middle) for an illustration); however, the sum over all possible input nodes in Equation 1 is still non-sparse, meaning node is still connected to all prior nodes. As DARTS selects only the top ( in the default) input nodes, we need additionally also to sparsify across the possible input nodes (that is a soft-pruning of them) rather than just summing them up (as done in Equation 1), as illustrated in Figure 2 (right).

Omniglot MiniImagenet
Parameters Architecture 1-shot, 20-way 5-shot, 20-way 1-shot, 5-way 5-shot, 5-way


Table 1: Results (mean standard deviation for 3 independent runs) on different data sets and different few-shot tasks. For all architecture, REPTILE was used as a meta-learning algorithm and all results were obtained using the same training pipeline to ensure a fair comparison. Accuracy in .

4.2 Soft-Pruning of Mixture over Input Nodes

A natural choice to also sparsify the inputs would be to also introduce weights of the inputs and sparsify them the same way as the operations’ weights by annealing a temperature to over the course of a task training:


Unfortunately, this would results in selecting exactly one input rather than a predefined number of inputs (e.g., the literature default 2 [56, 50, 31]). Instead, we weight every combination of inputs to allow an arbitrary number of inputs :


where denotes the set of all combinations of inputs of length . This introduces additional parameters per node, which is negligible for practical settings with . The input weights are optimized along with the operation’s weights . Note that we simply subsume and into in Algorithm 3.

With these two modifications, we can now not only find task-dependent optimal weights (given meta-weights) but also find task-dependent architectures (given a meta-architecture) that can be hard pruned without notable drop in performance, and thus without retraining. While in theory we can now enforce a one-hot encoding of the mixture operation as well as over the input nodes, we empirically found that it is sometimes helpful to not choose the minimal temperature too small but rather allow a few (usually not more than two) weights larger than 0 instead of hard forcing an one-hot encoding. At the end of each task learning, we then simply keep all operations and input nodes with corresponding weight larger than some threshold (e.g., ), while all others are pruned.

5 Experiments

We evaluate our proposed method on the standard few-shot image recognition benchmarks Omniglot [26] and MiniImagenet (as proposed by  [37]) in the -way, -shot setting (as proposed by  [48]), meaning that a few-shot learning task is generated by random sampling classes from either Omniglot or MiniImagenet and examples for each of the classes. We refer to [48] for a more details.

5.1 Comparison under the same meta-learning algorithm.

We first compare against the architectures from the original REPTILE [35] paper and from AutoMeta [25] when training all models with the same meta-learning algorithm, namely REPTILE. This ensures a fair comparison and differences in performance can be clearly attributed to differences in the architecture. We re-train the architectures from REPTILE and AutoMeta with our own training pipeline for

meta epochs (which we found to be sufficient to approximately reproduce results from the REPTILE paper) to further ensure that all architectures are trained under identical conditions. A detailed description of the experimental setup including all hyperparameters can be found in Appendix

A.1 in the supplementary material.

(a) Normal cell.
(b) Reduction cell.
Figure 4: The most common normal and reduction cell found by MetaNAS that are used for the evaluation in Section 5.2.

For our method, we consider the following search space based on DARTS and AutoMeta: we search for a normal and reduction cell (which is common practice in the NAS literature [56, 31, 50, 8, 51]). Both cells are composed of three intermediate nodes (i.e., hidden states). The set of candidate operations is MaxPool3x3, AvgPool3x3, SkipConnect, Conv1x5-5x1, Conv3x3, SepConv3x3, DilatedConv3x3. Our models are composed of 4 cells, with the first and third cells being normal cells and the second and forth cell being reduction cells. The number of filters is constant throughout the whole network (rather than doubling the filters whenever the spatial resolution decreases); it is set so that the pruned models match the size (in terms of number of parameters) of the REPTILE and AutoMeta models to ensure a fair comparison. We consider models in the regime of parameters as well as parameters. Note that, in contrast to DARTS, we optimize the architectural parameters also on training data (rather than validation data) due to very limited amount of data in the few-shot setting not allowing a validation split per task.

The results are summarized in Table 1. In the 1-shot, 20-way Omniglot experiment, MetaNAS achieves superior performance in the case of small models, while in the case of large models MetaNAS is on-par with AutoMeta while both outperform REPTILE. On Omniglot 5-shot, 20-way, all methods perform similarly well with MetaNAS and AutoMeta having slight advantages over REPTILE. On MiniImagenet 1-shot, 5way, both AutoMeta and MetaNAS outperform REPTILE. In the 5-shot, 5-way setting, MetaNAS does also outperform AutoMeta in the case of larger models while being slightly worse for small models. In summary, MetaNAS nearly always outperforms the original REPTILE model while it is on-par or does slightly outperform AutoMeta in almost all cases. We highlight that MetaNAS achieves this while being 10x more efficient than AutoMeta; the AutoMeta authors report computational costs in the order of 100 GPU days while MetaNAS was run for approximately one week on a single GPU. Moreover, MetaNAS finds task-specific architectures; see Figure 4 for the most common cells and 5 for two other commonly used (reduction) cells, with considerable differences in the operations and connectivity.

5.2 Scaling up architectures and comparison to other meta-learning algorithms.

We now compare to other meta-learning algorithms in the fixed architecture setting; that is: we use MetaNAS not with task-dependent architectures but with a single fixed architecture extracted after running MetaNAS. For this, we extract the most common used task-dependent architecture (see Figure 4) and scale it up by using more channels and cells and retrain it for more meta-epochs with stronger regularization (weight decay and DropPath[56]), which is common practice in the NAS literature [56, 38, 31, 50, 8]. Note that naively enlarging models for few-shot learning without regularization does not improve performance due to overfitting as reported by [37, 17]. Results are presented in Table 2. Again, MetaNAS improves over the standard REPTILE architecture and AutoMeta. It also significantly outperforms all other methods on MiniImagenet and achieves new state-of-the-art performance. On Omniglot, MetaNAS is on-par with MAML++. As MAML++ outperforms REPTILE as a meta-learning algorithm, it is likely that using MAML++ in combination with MetaNAS would further improve our results.

MiniImagenet Omniglot
Method 1-shot, 5-way 5-shot, 5-way 1-shot, 20 way
CAVIA[54] -
T-NAS++[1] -
Warp-MAML[20] -
AutoMeta[25] -
Table 2: Comparison to other meta-learning algorithm. Here we list the numbers stated in other papers. MetaNAS denotes the results of our proposed method after increasing model size, regularization and a longer meta-training period. Accuracy in .

6 Conclusion

We have proposed MetaNAS, the first method which fully integrates gradient-based meta-learning with neural architecture search. MetaNAS allows meta-learning a neural architecture along with the weights and adapting it to task-dependent architectures based on few labeled datapoints and with only a few steps of gradient descent. We have also proposed an extension of DARTS [31] that reduces the performance drop incurred during hard-pruning, which might be of independent interest. Empirical results on standard few-shot learning benchmarks show the superiority with respect to simple CNNs mostly used in few-shot learning so far. MetaNAS is on-par or better than other methods applying NAS to few-shot learning while being significantly more efficient. After scaling the found architectures up, MetaNAS significantly improves the state-of-the-art on MiniImagenet, achieving accuracy in the 1-shot, 5-way setting and in the 5-shot, 5 way setting.

As our framework is agnostic with respect to the meta-learning algorithm as well as to the differentiable architecture search method, our empirical results can likely be improved by using more sophisticated meta-learning methods such as MAML++ [2] and more sophisticated differentiable architecture search methods such as ProxylessNAS [8]. In the future, we plan to extend our framework beyond few shot classification to other multi-task problems.

Figure 5: Two other commonly used reduction cells.


  • [1] Anonymous. Towards fast adaptation of neural architectures with meta learning. In Submitted to International Conference on Learning Representations, 2020. under review.
  • [2] Antreas Antoniou, Harrison Edwards, and Amos Storkey. How to train your MAML. In International Conference on Learning Representations, 2019.
  • [3] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. In International Conference on Learning Representations, 2017.
  • [4] Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Accelerating Neural Architecture Search using Performance Prediction. In NIPS Workshop on Meta-Learning, 2017.
  • [5] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. In International Conference on Machine Learning, 2018.
  • [6] Andrew Brock, Theo Lim, J.M. Ritchie, and Nick Weston. SMASH: One-shot model architecture search through hypernetworks. In International Conference on Learning Representations, 2018.
  • [7] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture search by network transformation. In AAAI, 2018.
  • [8] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations, 2019.
  • [9] Liang-Chieh Chen, Maxwell Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, and Jon Shlens. Searching for efficient multi-scale architectures for dense image prediction. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8713–8724. Curran Associates, Inc., 2018.
  • [10] Liu Chenxi, Chen Liang Chieh, Schroff Florian, Adam Hartwig, Hua Wei, Yuille Alan L., and Fei Fei Li. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In

    Conference on Computer Vision and Pattern Recognition

    , 2019.
  • [11] Tristan Deleu, Tobias Würfl, Mandana Samiei, Joseph Paul Cohen, and Yoshua Bengio.

    Torchmeta: A Meta-Learning library for PyTorch, 2019.

    Available at:
  • [12] Xuanyi Dong and Yi Yang. Searching for a robust neural architecture in four gpu hours. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1761–1770, 2019.
  • [13] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter.

    Simple And Efficient Architecture Search for Convolutional Neural Networks.

    In NIPS Workshop on Meta-Learning, 2017.
  • [14] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Efficient multi-objective neural architecture search via lamarckian evolution. In International Conference on Learning Representations, 2019.
  • [15] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. Journal of Machine Learning Research, 20(55):1–21, 2019.
  • [16] Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: Robust and efficient hyperparameter optimization at scale. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1437–1446, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
  • [17] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126–1135, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • [18] Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 9516–9527. Curran Associates, Inc., 2018.
  • [19] Sebastian Flennerhag, Pablo Garcia Moreno, Neil Lawrence, and Andreas Damianou. Transferring knowledge across learning processes. In International Conference on Learning Representations, 2019.
  • [20] Sebastian Flennerhag, Andrei A. Rusu, Razvan Pascanu, Hujun Yin, and Raia Hadsell. Meta-learning with warped gradient descent. arXiv preprint, 2019.
  • [21] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V. Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [22] Sepp Hochreiter, A. Steven Younger, and Peter R. Conwell. Learning to learn using gradient descent. In Georg Dorffner, Horst Bischof, and Kurt Hornik, editors, Artificial Neural Networks — ICANN 2001, pages 87–94, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg.
  • [23] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren, editors. Automated Machine Learning: Methods, Systems, Challenges. Springer, 2018. In press, available at
  • [24] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. 2017.
  • [25] Jaehong Kim, Youngduck Choi, Moonsu Cha, Jung Kwon Lee, Sangyeul Lee, Sungwan Kim, Yongseok Choi, and Jiwon Kim. Auto-meta: Automated gradient based meta learner search. arXiv preprint, 2018.
  • [26] Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • [27] Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40:e253, 2017.
  • [28] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: Bandit-based configuration evaluation for hyperparameter optimization. In Proceedings of the International Conference on Learning Representations (ICLR’17), 2017. Published online:
  • [29] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L. Yuille, and Li Fei-Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [30] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive Neural Architecture Search. In European Conference on Computer Vision, 2018.
  • [31] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019.
  • [32] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh.

    The concrete distribution: A continuous relaxation of discrete random variables.

    In International Conference on Learning Representations, 2017.
  • [33] Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Dan Fink, Olivier Francon, Bala Raju, Hormoz Shahrzad, Arshak Navruzyan, Nigel Duffy, and Babak Hodjat. Evolving Deep Neural Networks. In arXiv:1703.00548, Mar. 2017.
  • [34] Vladimir Nekrasov, Hao Chen, Chunhua Shen, and Ian Reid. Fast neural architecture search of compact semantic segmentation models via auxiliary cells. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [35] Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. arXiv preprint, 2018.
  • [36] Ramakanth Pasunuru and Mohit Bansal. Continual and multi-task architecture search. CoRR, abs/1906.05226, 2019.
  • [37] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In In International Conference on Learning Representations (ICLR), 2017.
  • [38] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le.

    Aging Evolution for Image Classifier Architecture Search.

    In AAAI, 2019.
  • [39] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V. Le, and Alexey Kurakin. Large-scale evolution of image classifiers. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2902–2911, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • [40] Tonmoy Saikia, Yassine Marrakchi, Arber Zela, Frank Hutter, and Thomas Brox. Autodispnet: Improving disparity estimation with automl. 2019.
  • [41] Shreyas Saxena and Jakob Verbeek. Convolutional neural fabrics. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4053–4061. Curran Associates, Inc., 2016.
  • [42] Jurgen Schmidhuber. Evolutionary principles in self-referential learning. on learning how to learn: The meta-meta-meta…-hook. Master’s thesis, Technische Universitat Munchen, Germany, 1987.
  • [43] Christoph Schorn, Thomas Elsken, Sebastian Vogel, Armin Runge, Andre Guntoro, and Gerd Ascheid. Automated design of error-resilient and hardware-efficient deep neural networks. arXiv preprint, 2019.
  • [44] Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning. arXiv preprint, 2017.
  • [45] Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies. Evolutionary Computation, 10:99–127, 2002.
  • [46] Sebastian Thrun and Lorien Pratt. Learning to learn. In Springer Science+Business Media, 1998.
  • [47] Joaquin Vanschoren. Meta-Learning, pages 35–61. Springer International Publishing, Cham, 2019.
  • [48] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, koray kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3630–3638. Curran Associates, Inc., 2016.
  • [49] Catherine Wong, Neil Houlsby, Yifeng Lu, and Andrea Gesmundo. Transfer learning with neural automl. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8356–8365. Curran Associates, Inc., 2018.
  • [50] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. SNAS: stochastic neural architecture search. In International Conference on Learning Representations, 2019.
  • [51] Arber Zela, Thomas Elsken, Tommoy Saikia, Yassine Marrakchi, Thomas Brox, and Frank Hutter. Understanding and robustifying differentiable architecture search. arXiv preprint, 2019.
  • [52] Arber Zela, Aaron Klein, Stefan Falkner, and Frank Hutter. Towards automated deep learning: Efficient joint neural architecture and hyperparameter search. In ICML 2018 Workshop on AutoML (AutoML 2018), 2018.
  • [53] Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and Cheng-Lin Liu. Practical block-wise neural network architecture generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2423–2432, 2018.
  • [54] Luisa Zintgraf, Kyriacos Shiarli, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Fast context adaptation via meta-learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7693–7702, Long Beach, California, USA, 2019. PMLR.
  • [55] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations (ICLR) 2017 Conference Track, 2017.
  • [56] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. In Conference on Computer Vision and Pattern Recognition, 2018.

Appendix A Appendix

a.1 Detailed experimental setup and hyperparameters

Our implementation is based on the REPTILE [35]111 and DARTS [31]222 code. The data loaders and data splits are adopted from Torchmeta [11]333 The overall evaluation set-up is the same as in REPTILE.

Hyperparameters are listed in Table 3. The hyperparameters were determined by random search centered around default values from REPTILE on a validation split of the training data.

For the experiments in Section 5.1, all models were trained for 30,000 meta epochs. For MetaNAS, we did not adapt the architectural parameters for the first 15,000 meta epochs to warm-up the model. Such a warm-up phase is commonly employed as it helps avoiding unstable behaviour in gradient-based NAS [8, 40, 51].

Hyperparameter Value
batch size 20
meta batch size 10
shots during meta training 15 / 10
task training steps (during meta training) 5
task training steps (during meta testing) 50+50
task learning rate (weights)
task learning rate (architecture) /
task optimizer (weights) Adam
task optimizer (architecture) Adam
meta learning rate (weights) 1.0
meta learning rate (architecture) 0.6
meta optimizer (weights) SGD
meta optimizer (architecture) SGD
weight decay ( weights) 0.0
weight decay (architecture)
Table 3: Listing of hyperparameters for MetaNAS for the experiments of Section 5.1. Hyperparameters are the same across n-shot, k-way setting. Hyperparameters are the same for MiniImagenet and Omniglot except for rows with two values separated by a slash. In this case, the first value denotes the value for MiniImagenet while the latter one denotes the value for Omniglot. by 50+50 we mean that for the first 50 steps, both the weights and architecture are adapted while for the later 50 steps only weights are adapted.

For the experiment in Section 5.2, we make the following changes in contrast to Section 5.1: we meta-train for 100,000 meta epochs instead of 30,000 and increase the weight decay on the weights from 0.0 to and use DropPath [55]

with probability 0.2. We increase the number of channels per layer to

from 14 (30k parameter models) / 28 (100k parameter models) and use 5 cells instead of 4, whereas again the second and forth cell are reduction cells while all others are normal cells.

a.2 Motivation of meta-learning algorithm

In Section 3.2, we proposed the two meta-learning updates




Equation (8) extends the MAML update

which is simply one step of SGD on the meta-objective (Equation 2), while Equation (9) extends the REPTILE update

which was shown to maximize the inner product between gradients of different batches for the same task, resulting in improved generalization[35]. Equations (8) and (9

) constitute a simple heuristic that perform the same updates also on the architectural parameters.