Efficient Automatic Meta Optimization Search for Few-Shot Learning

09/06/2019 ∙ by Xinyue Zheng, et al. ∙ Lenovo 0

Previous works on meta-learning either relied on elaborately hand-designed network structures or adopted specialized learning rules to a particular domain. We propose a universal framework to optimize the meta-learning process automatically by adopting neural architecture search technique (NAS). NAS automatically generates and evaluates meta-learner's architecture for few-shot learning problems, while the meta-learner uses meta-learning algorithm to optimize its parameters based on the distribution of learning tasks. Parameter sharing and experience replay are adopted to accelerate the architectures searching process, so it takes only 1-2 GPU days to find good architectures. Extensive experiments on Mini-ImageNet and Omniglot show that our algorithm excels in few-shot learning tasks. The best architecture found on Mini-ImageNet achieves competitive results when transferred to Omniglot, which shows the high transferability of architectures among different computer vision problems.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many meta-learning methods [6, 16, 22] have achieved success in “-shot, -way” scenario. In this scenario, each task is a -classification problem, and the learner only sees

training instances from each class. After training with these training instances, the learner is able to classify new images in the test set. Finn et al.

[2] proposed a model-agnostic meta-learning approach (MAML). The key breakthrough is its initialization technology, which allows the learner to repeatedly train on each sampled task and set parameters at the optimal start point. Compared to MAML, which needs to calculate second derivatives in back-propagation, Reptile [12] only uses first-order derivatives with higher efficiency and less computational resource. However, the Reptile algorithm only optimizes meta-learner from the parameters level, and the learner’s model is a simple and powerful network structure that are artificially designed. Designing architectures is a time-consuming process which often requires rich expert knowledge and many experimental comparisons. Therefore, we propose a novel joint optimization scheme which combines model-agnostic meta-learning algorithm and automatic architecture design to improve the few-shot learning.

As is shown in Figure 1., our scheme employs the neural architecture search technique to automate architecture design process. The contributions of each component are listed as follows: The controller is trained with policy to sample the meta-learner’s architecture from component library; meta-learner uses Reptile to seek high adaptive initial parameters for different tasks; experience replay and parameter sharing speed up meta-learner search by learning from historical knowledge.

To be specific, our model search method is based on ENAS[13] which improves the efficiency of NAS by allowing parameter sharing among generated model. Figure 1. shows a recurrent network – the controller that outputs variable-length string to define a child model, including configurable model depth, stochastic skip connection, and different combination of convolution cells. The parameters of child model are trained by Reptile [12]. After a period of Reptile training, child model returns the accuracy as reward to evaluate and adjust the controller’s architecture-generation policy. To speed up model searching procedure, we apply experience replay to reduce the number of controller interactions with the environment and encourage the controller to fully study its accumulated experience in the changing environment. Ultimately, the controller can optimize its policy and yields the best model architecture. We retrain this model from scratch using Reptile, and it can generalize across tasks with only a small number of gradient steps using few samples on each task.

Figure 1: An overview of efficient automatic meta-learning method

We make the following contributions:

  • We are the first to propose an automatic meta-optimization system by applying neural architecture search technique to meta-learning method.

  • A series of experiments show that the joint automatic optimization method can ensure that the training model has rich expression ability and high cross-task generalization ability. On Mini-ImageNet benchmark, it achieves excellent performance with accuracy.

  • We achieve remarkable meta-learner search efficiency (-shot with GPU hours; -shot with GPU hours.). It credits to the incorporation of parameter sharing and experience replay techniques in search process.

  • The algorithm shows high transferability among different computer vision problems. The best architecture found on Mini-ImageNet achieves competitive results on Omniglot tasks, and the searched models in -shot, -way classification are transferable to -shot, -way scenario.

2 Related Work

2.1 Meta Learning

Meta-learning allows learners to train through a variety of similar tasks, and expects to generalize to previously unseen tasks quickly. There are several ways to realize meta-learning. Memory based methods [11, 16] adjust bias by weights update and generate outputs by learning from memories. Santoro et al. [16]

make use of external memory introduced by Neural Turing Machine

[3] to realize short term memory and build connections between labels and input images, so that latter inputs are able to compare with related images in memories to achieve better predictions. Gradient based methods [1, 6] train a LSTM optimizer to learn parameter optimization rules of the learner network. While [1] targets at large-scale classification, [6] is interested in few-shot learning and learns both optimization rules and weight initialization. Recent work such as relation network [20] and matching network [22]

employ idea from metric learning. Instead of using artificially designed metrics, it completely utilizes neural networks to learn deep distance metric. Simple Neural Attentive Learner (SNAIL)

[10] uses temporal convolution to collect past experience and soft attention to pinpoint specific pieces of details. Object-level representation learning [8] decomposes images into objects, and applies object-level relation learned from source dataset to the target dataset.

Although the existing approaches have achieved impressive results, they either introduce extra parameters which need more storage spaces or bring constraints on the model architecture. MAML [2] is well accepted for its simplicity and model-agnostic. This method learns highly adaptive parameters to initialize the neural network so that only a small number of gradients updates are required for fast learning on a new task. Recently, OpenAI proposes a similar method Reptile [12] which does not require differentiability during the optimization process compared to MAML.

2.2 Neural Architecture Search

Human-designed networks usually only perform specific tasks. An automated method to generate appropriate architecture with adaptive model parameters and hyperparameters for any given tasks is desired. Many hyperparameter optimization methods have been studied

[19, 7, 9, 4]. These optimization algorithms are able to select and fine tune the model hyperparameters automatically which surpass human expert-level optimizations. However, they are not flexible and often limited in generating fixed-length configuration for networks.

Recent years evolutionary algorithms and reinforcement learning algorithms have been adopted for neural architecture search and achieved promising performance. Neuro-evolution methods

[15, 14]

use mutation operations to explore large search spaces, which have expensive evaluation cost and need heuristic algorithms. The reinforcement learning approach has higher feasibility and achieves better results. Zoph et al.

[23, 24] use recurrent network to generate expected ”child model”, and utilize the accuracy of the child model on the validation set as reward signals to train this RNN. Efficient Neural Architecture Search (ENAS) [13] speeds up the training process by allowing parameter sharing among child models. Another efficient method Differentiable Architecture Search (DARTS) [18] constructs continuous search space and optimizes architecture in a differentiable manner.

3 Method

In meta-learning, learners make progress at task level rather than data point level. For example, MAML [2] takes in a distribution of tasks, where each task is an independent learning problem. In order to lower the loss on task , we need to compute the following formula:


where and represent the training set and test set on task respectively, the is the training procedure acting on , and the is computed on updated parameters with test samples . represents the model architecture of the meta-learner, and the are the model parameters under this architecture. In classic meta-learning setting, is fixed and we only optimize . In our proposed scheme, and will be joint optimized with the alternatively training manner.

There are two stages in each meta-optimization search step. First the controller with policy is trained to sample a architecture , where is the reward output by the meta-learner, it is a random value at the first time. Second, using this architecture, the meta-learner is trained with Reptile algorithm. Reptile [12] is the first order of MAML. Using the same principle, it seeks an initialization condition for model parameters which can be fine-tuned easily, so that the trained learner is able to achieve high performance on previously unseen tasks. The score on validation set will be input as reward to the controller based on reinforcement learning. These two stages are alternatively trained until some good architecture candidates are generated. After all search steps finished, we would finally retrain these candidates to obtain the architecture with highest score on the meta-test dataset.

Since the discrete domain search of can be transformed into the continuous domain optimization of controller network in our method, the formula (1) can be rewrite as a differential form which can be optimized with end-to-end training:


3.1 Generating Transferable Architecture by Controller

We use LSTM as the controller to generate a variable-length string which specifies the model architecture. The controller receives a randomly initialized variable as input at the very beginning, then the input of time step is the embedding of the decision sampled from time step .

As shown in Figure 2, the controller aims to generate a four-layer child model architecture. It needs to make two sets of decisions according to the current generation policy: 1) what operations to be applied and 2) which previous layers to be concatenated. There are several available operations: ordinary convolutions, depthwise-separable convolutions, average pooling and max pooling. After selecting the operation, the controller will decide to select the previous skip connection layers. Take layer

as an example, indices of previous layers are sampled, which conduce to possible choice. Corresponding to Figure 2, at layer , the controller selects the indices , which means the output of layer and are concatenated along depth dimension and sent to layer .

Figure 2: Left: The prediction string made by controller. Right: Connect prediction string to build a complete network.

3.2 Training Controller with Reinforcement Learning

3.2.1 Policy gradient

When training the controller, we freeze parameters of the child model and only update controller’s policy (referred to as Algorithm ). Actions are the decisions made by the controller’s policy in time series: selecting an operand and layers for skip connection. We utilize policy gradient [21] to train the controller to maximize the expected reward . The traditional policy gradient formula is:


where stands for the architectures sampled by the current policy. denotes the number of predictions made by the controller. controls the parameter update direction and step size. Equation (

) targets at increasing the generation probability of high reward models and reducing the opposite. We employ the empirical average reward of these

architectures to approximate the policy gradient.

The above method is unbiased but with high variance. As is proposed by

[23], we introduce a baseline in reward to reduce the variance. is defined as the exponential moving average of previous architecture accuracy. By subtracting the baseline, we can understand the improvement of a model compared with an average one.


The advantage function helps to update policy parameters with a more specific direction.

1:  Randomly initialize policy , input state
2:  for  to  do
3:     Observe , and generate action stream to form a child model
4:     Train Child model with Algorithm 2 on meta-training to get reward
5:     Perform PG to with this experience ()
6:     Store () if , or with probability into replay buffer
7:     if  ==  then
8:        for  to  do
9:           Uniformly sample transition from
10:           Update policy parameters with
11:        end for
12:     end if
13:  end for
Algorithm 1 Automatic architecture search with experience replay.

3.2.2 Experience Replay

Policy gradient is based on stochastic gradient algorithms. The controller is updated using only one sample architecture generated by the current policy and discards it after a single update. Therefore, the controller tends to forget its past experience which leads to oscillation. We solve this issue by adopting experience replay skills [17]. To perform experience replay, we store transition in a replay buffer, where stands for the -th architecture string selected by the controller, refers to the input state of the controller, and corresponds to the accuracy computed on the validation set. Since not all experiences are expected to be learned more than once, experience will be stored in the buffer with probability:


The criterion to measure the importance of transition is its reward , which suggests how good this architecture is compared with current baseline .

3.3 Training Child Model with Reptile

When training the child model (referred to as Algorithm ), we freeze the controller’s policy parameters . Child models are required to learn from limited number of images, so we build our work on a scalable meta-learning algorithm Reptile. Assume that are the distribution probability of tasks, we sample a batch of tasks from . The standard cross-entropy loss

denotes the task-specific loss function. In order to make the model parameters

sensitive, we calculate each task with gradient steps on loss

and get the final parameter vector

. Meta-optimization across tasks is performed via Adam algorithm, where is treated as gradient. Besides, training the child model on the validation set generates accuracy , which will be returned as the reward to scale gradients of the controller.

1:  Randomly initialize
2:  repeat
3:     Sample batch of tasks
4:     for each in  do
5:        Compute adapted parameters with gradients step:
6:     end for
7:     Update
8:  until Convergence
9:  return  Accuracy on validation set as reward
Algorithm 2 Reptile at training time.

4 Experiment

In this section, we evaluate automatic meta-learning method on two important benchmarks: Mini-ImageNet and Omniglot, and compare our results against strong baselines. All of our experiments consider solving -shot, -way learning problem. For each task of -shot, -way classification, the learner trains on related classes each with examples, we firstly sample classes from meta-dataset and then select examples for each class. Then, we split these examples into training and test sets, where training set contains examples for each class and test set contains the remaining sample. Take -shot, -way classification as example, we use examples — (images) x (classes) to train the learner and use additional examples to test the model.

4.1 Few-shot Learning Datasets

Mini-ImageNet is created by randomly sampling classes from ImageNet and selecting 600 examples for each class. Training set has images with classes, test set consists of images with classes, and validation set contains images with classes.

Omniglot consists of characters from different alphabets. We randomly select 1200 characters for training and use the remaining character classes for testing. As is proposed by Santoro et al. [16], we augment the dataset with rotations by multiples of degrees. Omniglot was proposed by lake and used in the 2015 Science paper [5].

4.2 Implementation Details

The controller is a one-layer LSTM with 100 hidden units, whose goal is to search -layers child models. Operations can be selected from: x , x , and x convolutions, x , x , and x depthwise-separable convolutions and x average pooling and max pooling. We perform experience replay on the controller every steps and transitions each time. A global average pooling is added before the fully connected layer and dropout layers with drop rates are added after each layer. These tricks reduce the number of parameters and avoid overfitting during training. We use GPU for - days to search for top- architectures for meta-learner, and each architecture takes GPU hours to retrain from scratch. Table presents parameter settings in the final retrain process.

Parameters -shot -way -shot -way
Adam learning rate
Outer iterations
Outer step size
Meta batch size
Inner iterations
Inner batch size
Train shots
Eval. inner iterations
Eval. inner batch size
Table 1: Parameters for final-retrain on Mini-ImageNet

4.3 Evaluation

Algorithm Transduction -shot -way -shot -way
Reptile[12] N
Reptile Y
Matching Nets[22] N
Relation Nets[20] N
Ours N
Ours Y
Ours (Transfer) N
Ours (Transfer) Y
Table 2: Results on Mini-ImageNet

As shown in Table

, Our method achieves competitive results on Mini-ImageNet. In transductive mode, the trained model classifies all the samples in test set at once, so the information is allowed to leak between test samples through batch normalization

[12]. As expected, the transductive experiments achieve high accuracy on -shot -way task.

Automatic searching process learns directly from the task distribution of dataset, We show it enables some degree of transferability. In experiments, we transfer the model constructed from -shot -way configuration into -shot -way for final-retrain. It achieves accuracy which still beyond the original Reptile performance on Mini-ImageNet.

Although some methods [20, 10] have achieved competitive performance as ours, our method sacrifices some accuracy in exchange for time and space efficiency. For example, we only select the top3 searched architectures for retrain. If we select top10, or top100 architectures for retrain, the accuracy may be improved. In addition, the network could automatically select the number of layers, the number of feature maps, etc., which is easier than the manual setting to find the most powerful architecture, but this procedure is very time-consuming. What’s more, we use the experience replay to encourage learner learning from past experience but reduce the number of times to explore new architectures. Although it greatly improves the efficiency, it reduces the possibility of exploring the best architecture to some extent. Therefore, if we only focus on accuracy, there is a great room for improvement, but we think efficient algorithm with competitive results are more valuable. Compared to the original NAS technology, who takes 32,400-43,200 GPU hours, our algorithm can search for good architectures within 48 GPU hours.

Algorithm Transduction -shot -way -shot -way
Matching Nets[22] N
order MAML[2] Y
Reptile[12] N
Reptile Y
Ours N
Ours (Transfer) N
Table 3: Results on Omniglot

In Omniglot, we try the distance transfer experiment to test the generalization performance of the searched architecture. Here, we merely transfer the model architecture from Mini-ImageNet, but all the weights will be re-trained from scratch. From Table 3. we find that transferred architecture generalize well to Omniglot problems, even exceeds the accuracy of original Reptile method with transductive setting. So the automatic meta-learning method has been proven not only to achieve within-task generalization, but also cross-task generalization.

Figure 3. shows the experience replay contributes to the understanding of new tasks, and helps the learner to discover better architectures with less computational cost. Note that the y-axis is the moving average of previous architecture accuracy, so it only reflects the average value and many architectures can achieve much higher accuracy. Figure 4. shows the good architectures we discovered.

Figure 3: Training curves for the architecture search procedure: exponential moving average architecture accuracy over K iterations.
Figure 4: High accuracy architectures searched by -shot, -way classification on Mini-ImageNet, and can be transferred to other classification scenes.

5 Conclusion

In this paper, we introduce an efficient automatic meta optimization search for few-shot learning problems. Rich experiments show that the proposed algorithm achieves competitive performance in few-shot learning tasks. Our work has a few key insights. Firstly, the proposed framework is universal because the architecture search will discover scalable architectures for meta-learner, which can be easily nested on any model-agnostic meta-learning algorithm. Secondly, parameter sharing and experience replay techniques greatly save the computational cost and improve the efficiency of our approach. Lastly, We show the within-task generalization and cross-task generalization of the learner’s architecture, this transferability is a desirable characteristic and deserves further study.