Rethink and Redesign Meta learning

12/11/2018 ∙ by Yunxiao Qin, et al. ∙ JD.com, Inc. 12

Recently, Meta-learning has been shown as a promising way to improve the ability of learning from few data for Computer Vision. However, previous Meta-learning approaches exposed below problems: 1) they ignored the importance of attention mechanism for Meta-learner, leading the Meta-learner to be interfered by unimportant information; 2) they ignored the importance of past knowledge which can help the Meta-learner accurately understand the input data and further express them into high representations, and they train the Meta-learner to solve few shot learning task directly on the few original input data instead of on the high representations; 3) they suffer from a problem which we named as task-over-fitting (TOF) problem, which is probably caused by that they are requested to solve few shot learning task based on the original high dimensional input data, and redundant input information leads themselves to be easier to suffer from TOF. In this paper, we rethink the Meta-learning algorithm and propose that the attention mechanism and the past knowledge are crucial for the Meta-learner, and the Meta-learner should well use its past knowledge and express the input data into high representations to solve few shot learning tasks. Moreover, the Meta-learning approach should be free from the TOF problem. Based on these arguments, we redesign the Meta-learning algorithm to solve these three aforementioned problems, and proposed three methods. Extensive experiments demonstrate the effectiveness of our designation and methods with state-of-the-art performances on several few shot learning benchmarks. The source code of our proposed methods will be released soon.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

page 9

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The development of deep learning technology makes remarkable progresses in many tasks[1, 2, 3, 4, 5]

. To achieve all them, large amounts of thousands and even millions of labeled data are required for the deep learning approach to obtain satisfactory performance. However, annotating abundant data is notoriously expensive. Furthermore, it is challenging to adapt trained neural networks for one task to solve new problems with few labeled data, as severe over-fitting issue would arise.

Recently, the meta-learning algorithm[6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17] that learns a meta-learner on a distribution of similar tasks instead of data, has shown promising result to enable the few-shot learning for many computer vision tasks. However, existing meta-learners still lag behind human vision considerably.

Learning from few-data is challenging for many Computer Vision problem. In comparison, we human beings can rapidly learn new categories from very few examples, due to our ability to accurately understand the image content by leveraging the past knowledge. Furthermore, be benefiting from our powerful attention mechanism, we can precisely locate and extract critical features from images. This allows us to efficiently narrow our memory space of patterns so that we can distinguish image categories with few examples.

An example of the few-shot learning of humans is illustrated in Fig.1. In this example, each image has several different features, such as plant, animal, tree, and table etc. However, only the feature of tree or table is useful for us to recognize these two classes of images. To solve the above task, we quickly adjust ourselves to pay attention to the critical feature and neglect the others, which manifests the critical role of attention mechanism.

Figure 1: Example of few-shot image classification task. The six images come from two classes, where four labeled ones are training data with the two unlabeled for test. We are asked to predict the two testing images. Clearly, image (c) belongs to class 1 that contains table, while image (f) is associated with class 2 of tree.

It should be noted that we are so intelligent to leverage our learned knowledge about the world to understand and extract compact feature representations from these images. Meanwhile, in the above task, we only need to quickly adjust our attention and decision logic to fit the task, whereas, representations about these images are constant and stable.

Therefore, we can summarize the few-shot learning process of human as the following: 1) firstly, make use of the past knowledge to understand and extract compact feature representations from the RGB image; 2) secondly, adjust attention quickly to extract the critical features from the compact representations, while removing irrelevant features; 3) finally, use the extracted attentive features to perform the few-shot learning tasks.

It is evident that there are two main modules enabling human’s few-shot learning: a representation module that utilizes the past knowledge to understand and express the image into compact feature representations; and an attention-based decision logical module that adapts accurately based on the compact feature representations.

From the analyzation mentioned above, and rethinking about existing meta-learning approaches which train meta-learners to solve few-shot learning tasks like the way shown in Fig.2(a), we can find why existing meta-learning approaches lag behind human vision: 1) Their meta-learners are affected by many features irrelevant to few-shot learning tasks. In absence of attention mechanism[18, 19, 20, 21, 22, 23], the meta-learners are unable to focus on the most distinguishable features of input images. 2) The lack of past knowledge makes the meta-learners unable to precisely learn the compact representations of input images in a lower dimensional feature space, and thus the meta-learners have to solve few-shot learning tasks in the inefficient space of original images.

((a)) Meta-learner A
((b)) Meta-learner B
Figure 2: (a) Meta-learner A solves few-shot learning task by accurately adjusting its total network according to the original input data. (b) Meta-learner B is separated into a representation module and an Attention Augmented Output (AAO) module. The representation module is responsible for well using its past knowledge to extract compact representations from the input data, The AAO module is responsible for adjusting itself accurately to solve the few-shot learning tasks in the compact feature space.

Moreover, existing meta-learning approaches suffer from the Task-Over-Fitting (TOF) problem. For example, the meta-learner trained on 5-way 1-shot tasks is not as capable as the one trained on 5-way 5-shot tasks when they are tested on 5-way 5-shot tasks, and vice versa.

However, in practical applications, it is uncertain that how much data and how many shot times are available to the meta-learner. Therefore, to meet the demand of practical applications, the meta-learner must be capable of working well on all K-shot learning tasks, where K is an any small positive integer. SNAIL[13] which trains the meta-learner on N-way random K-shot tasks (K

{1,2,3,4,5} and K randomly changes from epoch to epoch), partly solve the TOF problem by training trick. However, SNAIL did not define and analyze the reason about TOF problem.

The possible reason for TOF problem is still that existing meta-learners are lack of both the prior knowledge and the attention mechanism. They cannot precisely understand images, and be adversely affected by useless features irrelevant to the presented tasks. Thus, redundant information lead the meta-learner to be easier to suffer from TOF.

In this paper, we rethink the meta-learning algorithm and propose to learn the attention mechanism and the prior knowledge. The meta-learner will leverage its prior knowledge to learn compact representations of input data in an efficient lower-dimensional feature space. Meanwhile, as shown in Fig.2(b), we reduce the burden of the meta-learner that it only need to rapidly adapting its Attention Augmented Output (AAO) module to solve the few-shot learning tasks in the compact feature space.

To this end, we redesign the meta-learning algorithm and propose three methods step by step to leverage attention mechanism and past knowledge in the meta-learning process, and present a metric of Cross-Entropy across Tasks (CET) to measure how much the meta-learning is affected by the TOF problem. Here, we briefly introduce the proposed methods: 1) To enable the meta-learner to utilize attention mechanism, we embed attention mechanism into the meta-learner’s network. We call it Attention augmented Meta-Learning (AML). 2) To allow the meta-learner to utilize attention mechanism as well as prior knowledge, we separate its entire network into two modules: a representation module and an AAO module, which shows as Fig.2(b)

. The representation module which is equivalent to the same module of human vision, learns the prior knowledge in a supervised fashion, and is responsible for understanding and extracting compact feature representations from image. The AAO module plays the same role as the attention based decision logic module of human vision. It utilizes attention mechanism to make the meta-learner smarter so that the meta-learner can adjust its attention and focus on most discriminative features of input images, and solve the few-shot learning task with better performance. We call this method Representation based and Attention augmented Meta-Learning(RAML). 3) Supervised learning requires a large amount of labeled data. However, there are much more unlabeled data in real world. To take full advantage of unlabeled data, we design a novel method to train the representation module of meta-learner in an unsupervised fashion

[24, 25, 26, 27, 28, 29]

. We call this method Unsupervised Representation module based and Attention augmented Meta-Learning(URAML). With URAML, we show in our experiments that the growth of the number of unlabeled data and the development of unsupervised learning both improve the performance of URAML apprently.

The main contributions of our work are:

  • We rethink the meta-learning algorithm and propose both the attention mechanism and the past knowledge are both crucial for the meta-learner. Besides, the meta-learner should be trained to solve few-shot learning tasks in a compact representational space instead of the original image data.

  • We redesign the meta-learning algorithm and propose three methods step by step to leverage attention mechanism and past knowledge: AML, RAML, URAML.

  • Through extensive experiments, we show that the proposed methods achieve state-of-the-art performance on several few-shot learning benchmarks.

  • We define the TOF problem of meta-learning, and design a novel metric Cross-Entropy across Tasks (CET) to measure how much the meta-learning approaches suffer from the TOF problem. The cross-testing experiments (introduced in the experiments section) show that, compared to other meta-learning methods, the proposed methods are less sensitive to the TOF problem, especially for the RAML and URAML.

2 related work

2.1 Meta-learning

An N-way, K-shot learning task means there are a support set and a query set for the meta-learner. The support set contains K examples for each of the N classes, and the query set contains L examples for each of the N classes. Most of the meta-learning approaches train a meta-learner on the N-way, K-shot learning tasks in the following way: firstly, the meta-learner is required to inner-update itself according to the support set; secondly, the meta-learner exams the effect of the inner-update operation by calculating its loss on the query set; thirdly, by minimizing the loss on the query set, the meta-learner learns a great weight initializer (the initialized weight is easily inner-updated by simple gradient descent to perform well on the query set[9]), or a skillful weight updater (accurately inner-updating the meta-learner’s weight [11, 17]), or both[10], or to memorize the information of the support set so as to perform well on the query set based on the memory[13].

Besides, some other meta-learning approaches work well too. Such as LLAML[30], which builds on MAML, uses a local Laplace approximation to model the task parameters. Similar to MAML, Reptile[12] also trains the meta-learner to learn the weight initializer, but only considering the first order gradient. MetaGAN[16]

couples meta-learning with the generative adversarial network (GAN), and make use of the fake samples produced by the generator to help classifier to learn a better decision in few-shot learning tasks.

In our paper, we use the Meta-SGD approach[10] as our basic meta-learning approach for its excellent performance and feasibility.

2.2 Metric learning

Some researchers have tried to solve the few-shot learning problem by metric learning[31, 32, 33]. The principal of these approaches is straight-forward, that is, to train a non-linear mapping function that represents images in an embedding space. After this training, the embedding of images belonging to different classes is easy to be distinguished by a simple nearest neighbor or linear classifiers. Matching network[31]

trains the mapping function using a neural network and categorizes the testing image to the class where it is closest to other images of this class, and measure the distance by the Cosine distance between embedding. Similarly, Prototypical Network is another metric learning based few-shot learning approach, and it use Euclidean distance to measure the similarity or distance between embedding. Compared to meta-learning approaches, the disadvantage of these metric learning based approaches is obvious: they are not readily applicable to other domains, such as reinforcement learning

[34, 35], and regression.

2.3 Attention mechanism

Recent years, attention mechanism[18, 19, 20, 21]

has been widely used in computer vision systems, machine translation and natural language processing systems. Several manners of the attention mechanism have been proposed, such as soft attention

[18, 19], hard attention[20] and self attention[21] etc. Soft attention can be seen as simulating the attention mechanism by multiplying weight on the neural unit so that the network pays more attention on the neural unit which multiplies with larger weight. SENet[19] takes advantage of soft attention mechanism to win the champion on the image classification task of ILSVRC-2017[36]. Hard attention[20] can be seen as a module that decides a block region of the input image where is visible to the network, and the other region is invisible. Self-attention[21] improves the performance of the machine translation system by training a network to find the inner dependency of the input and that of the output. In this paper, we use soft attention mechanism as the meta-learner’s attention mechanism.

2.4 Unsupervised representation and feature learning

It is costly to train a deep neural network by supervised learning, as we have to collect enough data and annotate them carefully. Considering the obvious shortcut of the supervised learning, several unsupervised learning approach[24, 25, 26, 27, 28, 29] have been proposed. A well-known way is training a neural network to reconstruct the original input through an Encoder-Decoder architecture, such as Auto-Encoder[24] and Variational Auto-Encoder (VAE)[25] etc. The input data was squeezed gradually through the Encoder to get features which are used to reconstruct the original input through the Decoder. The features squeezed by these methods are suitable to reconstruct the input image, whereas they does not have enough semantic information that can be used for classification or other tasks.

Similar to Auto-Encoder[24] and Variational Auto-Encoder (VAE)[25], Context Auto-Encoder[26]

force the network to understand the input image by training the network to predict the contents of an arbitrary masked image region given the surrounding region. Colorization

[27] uses Lab images to train a network that generates the unseen ab channels from the input L channel. Based on Colorization, Split-Brain[29] trains two separated networks simultaneously. One is used to generate the ab channels from the L channel, and another is used to generate the L channel from the ab channels. The possible reason of the success of Colorization and Split-Brain is that they force the network to firstly predict every pixel’s semantic information from the input channel, and generate pixel’s value of the unseen channel according to the semantic information and the provided channel.

Different from these methods, DeepCluster[28] couples the deep learning with Cluster algorithm[37, 38], and get state-of-the-art performance in the unsupervised learning field. However, the DeepCluster algorithm may be troubled by a problem that there are many images containing complex or redundant semantic information so that these images are not suitable to be categorized into any specific cluster.

3 Method

3.1 Problem of learning from few-data

Learning from few-data is extremely difficult for the deep learning model. One reason about this is that the original input data is commonly represented in a large dimension space. Usually, tens or hundreds of thousands of dimension space is required. For example, for the image classification task, the original image is commonly stored in a large dimensional space(dimension of a 224x224 RGB image is 150528), and it is difficult for few images of one category, in such a large dimension space, to accurately reflect the character of this category.

However, human learn from few-data by firstly expressing the input date into high level representations, and secondly paying attention to the critical features of the compact representations. These two steps help human to reduce the dimension of the input date gradually, and maintain the principal and discriminative component of the input data, and understand the characteristic of the category with few images.

Existing meta-learning approaches help deep learning model to learn from few-data a lot. However, they train the meta-learner to quickly adjust its network to fit the few-shot learning task directly on the few original high dimensional input data and ignore the importance of attention mechanism and the past knowledge. Besides, they are troubled by the TOF problem.

In this paper, to counter the problem the existing meta-learning approaches expose, we redesign the meta-learning algorithm and propose three methods step by step: Attention augmented Meta-Learning (AML), Representation based and Attention augmented Meta-Learning(RAML), Unsupervised Representation module based and Attention augmented Meta-Learning(URAML).

3.2 Aml

Method AML equips the meta-learner with the power of attention mechanism by embedding an attention model into the meta-learner’s network. The network architecture of AML is shown in Fig.

3. An attention model is inserted explicitly, and the forward calculation is shown as (1). The CNN outputs is firstly fed into the attention module, and the feature is the channel-wise multiplication between the attention mask m and the original feature , and the classifier output the final prediction p based on feature .

(1)

Where , , , are functions of the CNN, attention model, channel-wise multiplication and the classification layer respectively. And , and are weights of the CNN, the attention model and the classification layer respectively. And x is the input data, p is the output prediction.

In this paper, we use soft attention mechanism to build up the attention model. Although the soft attention mechanism is not the same with the attention mechanism in human vision, it still plays a similar role with the human attention mechanism and helps the meta-learner to pay attention to key features. Fig.5 is used to better understand the processing of the soft attention mechanism for meta-learner.

The inner structure of the attention model shows in Fig.4 and the computation process of the attention model shows as (2). The feature is firstly global-average-pooled to get feature , and then a convolution layer coupled with a sigmoid activation layer are used to calculate the attention mask m from the feature .

(2)
Figure 3: Network structure of the proposed method AML. There is an attention model inserted explicitly in the meta-learner’s network.
Figure 4: Inner network structure of the attention model. The shape of feature map is (b,w,h,c) which is shown at the left of the figure, where b, w, h, c are the batch size, width, height and umber of channels of the feature map , and the shape of and m are both (b,1,1,c).

Where is the global-average-pooling function, and

is the sigmoid activation function.

With the embedded attention model, the meta-learner turns to solve the few-shot learning task better and easier, because it can tunes its attention to the more useful features in the few-shot learning process. Corresponding experiments show the positive effect of attention mechanism.

Figure 5: (a) Network structure of the proposed RAML. The meta-learner is composed of a representation module and an AAO module. The Pre-trained Classification (PC) module is used to help the meta-learner to learn the past knowledge. (b) Example that interprets the principle of soft attention mechanism for few-shot learning.
Figure 6: Network structure of the proposed URAML. The meta-learner is composed of a representation module and an AAO module. The Decoder module is used to help the meta-learner to learn the past knowledge.

3.3 Raml

Method RAML assembles the meta-learner not only the attention mechanism but also the ability of well using the past learned knowledge. To achieve that, we design the meta-learner’s network as Fig.5.

In Fig.5, there are three modules, and the meta-learner is composed by the representation module and the AAO module. The pre-trained Classification(PC) module does not belong to the meta-learner, and it is only used to help the representation module to learn knowledge in a supervised way.

The representation module is responsible for learning and leverage knowledge to help the meta-learner understand the input image. The AAO module is the module which also embeds an attention model and can be adjusted efficiently to fit the new few-shot learning task by the meta-learner.

The training process can be separated into two stages: the pre-training stage and meta-training stage. At the pre-training stage, both the representation module and the PC module are trained on the MiniImagenet-900 dataset(a dataset we organized to pre-train the representation module, and the detail about it will be introduced in the experiment section).

After the pre-training stage, the representation module has learned knowledge from the MiniImagenet-900 dataset. At the meta-training stage, for the meta-learner not forgetting the learned knowledge, we fix the pre-trained representation module entirely. The meta-learner only needs to learn to solve the few-shot learning task by quickly adjusting its AAO module in the representations space, which is a simpler work compared to that of the meta-learner in AML method.

It should be noted that the dataset used in the pre-training stage is different from that in the meta-training stage. In the meta-training stage, the meta-learner is trained on the MiniImagenet dataset, whereas in the pre-training stage, representation module of the meta-learner is trained on MiniImagenet-900 dataset, and there are no image class overlaps between these two datasets.

3.4 Uraml

As the representation module can also learn knowledge in the manner of unsupervised learning, we design the method URAML. Its network structure shows in Fig.6. The meta-learner is composed by the representation module and the AAO module. The decoder module does not belong to the meta-learner, and it is only used to help the representation module to learn knowledge in a supervised way.

At the pre-training stage, the representation module and the decoder module is trained by an unsupervised learning algorithm: Split-Brain auto-encoder[29]. The split brain auto-encoder trains two paths simultaneously. One is used to predict the ab channels of the input Lab image given the L channel, and the encoder-L with weight of calculates the feature of channel L , and the decoder with weight of predicts the ab channels p by (3). Similarly, the other path is used to predict the L channel given the ab channels, and the prediction process shows as (4). The losses of these two paths are calculated by (5

). In our paper, we use L2 regression loss as the loss function.

(3)
(4)
(5)
(6)

Where , , and are functions of Encoder, Decoder, loss and the global-average-pooling. and are the L and ab channels of the input Lab image.

At the meta-training stage, the AAO module is trained by the Meta-SGD approach, and the training process is the same with RAML and AML. The input of AAO module is the compact feature vector

, which is concatenated by and , by (6).

The difference between URAML and RAML is that the representation module is trained by unsupervised learning, and labeled data is not needed in the pre-training stage. This characteristic brings advantages to URAML:

  • We have confirmed a phenomenon that the growth of available unlabeled images used in the pre-training stage firstly boost up the performance of the representation module and secondly improve the performance of the meta-learner;

  • The progress of unsupervised learning algorithm also boost up the performance of the representation module and that of the meta-learner.

4 Experiments

In this section, we firstly present the datasets we used in our experiments, and than present the details and results of our experiments. All our experimental code is written based on the Tensorflow library

[39].

4.1 Dataset

We used several kinds of dataset in all our experiments: MiniImagenet[11], Omniglot[40], MiniImagenet-900, Places2[41], COCO[42], and OpenImages-300.

4.1.1 MiniImagenet

MiniImagenet[11]

is popularly used for evaluating the performance of meta-learning algorithm. It contains 100 image classes, including 64 training classes, 16 validation classes, and 20 testing classes. Each image class with 600 images are sampled from the ImageNet dataset

[43].

4.1.2 Omniglot

Omniglot[40] is another widely used dataset for meta-learning problem. It contains 50 different alphabets and 1623 characters from these alphabets, and each character has 20 images that hand-drawn by 20 different people.

4.1.3 MiniImagenet-900

MiniImagenet-900 dataset is designed to pre-train the representation module in method RAML and URAML, and it is composed of 900 image classes. Each image class and the corresponding images are collected from the original ImageNet dataset, and each image class contains about 1300 images. It is worth noting that there is no image class in MiniImageNet-900 coincides with the classes from the MiniImagenet dataset.

4.1.4 Other datasets

As the representation module of URAML is trained by unsupervised learning, we take full advantage of this characteristic by train the representation module not only on MiniImagenet-900 but also other datasets: Places2[41], COCO2017[42], and OpenImages-300.

The dataset OpenImages-300 is a subset of the OpenImages-V4 dataset[44]. The total OpenImages-V4 dataset contains 9 million images, and we randomly downloaded 3 million images from the OpenImages-V4 website to form the OpenImages-300 dataset.

For the COCO2017 and OpenImages-300 dataset, we cropped a square with the size of the min(height, width) of each training image from the center of it. For a fairly comparison with the previous few-shot learning and meta-learning methods, we resize all the images of Omniglot to 28x28 resolution, and all the image of MiniImagenet to 84x84 resolution. Finally, each image of the dataset that used to pre-train the representation module in RAML and URAML experiment is resized to 84x84 resolution and converted into Lab color space when used in the URAML experiment.

4.2 Experiments on MiniImagenet

On MiniImagenet dataset, we test all our three methods. Here, we show details about our experiments on MiniImagenet dataset.

4.2.1 AML experiment on MiniImagenet

On the MiniImagenet dataset, we test AML on 5-way 1-shot and 5-way 5-shot tasks. With the method AML, we improve the meta-learner’s ability by explicitly embedding attention mechanism into its network, and train it by the Meta-SGD approach. The structure of the meta-leaner’s network is shown in Fig.3

. Both the network of the attention model and the classifier is a simple fully connect layer, and 4 (Convolution-Relu-Batch Normalization) C-R-B blocks are used to extract features of the input image.

We train the meta-learner on 200000 randomly generated tasks for 60000 iterations, and set the learning rate of meta-learner to 0.001, and decay the learning rate to 0.0001 after 30000 iterations. Moreover, we set the hyper-parameter L to 15, and Dropout, L1 and L2 normalization are used to prevent the meta-learner from over-fitting.

The experimental result of the method AML on the MiniImagenet dataset shows in Tab.2, we attained the state-of-the-art on the 5-way 5-shot image classification task: 69.46%(compared to original Meta-SGD, we rise the meta-learner’s performance by 8.5%).

4.2.2 RAML experiment on MiniImagenet

In the experiment of RAML method, we equip the meta-learner the ability of both utilizing the past knowledge and the attention mechanism. The representation module is a modified Resnet-50[45] network which is feed with the image of 84x84 resolution in RGB color space, and the PC module is a simple fully-connect layer followed by a softmax-output layer.

At the pre-training stage, we set the batch size to 256, and the learning rate to 0.001, and decay the learning rate to 0.0001 after 30000 iterations, and use L2 normalization and dropout operation to prevent over-fitting. Finally, the representation module together with the PC module gets a 57.39% classification accuracy on the 900-way classification task on the MiniImagenet-900 dataset.

After the pre-training stage, at the meta-training stage, we fix the weights of the representation module and train the AAO module by the Meta-SGD approach. In the experiment of RAML method, the attention model contains one convolution layer with kernel size of 1x1 and 2048 filters, and the classifier is composed by two fully-connect layers (a layer with 2048 hidden units and a layer with 5 output units). Batch normalization and ReLU activation is used in every hidden layer.

The result shows in Tab.2. Compared to method AML, method RAML improves the meta-learner’s performance more significantly, the accuracy of 5-way 1-shot task rises from 52.25% to 63.66%, and the accuracy of 5-way 5-shot task rises from 69.46% to 80.49%.

The most likely reason why RAML performs so well is: before the meta-training stage, the representation module has learned the knowledge and the ability to understand the input image, and provides high level meaningful representations and features of the input image. In the meta-training stage, by fixing the representation module, the meta-learner’s work becomes more comfortable because it only needs to learn how to quickly adjust the AAO module according to the compact features the representation module provided, and do not need to take care of the original high dimensional input data. While the meta-learner of AML works harder than the meta-learner of RAML, as it has to adjust its total network to fit new few-shot learning tasks according to the original input data.

4.2.3 URAML experiment on MiniImagenet

The difference between method URAML and RAML is that the representation module of the URAML is pre-trained by unsupervised learning algorithm: Split-Brain. We also use Resnet-50 as the basic network of the representation module. As shown in Fig.6, two independent Resnet-50 networks are trained simultaneously, and the features these two networks output will be concatenated together to form the output feature vector of the representation module. In the experiment of URAML, we halve all the filters in the Resnet-50 for the representation module can output feature vector with a dimension of 2048, which is the same with that of the RAML.

The detail of the decoder network shows in Tab.1. We use the deconvolution[46] layer to upsample the feature. The number of filters of the last Conv layer is 1 or 2 according to that the network is recovering the L channel or the ab channel. It should be noted that to save the training burden, the decoder module recovers the ab and L channels into 11x11 resolution, not the original 84x84 resolution, and to calculate the loss which shown as (5), we resize ab and L channels of the input Lab image to 11x11 resolution.

Layers Number of filters Kernel
CONV_Relu_BN 1024 5
DeCONV_Relu_BN 512 3
DeCONV_Relu_BN 256 3
CONV 1 or 2 1
Table 1: Detailed Structure of the Decoder Module in the URAML Experiment

The hyperparameters of the pre-training process are the same with that of RAML experiment. Inspired by the context encoder

[26], we randomly drop several patches of each image that used in the pre-training stage, force the network not only recovering the invisible image channel from the given channel but also recovering the invisible patches. We show some processed images that we used in the pre-training stage in Fig.7.

Figure 7: Processed image that used in the pre-training stage in URAML experiment, and the corresponding original image.
Method Venue 5-way Accuracy
1-shot 5-shot
Matching nets FCE[31] NIPS-16 44.20% 57.00%
Meta-learner LSTM[11] ICLR-17 43.440.77% 60.60 0.71%
MAML[9] ICML-17 48.701.84% 63.110.92%
Prototypical Nets[32] NIPS-17 49.420.78% 68.200.66%
Meta-SGD[10] / 50.471.87% 64.030.94%
Reptile+Transduction[12] / 49.970.32% 65.990.58%
LLAMA[30] ICLR-18 49.401.83% /
Relation Net[47] CVPR-18 51.380.82% 67.070.69%
GNN[48] ICLR-18 50.330.36% 66.410.63%
AML(ours) / 52.250.85% 69.460.68%
SNAIL[13] ICLR-18 55.710.99% 68.880.92%
TADAM[49] NIPS-18 58.500.30% 76.700.30%
MetaGAN+RN[16] NIPS-18 52.710.64% 68.630.67%
RAML(ours) / 63.660.85% 80.490.45%
URAML(ours) / 49.560.79% 63.420.76%
Table 2:

Few-Shot Learning Performance on MiniImagenet Dataset. The Method Which is Colored With Blue Uses Deep Network (ResNet) to Extract the Feature, While the Other Use Shallow Network (4 Cascading Convolution Layers). The Accuracy is Averaged by the Accuracies on 600 Few-Shot Classification Tasks, With 95% Confidence Intervals, and All These 600 Tasks are Randomly Generated From the Test Set of the MiniImagenet Dataset. We Separately Highlight the Best Result of the Methods With Shallow Network and That of the Methods With Deep Network, for Each Task. We Also Highlight the Result of URAML, Even Though Its Result is Not Well Enough.

4.3 Experiments on Omniglot

As Omniglot dataset is much easy dataset that existing meta-learners can easily achieve the accuracy of more than 95% in most tasks, we only test method AML, on 5-way 1-shot, 5-way 5-shot, 20-way 1-shot, 20-way 5-shot tasks. By referring MAML[9]

, the architecture of the network we used in the Omniglot experiment is similar to that in the MiniImagenet dataset experiments, and the main difference is that the downsample operation is realized by the convolution layer with stride 2 instead of max-pooling layer.

Same to the experiments on Miniimagenet, we also train the meta-learner on 200000 randomly generated tasks for 60000 iterations and set the learning rate to 0.001. The experiment results show in Tab.3

It is clear that on all 4 few-shot image classification tasks, the proposed method AML attain state-of-the-art performance on 2 of these 4 tasks. On the 5-way 1-shot task, the method MetaGAN+RN performs a little better than the method AML, however, MetaGAN+RN uses a much deeper network than AML. On the 20-way 1-shot task, our method AML surpass other methods by a large margin (compared to the result of original Meta-SGD, AML improves the meta-learner’s performance from 95.93% to 98.48%).

Method Venue 5-way Accuracy 20-way Accuracy
1-shot 5-shot 1-shot 5-shot
matching nets FCE[31] NIPS-16 98.10% 98.90% 93.80% 98.50%
MAML[9] ICML-17 98.700.40% 99.900.10% 95.800.30% 98.900.20%
Prototypical Nets[32] NIPS-17 98.80% 99.70% 96.00% 98.90%
Meta-SGD[10] / 99.530.26% 99.930.09% 95.930.38% 98.970.19%
Reptile+Transduction[12] / 97.680.04% 99.480.06% 89.430.14% 97.120.32%
Relation Net[47] CVPR-18 99.600.20% 99.800.10% 97.600.20% 99.100.10%
GNN[48] ICLR-18 99.20% 99.70% 97.40% 99.00%
SNAIL[13] ICLR-18 99.070.16% 99.780.09% 97.640.30% 99.360.18%
MetaGAN+MAML[16] NIPS-18 99.100.30% 99.700.21% 96.400.27% 98.900.18%
MetaGAN+RN[16] NIPS-18 99.670.18% 99.860.11% 97.640.17% 99.210.10%
AML(ours) / 99.650.10% 99.850.04% 98.480.09% 99.550.06%
Table 3: Few-Shot Learning Performance on Omniglot Dataset. The Method Which is Colored With Blue Uses Deep Network (ResNet) to Extract the Feature, While the Other Use Shallow Network (4 Cascading Convolution Layers). The Accuracy is Tested as the Same Way as MAML[9]

4.4 Ablation study

4.4.1 Ablation study about the attention mechanism

To confirm the promotion effect of attention mechanism for the meta-learning approaches, we do a lot of experiments to compare the performance of the meta-learner which equips with the attention model and its counterpart which is not. The experimental results show in Tab.4 and Tab.5. The performance of the method which has a mark * is the result re-implemented by ourselves. There exist difference between the result of the corresponding paper with that of our re-implementation, which is probably caused by different hyper-parameters or different experiment setting (all methods in this ablation experiment use convolution layers with 32 filters). The comparison results revealing that in most cases, the attention mechanism improves the meta-learner by a clear margin, demonstrating the reasonability of our idea and method AML.

As attention mechanism will bring more weights into the meta-learner’s network, we do another experiment to validate that the improvement of AML is not caused by the growth of the number of weights but the contribution of attention mechanism. The experiment detail is: since the attention model is mainly a convolution layer with the kernel size of 1x1, we remove the attention model, and place a convolution layer with the kernel size of 1x1 on the top of the feature (shown in Fig.3). We name the meta-learner with this network as Ordinary Meta-Learning (OML), and its number of weight is the same with that of AML. The corresponding experimental result is shown in Tab.6, and it is clear that OML lags behind AML, which shows that the improvement effect of AML is not the contribution of the growth of the number of weight but the attention mechanism.

Method 5-way Accuracy
1-shot 5-shot
MAML* 48.030.83% 64.110.73%
MAML+attention 48.520.85% 64.940.69%
Reptile* 48.230.43% 63.690.49%
Reptile+attention 48.300.45% 64.220.39%
Meta-SGD* 48.150.93% 63.730.85%
Meta-SGD+attention 49.110.94% 65.540.84%
Table 4: Results of Ablation Experiments About the Attention Mechanism on the MiniImagenet Dataset
Method 5-way Accuracy 20-way Accuracy
1-shot 5-shot 1-shot 5-shot
MAML* 97.400.27% 99.710.05% 93.370.23% 97.460.11%
MAML+attention 97.410.28% 99.480.12% 92.990.25% 97.940.10%
Meta-SGD* 98.940.17% 99.510.07% 95.820.21% 98.400.09%
Meta-SGD+attention 99.260.15% 99.790.04% 97.940.14% 98.990.10%
Table 5: Results of Ablation Experiments About the Attention Mechanism on the Omniglot Dataset

4.4.2 Methods’ generalization on other meta-learning approach

To test whether all the proposed methods are universal to other meta-learning approaches, we change the meta-learning approach in all our methods from Meta-SGD to MAML, and all the network and hyper-parameters are constant. We mark them as AML-MAML, RAML-MAML and URAML-MAML, and test them on the MiniImagenet dataset. The corresponding result shows in Tab.6. Though the performance of AML-MAML and URAML-MAML drop down slightly, they are also comparable, which indicates that all our proposed methods generalize well to different meta-learning approaches.

An interesting phenomenon is that the RAML-MAML performs better than the primordial RAML, especially in the 5-way 5-shot task. We consider two reasons cause this phenomenon. One is the compact feature representations the pre-trained representation module output are distinguishable enough to solve the classification task. The other is that, compared to MAML, Meta-SGD doubles the weights of AAO module of the meta-learner. In a word, based on distinguishable high level representations provided by the pre-trained representation module, the primordial RAML is easier to suffer from over-fitting than the RAML-MAML.

Method 5-way Accuracy
1-shot 5-shot
OML 51.270.78% 67.730.65%
AML 52.250.85% 69.460.68%
AML-MAML 50.650.92% 68.950.69%
RAML 63.660.85% 80.490.45%
RAML-MAML 64.230.85% 83.760.49%
RAML-Places2 58.820.89% 74.090.76%
URAML 49.560.79% 63.420.76%
URAML-MAML 49.120.85% 62.930.49%
Table 6: Results of Several Ablation Experiments.

4.4.3 The affection of the pre-training dataset

We do experiments to test how the dataset that used in the pre-training stage affects the meta-learner of RAML and URAML methods.

a) affects to RAML: We do this experiment by pre-training the representation module on the Places2[41] dataset, and all the other experiment settings and hyper-parameters are constant with primordial RAML, we mark it as RAML-Places2. Corresponding experimental result shows in Tab.6. It is clear that the dataset used in the pre-training stage affects the meta-learner. The possible reason is that different dataset used in the pre-training stage leads the representation module to learn different knowledge and features of the input data[50]

. The Places2 dataset is a dataset commonly used for scene classification, which results in the representation module to learn the knowledge about the scene, and features which it outputs are more suitable for the scene classification task rather than the object classification task.

b) affects to URAML: We do an experiment to test how the quantity of unlabeled images used in the pre-training stage affects the performance of URAML.

We design two new versions of the URAML in this ablation experiment: URAML-V1 and URAML-V2. The representation module of URAML-V1 is trained only on the MiniImagenet-900 dataset. URAML-V2 is a higher version than URAML-V1 that besides MiniImagenet-900, the Places2 and COCO2017 are also used to train the representation module. Compared to the URAML-V1 and URAML-V2, representation module of the primordial URAML is trained by all the dataset of MiniImagenet-900,places365, COCO2017, and OpenImages-300. Details are shown in Tab.7.

Version Dataset Number of images
URAML-V1 MiniImagenet-900 1.15million
URAML-V2 MiniImagenet-900, places365,
COCO2017
4.10million
URAML MiniImagenet-900,places365, COCO2017,OpenImages-300 7.10million
Table 7: Detail About the Dataset That Used in the URAML Experiments.

Experimental results show in Tab.8. It is clear that the performance of URAML URAML-V2 URAML-V1, indicating that with the growth of the unlabeled training data used in pre-training stage boosts up the performance of the URAML, and there remains a large increase space as we can use more data in the URAML experiment.

4.4.4 The affection of the unsupervised learning

Moreover, the development of unsupervised learning algorithm also helps URAML a lot. We verify this viewpoint by using Auto-Encoder approach[24] to pre-train meta-learner’s representation module, and we named this version of URAML as URAML-AE. The few-shot learning performance on the MiniImagenet dataset shows in Tab.8, revealing the unsupervised learning algorithm affects the meta-learner significantly, and the better the unsupervised learning algorithm we use, the better the meta-learner performs. Maybe the most promising way to enhance the performance of URAML is to develop the unsupervised learning algorithm and collect enough unlabeled data.

Method 5-way Accuracy
1-shot 5-shot
URAML-AE 33.290.71% 43.600.66%
URAML-V1 45.910.79% 61.040.71%
URAML-V2 48.820.79% 62.840.78%
URAML 49.560.79% 63.420.76%
Table 8: Results of Ablation Experiments About URAML, on MiniImagenet datatset.

4.5 Cross-testing experiments

In existing meta-learning approaches, the meta-learner to be tested on 5-way 1-shot classification tasks must be trained on 5-way 1-shot classification tasks rather than on other tasks, and similarly, the meta-learner to be tested on 5-way 5-shot classification tasks must be trained on 5-way 5-shot classification tasks rather than on other tasks. This is because these meta-learning approaches suffer from TOF problem, and this problem is staring for meta-learning.

We do lots of cross-testing experiments to show the performance of MAML, Meta-SGD, AML, RAML, and URAML, on the TOF problem, and the experimental results show that our approach RAML, URAML suffer little from this problem.

We do the cross-testing experiments by training meta-learners with each meta-learning approach on 5-way K-shot image classification tasks on MiniImagenet dataset, the param K{1,3,5,7,9}, and test meta-learners’ performance on 5-way J-shot tasks, where J{1,3,5,7,9}. For example, we train a meta-learner by MAML on 5-way 3-shot tasks and test its performance on all 5-way K-shot tasks, K{1,3,5,7,9}. The experimental results are shown in Fig.8.

Figure 8: Results of cross-testing experiments amoung MAML, Meta-SGD, AML, RAML and URAML methods. In the cross-testing experiments, we train the meta-learner on all 5-way K-shot training tasks, where K{1,3,5,7,9}, and test it on all 5-way J-shot testing tasks, where J{1,3,5,7,9}, and all the tasks are image classification task on the MiniImagenet dataset. Each row presents specific K-shot tasks, and the value presents different meta-learners’ testing accuracy on it. Each column presents a specific meta-learner trained by a specific meta-learning approach on specific K-shot tasks, and the value presents the meta-learner’s testing accuracy on J-shot tasks. From this figure, we can see that the meta-learner trained by MAML on 9-shot tasks attain 39.69% accuracy on the 1-shot testing tasks, and the meta-learner trained by URAML on 1-shot tasks attain 65.52% accuracy on the 7-shot testing tasks.
(7)

Obviously, the meta-learner trained by MAML suffer seriously from the TOF problem, because the meta-learner which performs best on K-shot tasks could not perform well on J-shot tasks, where KJ. The meta-learner trained by URAML troubled little by the TOF problem, because the meta-learner which performs best on K-shot tasks also performs best on J-shot tasks, where K, J{1,5,7,9}.

We design a metric Cross-Entropy across Tasks (CET), to quantize how much does meta-learning approach be vulnerable to the TOF problem. The calculation process is shown as (7), where i{1,3,5,7,9} and the overstriking variable indicates that the variable is a vector. The distance d present the similarity between accuracy distribution vector dist and dist. The total distance D present the overall similarities between dist and dist (i,j{1,3,5,7,9}) of a specific approach.

For example, the testing accuracies vector accs (testing accuracys on 3-shot tasks) of different meta-learners of Meta-SGD is: [58.24%, 59.18%, 58.90%, 58.75%, 59.15%]. So, we can get dist by softmax([58.24%, 59.18%, 58.90%, 58.75%, 59.15%] / 59.18%) = [0.116, 0.255, 0.202, 0.178, 0.249], and get dist : [0.122, 0.206, 0.255, 0.233, 0.184], and d : 1.603, and D : 34.22.

It is clear that the smaller the total distance D appears, the less the meta-learning approach suffer from TOF problem.

Method MAML Meta-SGD AML RAML URAML
CET 57.19 34.22 33.35 32.13 32.16
Table 9: Performance of Different Meta-Learning Methods on the CET Metric.

We show different meta-learning approaches’ performance on the CET metric in Tab.9. RAML and URAML perform best, and MAML performs not good.

The possible reasons for this result are:

  • As the representation modules of RAML and URAML are pre-trained to learn and store knowledge that applies to most tasks, and the representation module is not biased to any specific K-shot learning task.

  • The meta-learner trained by existing meta-learning approaches need to update the total network given few-data, while the meta-learner trained by RAML and URAML methods only need to update the AAO module based on the compact feature representations output by the representation module, which is easier.

We can see an interesting phenomenon in Fig.8, that the meta-learner trained by RAML on 5-way 9shot tasks performs best in most of the test tasks, while the meta-learner trained by URAML on 5-way 1-shot tasks performs best in most of the test tasks. The possible reason behind this phenomenon is that the representation module of RAML approach learn knowledge by supervised learning, while the representation module of URAML approach learns knowledge by unsupervised learning, which results in the output features between these two kinds of representation module be different.

4.6 Feature analysis

To understand the effect of attention mechanism, we reduce and (shown in Fig.3, Fig.5 and Fig.6

) into a 2 dim space with Principal Component Analysis (PCA) algorithm, and visualize them on a 2D plant. As shown in Fig.

9, we visualize and of the meta-learners trained on 5-way 1 and 5 shot tasks with method AML, RAML and URAML. 500 feature points of each picture represent 500 and of the images from the query set of a task, and the task is a 5-way 1 or 5 shot task that randomly generated from the testing set of the MiniImagenet dataset.

It is clear that the standard deviation of the inner-class distance of

is smaller than that of , and the standard deviation of the inter-class distance of is larger than that of , indicating that among different image classes, the distribution of is more distinguishable than that of . The reason of this phenomenon is simple: the attention mechanism makes the meta-learner pays more attention to the critical feature, and the critical feature affects the more, which makes more distinguishable than to differentiate images of different classes.

Figure 9: Visualization of the feature point distribution of the feature and the feature of all our three methods. We use PCA algorithm to show the feature and the feature in a 2D space, and color these feature points with 5 colors, each color represents 1 image class of the 5 image class of the 5-way K-shot image classification task. D1 and D2 are the standard deviations of the inner-class distance and the inter-class distance respectively. It is clear that the inner-class distance becomes smaller and the interclass distance becomes larger after the feature is operated by attention model.

4.7 Heat-map of and

To further analyze how the attention mechanism affects the meta-learner, we visualize the heat-maps of and in Fig.10. By showing the support set of any random 5-way 1-shot classification task which is created from the test set of the MiniImagenet dataset, to the meta-learner, and feeding the meta-learner the corresponding query set, we finally get heat-maps of and of the images of the query set.

From the heat-map shown in Fig.10, it is clear that , compared to , is more sensitive to the distinguishable part of the input image, revealing that the meta-learner changes its attention and pays more attention to the discriminative part of the image. For example, the first row of Fig.10 is a lion. Besides the head of the lion, is also sensitive to the meaningless region of the image, such as the background. However, the meta-learner shrinks its attention region so that is almost only sensitive to the face of the lion.

Through the visualization and analysis about the heat-map of and , it is clear that the attention mechanism helps the meta-learner to shrink the attention region to the most distinguishable part of the image, and further helps the meta-learner to do a better few-shot classification task.

Figure 10: We show some images which are sampled from the query set of a 5-way 1-shot classification task, and the corresponding heat-maps of and .

5 Conclusion

In this paper, we rethink the meta-learning algorithm. By analyzing the few-shot leaning process of humans briefly, we find the importance of attention mechanism and the past knowledge for meta-learner in the few-shot learning process, and the meta-learner should extract compact feature representations from the input data by well using the past knowledge, and solve few-shot learning task in the compact feature space rather than the original image space. Moreover, we find existing meta-learning approaches suffer from the TOF problem, which is unfriendly to practical applications.

We redesign the meta-learning algorithm and propose three methods: AML, RAML, and URAML. All of them work successfully with state-of-the-art performance on several few-shot learning benchmarks. Moreover, compared to MAML and Meta-SGD, all proposed methods suffer less from the TOF problem, especially the RAML and URAML methods, indicating the rationality of our viewpoints and methods.

Though URAML performs not as well as RAML, we think it is the most promising method yet, because there is a large development space for the performance of URAML method which will also be the direction of our future work. From the results of our ablation study, two manners seem can improve the performance of URAML significantly. One is to develop the unsupervised learning algorithm or self-supervised learning. It is clear that the URAML works much better than URAML-AE only due to the disparity between two unsupervised learning approaches: Split-Brain and Auto-Encoder. Furthermore, the RAML performs better than URAML revealing that the current unsupervised learning algorithm falls behind supervised learning. Bridging the gap between unsupervised learning and supervised learning algorithm will boost up the performance of URAML to that of RAML in a substantial probability. The other manner is to use more unlabeled data in the pre-training stage of URAML. Although 7.1 million unlabeled images are used to pre-train the representation module in URAML, it still dramatically falls behind the images that human have ever seen in terms of both quantity and quality. As for the quantity, we assume that, if a person watches 1 image per second and keep watching 15 hours per day, he/she can see 100 million images in 5 years. As for quality, human see the world in a multimodal way, that is, the human can not only see the object but also touch and move around the object, which helps human understand the world more accurately than AI. In a word, developing the unsupervised or self-supervised learning algorithm and the collecting more unlabeled images will both help URAML to perform well.

Acknowledgment

At the end of our paper, we would like to sincerely thanks those who help us in the experiments and writing. Especially Na Liu, Zichang Tan, Jun Wan, Junliang Xing, and Tao Mei, their advises help us a lot.

References