Meta-Meta-Classification for One-Shot Learning

04/17/2020 ∙ by Arkabandhu Chowdhury, et al. ∙ ibm Rensselaer Polytechnic Institute Rice University 0

We present a new approach, called meta-meta-classification, to learning in small-data settings. In this approach, one uses a large set of learning problems to design an ensemble of learners, where each learner has high bias and low variance and is skilled at solving a specific type of learning problem. The meta-meta classifier learns how to examine a given learning problem and combine the various learners to solve the problem. The meta-meta-learning approach is especially suited to solving few-shot learning tasks, as it is easier to learn to classify a new learning problem with little data than it is to apply a learning algorithm to a small data set. We evaluate the approach on a one-shot, one-class-versus-all classification task and show that it is able to outperform traditional meta-learning as well as ensembling approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Meta-learning is often defined informally as “learning to learn” (Thrun & Pratt, 1998; Rendell et al., 1987). That is, rather than learning to solve a particular learning problem (such as an image classification problem) the goal in meta-learning is to solve many learning problems in an attempt to learn how to learn to solve a particular class of problems.

Meta-learning is a compelling approach for solving very small-data learning problems, such as one-shot or few-shot learning (Fei-Fei et al., 2006)

. One can generate a data set that consists of a large number of learning problems, where each problem has just a few training examples, and then use that set to learn how to solve learning problems with just a few examples. This contrasts with competing approaches such as transfer learning

(Torrey & Shavlik, 2010), where one solves one or more learning problems, and then adapt those solutions to a new, small-data learning problem. Meta-learning learns the learning process, rather than how to re-purpose an existing learner.

In this paper, we introduce a new approach to meta-learning, called meta-meta classification. Here, we use a large set of learning problems to design a set of different learners, each of which has high bias and low variance, so that it is skilled at solving a specific type of learning problem. Further, the meta-meta classifier also learns how to examine a new learning problem and select which of the learners should be used to solve that particular learning problem.

We call the method meta-meta classification to distinguish it from meta-classification, a term commonly used in ensemble methods (Dietterich, 2000). In ensembling, a meta-classifier is a classifier that aggregates the output from a family of learned scoring functions. For example, in bagging (Breiman, 1996), a meta-classifier may average the scores output from a family of scoring functions. In more sophisticated methods, the meta-classifier may itself be trained so that it learns to produce an accurate output from a set of less accurate scoring functions.

Figure 1: An aggregate scoring function realized via meta-meta-classification. The meta-meta-classifier uses the training set to select from among the parameterized learners , , and so on, to realize an aggregate scoring function .

In contrast, by training over a corpus of learning problems rather than a single problem, a meta-meta-classifier designs a set of learners, while at the same time learning how to examine a new problem and choose which learners are best to solve that problem. Ultimately, given a new learning problem, the output of the meta-meta classifier is a problem-specific meta-classifier defined over the set of scoring functions produced by the learners. Note that while a meta-meta-classifier learns how to produce a meta-classifier, it is not itself a meta-classifier.

Meta-meta-classification is particularly natural for very small-data learning problems. The underlying assumption here is that it is easier to classify a new learning problem with little data than it is to solve the new learning problem with little data. Intuitively, this may be the case: learned scoring functions are successfully used all the time to look at a particular object and predict its label. It does not seem to be inherently more difficult to look at a single object and its label (or small set of labeled objects in the case of few-shot learning) and identify which learners may apply to solving the problem. If it is possible to look at a restricted number of training examples and choose an appropriately biased, low-variance learner that best applies to the learning task, then the variance reduction realized by choosing a learner that is highly biased for the problem may result in very low error, even on highly data-restricted problems.

Our contributions. We define a new meta-learning strategy called meta-meta-classification

, in which a meta-meta-classifier is trained to recognize the type of learning task at hand, and to use that recognition to choose a biased, low-variance learner appropriate for the task. We show how this strategy can be used to learn a highly accurate aggregate scoring function, even for one-shot learning problems. For example, on a one-shot, one-class-versus-all classification task defined over the ImageNet corpus, meta-meta-classification is able to achieve greater than 82% test accuracy, compared to less than 61% test accuracy for the baseline meta-learning approach, and less than 67% for a comparatively-sized ensemble of meta-learners.

2 Background and Problem Definition

2.1 Meta-meta-classification: Overview

Meta-meta-classification is an approach to supervised learning that is particularly relevant to the problem of one-shot or few-shot learning, as it relies on learning a set of learners designed specifically to have high inductive bias as a way to prevent over-fitting, as well as how to apply those learners when a new learning problem is encountered.

Specifically, for input (feature) domain and output (label) domain , a meta-meta classifier takes as input a training set (a multi-set drawn from ), and then returns an aggregate scoring function that combines the output of the learners (in the context of ensemble-based learning, this aggregate scoring function is sometimes referred to as a meta-classifier). As in all forms of supervised learning, the goal is to produce an aggregate scoring function that gives relatively high values to pairs from that tend to occur together.

In contrast to classical ensemble approaches (such as stacking (Wolpert, 1992)), in meta-meta classification, the aggregate scoring function is constructed without examining how well the individual scoring functions output by the learners perform on the training set (or on a test set). Instead, the meta-meta classifier learns through experience how the learners should be combined for different types of problems. This makes meta-meta classification particularly attractive for few-shot learning problems, as there is no need to have enough data to test the accuracy of the output of the learners.

A meta-meta classifier has two parts: a set of learners, and a meta-aggregation function.

The learners. In classical supervised learning, we have a single scoring function and a learning algorithm. But in meta-meta-classification, we instead assume an ensemble of learners, from which we wish to build an aggregate scoring function. The th learner consists of a scoring function , as well as a training algorithm .

Let be the set of all multi-sets drawn from . The training algorithm maps a set of training examples drawn from to a particular value for . As is typical, the scoring function is parameterized on the parameter set chosen from parameter space by the training algorithm. More atypically, the training algorithm is itself parameterized on a parameter set . This parameter set can contain any parameters that control the learning process: the learning rate, the number of learning iterations, the set of parameters to initialize the learning algorithm, etc.

The meta-aggregation function. The goal is to learn, by looking at a set of learning problems, how to examine a new problem, and combine those learners to create a problem-specific meta-classifier . The meta-aggregation function is given this task.

For a function , let denote the function resulting from currying with respect to the first inputs, and then evaluating the resulting curried function at . Then is the result of applying the training algorithm in learner —parameterized with —to training set , and then “pre-loading” the resulting scoring function with .

A meta-aggregation function examines , and then conditioned on that , combines each of the scoring functions to create a new, more accurate aggregate scoring function.

Formally, a meta-aggregation function is a function:

By allowing the meta-aggregation function to examine the set and aggregate the scoring functions created by the learners, we obtain the aggregate scoring function

A depiction of how the learners and the meta-aggregation function together produce an aggregate scoring function is given in Figure 1.

2.2 Intuition: Why Meta-meta-classification?

If the training set is large, it is unclear that there is much benefit to meta-meta-classification. For large , we may choose a general-purpose learner with small inductive bias that works well regardless of the problem at hand. However, if is small— in the case of one-shot learning—there may be a significant benefit to the introduction of a set of learners and a meta-meta-classifier. If sufficient information about the problem-generating distribution is available through past experience, that we may learn a high-quality meta-meta classifier. After learning the meta-meta classifier, tiny training set may give enough information as to the exact nature of the classification task that the meta-aggregation function can accurately select an appropriate learner. This learner will ideally have high inductive bias, and be tailored to the specific learning problem. At the same time, it will hopefully have low variance, and will be accurate, even with the learner has been trained on very small .

In fact, this is the benefit of meta-meta classification: it allows for the use of a set of highly biased, low variance learners each of which covers a small subset of the set of classification problems that are expectedly encountered.

For this to work, a key assumption is that the task of recognizing which type of learning problem we are faced with is less data-intensive than the task of actually solving the learning problem. Hence, faced with limited training data, we use that data to first determine which type of learning problem we are faced with, and then use a high-bias learner that has been designed to perform well on that specific class of problem.

2.3 Relationship to Other Approaches

Meta-meta-classification is related to several other ideas in machine learning. For example, consider neural architecture search (Zoph & Le, 2016; Pham et al., 2018) and related ideas. Both approaches effectively appeal to a meta-meta-classifier that attempts to choose the best learner for a given task. The key difference, however, is that neural architecture search typically assumes large , so that the meta-meta-classifier is trivial. When evaluating a learner, simply see how accurate the learner is on a holdout set. If the learned model is accurate on the holdout set, the learner is a good choice. In meta-meta-classification, the assumption is that there is little data available to evaluate the accuracy of a constructed classifier, and so the meta-meta-classifier is introduced as an alternative to an accuracy test over a holdout set.

There is an obvious relationship between meta-meta-classification and boosting, bagging (Quinlan et al., 1996), and other ensemble methods. The aggregate scoring function enabled by the meta-meta-classifier is effectively controlling the use of an ensemble of learners. In ensemble methods, the function that aggregates the output from an ensemble of learners is often called a meta-classifier. However, the difference is that a meta-meta-classifier is trained how to produce a task-specific meta-classifier, it is not itself a meta-classifier. By looking at a large number of learning problems, the meta-meta-classifier learns how to select an appropriate, high-bias, low-variance learners from a set of learners, few of which are useful for any particular classification task.

Meta-meta-classification is related to other meta-learning approaches, for example, (Finn et al., 2017), as they also assume a distribution of learning tasks, and apply meta-learning to try to solve the one-shot learning problem. The key difference is that Finn et al.’s approach can be seen as trying to design a single learner (scoring function plus training algorithm) that works well for small-sized , for any data-generating sampled according to , rather than attempting to match the present learning task with an appropriate classifier.

3 Related Work

Meta-meta-classification broadly falls under the meta-learning or “learning to learn” paradigm (Hinton & Plaut, 1987; Thrun & Pratt, 1998; Bengio et al., 1992) which has been shown to produce promising results on few-shot classification problems. Meta-learning methods can be divided into three categories.

First are metric-based methods (Koch et al., 2015; Hadsell et al., 2006; Fink, 2005; Schroff et al., 2015; Shyam et al., 2017; Snell et al., 2017; Goldberger et al., 2005; Vinyals et al., 2016; Taigman et al., 2015) which aim to learn a similarity function or a distance metric between a pair of different samples. Neighborhood Components Analysis (NCA) (Goldberger et al., 2005)

learns a Mahalanobis distance to maximize K-nearest-neighbors (KNN) leave-one-out accuracy. Siamese networks

(Koch et al., 2015) use a pairwise verification loss to perform nearest-neighbours classification. Matching Networks (Vinyals et al., 2016) combine both embedding and classification to form an end-to-end differentiable nearest neighbours classifier. Prototypical Networks (Snell et al., 2017) apply an inductive bias in the form of class prototypes without full context embeddings.

Second are memory-augmented methods (Munkhdalai & Yu, 2017; Mishra et al., 2017; Duan et al., 2016; Wang et al., 2018; Santoro et al., 2016; Oreshkin et al., 2018) that learn to adjust model states using memory-augmented recurrent networks. For example, (Santoro et al., 2016) represents entries from a sample set in an external memory, AdaResNet (Munkhdalai et al., 2017)

uses memory and the sample set to produce conditionally shifted neuron coefficients for the query set, and SNAIL

(Mishra et al., 2017) uses an explicit attention mechanism to leverage specific information from past experience.

Third are optimization based methods (Finn et al., 2017, 2018; Yoon et al., 2018; Lee & Choi, 2018; Grant et al., 2018; Nichol & Schulman, 2018; Rusu et al., 2018; Rothfuss et al., 2018; Ravi & Larochelle, 2016; Zhang et al., 2018) that learn a network initialization that can quickly adapt to new tasks within a distribution of tasks with a very few steps of regular gradient descent. MAML (Finn et al., 2017)backpropagates the meta-loss through an inner learning loop, Reptile (Nichol & Schulman, 2018) incorporates an L2 loss that updates the meta-model parameters towards the task-specific models, and (Lee & Choi, 2018) learns a layer-wise subspace where gradient-based adaptation is done. However, since all of these meta-learners sample a task from a task-distribution to learn the initial parameters, they can be prone to overfitting (Mishra et al., 2017).

4 Learning a Meta-meta-classifier

4.1 Background

Assume a universe of probability distributions

, each defined over the domain , and a distribution defined over this universe. Hence is a distribution of distributions. Now, consider the following hierarchical stochastic process for generating a triple from :

  1. Sample

  2. Sample

  3. Sample

Here, is a training data set, and is a test pair.

Assume some loss function

. That is, takes as an argument a scoring function defined over domain , a “true” value for the output selected from , and scores how accurately the scores reflect the “true” output. Generally, any loss function can be used for : squared error if is the set of real numbers, cross-entropy if is a set of categories, etc. For example, for a scoring function , the squared error loss function is:

The goal when learning a meta-meta classifier is to choose from the parameter space so as to minimize the expected loss of the meta-meta classifier (or the “meta-loss”):

  Meta-Learn (, , , )
  // : Distribution of distributions to learn from
  // : Meta-learning batch size (# of problems)
  // : # of training instances in a learning problem
  // : # of test instances to evaluate a scoring function
  Initialize
  while loss decreases do
     for  to  do
        Sample
        Sample
        Sample
     end for
     
              
  end while
  return
Algorithm 1 End-to-End Gradient Descent

There are many possible instantiations of this idea. We now briefly describe a couple of them.

4.2 Example: End-to-End Gradient Descent

Assume that each of the learners utilizes gradient descent, and that is differentiable with respect to . Further, assume that performs one gradient update at learning rate using as the initialization of the gradient descent, so that and:111Here, denotes “the gradient of the th loss function with respect to parameter set , evaluated at .”

Then, letting we can run a gradient descent algorithm to learn the meta-aggregation function parameters as well as each of the parameters for the various learners. Assuming meta-learning rate , we repeatedly sample and for each sample, apply the following update rule:

Note that it is easily possible to extend this to training algorithms that perform more than a single gradient update; this merely requires expanding the expression computed by for an appropriate number of gradient steps. In practice, however, only a small number of gradient updates will be used in a small-data setting; a large number of steps will typically result in over-fitting.

Also, in practice, it may make sense to back-propagate the meta-loss from more than a single

test pair, as more test pairs may give a more stable estimate of the meta-loss and decrease time-until-convergence.

Finally, there is nothing preventing the use of a batch of learning problems during each iteration of gradient descent. Again, this may result in a more stable algorithm that takes less time to converge.

The full algorithm for end-to-end gradient descent, which uses a batch of learning problems as well as an arbitrarily-sized test set for back-propagation is in Algorithm 1.

  Meta-Learn (, , , , )
  // : Distribution of distributions to learn from
  // : Embedding function for problem instance
  // : Meta-learning batch size (# of problems)
  // : # of training instances in a learning problem
  // : # of test instances to evaluate a scoring function
  Initialize
   // Cluster a set of problem instances
  
  for  to  do
     Sample
     
  end for
  Run -means on to obtain
   // Create and partition a set of training distributions
   for to
  for  to  do
     Sample
     Add to for
      
  end for
   // Learn each of the training algorithms
  for  to  do
     while loss decreases do
        for  to  do
           Sample
           Sample
           Sample
        end for
        
                 
     end while
  end for
   // Now, learn
  while loss decreases do
     for  to  do
        Sample
        Sample
        Sample
     end for
     
     
  end while
  return
Algorithm 2 Three-Step-Meta-Learning

4.3 Example: Clustering Plus Gradient Descent

Unfortunately, the algorithm from the previous subsection may not work well in practice. Note that while the meta-meta classifier is being trained in a supervised manner—the goal is to learn a meta-meta classifier that can generate an accurate meta-classifier, in one important sense, the algorithm is unsupervised.

Ultimately, the meta-aggregation function must look at a specific training set and determine which of the learners is most appropriate for the underlying problem. If, at the time that is being learned, the learners themselves are being learned, this may be viewed as an unsupervised task; it is unclear how to segment the possible problems in into categories so that a reasonable learner or learners can be designed for each category.

In practice, unsupervised learning tasks are notoriously sensitive to initialization. Few machine learning practitioners running a

-means algorithm would sample the initial means from a Normal distribution, for example, as this would likely produce terrible results. Instead, the initial means may be sampled from the data set to be clustered.

Unfortunately, learning a meta-meta classifier consisting of a number of neural network learners via full gradient descent (Algorithm

1), starting with a typical, random neural-network initialization for individual learning parameters , is akin to initializing a -means algorithm poorly. In practice, all values will be terrible, but one will be slightly less terrible than the others, and the meta-aggregation function will learn to route most problems to the corresponding learner. As a result, the other learners are starved of training data and ignored, and the learned solution is equivalent to what would have been returned from the MAML method (Finn et al., 2017).

One way around this is to sample a large number of distributions from and explicitly cluster those distributions as a separate step. This requires having some way to cluster distributions of problems; we assume some embedding problem-specific embedding function that is able to map problem distributions (possibly non-deterministically) into a high-dimensional space, where they can be clustered using a -means algorithm (here is the number of learners that are to be meta-learned).

A procedure that uses such an explicit clustering step is depicted in Algorithm 2. The procedure is depicted pictorially in Figure 2. After first producing the clusters of problem distributions, one leaner is meta-learned per distribution cluster. Then, in a final step, the procedure trains the meta-aggregation function so that it is able to combine the output of the learners.

Finally, we point out that Algorithm 1 and Algorithm 2 can be used together. Algorithm 2 could be used to produce a high-quality initialization that is refined using Algorithm 1; the combined procedure is likely to outperform either individual methodology.

Figure 2: Learning a meta-meta classifier utilizing a pre-clustering of learning problems.

Whole data Whole data MM-classifier Nearest cluster Meta-meta
Hard bagging Soft bagging on whole data classifier
2 61.87 0.0022 62.27 0.0024 62.79 0.0022 61.71 0.0025 66.26 0.0020
4 62.48 0.0023 61.61 0.0024 63.74 0.0023 69.53 0.0022 74.02 0.0017
8 62.82 0.0024 62.40 0.0025 64.28 0.0023 74.45 0.002 77.92 0.0017
16 63.12 0.0024 63.34 0.0025 66.11 0.0024 74.70 0.0022 82.49 0.0016
Table 1:

ImageNet ILSVRC2012 results. The 95% confidence interval of observed test accuracy, computed over 10,000 problems is given.

denotes the number of models trained.

Whole data Whole data MM-classifier Nearest cluster Meta-meta
Hard bagging Soft bagging on whole data classifier
2 63.60 0.0022 64.38 0.0025 64.63 0.0025 71.53 0.0020 70.87 0.0020
4 66.36 0.0022 66.27 0.0023 66.99 0.0023 69.76 0.0022 72.44 0.0017
8 66.94 0.0024 67.04 0.0025 67.52 0.0025 74.29 0.0015 77.98 0.0014
16 67.21 0.0025 67.72 0.0026 69.61 0.0026 84.04 0.0012 85.67 0.0011
Table 2: Cross-domain results (meta-learning on ImageNet, test on CUB2011).

5 Experimental Evaluation


Whole data Whole data MM-classifier Nearest cluster Meta-meta
Hard bagging Soft bagging on whole data classifier
2 65.39 0.0098 65.66 0.0094 69.57 0.0077 68.88 0.0097 70.65 0.0082
4 70.62 0.0085 71.03 0.0083 73.00 0.0066 71.72 0.0087 76.05 0.0072
8 71.84 0.0083 72.23 0.0085 75.93 0.0061 73.35 0.0085 78.61 0.0072
Table 3: Aircraft data set results.

Whole data Whole data MM-classifier Nearest cluster Meta-meta
Hard bagging Soft bagging on whole data classifier
2 71.24 0.011 70.69 0.0092 73.26 0.011 73.57 0.0103 78.70 0.0097
4 73.83 0.011 77.32 0.0092 79.16 0.0067 77.07 0.0088 85.27 0.0058
8 77.70 0.0099 77.61 0.0088 85.25 0.0062 80.15 0.0084 90.87 0.0047
16 79.38 0.0088 79.56 0.0097 88.04 0.0059 82.02 0.0082 92.07 0.0044
Table 4: Omniglot data set results.

We evaluate the utility of meta-meta classification for a series of one-shot image classification tasks, where the goal is to recognize—given a single example—members of a single class which are mixed in with a number of other, “background” classes. We wish to answer two key questions. First, does increasing (the number of learners) actually increase classification accuracy? Second, does meta-meta classification outperform a simple ensemble of meta-learners? That is, does the biased ensembling of meta-meta classification outperform the simple tactic of just using a number of independent meta-learners?

Meta-Meta Image Classification. We consider several different image classification tasks, but the first is to learn to classify images from the ImageNet database. We used the ILSVRC2012 dataset (Russakovsky et al., 2015), the most popular flavor of ImageNet data. For each learning problem we have one “positive” class and 50 “negative” classes selected from ILSVRC2012, and we are given one positive example as well as 50 negative examples sampled from the mixture of 50 negative classes (some negative classes may have multiple samples, and some may not be represented in the sample set). We hold back 10% of the 1000 ILSVRC2012 classes for testing, and 90% of the classes are available for meta-learning.

Meta-learning relies on being able to generate a distribution of learning problems. To generate a learning problem, we sample 51 classes from the ILSVRC2012 classes available for meta-learning, and one is randomly designated as a “positive” class. When learning the meta-meta classification, training set is generated by sampling one image from the positive class selected, and 50 images from the 50 negative classes.

Each is the convolutional network architecture used by (Finn et al., 2017), which has 4 modules with a 3

3 convolutions and 32 filters, a ReLU nonlinearity, and 2

2 max-pooling. The scoring function is realized using a fully connected layer after the convolutions, and the last layer is fed into a softmax. Each

is the initial set of weights used when training the th network. During training, five iterations of gradient descent are performed.

The meta-aggregation function is realized by a simple, fully-connected neural network with two 256-neuron hidden layers. As input, this network accepts:

  1. for in (that is, the “no” score each learner gives to the test image)

  2. for in (the “yes” score that each learner gives to the test image)

  3. The 512-dimensional output of a ResNet network (He et al., 2016), where the final classification layers have been dropped, applied to the positive image in . This encoding allows the meta-aggregation function to classify the classification problem.

Here, consists of the weights used in the fully-connected neural network, as well as the ResNet network used to encode .

When using the three-step training process, our embedding function

samples a training set from the distribution, and pushes the positive training instance in that set through a pre-trained ResNet network. We trained a modified ResNet-152 classifier on the classes reserved for meta-learning and used the penultimate layer for feature extraction. We changed the number of output channels of the convolutions from [64, 128, 256, 512] to [64, 64, 128, 256] and block expansion from 4 to 2. This was done just to decrease the extracted feature size from the usual 2048 to 512.

Finally, each is the starting parameters of the gradient descent used by the th learner. Hence, in this instantiation of meta-meta classification, we are learning a set of MAML learners (Finn et al., 2017).

Additional One-Shot Learning Problems. We test three additional one-shot learning problems.

(1) Meta-learn on ImageNet ILSVRC2012, test on the CUB2011 Birds data set (Welinder et al., 2010). In this task, meta-learning is performed exactly as described above, on 900 classes selected from the ImageNet ILSVRC2012 data set. However, the testing distribution is different. Each positive class for testing is selected from among the CUB2011 Birds data set, and the negative classes are selected from among the 100 classes held back from the ILSVRC2012 data set. The goal is to perform cross-domain testing.

(2) Meta-learn on 87 classes from the Aircraft data set (Maji et al., 2013), test on 15 classes. During testing, one of the 15 test classes is chosen as the positive class, the other 14 classes are the negative classes. One training image is available from the positive class, and 50 from the 14 negative classes. The goal is to perform fine-grained testing.

(3) Meta-learn on 1200 characters from the Omniglot data set (Lake et al., 2015), test on 423 characters. During testing, a letter from the testing set is selected as the positive class, and 50 other test letters are selected as negative classes. Again, one image from the positive class is available, and 50 images of the other letters are available.

Competitive Methods Tested. To evaluate the efficacy of our ideas, we compare meta-meta classification against ensembles of meta-learners. In our experiments, the individual meta-learners in the ensemble are MAML learners (Finn et al., 2017). While a number of improvements to MAML have been suggested in the last couple of years (several of which were described in the Related Work section of this paper), we use MAML as a comparison point because our meta-meta classifier is effectively learning a set of MAML models. This facilitate an apples-to-apples comparison, though we note that MAML (both in meta-meta classification, and in the ensemble) could be replaced with any reasonable alternative.

Overall, we evaluate the following five classifiers: (1) Whole-data hard bagging: this is hard bagging over an ensemble of MAML models all trained on the entire data set. (2) Whole-data soft bagging: soft bagging over an ensemble of MAML models. (3) Meta-meta classifier on whole data: here we first learn a set of MAML models, each on the whole data, but then learn a meta-meta classifier (step three of three-step meta-learning) on the MAML models. This is useful for testing the utility of segmenting the data. (4) Nearest cluster: this is essentially the first two steps of three-step meta-learning, with the final classifier replaced with a simple nearest neighbor classifier on the ResNet features. (5) Meta-meta classifier: this is the full three-step meta-learning.

Results. For each data set and each of the five competitive methods, we test a variety of different values (, , , , though due to the small number of classes in the Aircraft data set, we omit the size- model there). In each case, we randomly generate 10,000 learning problems to evaluate each method, and the method is scored using accuracy on 50 positive and 50 negative examples. All results (including average accuracy, 95% confidence interval width) are given in Tables 1, 2, 3, 4. For comparison, a single MAML model achieved 60.78% accruacy on ImageNet ILSVRC2012, 62.37% accuracy on the cross-domain bird recognition problem, and 64.92% and 65.95% accuracy on the aircraft and Omniglot problems.

Discussion. Across all of the learning tasks, the meta-meta classifier consistently had the best accuracy—often considerably higher than the other options, and much higher than a single MAML model. For example, on the ILSVRC2012 data set, a meta-meta classifier with 16 classes obtains more than 82% accuracy, compared to just under 61% accuracy with a single MAML model.

We close by considering the question: when will meta-meta classification fail? Meta-meta classification relies on finding problems at deployment that are similar to those encountered during meta-learning. In our implementation, “similarity” among problems is determined by looking at the positive example. In a situation where the positive image may not be enough to classify the problem, meta-meta classification will fail. However, extending the notion of similarity to take into account both positive and negative classes is not necessarily easy, and there is a concern that if both positive and negative examples are considered, problems will not segment as easily. This is a question for future work.

6 Conclusion

We have explored a new type of meta-learning, called meta-meta classification. The idea in meta-meta classification is to learn a set of meta-learners tailored to different problem types, as well as a function called a “meta-meta classifier” that is able to look at a particular problem and decide how to combine the meta-learners to solve that problem. Thus, a meta-meta classifier itself meta-learns to produce a meta-classifier over the output of the meta-learners. Meta-meta classification is predicated on the assumption that it is easier to classify a problem (and choose an appropriate set of meta-learners) than it is to learn to solve the problem with little data. We have shown through a series of experiments that meta-meta classification can have much higher accuracy than a standard meta-learner or even an ensamble of such meta-learners.

References

  • Bengio et al. (1992) Bengio, S., Bengio, Y., Cloutier, J., and Gecsei, J. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pp. 6–8. Univ. of Texas, 1992.
  • Breiman (1996) Breiman, L. Bagging predictors. Machine learning, 24(2):123–140, 1996.
  • Dietterich (2000) Dietterich, T. G. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1–15. Springer, 2000.
  • Duan et al. (2016) Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. Rl $^ 2$: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
  • Fei-Fei et al. (2006) Fei-Fei, L., Fergus, R., and Perona, P. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.
  • Fink (2005) Fink, M. Object classification from a single example utilizing class relevance metrics. In Advances in neural information processing systems, pp. 449–456, 2005.
  • Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. JMLR. org, 2017.
  • Finn et al. (2018) Finn, C., Xu, K., and Levine, S. Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pp. 9516–9527, 2018.
  • Goldberger et al. (2005) Goldberger, J., Hinton, G. E., Roweis, S. T., and Salakhutdinov, R. R. Neighbourhood components analysis. In Advances in neural information processing systems, pp. 513–520, 2005.
  • Grant et al. (2018) Grant, E., Finn, C., Levine, S., Darrell, T., and Griffiths, T. Recasting gradient-based meta-learning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018.
  • Hadsell et al. (2006) Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality reduction by learning an invariant mapping. In

    2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)

    , volume 2, pp. 1735–1742. IEEE, 2006.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
  • Hinton & Plaut (1987) Hinton, G. E. and Plaut, D. C. Using fast weights to deblur old memories. In Proceedings of the ninth annual conference of the Cognitive Science Society, pp. 177–186, 1987.
  • Koch et al. (2015) Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In

    ICML deep learning workshop

    , volume 2, 2015.
  • Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • Lee & Choi (2018) Lee, Y. and Choi, S. Gradient-based meta-learning with learned layerwise metric and subspace. arXiv preprint arXiv:1801.05558, 2018.
  • Maji et al. (2013) Maji, S., Kannala, J., Rahtu, E., Blaschko, M., and Vedaldi, A. Fine-grained visual classification of aircraft. Technical report, 2013.
  • Mishra et al. (2017) Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. A simple neural attentive meta-learner. arXiv preprint arXiv:1707.03141, 2017.
  • Munkhdalai & Yu (2017) Munkhdalai, T. and Yu, H. Meta networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2554–2563. JMLR. org, 2017.
  • Munkhdalai et al. (2017) Munkhdalai, T., Yuan, X., Mehri, S., and Trischler, A. Rapid adaptation with conditionally shifted neurons. arXiv preprint arXiv:1712.09926, 2017.
  • Nichol & Schulman (2018) Nichol, A. and Schulman, J. Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999, 2:2, 2018.
  • Oreshkin et al. (2018) Oreshkin, B., López, P. R., and Lacoste, A. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pp. 721–731, 2018.
  • Pham et al. (2018) Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., and Dean, J. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
  • Quinlan et al. (1996) Quinlan, J. R. et al. Bagging, boosting, and c4. 5. In AAAI/IAAI, Vol. 1, pp. 725–730, 1996.
  • Ravi & Larochelle (2016) Ravi, S. and Larochelle, H. Optimization as a model for few-shot learning. 2016.
  • Rendell et al. (1987) Rendell, L. A., Sheshu, R., and Tcheng, D. K. Layered concept-learning and dynamically variable bias management. In IJCAI, pp. 308–314, 1987.
  • Rothfuss et al. (2018) Rothfuss, J., Lee, D., Clavera, I., Asfour, T., and Abbeel, P. Promp: Proximal meta-policy search. arXiv preprint arXiv:1810.06784, 2018.
  • Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  • Rusu et al. (2018) Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., and Hadsell, R. Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960, 2018.
  • Santoro et al. (2016) Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pp. 1842–1850, 2016.
  • Schroff et al. (2015) Schroff, F., Kalenichenko, D., and Philbin, J.

    Facenet: A unified embedding for face recognition and clustering.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823, 2015.
  • Shyam et al. (2017) Shyam, P., Gupta, S., and Dukkipati, A. Attentive recurrent comparators. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3173–3181. JMLR. org, 2017.
  • Snell et al. (2017) Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087, 2017.
  • Taigman et al. (2015) Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. Web-scale training for face identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2746–2754, 2015.
  • Thrun & Pratt (1998) Thrun, S. and Pratt, L. Learning to learn: Introduction and overview. In Learning to learn, pp. 3–17. Springer, 1998.
  • Torrey & Shavlik (2010) Torrey, L. and Shavlik, J. Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pp. 242–264. IGI Global, 2010.
  • Vinyals et al. (2016) Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638, 2016.
  • Wang et al. (2018) Wang, J. X., Kurth-Nelson, Z., Kumaran, D., Tirumala, D., Soyer, H., Leibo, J. Z., Hassabis, D., and Botvinick, M.

    Prefrontal cortex as a meta-reinforcement learning system.

    Nature neuroscience, 21(6):860, 2018.
  • Welinder et al. (2010) Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., and Perona, P. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
  • Wolpert (1992) Wolpert, D. H. Stacked generalization. Neural networks, 5(2):241–259, 1992.
  • Yoon et al. (2018) Yoon, J., Kim, T., Dia, O., Kim, S., Bengio, Y., and Ahn, S. Bayesian model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pp. 7332–7342, 2018.
  • Zhang et al. (2018) Zhang, R., Che, T., Ghahramani, Z., Bengio, Y., and Song, Y. Metagan: An adversarial approach to few-shot learning. In Advances in Neural Information Processing Systems, pp. 2365–2374, 2018.
  • Zoph & Le (2016) Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.