Log In Sign Up

Hierarchically Structured Meta-learning

by   Huaxiu Yao, et al.

In order to learn quickly with few samples, meta-learning utilizes prior knowledge learned from previous tasks. However, a critical challenge in meta-learning is task uncertainty and heterogeneity, which can not be handled via globally sharing knowledge among tasks. In this paper, based on gradient-based meta-learning, we propose a hierarchically structured meta-learning (HSML) algorithm that explicitly tailors the transferable knowledge to different clusters of tasks. Inspired by the way human beings organize knowledge, we resort to a hierarchical task clustering structure to cluster tasks. As a result, the proposed approach not only addresses the challenge via the knowledge customization to different clusters of tasks, but also preserves knowledge generalization among a cluster of similar tasks. To tackle the changing of task relationship, in addition, we extend the hierarchical structure to a continual learning environment. The experimental results show that our approach can achieve state-of-the-art performance in both toy-regression and few-shot image classification problems.


page 16

page 17


Automated Relational Meta-learning

In order to efficiently learn with small amount of data on new tasks, me...

EEML: Ensemble Embedded Meta-learning

To accelerate learning process with few samples, meta-learning resorts t...

Online Structured Meta-learning

Learning quickly is of great importance for machine intelligence deploye...

Modular Meta-Learning with Shrinkage

Most gradient-based approaches to meta-learning do not explicitly accoun...

MetaCon: Unified Predictive Segments System with Trillion Concept Meta-Learning

Accurate understanding of users in terms of predicative segments play an...

Fast Slow Learning: Incorporating Synthetic Gradients in Neural Memory Controllers

Neural Memory Networks (NMNs) have received increased attention in recen...

Learning to Forget for Meta-Learning

Few-shot learning is a challenging problem where the system is required ...

Code Repositories


Learn or Die 仕事以外で勉強したことを記録つける

view repo

1 Introduction

Learning quickly with a few samples is one of the key characteristics of human intelligence, while it remains a daunting challenge for artificial intelligence. Learning to learn (a.k.a., meta-learning) 

(Braun et al., 2010), as a common practice to address this challenge, leverages the transferable knowledge learned from previous tasks to improve the learning effectiveness in a new task. There have been several lines of meta-learning algorithms, including recurrent network based methods (Ravi & Larochelle, 2016), optimizer based methods (Andrychowicz et al., 2016), nearest neighbours based methods (Snell et al., 2017; Vinyals et al., 2016) and gradient descent based methods (Finn et al., 2017), which instantiate the transferable knowledge as latent representations, an optimizer, a metric space, and parameter initialization, respectively.

Despite their early success in few-shot image classification (Ravi & Larochelle, 2016) and machine translation (Gu et al., 2018), most of the existing meta-learning algorithms assume the transferable knowledge to be globally shared across all tasks. As a consequence, they suffer from handling a sequence of tasks originated from different distributions. At the other end of the spectrum, recently, a few research works (Finn et al., 2018; Yoon et al., 2018a; Lee & Choi, 2018) try to fix the problem by tailoring the transferable knowledge to each task. Yet the downside of such methods lies in the impaired knowledge generalization among closely correlated tasks (e.g., the tasks sampled from the same distribution).

Hence we are motivated to pursue a meta-learning framework to effectively balance generalization and customization. The inspiration comes from a hypothesis which has been formulated and tested by psychological researchers (Gershman et al., 2010, 2014). The hypothesis suggests that the key to human beings’ capability of solving a task with little training data is the way how human beings organize the learned knowledge from tasks. As bits of tasks impinge on us, we human beings cluster the tasks into several states based on task similarity, so that the learning occurs within each cluster instead of across cluster boundaries. Thus, when a new task arrives, it can either quickly take advantage of the knowledge learned within the cluster it belongs to or initiate a new cluster if it is wildly different from any existing clusters.

Inspired by this, we propose a novel meta-learning framework called Hierarchically Structured Meta-Learning (HSML). The key idea of the HSML is to enhance the meta-learning effectiveness by promoting knowledge customization to different clusters of tasks but simultaneously preserving knowledge generalization among a cluster of closely related tasks. In this paper, without loss of generality, we ground HSML on a gradient based meta learning algorithm (Finn et al., 2017)

with the transferable knowledge instantiated as parameter initializations. Specifically, first, the HSML resorts to a hierarchical clustering structure to perform soft clustering on tasks. The representation of each task is learned from either of the two proposed candidate aggregators, i.e., pooling autoencoder aggregator and recurrent autoencoder aggregator, and is passed to the hierarchical clustering structure to obtain the clustering result of this task. The sequentially incoming tasks, in turn, update the clustering structure. Especially, if the existing structure does not fit the task, we dynamically expand the structure. Secondly, a globally shared parameter initialization is tailored to each cluster via a parameter gate, to serve as the initializations for all tasks belonging to the cluster.

Again we would highlight the contribution of the proposed HSML: 1) it achieves a better balance between generalization and customization of the transferable knowledge, so that it empirically outperforms state-of-the-art meta-learning algorithms in both toy regression and few-shot image classification problems; 2) it is interpretable in terms of the task relationship; 3) it has been theoretically proved to be superior than existing gradient-based meta-learning algorithms.

2 Related Work

(a) : globally shared
(b) : task specific
(c) : HSML
Figure 1: Pictorial illustration of the difference between the proposed HSML and the other two representative lines of gradient based meta-learning algorithms.

Meta-learning, allowing machines to learn new skills or adapt to new environments rapidly with a few training examples, has demonstrated success in both supervised learning such as few-shot image classification and reinforcement learning settings. There are four common approaches: 1) use a recurrent neural network equipped with either external or internal memory storing and querying meta-knowledge 

(Munkhdalai & Yu, 2017; Santoro et al., 2016; Munkhdalai et al., 2018; Mishra et al., 2018); 2) learn a meta-optimizer which can quickly optimize the model parameters (Ravi & Larochelle, 2016; Andrychowicz et al., 2016; Li & Malik, 2016); 3) learn an effective distance metric between examples (Snell et al., 2017; Vinyals et al., 2016; Yang et al., 2018); 4) learn an appropriate initialization from which the model parameters can be updated within a few gradient steps (Finn et al., 2017, 2018; Lee & Choi, 2018).

HSML falls into the fourth aforementioned category named as gradient-based meta-learning. Most of the gradient-based meta-learning algorithms (Finn et al., 2017; Li et al., 2017; Flennerhag et al., 2018) assume a globally shared initialization across all tasks, as shown in Figure 0(a). To accommodate dynamically changing tasks, as illustrated in Figure 0(b), recent studies tailor the global shared initialization to each task by taking advantage of probabilistic models (Finn et al., 2018; Grant et al., 2018; Yoon et al., 2018a) and incorporating task-specific information (Lee & Choi, 2018; Vuorio et al., 2018). However, our proposed HSML outlined in Figure 0(c) customizes the global shared initialization to each cluster using a hierarchical clustering structure, which enjoys not only knowledge customization but also generalization (e.g., between task 1 and 3). Better yet, HSML right fit to a continual learning scenario with evolving clustering structures.

3 Preliminaries

The Meta-Learning Problem Suppose that a sequence of tasks

are sampled from an environment which is a probability distribution

on tasks (Baxter, 1998). In each task , we have a few examples to constitute the training set and the rest as the test set . Given a base learner with as parameters, the optimal parameters are learned to make accurate predictions, i.e., . The effectiveness of such a base learner on

is evaluated by the loss function

, which equals the mean square error for regression problems or the cross entropy loss for classification problems.

The goal of meta-learning is to learn from previous tasks a well-generalized meta-learner which can facilitate the training of the base learner in a future task with a few examples. In fulfillment of this, meta-learning involves two stages, i.e., meta-training and meta-testing. During meta-training, the parameters of the base learner for all tasks, i.e., , and the meta-learner are optimized alternatingly. In virtue of , the parameters are learned to minimize the expected empirical loss over training sets of all historical tasks, i.e., . In turn, a well-generalized can be obtained by minimizing the expected empirical loss over test sets, i.e., . When it comes to the meta-testing phase, provided with a future task , the learning effectiveness and efficiency are improved by applying the meta-learner and solving .
Gradient-based Meta-Learning Here we give an overview of the representative algorithm, model-agnostic meta-learning (MAML) (Finn et al., 2017). MAML instantiates the meta-learner as a well-generalized initialization for the parameters of a base learner from which a few gradient descent steps can be performed to reach the optimal for the task , which means . As a result, the optimization of during meta-training is formulated as (one gradient step as exemplary):


4 Methodology

Figure 2: The framework of the proposed HSML involving three essential stages. (a) Task representation learning: we learn the representation for the task using an autoencoder aggregator (e.g., pooling aggregator, recurrent aggregator). (b) Hierarchical task clustering: provided with the task representation, we learn the soft clustering assignment with this differentiable hierarchical clustering structure. Darker nodes signify more likely assigned clusters (e.g., the cluster 1 in the first level and the cluster B in the second level). (c) Knowledge adaptation: we next use a parameter gate to adapt the transferable knowledge () to a cluster-specific initialization () from which only a few gradient descent steps are required to achieve the optimal parameters .

In this section, we detail the proposed HSML algorithm whose framework is presented in Figure 2. The HSML aims to adapt the transferable knowledge learned from previous tasks, namely the initialization for parameters of the base learner in gradient based meta-learning ( here), to the task in a cluster-specific manner, so that the optimal parameters can be achieved in as few gradient descent steps as possible. As shown in the part (c) of Figure 2, the possibilities to adaptation are completely dictated by the hierarchical task clustering structure in part (b), and the eventual path for adaptation follows the clustering result on the task (i.e., ). By this means, the HSML balances between customization and generalization: the transferable knowledge is adapted to different clusters of tasks, while it is still shared among closely related tasks pertaining to the same cluster. To perform hierarchical clustering on tasks, we learn the representation of a task using the proposed task embedding network, i.e., the part (a). Next we will introduce the three stages, i.e., task representation learning, hierarchical task clustering, and knowledge adaptation, respectively.

4.1 Task Representation Learning

Learning the representation of a task with the whole training set

as input is much more challenging than the common representation learning over examples, which bears a striking similarity to the connection between sentence embeddings and word embeddings in natural language processing. Inspired by common practices in learning sentence embeddings 

(Conneau et al., 2017), we tackle the challenge by aggregating representations of all examples . The desiderata of an ideal aggregator include 1) high representational capacity, and 2) permutational invariance to its inputs. In light of these, we propose two candidate aggregators, i.e., pooling autoencoder aggregator (PAA) and recurrent autoencoder aggregator (RAA).
Pooling Autoencoder Aggregator To meet the first desideratum, foremost, we resort to an autoencoder that learns highly effective representation for each example. The recontruction loss for training the autoencoder is as follows,


where is the representation for the

-th example. In order to characterize the joint distribution instead of the marginal distribution only, we use

to preliminarily embed both features and predictions of an example. The definition of varies from dataset to dataset, which we will detail in supplementary material C. and

stand for the encoder composed of a stack of fully connected layers and the decoder consisting of two fully connected layers with ReLU activation, respectively. Consequently, the aggregation satisfying the permutational invariance follows,


where is the desired representation of task . denotes a max or mean pooling operator over examples.
Recurrent Autoencoder Aggregator Motivated by recent success of the recurrent embedding aggregation in order-invariant problems such as graph embedding  (Hamilton et al., 2017), we also consider a recurrent autoencoder aggregator which demonstrates more remarkable expressivity especially for a task with few examples. Different from the pooling autoencoder, examples are sequentially fed into the recurrent autoencoder, i.e.,


where and represent the learned representation and the reconstruction of the -th example, respectively. Here and stand for a recurrent encoder (LSTM or GRU) and a recurrent decoder, respectively. The reconstruction loss is similar to Eqn. (2), except that is replaced with . Thereupon, the task representation is aggregated over representations of all examples, i.e.,


Regrettably, the sequential feeding of examples makes the final task representation to be permutation sensitive, which violates the second prerequisite of an ideal aggregator. We address the problem by applying the recurrent aggregator to random permutations of examples (Hamilton et al., 2017).

4.2 Hierarchical Task Clustering

Given the representation of a task, we propose a hierarchical task clustering structure to locate the cluster the task belongs to. Before proceeding to detail the structure, we first explicate why the hierarchical clustering is preferred over flat clustering: a single level of task groups is likely insufficient to model complex task relationship in real-world applications; for example, to identify the cross-talks between gene expressions of multiple species, the study (Kim & Xing, 2010) suggests multi-level clustering of such gene interaction.

The hierarchical clustering, following the tradition of clustering, proceeds by alternating between two steps, i.e., assignment step and update step, in a layer-wise manner.
Assignment step: Each task receives a cluster assignment score on each hierarchical level, and the assignment that it receives in a particular level is a function of its representation in the previous level. Thus, we assign a task represented in the -th cluster of the -th level, i.e., , to the -th cluster in the -th level. Note that we conduct soft assignment for the following two reasons: (1) task groups have been demonstrated to overlap, since there is always a continuum in the sharing between tasks (Kumar & Daumé III, 2012); (2) the soft instead of hard assignment guarantees the differentiability, so that the full HSML framework can still be trained in an end-to-end fashion. In particular, for each task , the soft-assignment is computed by applying softmax over Euclidean distances between and the learnable cluster centers , i.e.,


where is a scaling factor in the -th level and denotes the number of clusters in the -th level.
Update step: As a result of assignment, the representation of a task in the -th cluster of the -th level, i.e., , can be updated with the following weighted average,


where and are learned to transform from representations of the -th to those of the -th level.

The full pipeline of clustering starts from , where the initialization for equals the task representation and , and ends at . We would especially discuss the cluster centers. The meta-learning scenario where training tasks come sequentially poses a unique challenge which requires the hierarchical clustering structure to be accordingly online. Therefore, the cluster centers are parameterized and learned as the learning proceeds. Each center is randomly initialized.

4.3 Knowledge Adaptation

The final representation , which encrypts the hierarchical clustering result, is believed to be cluster specific. Previous works (Xu et al., 2015; Lee & Choi, 2018) suggest that similar tasks activate similar meta-parameters (e.g., initialization) while different tasks trigger disparate ones. Inspired by this finding, we design a cluster-specific parameter gate,


where the fully connected layer is parameterized by

and activated by a sigmoid function

. It is worth mentioning here that concatenating the task representation together with not only preserves but also reinforces the cluster-specific property of the parameter gate. Most importantly, the globally transferable knowledge, i.e., the initial parameters , is adapted to the cluster-specific initial parameters via the parameter gate, i.e., .

Recalling the objectives for a meta-learning algorithm in Section 3, we reach the optimization problem for HSML:


where defined in Section 3 measures the empirical risk over and measures the reconstruction error as defined in Eqn. (2). is used to balance the importance of these two items. represents all learnable parameters including the global transferable initialization , the parameters for clustering, and those for knowledge adaptation (i.e., ).
Continual Adaptation We especially pay attention to the case where a new task does not fit any of the learned task clusters, which implies that additional clusters should be introduced to the hierarchical clustering structure. Incrementally adding model capacity (Yoon et al., 2018b; Daniely et al., 2015), has been the common practice to handle distribution drift without initially introducing excessive parameters. The key lies in the criterion when to expand the clustering structure. Since the loss values of fluctuate across different tasks during the online meta-training process, setting the loss value as threshold would obviously be futile. Instead, for every training tasks, we compute the average loss value . If the new average value is more than times the previous value (i.e., ), the number of clusters will be increased, and the parameters for new clusters are randomly initialized. The whole algorithm of our proposed model is detailed in Alg. 1.

0:  : distribution over tasks; : # of clusters in each layer; , : stepsizes; : threshold
1:  Randomly initialize
2:  while not done do
3:     if  then
4:        Increase the number of clusters
5:     end if
6:     Sample a batch of tasks
7:     for all  do
8:        Sample , from
9:        Compute in Eqn. (3) or Eqn. (5), in Eqn. (7), and reconstruction error
10:        Compute in Eqn. (8) and evaluate
11:        Update parameters with gradient descent (taking one step as an example):
12:     end for
13:     Update
14:     Compute and save for every rounds
15:  end while
Algorithm 1 Meta-training of HSML

5 Analysis

The core of HSML is to adapt a globally shared initialization of stochastic gradient descent (SGD) to be cluster specific via the proposed hierarchical clustering structure. Hence, in this section, we theoretically analyze the advantage of such adaptation in terms of the generalization bound.

For a task , we assume both training and testing examples are i.i.d. drawn from a distribution , i.e., and . According to Theorem 2 in (Kuzborskij & Lampert, 2017), a base learner is -on-average stable if its generalization is bounded by , i.e., . is the initialization of SGD to reach , and and denote the expected and empirical risk on , respectively.

Transferring the globally shared initialization (i.e., MAML) to the target task is equivalent to transferring a hypothesis learned from meta-training tasks like (Kuzborskij & Orabona, 2017). For HSML, the initialization can be represented as , which we demonstrate in the supplementary material A. In the following two theorems, provided with an initialization , we derive according to (Kuzborskij & Lampert, 2017) the generalization bounds of the base learner when the loss is convex and non-convex, respectively.

Theorem 1

Assume that is convex and optimized using SGD is -on-average stable. Then is bounded by,

Theorem 2

Assume that is -smooth and has a -Lipschitz Hessian. The step size at the -step satisfying with total steps and and then is bounded by,


Though some standard base learners (e.g., 4 convolutional layers in few-shot image classification (Finn et al., 2017)) with ReLU do not meet the property of Lipschitz Hessian, following (Nguyen & Hein, 2018), a softplus function can arbitrarily well approximate ReLU by adjusting and thus Theorem 2 holds. In both cases, MAML can be regarded as the special case of HSML, i.e., , where

is an identity matrix. Remarkably, by proving

, we conclude that HSML achieves a tighter generalization bound than MAML and thereby is much more favored. Consider the optimization process starting from , through the negative gradient direction, and . Thus, we can find a . We provide more details about analysis in supplementary material A.

6 Experiments

In this section, we evaluate the effectiveness of HSML. The goal of our experimental evaluation is to answer the following questions: (1) Can our approach outperform other meta-learning algorithms in toy regression and few-shot image classification tasks? (2) Can our approach discover reasonable task clusters? (3) Can our approach update the clustering structure in the continual learning manner and achieve better performance?

We study these questions on toy regression and few-shot image classification problems. For gradient-based meta-learning algorithms, we select the following as baselines: (1) globally shared models including MAML (Finn et al., 2017) and Meta-SGD (Li et al., 2017); (2) task specific models including MT-Net (Lee & Choi, 2018), BMAML (Yoon et al., 2018a) and MUMOMAML (Vuorio et al., 2018). The empirical results indicate that recurrent autoencoder aggregator (RAA) is on average better than PAA for task representation, so that RAA is used as the default aggregator. We also provide a comparison of RAA and PAA on few-shot classification problem in supplementary material G. All the baselines use the same neural network structure (base learner). For hierarchical task clustering, like (Ying et al., 2018)

, the number of clusters in a high layer is half of that in its consecutive lower layer. We specify the hyperparameters for meta-training in supplementary material C.

6.1 Toy Regression

Dataset and Experimental Settings In the toy regression problem, different tasks are sampled from different family of functions. In this paper, the underlying family functions are (1) Sinusoids: , , and ; (2) Line: , and ; (3) Cubic: , , , and ; (4) Quadratic: , , and .

represents a uniform distribution. Each individual is randomly sampled from one of the four underlying functions. The input

for both training and testing tasks. We train all models for 5-shot and 10-shot regression. Mean square error (MSE) is used as evaluation metric. In hierarchical clustering, we set the number of layers to be three with 4, 2, 1 clusters in each layer, respectively.

Results of Regression Performance The results of 5-shot and 10-shot regression are shown in Table 1. HSML improves the performance of global models (e.g., MAML) and task specific models (e.g., MUMOMAML), indicating the effectiveness of task clustering.

Model 5-shot 10-shot
HSML (ours)
Table 1: Performance of MSE

95% confidence intervals on toy regression tasks, averaged over 4,000 tasks. Both 5-shot and 10-shot results are reported.

Task Clustering Analysis in Toy Regression

Figure 3: (a) The visualization of soft-assignment in Eqn. (6) of six selected tasks. Darker color represents higher probability. (b) The corresponding fitting curves. The ground truth underlying a function is shown in red line with data samples marked as green stars. C1-4 mean cluster 1-4, respectively.

In order to show the power of HSML for detecting task clusters, we randomly select six tasks (more results are shown in supplementary material I) of 5-shot regression scenario and show soft-assignment values in Figure 3(a), i.e., the value of in Eqn. (6). Darker color stands for higher probability. The qualitative results of each task are shown in Figure 3(b). The ground truth underlying functions and the data samples are shown as red lines and green stars, respectively. Qualitative results of MAML, MUMOMAML (best baseline), HSML are shown in different colors.

As shown in the heatmap, sinusoids and linear with positive slope activate cluster 1 and 3, respectively. Both quadratic 1 and 2 activate cluster 2, while quadratic 1 also activates cluster 1 and quadratic 2 also activates cluster 3. From the qualitative results, we can see the shape of quadratic 2 is similar to that of linear with positive slope, while quadratic 1 has more apparent curvature. Similar findings also verify in cubic cases. The shape of cubic 2 is very similar to sinusoids, thus cluster 1 is activated. Different from cubic 2, cubic 1 mainly activates cluster 4, whose shape is similar to linear with negative slope. The results indicate that the main cluster criteria of HSML is the shapes of tasks despite the underlying family functions. Furthermore, according to the qualitative results, HSML fits better than other baselines.
Results of Continual Adaptation To demonstrate the effectiveness of HSML under the continual learning scenario (HSML-D), we add more underlying functions during meta-training. First, we generate tasks from sinusoids and linear, and quadratic and cubic functions are added after 15,000 and 30,000 training rounds, respectively. For comparison, one baseline is HSML with 2 fixed clusters (HSML-S(2C)), and the other is HSML with 10 fixed clusters with much more representational capability (HSML-S(10C)). The meta-training loss curve and the meta-testing performance (MSE) are shown in Figure 4. We can see that HSML-D outperforms as expected. Especially, HSML-D performs better than HSML-S(10C) which are prone to overfit and stuck at local optima at early stages.

Model HSML-S (2C) HSML-S (10C) HSML-D
MSE 95% CI
Figure 4: The performance comparison for the 5-shot toy regression problem in the continual adaptation scenario. The curve of MSE in meta-training process is shown in the top figure and the performance of meta-testing is reported in the bottom table.

6.2 Few-shot Classification

Dataset and Experimental Settings In the few-shot classification problem, we construct a new benchmark which currently consists of four image classification datasets: Caltech-UCSD Birds-200-2011 (Bird)  (Wah et al., 2011), Describable Textures Dataset (Texture) (Cimpoi et al., 2014), Fine-Grained Visual Classification of Aircraft (Aircraft) (Maji et al., 2013), and FGVCx-Fungi (Fungi) (Fun, 2018) (See supplementary material B for detailed descriptions of this benchmark). Similar to the preprocessing of MiniImagenet (Vinyals et al., 2016), we divide each dataset to meta-training, meta-validation and meta-testing classes. Following the protocol in (Finn et al., 2017; Ravi & Larochelle, 2016; Vinyals et al., 2016), we adopt N-way classification with K-shot samples. Each task samples classes from one of the four datasets. Compared with previous benchmarks (e.g., MiniImagenet) that the tasks are constructed within a single dataset, the new benchmark is more heterogeneous and closer to the real-world image classification. Like (Finn et al., 2017), the base learner is a standard four-block convolutional architecture. The number of layers in hierarchical clustering structure is set as 3 with 4, 2, 1 clusters in each layer. Note that, in this section, for the tables without confidence interval, we provide the full results in supplementary material F. In addition, we provide the comparison to MiniImagenet benchmark in supplementary material D. Note that, the sampled tasks from MiniImagenet do not have obvious heterogeneity and uncertainty. Our approach achieves comparable results among gradient-based meta-learning methods.
Results of Classification Performance

Model Bird Texture Aircraft Fungi Average
5-way 1-shot MAML
HSML (ours)
5-way 5-shot MAML
HSML (ours)
Table 2: Comparison between HSML and other gradient-based meta-learning methods on the 5-way, 1-shot/5-shot image classification problem, averaged over 1000 tasks for each dataset. Accuracy confidence intervals are reported.

For each dataset, we report the averaged accuracy over 1000 tasks of 5-way 1-shot/5-shot classification in Table 2. HSML consistently outperforms the other baselines on each dataset, which demonstrates the power of modeling hierarchical clustering structure. To verify the effectiveness of our proposed three components (i.e., task representation, hierarchical task clustering, knowledge adaptation), we also propose some variants of HSML. The detailed description of these variants and their corresponding results can be found in the supplementary material H, which further enhance the contribution of each component. In addition, we design another challenging leave-one-out experiment in this benchmark. We use one dataset for meta-testing and the rest three for meta-training. The results are reported in the supplementary material E and the HSML still achieves the best performance.
Task Clustering Analysis in Few-shot Classification Like the analysis of toy regression, we select four tasks in 5-way 1-shot classification and show their soft-assignment in Figure 5 (more results are shown in the supplementary material J). Darker color means higher probability. Furthermore, in Figure 5, we show the learned hierarchical clustering of each task. In each layer, the top activated clusters are shown in darker color and then the activation paths are generated.

Figure 5: (a) The visualization of soft-assignment in Eqn. (6) of four selected tasks. (b) Learned hierarchical structure of each task. In each layer, top activated cluster is shown in dark color.

From Figure 5, we can see different datasets mainly activate different clusters: birdcluster 2, texturecluster 4, aircraftcluster 1, fungicluster 3. It is also interesting to find the clustering across different tasks via the second largest activated cluster which further promote knowledge transfer between tasks. The correlation may represent the similarity of shape (bird and aircraft), environment (fungi and bird), surface texture (texture and fungi). Note that, aircraft is correlated to texture because the classification of aircraft variant is mainly based on their shape and texture. The clustering can be further verified in the learned activated path. In the second layer, the left node, which may represent the environment, is activated by cluster 2 (activated by bird) and 3 (activated by fungi). The right node that reflects surface texture is activated by cluster 1 (activated by aircraft) and 4 (activated by texture). In Figure 6, in addition, we randomly select 1000 tasks from each dataset, and show the t-SNE (Maaten & Hinton, 2008) visualization of the gated weight, i.e., , in Eqn. (9). Compared with MUMOMAML, the results indicate that our clustering structure are able to identify the tasks in different clusters.

(b) : HSML (Ours)
Figure 6: t-SNE visualization of gated weight, i.e., , in Eqn. (9)

Results of Continual Adaptation In few-shot classification task, we conduct the experiments for continual adaptation in the 5-way 1-shot scenario. Initially, the tasks are generated from bird and texture datasets. Then, aircraft and fungi datasets are added after approximately meta-training round 15000 and 25000, respectively. We show the average meta-training accuracy curve and meta-testing accuracy in Figure 7, where MUMOMAML, HSML-S(2C) and HSML-S(10C) are used as baselines. As shown in Figure 7, HSML-D consistently achieves better performance.

Model Bird Texture Aircraft Fungi
MUMOMAML 56.66% 33.68% 45.73% 40.38%
HSML-S (2C) 60.77% 33.41% 51.28% 40.78%
HSML-S (10C) 59.16% 34.48% 52.30% 40.56%
HSML-D 61.16% 34.53% 54.50% 41.66%
Figure 7: The performance comparison for the 5-way 1-shot few-shot classification problem in the continual adaptation scenario. The top figure and bottom table show the meta-training accuracy curves and the meta-testing accuracy, respectively.

Effect of Cluster Numbers We further analyze the effect of cluster numbers. The results are shown in Table 3. The cluster numbers from bottom layer to top layer are saved in a tuple. We can see that too few clusters may not enough to learn the task clustering characteristic (e.g., case (2,2,1)). In this dataset, increasing layers (e.g., case (8,4,4,1)) achieves similar performance compared with case (4,2,1). However, the former introduces more parameters.

Num. of Clu. Bird Texture Aircraft Fungi
Table 3: Comparison of different cluster numbers. The numbers in first column represents the number of clusters from bottom layer to top layer. Accuracy for 5-way 1-shot classification are reported.

7 Conclusion and Discussion

In this paper, we introduce HSML to improve the meta-learning effectiveness, which simultaneously customizing task knowledge and preserving knowledge generalization via hierarchical clustering structure. Compared with several baselines, experiments demonstrated the effectiveness and interpretability of our algorithm in both toy regression and few-shot classification problems.

Although our method is widely applicable, there are some limitations and interesting future directions. (1) In this paper, we provide a simple version for continual learning, where tasks from new underlying groups are added continually. However, to construct a more reliable lifelong learning system, it is will be necessary to consider more complex evolution relations between tasks (e.g., relationship forgetting); (2) Another interesting direction is to combining active learning with task relation learning for automatically exploring evolutionary task relations.


Appendix A Detailed Theoretical Analysis

Proof of Theorem 1

Assuming a task is sampled from , its training and testing samples are i.i.d. drawn from distribution , i.e., and . According to Theorem 3 in (Kuzborskij & Lampert, 2017), if is convex, the base learner SGD is -on-average-stable with


where .

For a new task , we first prove that the initialization can be approximately represented as . Wihtout loss of generality, here we consider a hierarchy in HSML.


where and . Note that the first equality holds by converting the Hadamard product into matrix multiplication, and the first and the second approximations come from first-order taylor series of sigmoid and hybolic functions. In addition, in the hierarchical structure, , .

From Eqn. 12, we can see that depends . Like (Kuzborskij & Lampert, 2017), when the optimization process for task starts from the equivalent form that , we can bound by using Hoeffding bound as:


Thus, we reach the conclusion.

Proof of Theorem 2

In non-convex case, we assume is -smooth and has -Lipschitz Hessian. According to the Corollary 1 and Proposition 1 in (Kuzborskij & Lampert, 2017), for task , we define:




Then, we use Hoeffding inequality and get


Finally, let , can be bounded as:


Thus, we reach our conclusion.

Existance of

Here, we provides more details about the analysis of existence of , i.e., . Though the negative gradient descent, we can get


Then, we can find a . It can also be verified in Figure 8. Assume is in the red contour, we can find a better parameter inside the contour through its negative gradient direction.

Figure 8: Illustration of Existance of .

Appendix B Detailed Description of the New Few-shot Classification Benchmark

The new benchmark consists of four image classification datasets. All images are resized to . Here, we briefly introduce each of them as follows:

  • [leftmargin=*]

  • Caltech-UCSD Birds-200-2011 (CUB-200-2011) (Wah et al., 2011) is a bird image dataset which contains 11,788 photos of 200 bird species. In this paper, we randomly select 100 species with 60 photos in each species. We split the meta-training/meta-validation/meta-testing sets as 64/16/20 species.

    • Meta-training: Savannah Sparrow, Dark eyed Junco, Black footed Albatross, Henslow Sparrow, Cape Glossy Starling, Black throated Sparrow, Northern Waterthrush, Hooded Warbler, Baltimore Oriole, Scarlet Tanager, Cerulean Warbler, Downy Woodpecker, Black and white Warbler, Tropical Kingbird, Canada Warbler, Blue Jay, Elegant Tern, Groove billed Ani, Mallard, European Goldfinch, Red breasted Merganser, Geococcyx, Red winged Blackbird, Ringed Kingfisher, Prairie Warbler, Florida Jay, Hooded Oriole, American Redstart, Western Wood Pewee, Sayornis, Myrtle Warbler, Yellow Warbler, Tree Swallow, Rufous Hummingbird, Fish Crow, Bewick Wren, Seaside Sparrow, Vesper Sparrow, American Crow, Eared Grebe, Blue headed Vireo, White necked Raven, Frigatebird, Horned Lark, Tree Sparrow, Red bellied Woodpecker, Pacific Loon, Caspian Tern, Anna Hummingbird, Olive sided Flycatcher, Common Tern, Cedar Waxwing, Great Crested Flycatcher, Blue Grosbeak, White breasted Kingfisher, White eyed Vireo, Purple Finch, Cliff Swallow, Scissor tailed Flycatcher, Harris Sparrow, Western Grebe, Gadwall, American Goldfinch, Pine Warbler.

    • Meta-validation: Mockingbird, Vermilion Flycatcher, Cape May Warbler, Prothonotary Warbler, White crowned Sparrow, Ovenbird, Pomarine Jaeger, Indigo Bunting, Blue winged Warbler, Chipping Sparrow, Horned Grebe, Fox Sparrow, Green Violetear, Nashville Warbler, Least Tern, Marsh Wren.

    • Meta-testing: Rose breasted Grosbeak, Nighthawk, Long tailed Jaeger, Bronzed Cowbird, California Gull, Ivory Gull, Northern Fulmar, Brown Pelican, Ring billed Gull, Great Grey Shrike, White breasted Nuthatch, Mourning Warbler, Sage Thrasher, Horned Puffin, Pied Kingfisher, Shiny Cowbird, Scott Oriole, Red eyed Vireo, Song Sparrow, Winter Wren.

  • Describable Textures Dataset (DTD) (Cimpoi et al., 2014) is a texture image dataset which contains 5640 images from 47 classes. Each class contains 120 images. Meta-training/Meta-validation/Meta-testing contains 30/7/10 classes respectively.

    • Meta-training: pitted, woven, crosshatched, crystalline, sprinkled, lacelike, bubbly, marbled, dotted, bumpy, striped, zigzagged, lined, smeared, pleated, stratified, waffled, knitted, gauzy, porous, spiralled, grooved, banded, potholed, stained, veined, swirly, frilly, freckled, studded.

    • Meta-validation: wrinkled, grid, perforated, cobwebbed, honeycombed, cracked, blotchy.

    • Meta-testing: fibrous, matted, scaly, chequered, flecked, paisley, braided, polka-dotted, interlaced, meshed.

  • Fine-Grained Visual Classification of Aircraft (FGVC-Aircraft) (Maji et al., 2013) is a image dataset for fine grained visual categorization of aircraft. The dataset contains 102 different aircraft variants. In this paper, we randomly select 100 variants with 100 images in each variant. We split the meta-training/meta-validation/meta-testing to 64/16/20 variants respectively.

    • Meta-training: MD-90, 737-600, A310, An-12, DR-400, Falcon-900, DC-3, Challenger-600, Fokker-70, Cessna-172, 747-400, ERJ-145, Dornier-328, A330-300, A319, Model-B200, E-170, A340-500, BAE-125, Metroliner, 747-300, C-130, DH-82, Hawk-T1, 727-200, 767-300, DC-10, Spitfire, E-195, BAE-146-300, F-16A-B, Beechcraft-1900, 747-200, Boeing-717, Falcon-2000, 777-300, Cessna-560, DHC-8-100, Cessna-525, 737-200, DC-8, Global-Express, DHC-1, CRJ-200, A340-300, DC-9-30, CRJ-900, A320, 737-300, Eurofighter-Typhoon, SR-20, E-190, Saab-340, C-47, Il-76, MD-87, 757-300, DHC-6, Tu-154, 777-200, 767-200, A318, 757-200, A300B4.

    • Meta-validation: 737-900, A340-600, 737-800, 737-400, L-1011, A330-200, Gulfstream-V, 737-500, A340-200, ATR-72, MD-11, CRJ-700, EMB-120, Fokker-100, DC-6, 737-700.

    • Meta-testing: 707-320, PA-28, Cessna-208, F-A-18, DHC-8-300, ERJ-135, Tornado, BAE-146-200, A321, ATR-42, Saab-2000, Tu-134, Fokker-50, A380, MD-80, Gulfstream-IV, Yak-42, 747-100, 767-400, Embraer-Legacy-600.

  • FGVCx-Fungi (Fungi) (Fun, 2018) contains over 100,000 fungi images of nearly 1,500 wild mushroom species. We first filter the species with less than 150 images and then randomly select 100 species with 150 images in each species. We split the meta-training/meta-validation/meta-testing to 64/16/20 species respectively.

    • Meta-training: Suillus granulatus, Phaeolus schweinitzii, Cystoderma amianthinum, Pycnoporellus fulgens, Psathyrella candolleana, Meripilus giganteus, Phellinus pomaceus, Laccaria laccata, Laccaria proxima, Amanita excelsa, Ganoderma pfeifferi, Clitopilus prunulus, Agaricus arvensis, Hericium coralloides, Plicatura crispa, Agrocybe praecox, Steccherinum ochraceum, Hypholoma fasciculare, Xerocomellus pruinatus, Xerocomellus chrysenteron, Crepidotus cesatii, Auricularia auricula-judae, Heterobasidion annosum, Entoloma clypeatum, Cortinarius torvus, Mycena tintinnabulum, Laetiporus sulphureus, Datronia mollis, Pholiota squarrosa, Cerioporus squamosus, Tricholoma terreum, Coprinellus micaceus, Cylindrobasidium laeve, Dacrymyces stillatus, Gloeophyllum sepiarium, Lycoperdon perlatum, Hygrophorus pustulatus, Clavulina coralloides, Xerocomus ferrugineus, Cortinarius alboviolaceus, Byssomerulius corium, Boletus edulis, Hymenopellis radicata, Basidioradulum radula, Cortinarius elatior, Schizophyllum commune, Cortinarius malicorius, Suillellus luridus, Ganoderma applanatum, Oligoporus guttulatus, Tubaria furfuracea, Cortinarius largus, Pleurotus ostreatus, Stereum hirsutum, Xylodon raduloides, Peniophora incarnata, Sutorius luridiformis, Flammulina velutipes var. velutipes, Phlebia radiata, Hygrocybe conica, Chlorophyllum olivieri, Armillaria ostoyae, Peniophora quercina, Mycena galericulata

    • Meta-validation: Agaricus impudicus, Daedaleopsis confragosa, Fomitopsis pinicola, Cortinarius anserinus, Mucidula mucida, Trametes versicolor, Stropharia cyanea, Ramaria stricta, Radulomyces confluens, Gliophorus psittacinus, Psathyrella spadiceogrisea, Coprinopsis lagopus, Daedalea quercina, Amanita muscaria, Armillaria lutea, Vuilleminia comedens

    • Meta-testing: Hygrocybe ceracea, Trametes hirsuta, Polyporus tuberaster, Lacrymaria lacrymabunda, Fistulina hepatica, Gymnopus dryophilus, Amanita rubescens, Fuscoporia ferrea, Craterellus undulatus, Tricholoma scalpturatum, Mycena pura, Russula depallens, Bjerkandera adusta, Trametes gibbosa, Tremella mesenterica, Cerioporus varius, Amanita fulva, Xylodon paradoxus, Cuphophyllus virgineus, Cortinarius flexipes

Appendix C Hyperparameters & Additional Experiment Settings

We summarize the hyperparameters in this paper in Table 4. Like (Finn et al., 2017)

, we compute the full Hessian-vector products for MAML. All cluster centers are randomly initialized. Note that, in few-shot classification problem, we use the change of averaged training accuracy to determine whether to increase clusters. Thus,

in this problem. For toy regression task, the pre-aggregator embedding is a fully connected layer. Following (Finn et al., 2017)

, the base learner has two hidden layers with 40 neurons in each. For few-shot image classification task, the pre-aggregator embedding

is a block of two convolutional layers with two fully connected layers. The base learner is a standard base learner with 4 standard convolutional blocks. For continual scenario, we add one cluster every time. All the experiments are implemented using Tensorflow 

(Abadi et al., 2016).

Hyperparameters Toy Regreesion miniImageNet Multi-Datasets (New Benchmark)
Input Scale (only for image data) /
Meta-batch Size (task batch size) 25 4 4
Inner loop learning rate () 0.001 0.001 0.001
Outer loop learning rate () 0.001 0.01 0.01
Filters of CNN (only for image data) / 32 32
Meta-training adaptation steps 5 5 5
Task representation size 40 128 128
Reconstruction loss weight () 0.01 0.01 0.01
Image Embedding Size (before aggregator) / 64 64
Continual Training Threshold () 1.25 / 0.85

# epoch (Q) for computing loss

1000 / 100
Table 4: Hyperparameter summary

Appendix D Results of MiniImagenet

In this part, we present the additional comparison on MiniImagenet dataset. Similar to the analysis in (Finn et al., 2018), the sampled tasks in this benchmark do not have obvious heterogeneity and uncertainty. Thus, the goal is to compare our approach with gradient-based meta-learning methods and other previous models. The expressive capacity of each model is controlled by using 4 standard convolutional layers and the results are shown in Table 5. With the same expressive capacity, our model can achieve comparable performance with MAML-based models and other previous models in meta-learning field.

MiniImagenet 5-way 1-shot Accuracy
Matching Nets (Vinyals et al., 2016)
meta-learner LSTM (Ravi & Larochelle, 2016)
Prototypical Network (Snell et al., 2017)
SNAIL (Mishra et al., 2018)
mAP-DLM (Triantafillou et al., 2017)
Relation Net (Yang et al., 2018)
GNN (Garcia & Bruna, 2017)
MAML (Finn et al., 2017)
LLAMA (Finn & Levine, 2017)
BMAML (Yoon et al., 2018a)
MT-Net (Lee & Choi, 2018)
MUMOMAML (Vuorio et al., 2018)
Reptile (Nichol & Schulman, 2018)
MetaSGD (Li et al., 2017)
PLATIPUS (Finn et al., 2018)
HSML (ours)
Table 5: Comparison between our approach and prior few-shot learning techniques on the 5-way, 1-shot MiniImagenet benchmark. For MT-Net (Lee & Choi, 2018), we remove the T-block since it introduces several convolutional layers which increases the expressive capacity of base learner (Lin et al., 2013). For BMAML (Yoon et al., 2018a), 24 classes are used for meta-testing in their original paper, while other methods use 20 classes. Since they have not released their code, we are not able to know the used classes. Thus, we implement it and report their performance on the standard classes (i.e., 20 classes for testing). Like (Finn et al., 2018), we bold methods whose highest scores that overlap in their confidence intervals.

Appendix E Leave-one-out Experiments on Few-shot Image Classification

In this part, we design a more difficult experiment for few-shot image classification. For each dataset, we use three datasets for meta-training and the remaining dataset for meta-testing. For example, we use texture, bird and aircraft datasets for meta-training, and fungi dataset for meta-testing. Different from all the previous meta-learning settings which only use different classes for meta-testing, the leave-one-out experiment use a totally different dataset to test the generalization performance, which is more challenging.

The results of 5-way 1-shot classification are shown in Table 6. We compare our methods with MAML and MUMOMAML (the best baseline in few-shot classification). We can see all results are significantly worse than the results without the leave-one-out technique, which shows the difficulty of this experiment. However, by capturing task clustering structure, our method can still achieves better performance than MAML and MUMOMAML.

Model Bird Texture Aircraft Fungi Average
Table 6: Comparison of leave-one-out experiments on 5-way 1-shot classification. 4000 tasks are used to test the performance. For each dataset, the performance is reported when this dataset is used for meta-testing.

Appendix F Additional Results of Few-shot Classification

Table 7 and Table 8 contain the full results (accuracy with confident interval) of few-shot image classfiation. Table 7 shows the full results of the bottom table in Figure 7 (in paper). Table 8 contains the full results of Table 3 (in paper).

Model Bird Texture Aircraft Fungi Average
HSML-Static (2C)
HSML-Static (10C)
Table 7: Comparison of online update results on few-shot image classification 5-way 1-shot scenario (Full Table).
Num. of Clus. Bird Texture Aircraft Fungi Average

Table 8: Comparison of different cluster numbers (Full Table).

Appendix G Effect of Different Aggregator

In our experiment, we found that the recurrent aggregator performs the best. To give more quantitative insight about the choice of aggregator, we compare these two aggregators with different shots in Table 9. We can see that recurrent aggregator significantly outperforms in 1-shot scenario. With the increase of the size of training samples, the performances of the two aggregators become more similar. Therefore, compared with recurrent aggregator, training a better mean pooling aggregator may require more data.

Model Bird Texture Aircraft Fungi Average
1-shot HSML-MPAA
3-shot HSML-MPAA