Log In Sign Up

Meta-learning Amidst Heterogeneity and Ambiguity

by   Kyeongryeol Go, et al.
KAIST 수리과학과

Meta-learning aims to learn a model that can handle multiple tasks generated from an unknown but shared distribution. However, typical meta-learning algorithms have assumed the tasks to be similar such that a single meta-learner is sufficient to aggregate the variations in all aspects. In addition, there has been less consideration on uncertainty when limited information is given as context. In this paper, we devise a novel meta-learning framework, called Meta-learning Amidst Heterogeneity and Ambiguity (MAHA), that outperforms previous works in terms of prediction based on its ability on task identification. By extensively conducting several experiments in regression and classification, we demonstrate the validity of our model, which turns out to be robust to both task heterogeneity and ambiguity.


ST-MAML: A Stochastic-Task based Method for Task-Heterogeneous Meta-Learning

Optimization-based meta-learning typically assumes tasks are sampled fro...

Automated Relational Meta-learning

In order to efficiently learn with small amount of data on new tasks, me...

The Effect of Diversity in Meta-Learning

Few-shot learning aims to learn representations that can tackle novel ta...

EEML: Ensemble Embedded Meta-learning

To accelerate learning process with few samples, meta-learning resorts t...

MACFE: A Meta-learning and Causality Based Feature Engineering Framework

Feature engineering has become one of the most important steps to improv...

Neuromorphic Architecture Optimization for Task-Specific Dynamic Learning

The ability to learn and adapt in real time is a central feature of biol...

Learning an Explicit Hyperparameter Prediction Policy Conditioned on Tasks

Meta learning has attracted much attention recently in machine learning ...

1 Introduction

Although deep learning models have shown remarkable performance in various domains, they have consistently been criticized because of their sensitivity to the amount of data

(Chen and Lin, 2014; Najafabadi et al., 2015; Cho et al., 2015; Sun et al., 2017; Hestness et al., 2017). Despite all available public data, the data scarcity issue is still not negligible. In many cases, the actual data that is worth analyzing is quite limited for many different reasons, for example, concerns about data privacy (Liu et al., 2020) and noisy data with anomalies (Sanders and Saxe, 2017)

. Along with transfer learning, few-shot learning, and multi-task learning, meta-learning has recently been highlighted as a way to overcome this deficiency with its adaptive behavior using a few data points

(Vanschoren, 2018; Hospedales et al., 2020).

Meta-learning aims to handle multiple tasks by efficiently organizing the acquired knowledge. However, typical algorithms have been assessed based on a solid assumption which lacks the representative potential in real-world scenarios. Among many tackles (Triantafillou et al., 2019; Lee et al., 2019a), we mainly focus on the following two assumptions. First of all, the tasks are regarded to be similar such that a single meta-learner is sufficient to aggregate the variations in all aspects. It implies that there has been little effort to compactly abstract notions within heterogeneity, one of the essential factors characterizing human intelligence, which is advantageous in decision-making to query the associated information to solve the problem. In addition, there has been less consideration on uncertainty for identifying particular task with a few data points. It is therefore not easy to analyze or transfer the acquired knowledge of the model, which is critical in the growing AI industries, such as a medical diagnosis (Ahmad et al., 2018; Challen et al., 2019; Vellido, 2019) and autonomous vehicles (Kim and Canny, 2017; Shafaei et al., 2018; Chen et al., 2020), because a certain level of interpretability is required for greater safety.

In this respect, we hypothesize that a disentanglement in task representation is advantageous, which frequently appears in studies to analyze the inherent factors of variation within the dataset. This is to i) uncover the distinctive properties as a tool for interpretability and to ii) explicitly separate the dataset into several clusters, which would have been detrimental when trained altogether. However, as a trade-off for interpretability, the overconfident nature of deep learning may strictly assign the tasks into certain clusters without considering ambiguity, which requires an additional treatment to cope with the anomalies.

Figure 1: Heterogeneity and ambiguity occurred in task distribution. Those are not independent concepts, but the ambiguity naturally comes after the heterogeneity.

To this end, we propose a new meta-learning framework, Meta-learning Amidst Heterogeneity and Ambiguity (MAHA), that performs robustly on the following two huddles. Task heterogeneity: there is no clear discrimination between the tasks that are sampled from the faraway modes of task distribution (Vuorio et al., 2018; Yao et al., 2019, 2020a). Task ambiguity: too few data points are given to infer the task identity (Finn et al., 2018; Rusu et al., 2018). Specifically, we devise a pre-task built upon the neural processes (Garnelo et al., 2018a, b; Kim et al., 2019) to obtain well-clustered and interpretable representation. Then, an agglomerative clustering is applied to the representation without any external knowledge such as the number of clusters and separately train a different model for each cluster. Please refer to Figure 7 for the overall training process of MAHA.

To summarize, the main contributions of this paper are the following 4-folds:

  • We propose a simple yet powerful architecture design for the neural processes to better leverage the latent variables and be applicable in classification. (See Section 5.1)

  • We resolve the information asymmetry in the neural processes and construct well-clustered and interpretable representations. (See Section 5.2)

  • We validate MAHA through both regression and classification, by which the experimental results demonstrate its ability to cope with the heterogeneity and ambiguity. (See Section 6)

  • We devise an additional regularization term for the low-shot regime that distills an obtainable knowledge from relatively various training samples and variations. (See Appendix B)

2 Related work

Gradient-based meta-learning, represented by MAML (Finn et al., 2017), aims to learn the prior parameters that can quickly adapt to certain tasks through several gradient steps. It consists of the inner-loop for the task adaptation and the outer-loop for the meta-update over tasks. Many variants have emerged to balance generalization and customization in a task-adaptive manner. To begin with a generalization perspective, (Finn et al., 2018; Kim et al., 2018) suggested probabilistic extensions through the hierarchical Bayesian model and Stein variational gradient descent (SVGD) (Liu and Wang, 2016). In addition, Rusu et al. (2018) conducted the inner-loop on the low-dimensional latent embedding space, and (Yin et al., 2019) proposed the meta-regularization that was built on information theory. From a customization perspective, (Lee and Choi, 2018) divided the parameters into two categories, one of which is shared across tasks, and the other can be modulated task-specifically. (Zintgraf et al., 2019) was informed by the layer-wise adaptive units, and (Vuorio et al., 2018; Yao et al., 2019, 2020a, 2020b) considered the auxiliary networks that modulate the initial parameter before the inner-loop.

The family of neural processes, also known as contextual meta-learning, is devised to imitate the flexibility of the Gaussian Process (Rasmussen, 2003)

while resolving the scalability issue. Rather than explicitly modeling the kernel to conduct the Bayesian inference like

(Wilson et al., 2016), it learns an implicit kernel directly from data which overcomes the design restrictions. Task-specific information is extracted from the subset of data through an encoder, which is then aggregated for utilization in the decoder to predict the corresponding outputs of the remaining data. Starting from the conditional neural process (CNP) (Garnelo et al., 2018a), which was built solely on a deterministic path, the neural process (NP) (Garnelo et al., 2018b) applies the addition of a stochastic path. The attentive neural process (ANP) (Kim et al., 2019) further applies an attention mechanism to resolve the underfitting issue in NP by enlarging the locally adaptive behavior. More complex modules, such as a graph structure (Louizos et al., 2019)

and recurrent neural network

(Kumar et al., 2018; Singh et al., 2019), were further considered to capture the dependencies on latent variables and the complex temporal dynamics.

However, many problems remain unsolved. Firstly, the neural processes yet rely on a complex feature extractor to enable task-specific modulation, which requires various regularization techniques with additional hyperparameters

(Requeima et al., 2019). Furthermore, whereas the neural processes are able to obtain an explicit task representation, the existing approaches have investigated little regarding interpretability. Finally, the performance analysis has been mainly focused on regression (Le et al., 2018; Kim et al., 2019; Singh et al., 2019; Suresh and Srinivasan, ; Gordon et al., 2020), and some are not even directly applicable for classification (Lee et al., 2020).

3 Problem setting

Let be the context set, and let be the target set, where both and are sampled from the same task . A common goal in meta-learning is to devise an algorithm for the model that appropriately uses the model parameter to obtain the task-specific parameter according to the input-output pairs in such that when is given,

can be accurately estimated with high confidence. For example, in MAML

Finn et al. (2017), a task-specific parameter can be computed by using a gradient step . On the other hand, in CNP (Garnelo et al., 2018a), and no longer share the same parameter space. Here, the model parameter is divided into an encoder and a decoder part , and the task-specific parameter can be computed by the encoder output . Hereafter, we omit for brevity.

For model training, is iteratively updated using batchs. Here, each batch is constructed through multiple tasks that are characterized by way and shot. If there are N classes, each of which contains K input-output pairs, we call it an N-way K-shot

problem. The class labels are shuffled in classification whenever a task instance is created, which encourages a meta-learning algorithm to learn how to classify images even when the configuration of unseen classes occurs.

4 Preliminary : (Attentive) Neural Process

Figure 2:

Graphical model of the related baselines. Circles denote random variables, whereas diamonds denote deterministic variables. Shaded variables are observed during the test phase, and every in-between edge is implemented as a neural network.

In Figure 2, we summarize how a basic family of neural processes has evolved in terms of the graphical model. The encoder comprises a deterministic path and stochastic path computing the task-specific parameter of the variational distributions which we denote by and .111Note that

is deterministic with zero variance.

Here, indicates a set of input-output pairs and a reparameterization trick is applied at the end of the stochastic path for differentiable non-centered parameterization (Kingma and Welling, 2014).

For both paths, NP is constructed by:

where MeanPool() is a mean-pooling operation along the subscripted dimension, rFF(

) can be any row-wise feedforward layer, such as Multi-Layer Perceptron (MLP), and

denotes the concatenation. On the other hand, ANP exploits the multi-head attention, connecting to in graphical model, and self-attention, both of which are proposed in (Vaswani et al., 2017). As in NP, the value of is same for every shot of , however, based on the attention score with each element of , is now computed in shot-dependent manner.

Then, conditioned on the encoder outputs, and , with the target input

, the decoder computes the parameters of predictive distribution on the target output



where the predictive distribution is expressed as

. Eventually, relying on the variational inference, one can obtain the loss function which approximates the negative ELBO by replacing an intractable

with the variational distribution following (Garnelo et al., 2018b):


As a result, based on the Kolmogorov extension and de-Finetti theorems, the neural processes become a stochastic process that satisfies the exchangeability and consistency (Garnelo et al., 2018b). However, when trained using the deterministic path, the neural processes with latent variables is empirically shown to have difficulty capturing the variability of the stochastic process (Le et al., 2018), of which causes are investigated and resolved in Section 5.2.

5 Meta-learning Amidst Heterogeneity and Ambiguity

This section describes our algorithm MAHA whose primary focus is to devise a pre-task to cope with task heterogeneity and ambiguity in meta-learning. We first introduce an encoder-decoder pipeline of MAHA, namely FELD, of which effects are examined by substituting the correspondent within NP in Section 6. Then, a dimension-wise pooling and an auto-encoding structure are proposed to obtain well-clustered and interpretable representation. Finally, the training process of MAHA is described, which applies to both regression and classification.

5.1 Encoder-decoder pipeline

Flexible Encoder

Although the attention mechanism proposed in ANP was a key to resolve the underfitting in NP, there is less incentive for to focus on task identity that is shared across shots. As a result, in Figure 3, ANP appears to strongly fit the given input-output pairs, which leads to a wiggly prediction. Particularly within task heterogeneity and ambiguity where the prediction space is prone to be highly variable, the wiggly prediction of ANP leads to a poor generalization performance (See Figure 8). Therefore, the graphical model of NP is rather considered in MAHA since its latent variables are shot-independent. Then, based on analysis in (Cremer et al., 2018), the problematic underfitting is dealt with by substituting the encoder with the flexible and permutation-invariant Set Transformer (ST) (Lee et al., 2019b). Note that the Set Transformer can incorporate the rFF() and in the encoder of NP. See Appendix A for a more detailed explanation about the modules in Set Transformer.

Figure 3: Qualitative comparison between NP, ANP, and NP with the flexible encoder (NP+FE) on functions generated from Gaussian Process. The shaded areas correspond to the

2 standard deviations. Prediction of ANP turns out to be wiggly, while NP and NP+FE are relatively smooth following Occam’s razor. Note that quantitative comparison can be looked up in Table 


Linear Decoder

We avoid using a complex decoder such as (Oord et al., 2016) and apply feature-wise linear modulation to the target input . Inspired by (Zhao et al., 2017a), we composite the latent variables using a skip connection. Among the many normalization techniques, a layer normalization (Ba et al., 2016) is applied since the statistic is computed independently for each batch instance such that only can capture the heterogeneity in accordance with the pooling proposed in Section 5.2.


Here, g() implies any feature extractor, LN() indicates a layer normalization, and the transpose operation T

permutes the last two dimensions of the tensor. It is aligned with the previous approaches

Bowman et al. (2015); Semeniuta et al. (2017); Yang et al. (2017); He et al. (2019) which weaken the decoder to allow the latent variables to be appropriately leveraged. Also, it relates to studies on few-shot classification (Gordon et al., 2018; Requeima et al., 2019) where each column of is computed by shots within the same way. However, when accompanied with the pooling in Section 5.2, the columns are no more independent by one another and share information across way.

Figure 4: Prediction on output distribution. Superscript indicates the -th batch instance.

5.2 Inducing disentanglement on

Figure 5: Stacked bar plot for the weight norm of the decoding layer
Figure 6: Computational diagram for and . For visual comfort, every block of the encoder outputs in regression is reshaped from into . In classification, shot dimension is divided along way with subscript and .

For NP and ANP trained on functions generated from GP, we illustrate the weight norm of the decoding layer right behind the latent variables in Figure 5. The sparsely-coded decoder implies the redundancy of the stochastic path due to the component collapsing behavior referred to in (Nalisnick and Smyth, 2016; Joo et al., 2020). This phenomenon can be explained by the information preference problem (Chen et al., 2016; Zhao et al., 2017b) where the information flow is concentrated on the deterministic path with the tendency to ignore the stochastic path.

In order to handle the information asymmetry, several solutions were proposed in studies on the generative model, such as the KL annealing scheduler (Bowman et al., 2015; Fu et al., 2019) and expressive posterior approximation (Rezende and Mohamed, 2015; Kingma et al., 2016), but these are generally not robust to changes in model architecture. Instead, we propose a simple method to avoid redundancy of the stochastic path by encouraging it to acquire multi-modality within heterogeneity and ambiguity.

Dimension-wise pooling

We explicitly capture the distinct variations within the information flow by pooling each path across different dimensions, batch for and way for :


Then, the deterministic representation becomes identical not only across shot, but also across batch. Then, whenever it is insufficient to handle all variations across tasks within the same batch i.e., facing task heterogeneity, the model should resort to the stochastic representation since the deterministic representation only captures the average properties. On the other hand, the stochastic representation allows the different way to share information and becomes class-invariant. We illustrate how the latent variables and are computed in Figure 6. Note that the value of way is set to 1 in regression such that pooling on is negligible.

Auto-encoding structure

Empirically, we observe that the KL collapse (Bowman et al., 2015; Alemi et al., 2016; Sønderby et al., 2016; Zhao et al., 2017b) does not occur whenever the pooling operations is used (see Appendix D). This implies that the posterior does not simply converge to the approximate prior so that the decoder gets dependent on the stochastic path. However, there is still an incentive for to be underutilized during the decoding because it is inferred by small not large (Hewitt et al., 2018) and neural networks exploiting set representation is known to poorly perform in low-shot regime (Edwards and Storkey, 2016; Zaheer et al., 2017) i.e., facing task ambiguity.

Thereby, we resort to the conditional auto-encoding structure (Sohn et al., 2015) on top of the dimension-wise pooling to cope with the lack of training samples. As a result, the following loss function is derived which differs from Equation 2 on i) whether the pooling operations are used or not and ii) which set is used to compute the deterministic representation, each of which is the result of the dimension-wise pooling and the auto-encoding structure:


5.3 Training process

See Figure 7. Initially, the dimension-wise pooling and the auto-encoding structure proposed in Section 5.2 are used along with FELD to minimize the loss function in Equation 5. Next, an agglomerative clustering is applied to the disentangled representation from the stochastic path to estimate the number of clusters with the highest purity value.222For a homogeneous dataset, a single cluster is available such that the previous steps can be omitted. Finally, for each cluster, separate FELD is trained from the beginning by Equation 2

where the tasks are no longer uniformly sampled but statistically skewed based on the ratio of heterogeneous tasks within the cluster. According to the Euclidean distance to the cluster centers, FELD in correspondence to the closest cluster is exploited for evaluation.

Figure 7: MAHA. is the number of estimated clusters such that the meta-train set .

6 Experiment

We first experiment on frequently appearing benchmark datasets in meta-learning and investigate the role of the encoder-decoder pipeline (FELD) by gradually adjusting NP. Those datasets are generally regarded to be homogeneous such that the MAHA is equivalent to FELD when assuming a single cluster as noted in Section 5.3. After that, MAHA is evaluated on heterogeneous datasets following the experimental setting of (Yao et al., 2019) with the dimension-wise pooling and the auto-encoding structure in Section 5.2, of which roles are examined in both quantitative and qualitative manner. Please refer to Appendix C for details about the data-split, architecture design, and the hyperparameter search.

Overall, we are to answer the following three questions:

  • Does MAHA outperform the previous baselines in terms of prediction? (See Table 1 to 5)

  • What are the benefits of using the flexible encoder and the linear decoder? (See Section 6.1)

  • How does the dimension-wise pooling and the auto-encoding structure contribute to obtaining well-clustered representation within heterogeneity? (See section 6.2)

6.1 Homogeneous dataset

NP 0.166 0.002
ANP 0.142 0.002
NP+FE 0.138 0.002
NP+LD 0.312 0.002
FELD 0.130 0.002
Table 1: MSE on Gaussian Process

Gaussian Process

Following the basic neural processes (Garnelo et al., 2018a, b; Kim et al., 2019), we consider functions generated from GP with squared exponential kernel . The experimental result in Table 1 states that although ANP performs better than NP in terms of flexibility, the dominance no longer holds when NP is equipped with the flexible encoder. However, a degradation in performance is shown when using the linear decoder in NP. This is empirical evidence that NP strongly relies on the complexity of the decoder in regression, by which the model is prone to ignore the latent variables (Chen et al., 2016; Zhao et al., 2017b). By exploiting the flexible encoder to obtain more informative latent variables by themselves such that the (shallow) linear decoder is just enough for prediction, FELD performs better than any other models with the (deep) conventional decoder. We find the Set Transformer is the perfect choice whose improvement can not be caught up by simply stacking MLPs. Moreover, it is noticeable that FELD outperforms NP+FE despite a decreased model capacity.

Model 5-way 1-shot 5-way 5-shot
Matching Net 43.40 0.78% 51.09 0.71%
Meta-LSTM 43.44 0.77% 60.60 0.71%
MAML 48.70 1.84% 63.11 0.92%
ProtoNet 49.42 0.78% 68.20 0.66%
REPTILE 49.97 0.32% 65.99 0.58%
Relation Net 50.44 0.82% 65.32 0.70%
CAVIA 51.82 0.65% 65.85 0.55%
VERSA 53.40 1.82% 67.37 0.86%
TPN 55.51 0.86% 69.86 0.65%
Meta-SGD 54.24 0.03% 70.86 0.04%
SNAIL 55.71 0.99% 68.88 0.92%
NP+LD 57.30 0.06% 75.10 0.04%
TADAM 58.50 0.30% 76.70 0.30%
LEO 61.76 0.08% 77.59 0.12%
FELD 62.77 0.05% 81.15 0.03%
Table 2:

Accuracy on mini-ImageNet

Mini-ImageNet, Tiered-ImageNet

Similar tendency can be observed in classification. We consider mini-ImageNet

(Vinyals et al., 2016) and tiered-ImageNet (Ren et al., 2018), which are frequently used large-scale datasets for few-shot image classification. For mini-ImageNet, we follow the split of Ravi and Larochelle (2016), which assigns 64 classes for the meta-train set, 16 classes for the meta-valid set, and 20 classes for the meta-test set. For tiered-ImageNet, 608 classes are first grouped into 34 higher-level nodes, divided into 20, 6, and 8 nodes to construct the meta-train set, meta-valid set, and meta-test set. We use the feature provided by (Rusu et al., 2018), which is obtained by pre-training a deep residual network in a supervised manner as in (Gidaris and Komodakis, 2018; Oreshkin et al., 2018; Qiao et al., 2018). However, unlike (Qiao et al., 2018; Rusu et al., 2018), the meta-valid set is used for early stopping and hyperparameter search but not utilized to update the parameters.

Model 5-way 1-shot 5-way 5-shot
MAML 51.67 1.81% 70.30 0.08%
ProtoNet 53.31 0.89% 72.69 0.74%
Relation Net 54.48 0.93% 71.32 0.78%
Warp-MAML 57.20 0.90% 74.10 0.70%
TPN 57.41 0.94% 71.55 0.74%
Meta-SGD 62.95 0.03% 79.34 0.06%
NP+LD 63.36 0.06% 80.50 0.04%
LEO 66.33 0.05% 81.44 0.09%
FELD 66.87 0.06% 83.54 0.04%
Table 3: Accuracy on tiered-ImageNet

In Table 2, 3, accuracy on mini-ImageNet and tiered-ImageNet is reported. We collect the score of various baselines that use either convolutional networks or deep residual networks and do not exploit any data augmentation for a fair comparison. While NP performs no better than a random guess when following (Garnelo et al., 2018a), NP+LD results in a comparable score to the recent models in gradient-based meta-learning, verifying the validity of the linear decoder in classification. FELD achieves even better performance than the state-of-the-art, which is remarkable in the sense that the attention modules in Set Transformers can not be fully utilized in low-shot regime.

6.2 Heterogeneous dataset

Sine Polynomial

To verify the performance on the family of functions, we experiment on the toy 1D regression as in (Vuorio et al., 2018; Yao et al., 2019, 2020a). In particular, we follow the exact setting of (Yao et al., 2019) where each task is randomly chosen to be one of the following one-dimensional functions where the coefficients are uniformly sampled from the prefixed intervals summarized in Appendix C.1: (sine) , (line) , (quad) , (cubic)

. A small number of data points are given as context, requiring the model to appropriately interpolate and extrapolate in a highly variable prediction space.

Model 5-shot 10-shot
BMAML 2.435 0.130 0.967 0.056
MAML 2.205 0.121 0.761 0.068
META-SGD 2.053 0.117 0.836 0.065
MT-NET 2.016 0.019 0.698 0.054
MUMOMAML 1.096 0.085 0.256 0.028
HSML 0.856 0.073 0.161 0.021
NP 0.514 0.051 0.089 0.015
ANP 0.415 0.046 0.058 0.016
FELD 0.118 0.015 0.008 0.002
MAHA 0.077 0.006 0.003 0.001
MAHA* 0.056 0.003 0.002 0.001
Table 4: MSE on Sine Polynomial

In Table 4

, MSE over 4000 tasks are presented with 95% confidence interval. Generally, all the gradient-based meta-learning algorithms are outperformed by the neural processes, and a noticeable gain is again observed by solely exploiting the encoder-decoder pipeline, FELD. By adjusting FELD to MAHA by task clustering and MAHA to MAHA* by knowledge distillation, a monotonic improvement is observed.

333We handle the overconfident nature of deep learning to better cope with the ambiguity by distilling an obtainable knowledge from to . Please refer to Appendix B for a more detailed explanation.

In Figure 8, we illustrate the interpolation and extrapolation of MAHA in comparison to ANP. As noted in Section 5.1, the main interest of ANP is shown to fitting the context points, which poorly perform in predicting the target outputs whose corresponding inputs are located farther away from that of the context points. This tendency can be observed during interpolation and extrapolation, leading to a wiggly prediction with significant variance. By contrast, MAHA can correctly infer the functional shape, which can be confirmed through a consistently low variance.

Figure 8: Qualitative comparison of ANP and MAHA on various function types. The context points are selected from 40% of the entire domain for extrapolation.
POOL AE 1-shot 5-shot
0.8020 0.9957
0.7455 0.9145
0.9035 0.9930
0.9560 0.9992
Figure 9: t-SNE of from and the estimated purity values


Four distinct fine-grained image classification datasets are combined to construct the multi-dataset proposed in (Yao et al., 2019)

: (Bird) CUB-200-2011, (Texture) Describable Textures Dataset, (Aircraft) FGVC of Aircraft, and (Fungi) FGVCx-Fungi. Compared to a homogeneous setting, this is more challenging since overfitting to a particular dataset can critically harm the performance. For the feature extractor, we followed

(Yao et al., 2019) where 2-Conv blocks are used for task clustering, and 4-Conv blocks are used for prediction.

In Figure 9, for 1-shot setting, mean value of the variational distribution is visualized through t-SNE (Van der Maaten and Hinton, 2008). Without external knowledge, such as the number of true clusters, the embeddings get interpretable when using both the dimension-wise pooling and the auto-encoding structure. The distinct datasets are no more clearly discriminated without either of them, which is quantitatively demonstrated by the estimated purity values in the bottom table. Note that the validity of the methodologies stands out particularly in low-shot regime which implies the difficulty of task identification within ambiguity.

The tendency can be observed by the performance measure presented in Table 5. Compared to 1-shot setting where a noticeable gain is occurred by task clustering, in 5-shot setting, there is almost no difference between FELD and MAHA. This is because the models can clearly identify the tasks regardless of whether the pooling or the auto-encoding structure is used or not, demonstrated by the high purity values. Accordingly, the knowledge distillation, which is fundamentally devised to regularize the model within ambiguity appropriately, has shown a worthwhile improvement from MAHA to MAHA* particularly in 1-shot setting. Eventually, MAHA (and MAHA*) beats all the previous works with a fairly large margin and achieves state-of-the-art performance.

Model Bird Texture Aircraft Fungi Average
5-way 1-shot MAML 53.94 1.45% 31.66 1.31% 51.37 1.38% 42.12 1.36% 44.77%
Meta-SGD 55.58 1.43% 32.38 1.32% 52.99 1.36% 41.74 1.34% 45.67%
MT-NET 58.72 1.43% 32.80 1.35% 47.72 1.46% 43.11 1.42% 45.59%
BMAML 54.89 1.48% 32.53 1.33% 53.63 1.37% 42.50 1.33% 45.89%
MUMOMAML 56.82 1.49% 33.81 1.36% 53.14 1.39% 42.22 1.40% 46.50%
HSML 60.98 1.50% 35.01 1.36% 57.38 1.40% 44.02 1.39% 49.35%
ARML 62.33 1.47% 35.65 1.40% 58.56 1.41% 44.82 1.38% 50.34%
FELD 56.17 0.64% 35.86 0.41% 53.03 0.58% 45.41 0.58% 47.61%
MAHA 63.89 0.34% 37.22 0.23% 58.90 0.44% 47.95 0.34% 51.99%
MAHA* 64.45 0.36% 37.83 0.23% 59.18 0.43% 48.33 0.33% 52.41%
5-way 5-shot MAML 68.52 0.79% 44.56 0.68% 66.18 0.71% 51.85 0.85% 57.78%
Meta-SGD 67.87 0.74% 45.49 0.68% 66.84 0.70% 52.51 0.81% 58.18%
MT-NET 69.22 0.75% 46.57 0.70% 63.03 0.69% 53.49 0.83% 58.08%
BMAML 69.01 0.74% 46.06 0.69% 65.74 0.67% 52.43 0.84% 58.31%
MUMOMAML 70.49 0.76% 45.89 0.69% 67.31 0.68% 53.96 0.82% 59.41%
HSML 71.68 0.73% 48.08 0.69% 73.49 0.68% 56.32 0.80% 62.39%
ARML 73.34 0.70% 49.67 0.67% 74.88 0.64% 57.55 0.82% 63.86%
FELD 77.63 0.46% 55.80 0.38% 75.88 0.41% 63.68 0.50% 68.24%
MAHA 75.04 0.26% 54.39 0.21% 79.98 0.20% 65.09 0.25% 68.62%
MAHA* 75.82 0.26% 54.28 0.22% 79.91 0.19% 65.18 0.25% 68.79%
Table 5: Accuracy on multi-dataset

7 Conclusion

This paper proposes a new meta-learning framework, MAHA, that performs robustly amidst heterogeneity and ambiguity. We aim to disentangle the stochastic representation by the dimension-wise pooling and the auto-encoding structure based on the newly devised encoder-decoder pipeline to better leverage the latent variables. With the multi-step training process, comprehensive experiments are conducted on regression and classification. In the end, we argue that the proposed model captures the task identity with lower variance, leading to a noticeable improvement in performance. The potential limitation of MAHA would be the additional computational cost from the flexible encoder composed of multiple attention modules. However, by orthogonally applying to the existing work, the compatibility and the necessity are empirically verified. An interesting future work would be to apply our model to reinforcement learning. In particular, training a policy directly from well-clustered representations for sample-efficient exploration seems promising in an environment with sparse rewards.

Broader Impact

When training meta-learning models, there comes a customization process based on the problem at hand. If not using the benchmark datasets that frequently appear in academia, it becomes unclear to which extent the distinct datasets should be combined, expecting the model to be versatile on every possible task generation. MAHA, in this respect, can guide for a human to analyze and cluster the available data into separate clusters. Moreover, MAHA mainly benefits future AI industries where the limited communication between the decentralized servers is available as it can infer the global context even with a small amount of information. As a result, we do not expect any negative societal impacts, but we believe that MAHA possesses many implications in more realistic scenarios.


  • M. A. Ahmad, C. Eckert, and A. Teredesai (2018)

    Interpretable machine learning in healthcare

    In Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics, pp. 559–560. Cited by: §1.
  • A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy (2016) Deep variational information bottleneck. arXiv preprint arXiv:1612.00410. Cited by: §5.2.
  • J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §5.1.
  • S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio (2015) Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349. Cited by: §5.1, §5.2, §5.2.
  • R. Challen, J. Denny, M. Pitt, L. Gompels, T. Edwards, and K. Tsaneva-Atanasova (2019) Artificial intelligence, bias and clinical safety. BMJ Quality & Safety 28 (3), pp. 231–237. Cited by: §1.
  • J. Chen, S. E. Li, and M. Tomizuka (2020) Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. arXiv preprint arXiv:2001.08726. Cited by: §1.
  • X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel (2016)

    Variational lossy autoencoder

    arXiv preprint arXiv:1611.02731. Cited by: §5.2, §6.1.
  • X. Chen and X. Lin (2014) Big data deep learning: challenges and perspectives. IEEE access 2, pp. 514–525. Cited by: §1.
  • J. Cho, K. Lee, E. Shin, G. Choy, and S. Do (2015) How much data is needed to train a medical image deep learning system to achieve necessary high accuracy?. arXiv preprint arXiv:1511.06348. Cited by: §1.
  • C. Cremer, X. Li, and D. Duvenaud (2018) Inference suboptimality in variational autoencoders. arXiv preprint arXiv:1801.03558. Cited by: §5.1.
  • H. Edwards and A. Storkey (2016) Towards a neural statistician. arXiv preprint arXiv:1606.02185. Cited by: §5.2.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400. Cited by: §2, §3.
  • C. Finn, K. Xu, and S. Levine (2018) Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pp. 9516–9527. Cited by: §1, §2.
  • H. Fu, C. Li, X. Liu, J. Gao, A. Celikyilmaz, and L. Carin (2019) Cyclical annealing schedule: a simple approach to mitigating kl vanishing. arXiv preprint arXiv:1903.10145. Cited by: §5.2.
  • M. Garnelo, D. Rosenbaum, C. J. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. J. Rezende, and S. Eslami (2018a) Conditional neural processes. arXiv preprint arXiv:1807.01613. Cited by: §1, §2, §3, §6.1, §6.1.
  • M. Garnelo, J. Schwarz, D. Rosenbaum, F. Viola, D. J. Rezende, S. Eslami, and Y. W. Teh (2018b) Neural processes. arXiv preprint arXiv:1807.01622. Cited by: §1, §2, §4, §4, §6.1.
  • S. Gidaris and N. Komodakis (2018) Dynamic few-shot visual learning without forgetting. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 4367–4375. Cited by: §6.1.
  • J. Gordon, J. Bronskill, M. Bauer, S. Nowozin, and R. E. Turner (2018) Meta-learning probabilistic inference for prediction. arXiv preprint arXiv:1805.09921. Cited by: §5.1.
  • J. Gordon, W. Bruinsma, A. Y. Foong, J. Requeima, Y. Dubois, and R. E. Turner (2020) Convolutional conditional neural processes. Cited by: §2.
  • J. He, D. Spokoyny, G. Neubig, and T. Berg-Kirkpatrick (2019) Lagging inference networks and posterior collapse in variational autoencoders. arXiv preprint arXiv:1901.05534. Cited by: §5.1.
  • J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. Patwary, M. Ali, Y. Yang, and Y. Zhou (2017) Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409. Cited by: §1.
  • L. B. Hewitt, M. I. Nye, A. Gane, T. Jaakkola, and J. B. Tenenbaum (2018) The variational homoencoder: learning to learn high capacity generative models from few examples. arXiv preprint arXiv:1807.08919. Cited by: §5.2.
  • T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey (2020) Meta-learning in neural networks: a survey. arXiv preprint arXiv:2004.05439. Cited by: §1.
  • W. Joo, W. Lee, S. Park, and I. Moon (2020) Dirichlet variational autoencoder. Pattern Recognition 107, pp. 107514. Cited by: §5.2.
  • H. Kim, A. Mnih, J. Schwarz, M. Garnelo, A. Eslami, D. Rosenbaum, O. Vinyals, and Y. W. Teh (2019) Attentive neural processes. arXiv preprint arXiv:1901.05761. Cited by: §1, §2, §2, §6.1.
  • J. Kim and J. Canny (2017) Interpretable learning for self-driving cars by visualizing causal attention. In Proceedings of the IEEE international conference on computer vision, pp. 2942–2950. Cited by: §1.
  • T. Kim, J. Yoon, O. Dia, S. Kim, Y. Bengio, and S. Ahn (2018) Bayesian model-agnostic meta-learning. arXiv preprint arXiv:1806.03836. Cited by: §2.
  • D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improving variational inference with inverse autoregressive flow. arXiv preprint arXiv:1606.04934. Cited by: §5.2.
  • D. Kingma and M. Welling (2014) Efficient gradient-based inference through transformations between bayes nets and neural nets. In International Conference on Machine Learning, pp. 1782–1790. Cited by: §4.
  • A. Kumar, S. Eslami, D. J. Rezende, M. Garnelo, F. Viola, E. Lockhart, and M. Shanahan (2018) Consistent generative query networks. arXiv preprint arXiv:1807.02033. Cited by: §2.
  • T. A. Le, H. Kim, M. Garnelo, D. Rosenbaum, J. Schwarz, and Y. W. Teh (2018) Empirical evaluation of neural process objectives. In NeurIPS workshop on Bayesian Deep Learning, Cited by: §2, §4.
  • H. B. Lee, H. Lee, D. Na, S. Kim, M. Park, E. Yang, and S. J. Hwang (2019a) Learning to balance: bayesian meta-learning for imbalanced and out-of-distribution tasks. arXiv preprint arXiv:1905.12917. Cited by: §1.
  • J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh (2019b) Set transformer: a framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pp. 3744–3753. Cited by: §5.1.
  • J. Lee, Y. Lee, J. Kim, E. Yang, S. J. Hwang, and Y. W. Teh (2020) Bootstrapping neural processes. arXiv preprint arXiv:2008.02956. Cited by: §2.
  • Y. Lee and S. Choi (2018) Gradient-based meta-learning with learned layerwise metric and subspace. arXiv preprint arXiv:1801.05558. Cited by: §2.
  • B. Liu, M. Ding, S. Shaham, W. Rahayu, F. Farokhi, and Z. Lin (2020) When machine learning meets privacy: a survey and outlook. arXiv preprint arXiv:2011.11819. Cited by: §1.
  • Q. Liu and D. Wang (2016) Stein variational gradient descent: a general purpose bayesian inference algorithm. Advances in neural information processing systems 29, pp. 2378–2386. Cited by: §2.
  • C. Louizos, X. Shi, K. Schutte, and M. Welling (2019) The functional neural process. In Advances in Neural Information Processing Systems, pp. 8746–8757. Cited by: §2.
  • M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R. Wald, and E. Muharemagic (2015) Deep learning applications and challenges in big data analytics. Journal of big data 2 (1), pp. 1–21. Cited by: §1.
  • E. Nalisnick and P. Smyth (2016) Stick-breaking variational autoencoders. arXiv preprint arXiv:1605.06197. Cited by: §5.2.
  • A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu (2016) Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759. Cited by: §5.1.
  • B. N. Oreshkin, P. Rodriguez, and A. Lacoste (2018) Tadam: task dependent adaptive metric for improved few-shot learning. arXiv preprint arXiv:1805.10123. Cited by: §6.1.
  • S. Qiao, C. Liu, W. Shen, and A. L. Yuille (2018) Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7229–7238. Cited by: §6.1.
  • C. E. Rasmussen (2003) Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Cited by: §2.
  • S. Ravi and H. Larochelle (2016) Optimization as a model for few-shot learning. Cited by: §6.1.
  • M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel (2018) Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676. Cited by: §6.1.
  • J. Requeima, J. Gordon, J. Bronskill, S. Nowozin, and R. E. Turner (2019) Fast and flexible multi-task classification using conditional neural adaptive processes. In Advances in Neural Information Processing Systems, pp. 7959–7970. Cited by: §2, §5.1.
  • D. Rezende and S. Mohamed (2015) Variational inference with normalizing flows. In International Conference on Machine Learning, pp. 1530–1538. Cited by: §5.2.
  • A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell (2018) Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960. Cited by: §1, §2, §6.1.
  • H. Sanders and J. Saxe (2017) Garbage in, garbage out: how purport-edly great ml models can be screwed up by bad data. Proceedings of Blackhat 2017. Cited by: §1.
  • S. Semeniuta, A. Severyn, and E. Barth (2017)

    A hybrid convolutional variational autoencoder for text generation

    arXiv preprint arXiv:1702.02390. Cited by: §5.1.
  • S. Shafaei, S. Kugele, M. H. Osman, and A. Knoll (2018) Uncertainty in machine learning: a safety perspective on autonomous driving. In International Conference on Computer Safety, Reliability, and Security, pp. 458–464. Cited by: §1.
  • G. Singh, J. Yoon, Y. Son, and S. Ahn (2019) Sequential neural processes. In Advances in Neural Information Processing Systems, pp. 10254–10264. Cited by: §2, §2.
  • K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. Advances in neural information processing systems 28, pp. 3483–3491. Cited by: §5.2.
  • C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther (2016) Ladder variational autoencoders. In Advances in neural information processing systems, pp. 3738–3746. Cited by: §5.2.
  • C. Sun, A. Shrivastava, S. Singh, and A. Gupta (2017) Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pp. 843–852. Cited by: §1.
  • [57] A. Suresh and S. Srinivasan Improved attentive neural processes. Cited by: §2.
  • E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, U. Evci, K. Xu, R. Goroshin, C. Gelada, K. Swersky, P. Manzagol, et al. (2019) Meta-dataset: a dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096. Cited by: §1.
  • L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne.. Journal of machine learning research 9 (11). Cited by: §6.2.
  • J. Vanschoren (2018) Meta-learning: a survey. arXiv preprint arXiv:1810.03548. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.
  • A. Vellido (2019) The importance of interpretability and visualization in machine learning for applications in medicine and health care. Neural computing and applications, pp. 1–15. Cited by: §1.
  • O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. arXiv preprint arXiv:1606.04080. Cited by: §6.1.
  • R. Vuorio, S. Sun, H. Hu, and J. J. Lim (2018) Toward multimodal model-agnostic meta-learning. arXiv preprint arXiv:1812.07172. Cited by: §1, §2, §6.2.
  • A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing (2016) Deep kernel learning. In Artificial intelligence and statistics, pp. 370–378. Cited by: §2.
  • Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick (2017) Improved variational autoencoders for text modeling using dilated convolutions. In International conference on machine learning, pp. 3881–3890. Cited by: §5.1.
  • H. Yao, Y. Wei, J. Huang, and Z. Li (2019) Hierarchically structured meta-learning. arXiv preprint arXiv:1905.05301. Cited by: §1, §2, §6.2, §6.2, §6.
  • H. Yao, X. Wu, Z. Tao, Y. Li, B. Ding, R. Li, and Z. Li (2020a) Automated relational meta-learning. arXiv preprint arXiv:2001.00745. Cited by: §1, §2, §6.2.
  • H. Yao, Y. Zhou, M. Mahdavi, Z. Li, R. Socher, and C. Xiong (2020b) Online structured meta-learning. arXiv preprint arXiv:2010.11545. Cited by: §2.
  • M. Yin, G. Tucker, M. Zhou, S. Levine, and C. Finn (2019) Meta-learning without memorization. arXiv preprint arXiv:1912.03820. Cited by: §2.
  • M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. Smola (2017) Deep sets. arXiv preprint arXiv:1703.06114. Cited by: §5.2.
  • S. Zhao, J. Song, and S. Ermon (2017a) Learning hierarchical features from generative models. arXiv preprint arXiv:1702.08396. Cited by: §5.1.
  • S. Zhao, J. Song, and S. Ermon (2017b) Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658. Cited by: §5.2, §5.2, §6.1.
  • L. Zintgraf, K. Shiarli, V. Kurin, K. Hofmann, and S. Whiteson (2019) Fast context adaptation via meta-learning. In International Conference on Machine Learning, pp. 7693–7702. Cited by: §2.