1 Introduction
Although deep learning models have shown remarkable performance in various domains, they have consistently been criticized because of their sensitivity to the amount of data
(Chen and Lin, 2014; Najafabadi et al., 2015; Cho et al., 2015; Sun et al., 2017; Hestness et al., 2017). Despite all available public data, the data scarcity issue is still not negligible. In many cases, the actual data that is worth analyzing is quite limited for many different reasons, for example, concerns about data privacy (Liu et al., 2020) and noisy data with anomalies (Sanders and Saxe, 2017). Along with transfer learning, fewshot learning, and multitask learning, metalearning has recently been highlighted as a way to overcome this deficiency with its adaptive behavior using a few data points
(Vanschoren, 2018; Hospedales et al., 2020).Metalearning aims to handle multiple tasks by efficiently organizing the acquired knowledge. However, typical algorithms have been assessed based on a solid assumption which lacks the representative potential in realworld scenarios. Among many tackles (Triantafillou et al., 2019; Lee et al., 2019a), we mainly focus on the following two assumptions. First of all, the tasks are regarded to be similar such that a single metalearner is sufficient to aggregate the variations in all aspects. It implies that there has been little effort to compactly abstract notions within heterogeneity, one of the essential factors characterizing human intelligence, which is advantageous in decisionmaking to query the associated information to solve the problem. In addition, there has been less consideration on uncertainty for identifying particular task with a few data points. It is therefore not easy to analyze or transfer the acquired knowledge of the model, which is critical in the growing AI industries, such as a medical diagnosis (Ahmad et al., 2018; Challen et al., 2019; Vellido, 2019) and autonomous vehicles (Kim and Canny, 2017; Shafaei et al., 2018; Chen et al., 2020), because a certain level of interpretability is required for greater safety.
In this respect, we hypothesize that a disentanglement in task representation is advantageous, which frequently appears in studies to analyze the inherent factors of variation within the dataset. This is to i) uncover the distinctive properties as a tool for interpretability and to ii) explicitly separate the dataset into several clusters, which would have been detrimental when trained altogether. However, as a tradeoff for interpretability, the overconfident nature of deep learning may strictly assign the tasks into certain clusters without considering ambiguity, which requires an additional treatment to cope with the anomalies.
To this end, we propose a new metalearning framework, Metalearning Amidst Heterogeneity and Ambiguity (MAHA), that performs robustly on the following two huddles. Task heterogeneity: there is no clear discrimination between the tasks that are sampled from the faraway modes of task distribution (Vuorio et al., 2018; Yao et al., 2019, 2020a). Task ambiguity: too few data points are given to infer the task identity (Finn et al., 2018; Rusu et al., 2018). Specifically, we devise a pretask built upon the neural processes (Garnelo et al., 2018a, b; Kim et al., 2019) to obtain wellclustered and interpretable representation. Then, an agglomerative clustering is applied to the representation without any external knowledge such as the number of clusters and separately train a different model for each cluster. Please refer to Figure 7 for the overall training process of MAHA.
To summarize, the main contributions of this paper are the following 4folds:

We propose a simple yet powerful architecture design for the neural processes to better leverage the latent variables and be applicable in classification. (See Section 5.1)

We resolve the information asymmetry in the neural processes and construct wellclustered and interpretable representations. (See Section 5.2)

We validate MAHA through both regression and classification, by which the experimental results demonstrate its ability to cope with the heterogeneity and ambiguity. (See Section 6)

We devise an additional regularization term for the lowshot regime that distills an obtainable knowledge from relatively various training samples and variations. (See Appendix B)
2 Related work
Gradientbased metalearning, represented by MAML (Finn et al., 2017), aims to learn the prior parameters that can quickly adapt to certain tasks through several gradient steps. It consists of the innerloop for the task adaptation and the outerloop for the metaupdate over tasks. Many variants have emerged to balance generalization and customization in a taskadaptive manner. To begin with a generalization perspective, (Finn et al., 2018; Kim et al., 2018) suggested probabilistic extensions through the hierarchical Bayesian model and Stein variational gradient descent (SVGD) (Liu and Wang, 2016). In addition, Rusu et al. (2018) conducted the innerloop on the lowdimensional latent embedding space, and (Yin et al., 2019) proposed the metaregularization that was built on information theory. From a customization perspective, (Lee and Choi, 2018) divided the parameters into two categories, one of which is shared across tasks, and the other can be modulated taskspecifically. (Zintgraf et al., 2019) was informed by the layerwise adaptive units, and (Vuorio et al., 2018; Yao et al., 2019, 2020a, 2020b) considered the auxiliary networks that modulate the initial parameter before the innerloop.
The family of neural processes, also known as contextual metalearning, is devised to imitate the flexibility of the Gaussian Process (Rasmussen, 2003)
while resolving the scalability issue. Rather than explicitly modeling the kernel to conduct the Bayesian inference like
(Wilson et al., 2016), it learns an implicit kernel directly from data which overcomes the design restrictions. Taskspecific information is extracted from the subset of data through an encoder, which is then aggregated for utilization in the decoder to predict the corresponding outputs of the remaining data. Starting from the conditional neural process (CNP) (Garnelo et al., 2018a), which was built solely on a deterministic path, the neural process (NP) (Garnelo et al., 2018b) applies the addition of a stochastic path. The attentive neural process (ANP) (Kim et al., 2019) further applies an attention mechanism to resolve the underfitting issue in NP by enlarging the locally adaptive behavior. More complex modules, such as a graph structure (Louizos et al., 2019)(Kumar et al., 2018; Singh et al., 2019), were further considered to capture the dependencies on latent variables and the complex temporal dynamics.However, many problems remain unsolved. Firstly, the neural processes yet rely on a complex feature extractor to enable taskspecific modulation, which requires various regularization techniques with additional hyperparameters
(Requeima et al., 2019). Furthermore, whereas the neural processes are able to obtain an explicit task representation, the existing approaches have investigated little regarding interpretability. Finally, the performance analysis has been mainly focused on regression (Le et al., 2018; Kim et al., 2019; Singh et al., 2019; Suresh and Srinivasan, ; Gordon et al., 2020), and some are not even directly applicable for classification (Lee et al., 2020).3 Problem setting
Let be the context set, and let be the target set, where both and are sampled from the same task . A common goal in metalearning is to devise an algorithm for the model that appropriately uses the model parameter to obtain the taskspecific parameter according to the inputoutput pairs in such that when is given,
can be accurately estimated with high confidence. For example, in MAML
Finn et al. (2017), a taskspecific parameter can be computed by using a gradient step . On the other hand, in CNP (Garnelo et al., 2018a), and no longer share the same parameter space. Here, the model parameter is divided into an encoder and a decoder part , and the taskspecific parameter can be computed by the encoder output . Hereafter, we omit for brevity.For model training, is iteratively updated using batchs. Here, each batch is constructed through multiple tasks that are characterized by way and shot. If there are N classes, each of which contains K inputoutput pairs, we call it an Nway Kshot
problem. The class labels are shuffled in classification whenever a task instance is created, which encourages a metalearning algorithm to learn how to classify images even when the configuration of unseen classes occurs.
4 Preliminary : (Attentive) Neural Process
In Figure 2, we summarize how a basic family of neural processes has evolved in terms of the graphical model. The encoder comprises a deterministic path and stochastic path computing the taskspecific parameter of the variational distributions which we denote by and .^{1}^{1}1Note that
is deterministic with zero variance.
Here, indicates a set of inputoutput pairs and a reparameterization trick is applied at the end of the stochastic path for differentiable noncentered parameterization (Kingma and Welling, 2014).For both paths, NP is constructed by:
where MeanPool() is a meanpooling operation along the subscripted dimension, rFF(
) can be any rowwise feedforward layer, such as MultiLayer Perceptron (MLP), and
denotes the concatenation. On the other hand, ANP exploits the multihead attention, connecting to in graphical model, and selfattention, both of which are proposed in (Vaswani et al., 2017). As in NP, the value of is same for every shot of , however, based on the attention score with each element of , is now computed in shotdependent manner.Then, conditioned on the encoder outputs, and , with the target input
, the decoder computes the parameters of predictive distribution on the target output
:(1) 
where the predictive distribution is expressed as
. Eventually, relying on the variational inference, one can obtain the loss function which approximates the negative ELBO by replacing an intractable
with the variational distribution following (Garnelo et al., 2018b):(2) 
As a result, based on the Kolmogorov extension and deFinetti theorems, the neural processes become a stochastic process that satisfies the exchangeability and consistency (Garnelo et al., 2018b). However, when trained using the deterministic path, the neural processes with latent variables is empirically shown to have difficulty capturing the variability of the stochastic process (Le et al., 2018), of which causes are investigated and resolved in Section 5.2.
5 Metalearning Amidst Heterogeneity and Ambiguity
This section describes our algorithm MAHA whose primary focus is to devise a pretask to cope with task heterogeneity and ambiguity in metalearning. We first introduce an encoderdecoder pipeline of MAHA, namely FELD, of which effects are examined by substituting the correspondent within NP in Section 6. Then, a dimensionwise pooling and an autoencoding structure are proposed to obtain wellclustered and interpretable representation. Finally, the training process of MAHA is described, which applies to both regression and classification.
5.1 Encoderdecoder pipeline
Flexible Encoder
Although the attention mechanism proposed in ANP was a key to resolve the underfitting in NP, there is less incentive for to focus on task identity that is shared across shots. As a result, in Figure 3, ANP appears to strongly fit the given inputoutput pairs, which leads to a wiggly prediction. Particularly within task heterogeneity and ambiguity where the prediction space is prone to be highly variable, the wiggly prediction of ANP leads to a poor generalization performance (See Figure 8). Therefore, the graphical model of NP is rather considered in MAHA since its latent variables are shotindependent. Then, based on analysis in (Cremer et al., 2018), the problematic underfitting is dealt with by substituting the encoder with the flexible and permutationinvariant Set Transformer (ST) (Lee et al., 2019b). Note that the Set Transformer can incorporate the rFF() and in the encoder of NP. See Appendix A for a more detailed explanation about the modules in Set Transformer.
Linear Decoder
We avoid using a complex decoder such as (Oord et al., 2016) and apply featurewise linear modulation to the target input . Inspired by (Zhao et al., 2017a), we composite the latent variables using a skip connection. Among the many normalization techniques, a layer normalization (Ba et al., 2016) is applied since the statistic is computed independently for each batch instance such that only can capture the heterogeneity in accordance with the pooling proposed in Section 5.2.
(3) 
Here, g() implies any feature extractor, LN() indicates a layer normalization, and the transpose operation T
permutes the last two dimensions of the tensor. It is aligned with the previous approaches
Bowman et al. (2015); Semeniuta et al. (2017); Yang et al. (2017); He et al. (2019) which weaken the decoder to allow the latent variables to be appropriately leveraged. Also, it relates to studies on fewshot classification (Gordon et al., 2018; Requeima et al., 2019) where each column of is computed by shots within the same way. However, when accompanied with the pooling in Section 5.2, the columns are no more independent by one another and share information across way.5.2 Inducing disentanglement on
For NP and ANP trained on functions generated from GP, we illustrate the weight norm of the decoding layer right behind the latent variables in Figure 5. The sparselycoded decoder implies the redundancy of the stochastic path due to the component collapsing behavior referred to in (Nalisnick and Smyth, 2016; Joo et al., 2020). This phenomenon can be explained by the information preference problem (Chen et al., 2016; Zhao et al., 2017b) where the information flow is concentrated on the deterministic path with the tendency to ignore the stochastic path.
In order to handle the information asymmetry, several solutions were proposed in studies on the generative model, such as the KL annealing scheduler (Bowman et al., 2015; Fu et al., 2019) and expressive posterior approximation (Rezende and Mohamed, 2015; Kingma et al., 2016), but these are generally not robust to changes in model architecture. Instead, we propose a simple method to avoid redundancy of the stochastic path by encouraging it to acquire multimodality within heterogeneity and ambiguity.
Dimensionwise pooling
We explicitly capture the distinct variations within the information flow by pooling each path across different dimensions, batch for and way for :
(4) 
Then, the deterministic representation becomes identical not only across shot, but also across batch. Then, whenever it is insufficient to handle all variations across tasks within the same batch i.e., facing task heterogeneity, the model should resort to the stochastic representation since the deterministic representation only captures the average properties. On the other hand, the stochastic representation allows the different way to share information and becomes classinvariant. We illustrate how the latent variables and are computed in Figure 6. Note that the value of way is set to 1 in regression such that pooling on is negligible.
Autoencoding structure
Empirically, we observe that the KL collapse (Bowman et al., 2015; Alemi et al., 2016; Sønderby et al., 2016; Zhao et al., 2017b) does not occur whenever the pooling operations is used (see Appendix D). This implies that the posterior does not simply converge to the approximate prior so that the decoder gets dependent on the stochastic path. However, there is still an incentive for to be underutilized during the decoding because it is inferred by small not large (Hewitt et al., 2018) and neural networks exploiting set representation is known to poorly perform in lowshot regime (Edwards and Storkey, 2016; Zaheer et al., 2017) i.e., facing task ambiguity.
Thereby, we resort to the conditional autoencoding structure (Sohn et al., 2015) on top of the dimensionwise pooling to cope with the lack of training samples. As a result, the following loss function is derived which differs from Equation 2 on i) whether the pooling operations are used or not and ii) which set is used to compute the deterministic representation, each of which is the result of the dimensionwise pooling and the autoencoding structure:
(5) 
5.3 Training process
See Figure 7. Initially, the dimensionwise pooling and the autoencoding structure proposed in Section 5.2 are used along with FELD to minimize the loss function in Equation 5. Next, an agglomerative clustering is applied to the disentangled representation from the stochastic path to estimate the number of clusters with the highest purity value.^{2}^{2}2For a homogeneous dataset, a single cluster is available such that the previous steps can be omitted. Finally, for each cluster, separate FELD is trained from the beginning by Equation 2
where the tasks are no longer uniformly sampled but statistically skewed based on the ratio of heterogeneous tasks within the cluster. According to the Euclidean distance to the cluster centers, FELD in correspondence to the closest cluster is exploited for evaluation.
6 Experiment
We first experiment on frequently appearing benchmark datasets in metalearning and investigate the role of the encoderdecoder pipeline (FELD) by gradually adjusting NP. Those datasets are generally regarded to be homogeneous such that the MAHA is equivalent to FELD when assuming a single cluster as noted in Section 5.3. After that, MAHA is evaluated on heterogeneous datasets following the experimental setting of (Yao et al., 2019) with the dimensionwise pooling and the autoencoding structure in Section 5.2, of which roles are examined in both quantitative and qualitative manner. Please refer to Appendix C for details about the datasplit, architecture design, and the hyperparameter search.
Overall, we are to answer the following three questions:
6.1 Homogeneous dataset
Model  FE  LD  MSE 

NP  0.166 0.002  
ANP  0.142 0.002  
NP+FE  ✓  0.138 0.002  
NP+LD  ✓  0.312 0.002  
FELD  ✓  ✓  0.130 0.002 
Gaussian Process
Following the basic neural processes (Garnelo et al., 2018a, b; Kim et al., 2019), we consider functions generated from GP with squared exponential kernel . The experimental result in Table 1 states that although ANP performs better than NP in terms of flexibility, the dominance no longer holds when NP is equipped with the flexible encoder. However, a degradation in performance is shown when using the linear decoder in NP. This is empirical evidence that NP strongly relies on the complexity of the decoder in regression, by which the model is prone to ignore the latent variables (Chen et al., 2016; Zhao et al., 2017b). By exploiting the flexible encoder to obtain more informative latent variables by themselves such that the (shallow) linear decoder is just enough for prediction, FELD performs better than any other models with the (deep) conventional decoder. We find the Set Transformer is the perfect choice whose improvement can not be caught up by simply stacking MLPs. Moreover, it is noticeable that FELD outperforms NP+FE despite a decreased model capacity.
Model  5way 1shot  5way 5shot 

Matching Net  43.40 0.78%  51.09 0.71% 
MetaLSTM  43.44 0.77%  60.60 0.71% 
MAML  48.70 1.84%  63.11 0.92% 
ProtoNet  49.42 0.78%  68.20 0.66% 
REPTILE  49.97 0.32%  65.99 0.58% 
Relation Net  50.44 0.82%  65.32 0.70% 
CAVIA  51.82 0.65%  65.85 0.55% 
VERSA  53.40 1.82%  67.37 0.86% 
TPN  55.51 0.86%  69.86 0.65% 
MetaSGD  54.24 0.03%  70.86 0.04% 
SNAIL  55.71 0.99%  68.88 0.92% 
NP+LD  57.30 0.06%  75.10 0.04% 
TADAM  58.50 0.30%  76.70 0.30% 
LEO  61.76 0.08%  77.59 0.12% 
FELD  62.77 0.05%  81.15 0.03% 
Accuracy on miniImageNet
MiniImageNet, TieredImageNet
Similar tendency can be observed in classification. We consider miniImageNet
(Vinyals et al., 2016) and tieredImageNet (Ren et al., 2018), which are frequently used largescale datasets for fewshot image classification. For miniImageNet, we follow the split of Ravi and Larochelle (2016), which assigns 64 classes for the metatrain set, 16 classes for the metavalid set, and 20 classes for the metatest set. For tieredImageNet, 608 classes are first grouped into 34 higherlevel nodes, divided into 20, 6, and 8 nodes to construct the metatrain set, metavalid set, and metatest set. We use the feature provided by (Rusu et al., 2018), which is obtained by pretraining a deep residual network in a supervised manner as in (Gidaris and Komodakis, 2018; Oreshkin et al., 2018; Qiao et al., 2018). However, unlike (Qiao et al., 2018; Rusu et al., 2018), the metavalid set is used for early stopping and hyperparameter search but not utilized to update the parameters.Model  5way 1shot  5way 5shot 

MAML  51.67 1.81%  70.30 0.08% 
ProtoNet  53.31 0.89%  72.69 0.74% 
Relation Net  54.48 0.93%  71.32 0.78% 
WarpMAML  57.20 0.90%  74.10 0.70% 
TPN  57.41 0.94%  71.55 0.74% 
MetaSGD  62.95 0.03%  79.34 0.06% 
NP+LD  63.36 0.06%  80.50 0.04% 
LEO  66.33 0.05%  81.44 0.09% 
FELD  66.87 0.06%  83.54 0.04% 
In Table 2, 3, accuracy on miniImageNet and tieredImageNet is reported. We collect the score of various baselines that use either convolutional networks or deep residual networks and do not exploit any data augmentation for a fair comparison. While NP performs no better than a random guess when following (Garnelo et al., 2018a), NP+LD results in a comparable score to the recent models in gradientbased metalearning, verifying the validity of the linear decoder in classification. FELD achieves even better performance than the stateoftheart, which is remarkable in the sense that the attention modules in Set Transformers can not be fully utilized in lowshot regime.
6.2 Heterogeneous dataset
Sine Polynomial
To verify the performance on the family of functions, we experiment on the toy 1D regression as in (Vuorio et al., 2018; Yao et al., 2019, 2020a). In particular, we follow the exact setting of (Yao et al., 2019) where each task is randomly chosen to be one of the following onedimensional functions where the coefficients are uniformly sampled from the prefixed intervals summarized in Appendix C.1: (sine) , (line) , (quad) , (cubic)
. A small number of data points are given as context, requiring the model to appropriately interpolate and extrapolate in a highly variable prediction space.
Model  5shot  10shot 

BMAML  2.435 0.130  0.967 0.056 
MAML  2.205 0.121  0.761 0.068 
METASGD  2.053 0.117  0.836 0.065 
MTNET  2.016 0.019  0.698 0.054 
MUMOMAML  1.096 0.085  0.256 0.028 
HSML  0.856 0.073  0.161 0.021 
NP  0.514 0.051  0.089 0.015 
ANP  0.415 0.046  0.058 0.016 
FELD  0.118 0.015  0.008 0.002 
MAHA  0.077 0.006  0.003 0.001 
MAHA*  0.056 0.003  0.002 0.001 
In Table 4
, MSE over 4000 tasks are presented with 95% confidence interval. Generally, all the gradientbased metalearning algorithms are outperformed by the neural processes, and a noticeable gain is again observed by solely exploiting the encoderdecoder pipeline, FELD. By adjusting FELD to MAHA by task clustering and MAHA to MAHA* by knowledge distillation, a monotonic improvement is observed.
^{3}^{3}3We handle the overconfident nature of deep learning to better cope with the ambiguity by distilling an obtainable knowledge from to . Please refer to Appendix B for a more detailed explanation.In Figure 8, we illustrate the interpolation and extrapolation of MAHA in comparison to ANP. As noted in Section 5.1, the main interest of ANP is shown to fitting the context points, which poorly perform in predicting the target outputs whose corresponding inputs are located farther away from that of the context points. This tendency can be observed during interpolation and extrapolation, leading to a wiggly prediction with significant variance. By contrast, MAHA can correctly infer the functional shape, which can be confirmed through a consistently low variance.
POOL  AE  1shot  5shot 

0.8020  0.9957  
✓  0.7455  0.9145  
✓  0.9035  0.9930  
✓  ✓  0.9560  0.9992 
Multidataset
Four distinct finegrained image classification datasets are combined to construct the multidataset proposed in (Yao et al., 2019)
: (Bird) CUB2002011, (Texture) Describable Textures Dataset, (Aircraft) FGVC of Aircraft, and (Fungi) FGVCxFungi. Compared to a homogeneous setting, this is more challenging since overfitting to a particular dataset can critically harm the performance. For the feature extractor, we followed
(Yao et al., 2019) where 2Conv blocks are used for task clustering, and 4Conv blocks are used for prediction.In Figure 9, for 1shot setting, mean value of the variational distribution is visualized through tSNE (Van der Maaten and Hinton, 2008). Without external knowledge, such as the number of true clusters, the embeddings get interpretable when using both the dimensionwise pooling and the autoencoding structure. The distinct datasets are no more clearly discriminated without either of them, which is quantitatively demonstrated by the estimated purity values in the bottom table. Note that the validity of the methodologies stands out particularly in lowshot regime which implies the difficulty of task identification within ambiguity.
The tendency can be observed by the performance measure presented in Table 5. Compared to 1shot setting where a noticeable gain is occurred by task clustering, in 5shot setting, there is almost no difference between FELD and MAHA. This is because the models can clearly identify the tasks regardless of whether the pooling or the autoencoding structure is used or not, demonstrated by the high purity values. Accordingly, the knowledge distillation, which is fundamentally devised to regularize the model within ambiguity appropriately, has shown a worthwhile improvement from MAHA to MAHA* particularly in 1shot setting. Eventually, MAHA (and MAHA*) beats all the previous works with a fairly large margin and achieves stateoftheart performance.
Model  Bird  Texture  Aircraft  Fungi  Average  

5way 1shot  MAML  53.94 1.45%  31.66 1.31%  51.37 1.38%  42.12 1.36%  44.77% 
MetaSGD  55.58 1.43%  32.38 1.32%  52.99 1.36%  41.74 1.34%  45.67%  
MTNET  58.72 1.43%  32.80 1.35%  47.72 1.46%  43.11 1.42%  45.59%  
BMAML  54.89 1.48%  32.53 1.33%  53.63 1.37%  42.50 1.33%  45.89%  
MUMOMAML  56.82 1.49%  33.81 1.36%  53.14 1.39%  42.22 1.40%  46.50%  
HSML  60.98 1.50%  35.01 1.36%  57.38 1.40%  44.02 1.39%  49.35%  
ARML  62.33 1.47%  35.65 1.40%  58.56 1.41%  44.82 1.38%  50.34%  
FELD  56.17 0.64%  35.86 0.41%  53.03 0.58%  45.41 0.58%  47.61%  
MAHA  63.89 0.34%  37.22 0.23%  58.90 0.44%  47.95 0.34%  51.99%  
MAHA*  64.45 0.36%  37.83 0.23%  59.18 0.43%  48.33 0.33%  52.41%  
5way 5shot  MAML  68.52 0.79%  44.56 0.68%  66.18 0.71%  51.85 0.85%  57.78% 
MetaSGD  67.87 0.74%  45.49 0.68%  66.84 0.70%  52.51 0.81%  58.18%  
MTNET  69.22 0.75%  46.57 0.70%  63.03 0.69%  53.49 0.83%  58.08%  
BMAML  69.01 0.74%  46.06 0.69%  65.74 0.67%  52.43 0.84%  58.31%  
MUMOMAML  70.49 0.76%  45.89 0.69%  67.31 0.68%  53.96 0.82%  59.41%  
HSML  71.68 0.73%  48.08 0.69%  73.49 0.68%  56.32 0.80%  62.39%  
ARML  73.34 0.70%  49.67 0.67%  74.88 0.64%  57.55 0.82%  63.86%  
FELD  77.63 0.46%  55.80 0.38%  75.88 0.41%  63.68 0.50%  68.24%  
MAHA  75.04 0.26%  54.39 0.21%  79.98 0.20%  65.09 0.25%  68.62%  
MAHA*  75.82 0.26%  54.28 0.22%  79.91 0.19%  65.18 0.25%  68.79% 
7 Conclusion
This paper proposes a new metalearning framework, MAHA, that performs robustly amidst heterogeneity and ambiguity. We aim to disentangle the stochastic representation by the dimensionwise pooling and the autoencoding structure based on the newly devised encoderdecoder pipeline to better leverage the latent variables. With the multistep training process, comprehensive experiments are conducted on regression and classification. In the end, we argue that the proposed model captures the task identity with lower variance, leading to a noticeable improvement in performance. The potential limitation of MAHA would be the additional computational cost from the flexible encoder composed of multiple attention modules. However, by orthogonally applying to the existing work, the compatibility and the necessity are empirically verified. An interesting future work would be to apply our model to reinforcement learning. In particular, training a policy directly from wellclustered representations for sampleefficient exploration seems promising in an environment with sparse rewards.
Broader Impact
When training metalearning models, there comes a customization process based on the problem at hand. If not using the benchmark datasets that frequently appear in academia, it becomes unclear to which extent the distinct datasets should be combined, expecting the model to be versatile on every possible task generation. MAHA, in this respect, can guide for a human to analyze and cluster the available data into separate clusters. Moreover, MAHA mainly benefits future AI industries where the limited communication between the decentralized servers is available as it can infer the global context even with a small amount of information. As a result, we do not expect any negative societal impacts, but we believe that MAHA possesses many implications in more realistic scenarios.
References

Interpretable machine learning in healthcare
. In Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics, pp. 559–560. Cited by: §1.  Deep variational information bottleneck. arXiv preprint arXiv:1612.00410. Cited by: §5.2.
 Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §5.1.
 Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349. Cited by: §5.1, §5.2, §5.2.
 Artificial intelligence, bias and clinical safety. BMJ Quality & Safety 28 (3), pp. 231–237. Cited by: §1.
 Interpretable endtoend urban autonomous driving with latent deep reinforcement learning. arXiv preprint arXiv:2001.08726. Cited by: §1.

Variational lossy autoencoder
. arXiv preprint arXiv:1611.02731. Cited by: §5.2, §6.1.  Big data deep learning: challenges and perspectives. IEEE access 2, pp. 514–525. Cited by: §1.
 How much data is needed to train a medical image deep learning system to achieve necessary high accuracy?. arXiv preprint arXiv:1511.06348. Cited by: §1.
 Inference suboptimality in variational autoencoders. arXiv preprint arXiv:1801.03558. Cited by: §5.1.
 Towards a neural statistician. arXiv preprint arXiv:1606.02185. Cited by: §5.2.
 Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400. Cited by: §2, §3.
 Probabilistic modelagnostic metalearning. In Advances in Neural Information Processing Systems, pp. 9516–9527. Cited by: §1, §2.
 Cyclical annealing schedule: a simple approach to mitigating kl vanishing. arXiv preprint arXiv:1903.10145. Cited by: §5.2.
 Conditional neural processes. arXiv preprint arXiv:1807.01613. Cited by: §1, §2, §3, §6.1, §6.1.
 Neural processes. arXiv preprint arXiv:1807.01622. Cited by: §1, §2, §4, §4, §6.1.

Dynamic fewshot visual learning without forgetting.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4367–4375. Cited by: §6.1.  Metalearning probabilistic inference for prediction. arXiv preprint arXiv:1805.09921. Cited by: §5.1.
 Convolutional conditional neural processes. Cited by: §2.
 Lagging inference networks and posterior collapse in variational autoencoders. arXiv preprint arXiv:1901.05534. Cited by: §5.1.
 Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409. Cited by: §1.
 The variational homoencoder: learning to learn high capacity generative models from few examples. arXiv preprint arXiv:1807.08919. Cited by: §5.2.
 Metalearning in neural networks: a survey. arXiv preprint arXiv:2004.05439. Cited by: §1.
 Dirichlet variational autoencoder. Pattern Recognition 107, pp. 107514. Cited by: §5.2.
 Attentive neural processes. arXiv preprint arXiv:1901.05761. Cited by: §1, §2, §2, §6.1.
 Interpretable learning for selfdriving cars by visualizing causal attention. In Proceedings of the IEEE international conference on computer vision, pp. 2942–2950. Cited by: §1.
 Bayesian modelagnostic metalearning. arXiv preprint arXiv:1806.03836. Cited by: §2.
 Improving variational inference with inverse autoregressive flow. arXiv preprint arXiv:1606.04934. Cited by: §5.2.
 Efficient gradientbased inference through transformations between bayes nets and neural nets. In International Conference on Machine Learning, pp. 1782–1790. Cited by: §4.
 Consistent generative query networks. arXiv preprint arXiv:1807.02033. Cited by: §2.
 Empirical evaluation of neural process objectives. In NeurIPS workshop on Bayesian Deep Learning, Cited by: §2, §4.
 Learning to balance: bayesian metalearning for imbalanced and outofdistribution tasks. arXiv preprint arXiv:1905.12917. Cited by: §1.
 Set transformer: a framework for attentionbased permutationinvariant neural networks. In International Conference on Machine Learning, pp. 3744–3753. Cited by: §5.1.
 Bootstrapping neural processes. arXiv preprint arXiv:2008.02956. Cited by: §2.
 Gradientbased metalearning with learned layerwise metric and subspace. arXiv preprint arXiv:1801.05558. Cited by: §2.
 When machine learning meets privacy: a survey and outlook. arXiv preprint arXiv:2011.11819. Cited by: §1.
 Stein variational gradient descent: a general purpose bayesian inference algorithm. Advances in neural information processing systems 29, pp. 2378–2386. Cited by: §2.
 The functional neural process. In Advances in Neural Information Processing Systems, pp. 8746–8757. Cited by: §2.
 Deep learning applications and challenges in big data analytics. Journal of big data 2 (1), pp. 1–21. Cited by: §1.
 Stickbreaking variational autoencoders. arXiv preprint arXiv:1605.06197. Cited by: §5.2.
 Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759. Cited by: §5.1.
 Tadam: task dependent adaptive metric for improved fewshot learning. arXiv preprint arXiv:1805.10123. Cited by: §6.1.
 Fewshot image recognition by predicting parameters from activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7229–7238. Cited by: §6.1.
 Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Cited by: §2.
 Optimization as a model for fewshot learning. Cited by: §6.1.
 Metalearning for semisupervised fewshot classification. arXiv preprint arXiv:1803.00676. Cited by: §6.1.
 Fast and flexible multitask classification using conditional neural adaptive processes. In Advances in Neural Information Processing Systems, pp. 7959–7970. Cited by: §2, §5.1.
 Variational inference with normalizing flows. In International Conference on Machine Learning, pp. 1530–1538. Cited by: §5.2.
 Metalearning with latent embedding optimization. arXiv preprint arXiv:1807.05960. Cited by: §1, §2, §6.1.
 Garbage in, garbage out: how purportedly great ml models can be screwed up by bad data. Proceedings of Blackhat 2017. Cited by: §1.

A hybrid convolutional variational autoencoder for text generation
. arXiv preprint arXiv:1702.02390. Cited by: §5.1.  Uncertainty in machine learning: a safety perspective on autonomous driving. In International Conference on Computer Safety, Reliability, and Security, pp. 458–464. Cited by: §1.
 Sequential neural processes. In Advances in Neural Information Processing Systems, pp. 10254–10264. Cited by: §2, §2.
 Learning structured output representation using deep conditional generative models. Advances in neural information processing systems 28, pp. 3483–3491. Cited by: §5.2.
 Ladder variational autoencoders. In Advances in neural information processing systems, pp. 3738–3746. Cited by: §5.2.
 Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pp. 843–852. Cited by: §1.
 [57] Improved attentive neural processes. Cited by: §2.
 Metadataset: a dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096. Cited by: §1.
 Visualizing data using tsne.. Journal of machine learning research 9 (11). Cited by: §6.2.
 Metalearning: a survey. arXiv preprint arXiv:1810.03548. Cited by: §1.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.
 The importance of interpretability and visualization in machine learning for applications in medicine and health care. Neural computing and applications, pp. 1–15. Cited by: §1.
 Matching networks for one shot learning. arXiv preprint arXiv:1606.04080. Cited by: §6.1.
 Toward multimodal modelagnostic metalearning. arXiv preprint arXiv:1812.07172. Cited by: §1, §2, §6.2.
 Deep kernel learning. In Artificial intelligence and statistics, pp. 370–378. Cited by: §2.
 Improved variational autoencoders for text modeling using dilated convolutions. In International conference on machine learning, pp. 3881–3890. Cited by: §5.1.
 Hierarchically structured metalearning. arXiv preprint arXiv:1905.05301. Cited by: §1, §2, §6.2, §6.2, §6.
 Automated relational metalearning. arXiv preprint arXiv:2001.00745. Cited by: §1, §2, §6.2.
 Online structured metalearning. arXiv preprint arXiv:2010.11545. Cited by: §2.
 Metalearning without memorization. arXiv preprint arXiv:1912.03820. Cited by: §2.
 Deep sets. arXiv preprint arXiv:1703.06114. Cited by: §5.2.
 Learning hierarchical features from generative models. arXiv preprint arXiv:1702.08396. Cited by: §5.1.
 Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658. Cited by: §5.2, §5.2, §6.1.
 Fast context adaptation via metalearning. In International Conference on Machine Learning, pp. 7693–7702. Cited by: §2.