1 Introduction
Fewshot learning (FSL) Miller et al. (2000); Li et al. (2006); Lake et al. (2015)
aims to learn classifiers from few examples per class. Recently, deep learning has been successfully exploited for FSL via learning metamodels from a large number of
metatraining tasks. These metamodels can be then used for rapidadaptation for the target/metatesting tasks that only have few training examples. Examples of such metamodels include: (1) metric/similaritybased models, which learn contextual, and taskspecific similarity measures Koch (2015); Vinyals et al. (2016); Snell et al. (2017); and (2) optimizationbased models, which receive the input of gradients from a FSL task and predict either model parameters or parameter updates (Ravi and Larochelle, 2017; Munkhdalai and Yu, 2017; Finn et al., 2017; Wang et al., 2017).In the past, FSL has mainly considered image domains, where all tasks are often sampled from one huge collection of data, such as Omniglot (Lake et al., 2011)
and ImageNet
(Vinyals et al., 2016), making tasks come from a single domain thus related. Due to such a simplified setting, almost all previous works employ a common metamodel (metric/optimizationbased) for all fewshot tasks. However, this setting is far from the realistic scenarios in many realworld applications of fewshot text classification. For example, on an enterprise AI cloud service, many clients submit various tasks to train text classification models for businessspecific purposes. The tasks could be classifying customers’ comments or opinions on different products/services, monitoring public reactions to different policy changes, or determining users’ intents in different types of personal assistant services. As most of the clients cannot collect enough data, their submitted tasks form a fewshot setting. Also, these tasks are significantly diverse, thus a common metric is insufficient to handle all these tasks.We consider a more realistic FSL setting where tasks are diverse. In such a scenario, the optimal metamodel may vary across tasks. Our solution is based on the metriclearning approach (Snell et al., 2017) and the key idea is to maintain multiple metrics for FSL. The metalearner selects and combines multiple metrics for learning the target task using task clustering on the metatraining tasks. During the metatraining, we propose to first partition the metatraining tasks into clusters, making the tasks in each cluster likely to be related. Then within each cluster, we train a deep embedding function as the metric. This ensures the common metric is only shared across tasks within the same cluster. Further, during metatesting, each target FSL task is assigned to a taskspecific metric, which is a linear combination of the metrics defined by different clusters. In this way, the diverse fewshot tasks can derive different metrics from the previous learning experience.
The key of the proposed FSL framework is the task clustering algorithm. Previous works (Kumar and Daume III, 2012; Kang et al., 2011; Crammer and Mansour, 2012; Barzilai and Crammer, 2015) mainly focused on convex objectives, and assumed the number of classes is the same across different tasks (e.g. binary classification is often considered). To make task clustering (i) compatible with deep networks and (ii) able to handle tasks with a various number of labels, we propose a matrixcompletion based task clustering algorithm. The algorithm utilizes task similarity measured by crosstask transfer performance, denoted by matrix S. The entry of S
is the estimated accuracy by adapting the learned representations on the
th (source) task to the th (target) task. We rely on matrix completion to deal with missing and unreliable entries in Sand finally apply spectral clustering to generate the task partitions.
To the best of our knowledge, our work is the first one addressing the diverse fewshot learning problem and reporting results on realworld fewshot text classification problems. The experimental results show that the proposed algorithm provides significant gains on fewshot sentiment classification and dialog intent classification tasks. It provides positive feedback on the idea of using multiple metamodels (metrics) to handle diverse FSL tasks, as well as the proposed task clustering algorithm on automatically detecting related tasks.
2 Problem Definition
FewShot Learning
Since we focus on diverse metricbased FSL, the problem can be formulated in two stages: (1) metatraining, where a set of metrics is learned on the metatraining tasks . Each maps two input to a scalar of similarity score. Here is a collection of tasks. Here is a predefined number (usually ). Each task consists of training, validation, and testing set denoted as , respectively. Note that the definition of is a generalized version of in (Ravi and Larochelle, 2017), since each task can be either fewshot (where is empty) or regular^{2}^{2}2For example, the methods in Triantafillou et al. (2017) can be viewed as training metamodels from any sampled batches from one single metatraining dataset.. (2) metatesting: the trained metrics in is applied to metatesting tasks denoted as , where each is a fewshot learning task consisting of both training and testing data as . is a small labeled set for generating the prediction model for each . Specifically,
s are kNNbased predictors built upon the metrics in
. We will detail the construction of in Section 3, Eq. (6). It is worth mentioning that the definition of is the same as in (Ravi and Larochelle, 2017). The performance of fewshot learning is the macroaverage of ’s accuracy on all the testing set s.Our definitions can be easily generalized to other metalearning approaches Ravi and Larochelle (2017); Finn et al. (2017); Mishra et al. (2017). The motivation of employing multiple metrics is that when the tasks are diverse, one metric model may not be sufficient. Note that previous metricbased FSL methods can be viewed as a special case of our definition where only contains a single , as shown in the two base model examples below.
Base Model: Matching Networks
In this paper we use the metricbased model Matching Network (MNet) Vinyals et al. (2016) as the base metric model. The model (Figure 1b) consists of a neural network as the embedding function (encoder) and an augmented memory. The encoder, , maps an input to a length vector. The learned metric is thus the similarity between the encoded vectors, , i.e. the metric is modeled by the encoder . The augmented memory stores a support set , where is the supporting instance and is its corresponding label in a onehot format. The MNet explicitly defines a classifier conditioned on the supporting set . For any new data , predicts its label via a similarity function between the test instance and the support set :
(1) 
where we defined to be a softmax distribution given , where is a supporting instance, i.e., , where are the parameters of the encoder . Thus, is a valid distribution over the supporting set’s labels . To adapt the MNet to text classification, we choose encoder to be a convolutional neural network (CNN) following Kim (2014); Johnson and Zhang (2016). Figure 1 shows the MNet with the CNN architecture. Following (Collobert et al., 2011; Kim, 2014)
, the model consists of a convolution layer and a maxpooling operation over the entire sentence.
To train the MNets, we first sample the training dataset for task from all tasks , with notation simplified as . For each class in the sampled dataset , we sample random instances in that class to construct a support set , and sample a batch of training instances as training examples, i.e., . The training objective is to minimize the prediction error of the training samples given the supporting set (with regard to the encoder parameters ) as follows:
(2) 
Base Model: Prototypical Networks
3 Methodology
We propose a taskclustering framework to address the diverse fewshot learning problem stated in Section 2. We have the FSL algorithm summarized in Algorithm 1. Figure 2 gives an overview of our idea. The initial step of the algorithm is a novel task clustering algorithm based on matrix completion, which is described in Section 3.1. The fewshot learning method based on task clustering is then introduced in Section 3.2.
3.1 Robust Task Clustering by Matrix Completion
Our task clustering algorithm is shown in Algorithm 2. The algorithm first evaluates the transfer performance by applying a singletask model to another task (Section 3.1.1), which will result in a (partially observed) crosstask transfer performance matrix S. The matrix S is then cleaned and completed, giving a symmetry task similarity matrix Y for spectral clustering Ng et al. (2002).
3.1.1 Estimation of CrossTask Transfer Performance
Using singletask models, we can compute performance scores by adapting each to each task . This forms an pairwise classification performance matrix S, called the transferperformance matrix. Note that S is asymmetric since usually .
Ideally, the transfer performance could be estimated by training a MNet on task and directly evaluating it on task . However, the limited training data usually lead to generally low transfer performance of singletask MNet. As a result we adopt the following approach to estimate S:
We train a CNN classifier (Figure 1(a)) on task , then take only the encoder from and freeze it to train a classifier on task . This gives us a new task model, and we test this model on to get the accuracy as the transferperformance . The score shows how the representations learned on task can be adapted to task , thus indicating the similarity between tasks.
Remark: OutofVocabulary Problem
In text classification tasks, transferring an encoder with finetuned word embeddings from one task to another is difficult as there can be a significant difference between the two vocabularies. Hence, while learning the singletask CNN classifiers, we always make the word embeddings fixed.
3.1.2 Task Clustering Method
Directly using the transfer performance for task clustering may suffer from both efficiency and accuracy issues. First, evaluation of all entries in the matrix S
involves conducting the sourcetarget transfer learning
times, where is the number of metatraining tasks. For a large number of diverse tasks where the can be larger than 1,000, evaluation of the full matrix is unacceptable (over 1M entries to evaluate). Second, the estimated crosstask performance (i.e. some or scores) is often unreliable due to small data size or label noise. When the number of the uncertain values is large, they can collectively mislead the clustering algorithm to output an incorrect taskpartition. To address the aforementioned challenges, we propose a novel task clustering algorithm based on the theory of matrix completion (Candès and Tao, 2010). Specifically, we deal with the huge number of entries by randomly sample task pairs to evaluate the and scores. Besides, we deal with the unreliable entries and asymmetry issue by keeping only task pairs with consistent and scores. as will be introduced in Eq. (4). Below, we describe our method in detail.Score Filtering
First, we use only reliable task pairs to generate a partiallyobserved similarity matrix Y. Specifically, if and are high enough, then it is likely that tasks belong to a same cluster and share significant information. Conversely, if and
are low enough, then they tend to belong to different clusters. To this end, we need to design a mechanism to determine if a performance is high or low enough. Since different tasks may vary in difficulty, a fixed threshold is not suitable. Hence, we define a dynamic threshold using the mean and standard deviation of the target task performance, i.e.,
and , where is the th column of S. We then introduce two positive parameters and , and define high and low performance as greater than or lower than , respectively. When both and are high and low enough, we set their pairwise similarity as and , respectively. Other task pairs are treated as uncertain task pairs and are marked as unobserved, and don’t influence our clustering method. This leads to a partiallyobserved symmetric matrix Y, i.e.,(4) 
Matrix Completion
Given the partially observed matrix Y, we then reconstruct the full similarity matrix . We first note that the similarity matrix X should be of lowrank (proof deferred to appendix). Additionally, since the observed entries of Y are generated based on high and low enough performance, it is safe to assume that most observed entries are correct and only a few may be incorrect. Therefore, we introduce a sparse matrix E to capture the observed incorrect entries in Y. Combining the two observations, Y can be decomposed into the sum of two matrices X and E, where X is a low rank matrix storing similarities between task pairs, and E is a sparse matrix that captures the errors in Y. The matrix completion problem can be cast as the following convex optimization problem:
s.t. 
where denotes the matrix nuclear norm, the convex surrogate of rank function. is the set of observed entries in Y, and is a matrix projection operator defined as
Finally, we apply spectral clustering on the matrix X to get the task clusters.
Remark: Sample Efficiency
In the Appendix A, we show a Theorem 7.1 as well as its proof, implying that under mild conditions, the problem (3.1.2) can perfectly recover the underlying similarity matrix if the number of observed correct entries is at least . This theoretical guarantee implies that for a large number of training tasks, only a tiny fraction of all task pairs is needed to reliably infer similarities over all task pairs.
3.2 FewShot Learning with Task Clusters
3.2.1 Training Cluster Encoders
For each cluster , we train a multitask MNet model (Figure 1(b)) with all tasks in that cluster to encourage parameter sharing. The result, denoted as is called the clusterencoder of cluster . The th metric of the cluster is thus .
3.2.2 Adapting Multiple Metrics for FewShot Learning
To build a predictor
with access to only a limited number of training samples, we make the prediction probability by linearly combining prediction from learned clusterencoders:
(6) 
where is the learned (and frozen) encoder of the th cluster, are adaptable parameters trained with fewshot training examples. And the predictor from each cluster is
(7) 
is the corresponding training sample of label .
Remark: Joint Method versus Pipeline Method
Endtoend joint optimization on training data becomes a popular methodology for deep learning systems, but it is not directly applicable to diverse FSL. One main reason is that deep networks could easily fit any task partitions if we optimize on training loss only, making the learned metrics not generalize, as discussed in Section 6
. As a result, this work adopts a pipeline training approach and employing validation sets for task clustering. Combining reinforcement learning with metalearning could be a potential solution to enable an endtoend training for future work.
4 Tasks and Data Sets
We test our methods by conducting experiments on two text classification data sets. We used NLTK toolkit^{3}^{3}3http://www.nltk.org/ for tokenization. The task are divided into metatraining tasks and metatesting tasks (target tasks), where the metatraining tasks are used for clustering and clusterencoder training. The metatesting tasks are fewshot tasks, which are used for evaluating the method in Eq. (6).
4.1 Amazon Review Sentiment Classification
First, following Barzilai and Crammer (2015), we construct multiple tasks with the multidomain sentiment classification (Blitzer et al., 2007) data set. The dataset consists of Amazon product reviews for 23 types of products (see Appendix D for the details). For each product domain, we construct three binary classification tasks with different thresholds on the ratings: the tasks consider a review as positive if it belongs to one of the following buckets stars, stars or stars.^{4}^{4}4Data downloaded from http://www.cs.jhu.edu/~mdredze/datasets/sentiment/, in which the 3star samples were unavailable due to their ambiguous nature (Blitzer et al., 2007). These buckets then form the basis of the tasksetup, giving us 23 369 tasks in total. For each domain we distribute the reviews uniformly to the 3 tasks. For evaluation, we select 12 (43) tasks from 4 domains (Books, DVD, Electronics, Kitchen) as the metatesting (target) tasks out of all 23 domains. For the target tasks, we create 5shot learning problems.
4.2 RealWorld Tasks: User Intent Classification for Dialog System
The second dataset is from an online service which trains and serves intent classification models to various clients. The dataset comprises recorded conversations between human users and dialog systems in various domains, ranging from personal assistant to complex serviceordering or customerservice request scenarios. During classification, intentlabels^{5}^{5}5In conversational dialog systems, intentlabels are used to guide the dialogflow. are assigned to user utterances (sentences). We use a total of 175 tasks from different clients, and randomly sample 10 tasks from them as our target tasks. For each metatraining task, we randomly sample 64% data into a training set, 16% into a validation set, and use the rest as the test set. The number of labels for these tasks varies a lot (from 2 to 100, see Appendix D for details), making regular shot settings not essentially limitedresource problems (e.g., 5shot on 100 classes will give a good amount of 500 training instances). Hence, to adapt this to a FSL scenario, for target tasks we keep one example for each label (oneshot), plus 20 randomly picked labeled examples to create the training data. We believe this is a fairly realistic estimate of labeled examples one client could provide easily.
Remark: Evaluation of the Robustness of Algorithm 2
Our matrixcompletion method could handle a large number of tasks via taskpair sampling. However, the sizes of tasks in the above two fewshot learning datasets are not too huge, so evaluation of the whole tasksimilarity matrix is still tractable. In our experiments, the incomplete matrices mainly come from the scorefiltering step (see Eq. 4). Thus there is limited randomness involved in the generation of task clusters.
To strengthen the conclusion, we evaluate our algorithm on an additional dataset with a much larger number of tasks. The results are reported in the multitask learning setting instead of the fewshot learning setting focused in this paper. Therefore we put the results to a nonarchive version of this paper^{6}^{6}6https://arxiv.org/pdf/1708.07918.pdf for further reference.
5 Experiments
5.1 Experiment Setup
Baselines
We compare our method to the following baselines: (1) Singletask CNN: training a CNN model for each task individually; (2) Singletask FastText: training one FastText model (Joulin et al., 2016) with fixed embeddings for each individual task; (3) Finetuned the holistic MTLCNN: a standard transferlearning approach, which trains one MTLCNN model on all the training tasks offline, then finetunes the classifier layer (i.e. Figure 1(a)) on each target task; (4) Matching Network: a metriclearning based fewshot learning model trained on all training tasks; (5) Prototypical Network: a variation of matching network with different prediction function as Eq. 3; (6) Convex combining all singletask models: training one CNN classifier on each metatraining task individually and taking the encoder, then for each target task training a linear combination of all the above singletask encoders with Eq. (6). This baseline can be viewed as a variation of our method without task clustering. We initialize all models with pretrained 100dim Glove embeddings (trained on 6B corpus) (Pennington et al., 2014).
Model  Avg Acc  
Sentiment  Intent  
(1) Singletask CNN w/pretrained emb  65.92  34.46 
(2) Singletask FastText w/pretrained emb  63.05  23.87 
(3) Finetuned holistic MTLCNN  76.56  30.36 
(4) Matching Network (Vinyals et al., 2016)  65.73  30.42 
(5) Prototypical Network (Snell et al., 2017)  68.15  31.51 
(6) Convex combination of all singletask models  78.85  34.43 
RobustTCFSL  83.12  37.59 
Adaptive RobustTCFSL    42.97 
HyperParameter Tuning
In all experiments, we set both and parameters in (4) to . This strikes a balance between obtaining enough observed entries in Y, and ensuring that most of the retained similarities are consistent with the cluster membership. The window/hiddenlayer sizes of CNN and the initialization of embeddings (random or pretrained) are tuned during the clusterencoder training phase, with the validation sets of metatraining tasks. We have the CNN with window size of 5 and 200 hidden units. The singlemetric FSL baselines have 400 hidden units in the CNN encoders. On sentiment classification, all clusterencoders use random initialized word embeddings for sentiment classification, and use Glove embeddings as initialization for intent classification, which is likely because the training sets of the intent tasks are usually small.
Since all the sentiment classification tasks are binary classification based on our dataset construction. A CNN classifier with binary output layer can be also trained as the clusterencoder for each task cluster. Therefore we compared CNN classifier, matching network, and prototypical network on Amazon review, and found that CNN classifier performs similarly well as prototypical network. Since some of the Amazon review data is quite large which involves further difficulty on the computation of supporting sets, we finally use binary CNN classifiers as clusterencoders in all the sentiment classification experiments.
Selection of the learning rate and number of training epochs for FSL settings, i.e., fitting
s in Eq. (6), is more difficult since there is no validation data in fewshot problems. Thus we preselect a subset of metatraining tasks as metavalidation tasks and tune the two hyperparameters on the metavalidation tasks.5.2 Experimental Results
Table 1 shows the main results on (i) the 12 fewshot product sentiment classification tasks by leveraging the learned knowledge from the 57 previously observed tasks from other product domains; and (ii) the 10 fewshot dialog intent classification tasks by leveraging the 165 previously observed tasks from other clients’ data.
Due to the limited training resources, all the supervisedlearning baselines perform poorly. The two stateoftheart metricbased FSL approaches, matching network (4) and prototypical network (5), do not perform better compared to the other baselines, since the single metric is not sufficient for all the diverse tasks. On intent classification where tasks are further diverse, all the singlemetric or singlemodel methods (35) perform worse compared to the singletask CNN baseline (1). The convex combination of all the single training task models is the best performing baseline overall. However, on intent classification it only performs on par with the singletask CNN (1), which does not use any metalearning or transfer learning techniques, mainly for two reasons: (i) with the growth of the number of metatraining tasks, the model parameters grow linearly, making the number of parameters (165 in this case) in Eq.(
6) too large for the fewshot tasks to fit; (ii) the metatraining tasks in intent classification usually contain less training data, making the singletask encoders not generalize well.In contrast, our RobustTCFSL gives consistently better results compared to all the baselines. It outperforms the baselines in previous work (15) by a large margin of more than 6% on the sentiment classification tasks, and more than 3% on the intent classification tasks. It is also significantly better than our proposed baseline (6), showing the advantages of the usage of task clustering.
Adaptive RobustTCFsl
Although the RobustTCFSL improves over baselines on intent classification, the margin is smaller compared to that on sentiment classification, because the intent classification tasks are more diverse in nature. This is also demonstrated by the training accuracy on the target tasks, where several tasks fail to find any cluster that could provide a metric that suits their training examples. To deal with this problem, we propose an improved algorithm to automatically discover whether a target task belongs to none of the taskclusters. If the task doesn’t belong to any of the clusters, it cannot benefit from any previous knowledge thus falls back to singletask CNN. The target task is treated as “outofclusters” when none of the clusters could achieve higher than 20% accuracy (selected on metavalidation tasks) on its training data. We call this method Adaptive RobustTCFSL, which gives more than 5% performance boost over the best RobustTCFSL result on intent classification. Note that the adaptive approach makes no difference on the sentiment tasks, because they are more closely related so reusing clusterencoders always achieves better results compared to singletask CNNs.
5.3 Analysis
Effect of the number of clusters
Figure 3 shows the effect of cluster numbers on the two tasks. RobustTC achieves best performance with 5 clusters on sentiment analysis (SA) and 20 clusters on intent classification (Intent). All clustering results significantly outperform the singlemetric baselines (#cluster=1 in the figure).
Clus0  Clus1  Clus2  Clus3  Clus4  Clus5  Clus6  Clus7  Clus8  Clus9  
automotive.t2  apparel.t2  baby.t5  automotive.t5  apparel.t5  beauty.t4  camera.t4  gourmet.t5  cell.t4  apparel.t4  
camera.t2  automotive.t4  magazines.t5  baby.t4  camera.t5  beauty.t5  software.t2  magazines.t4  software.t5  toys.t2  
health.t2  baby.t2  sports.t5  health.t4  grocery.t5  cell.t5  software.t4  music.t4  toys.t4  
magazines.t2  cell.t2  toys.t5  health.t5  jewelry.t5  gourmet.t2  music.t5  
office.t2  computer.t2  video.t5  gourmet.t4  video.t4  
outdoor.t2  computer.t4  grocery.t2  
sports.t2  computer.t5  grocery.t4  
sports.t4  jewelry.t4  office.t4  
music.t2  outdoor.t4  
video.t2  
dvdt4  0.4844  0.4416  0.4625  0.7843  0.7970  0.7196  0.8952  0.3763  0.7155  0.6315 
dvdt5  0.0411  0.2493  0.5037  0.3567  0.1686  0.0355  0.4150  0.2603  0.0867  0.0547 
kitchent4  0.6823  0.7268  0.7929  1.2660  1.1119  0.7255  1.2196  0.7065  0.6625  1.0945 
Effect of the clustering algorithms
Compared to previous task clustering algorithms, our RobustTC is the only one that can cluster tasks with varying numbers of class labels (e.g. in intent classification tasks). Moreover, we show that even in the setting of all binary classifications tasks (e.g. the sentimentanalysis tasks) that previous task clustering research work on, our RobustTC is still slightly better for the diverse FSL problems. Figure 3
compares with a stateoftheart logistic regression based task clustering method (
ASAPMTLR) (Barzilai and Crammer, 2015). Our RobustTC clusters give slightly better FSL performance (e.g. 83.12 vs. 82.65 when #cluster=5).Visualization of Task Clusters
The top rows of Table 2 shows the ten clusters used to generate the sentiment classification results in Figure 3. From the results, we can see that tasks with same thresholds are usually grouped together; and tasks in similar domains also tend to appear in the same clusters, even the thresholds are slightly different (e.g. t2 vs t4 and t4 vs t5).
The bottom of the table shows the weights s in Eq.(6) for the target tasks with the largest improvement. It confirms that our RobustTCFSL algorithm accurately adapts multiple metrics for the target tasks.
6 Related Work
Few Shot Learning FSL (Miller et al., 2000; Li et al., 2006; Lake et al., 2015) aims to learn classifiers for new classes with only a few training examples per class. Recent deep learning based FSL approaches mainly fall into two categories: (1) metricbased approaches Koch (2015); Vinyals et al. (2016); Snell et al. (2017), which aims to learn generalizable metrics and corresponding matching functions from multiple training tasks. These approaches essentially learn one metric for all tasks, which is suboptimal when the tasks are diverse. (2) optimizationbased approaches Ravi and Larochelle (2017); Munkhdalai and Yu (2017); Finn et al. (2017), which aims to learn to optimize model parameters (by either predicting the parameter updates or directly predicting the model parameters) given the gradients computed from fewshot examples.
Previous FSL research usually adopts the shot, way setting, where all the fewshot tasks have the same number of class labels, and each label has training instances. Moreover, these fewshot tasks are usually constructed by sampling from one huge dataset, thus all the tasks are guaranteed to be related to each other. However, in realworld applications, the fewshot learning tasks could be diverse: there are different tasks with varying number of class labels and they are not guaranteed to be related to each other. As a result, a single metamodel or metricmodel is usually not sufficient to handle all the fewshot tasks.
Task Clustering Previous task clustering methods measure the task relationships in terms of similarities among singletask model parameters (Kumar and Daume III, 2012; Kang et al., 2011); or jointly assign task clusters and train model parameters for each cluster to minimize the overall training loss (Crammer and Mansour, 2012; Barzilai and Crammer, 2015; Murugesan et al., 2017). These methods usually work on convex models but do not fit the deep networks, mainly because of (i) the parameters of deep networks are very highdimensional and their similarities are not necessarily related to the functional similarities; and (ii) deep networks have flexible representation power so they may overfit to arbitrary cluster assignment if we consider training loss alone. Moreover, these methods require identical class label sets across different tasks, which does not hold in most of the realistic settings.
7 Conclusion
We propose a fewshot learning approach for diverse tasks based on task clustering. The proposed method can use multiple metrics, and performs significantly better compared to previous singlemetric methods when the fewshot tasks come from diverse domains. Future work includes applying the taskclustering idea to other FSL algorithms Ravi and Larochelle (2017); Finn et al. (2017); Cheng et al. (2017), and exploring more advanced composition methods of clusterencoders beyond linear combination Chang et al. (2013); Andreas et al. (2016).
References

Andreas et al. (2016)
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016.
Neural module networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 39–48.  Barzilai and Crammer (2015) Aviad Barzilai and Koby Crammer. 2015. Convex multitask learning by clustering. In AISTATS.
 Blitzer et al. (2007) John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification. In ACL, volume 7, pages 440–447.
 Candès and Tao (2010) Emmanuel J Candès and Terence Tao. 2010. The power of convex relaxation: Nearoptimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080.
 Chandrasekaran et al. (2011) Venkat Chandrasekaran, Sujay Sanghavi, Pablo A Parrilo, and Alan S Willsky. 2011. Ranksparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572–596.
 Chang et al. (2013) Shiyu Chang, GuoJun Qi, Jinhui Tang, Qi Tian, Yong Rui, and Thomas S Huang. 2013. Multimedia lego: Learning structured model by probabilistic logic ontology tree. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pages 979–984. IEEE.
 Cheng et al. (2017) Yu Cheng, Mo Yu, Xiaoxiao Guo, and Bowen Zhou. 2017. Fewshot learning with meta metric learners. In NIPS 2017 Workshop on MetaLearning.

Collobert et al. (2011)
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray
Kavukcuoglu, and Pavel Kuksa. 2011.
Natural language processing (almost) from scratch.
Journal of Machine Learning Research
, 12(Aug):2493–2537.  Crammer and Mansour (2012) Koby Crammer and Yishay Mansour. 2012. Learning multiple tasks using shared hypotheses. In Advances in Neural Information Processing Systems, pages 1475–1483.
 Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400.
 Johnson and Zhang (2016) Rie Johnson and Tong Zhang. 2016. Supervised and semisupervised text categorization using onehot lstm for region embeddings. stat, 1050:7.
 Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
 Kang et al. (2011) Zhuoliang Kang, Kristen Grauman, and Fei Sha. 2011. Learning with whom to share in multitask feature learning. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 521–528.
 Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. In EMNLP, pages 1746–1751, Doha, Qatar. Association for Computational Linguistics.
 Koch (2015) Gregory Koch. 2015. Siamese neural networks for oneshot image recognition. Ph.D. thesis, University of Toronto.
 Kumar and Daume III (2012) Abhishek Kumar and Hal Daume III. 2012. Learning task grouping and overlap in multitask learning. In Proceedings of the 29th International Conference on Machine Learning (ICML12).
 Lake et al. (2011) Brenden M Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B Tenenbaum. 2011. One shot learning of simple visual concepts. In CogSci, volume 172, page 2.
 Lake et al. (2015) Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. 2015. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338.
 Li et al. (2006) FeiFei Li, Rob Fergus, and Pietro Perona. 2006. Oneshot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):594–611.
 Miller et al. (2000) Erik G Miller, Nicholas E Matsakis, and Paul A Viola. 2000. Learning from one example through shared densities on transforms. In Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, volume 1, pages 464–471. IEEE.
 Mishra et al. (2017) Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. 2017. A simple neural attentive metalearner. In NIPS 2017 Workshop on MetaLearning.
 Munkhdalai and Yu (2017) Tsendsuren Munkhdalai and Hong Yu. 2017. Meta networks. arXiv preprint arXiv:1703.00837.
 Murugesan et al. (2017) Keerthiram Murugesan, Jaime Carbonell, and Yiming Yang. 2017. Coclustering for multitask learning. arXiv preprint arXiv:1703.00994.

Ng et al. (2002)
Andrew Y Ng, Michael I Jordan, and Yair Weiss. 2002.
On spectral clustering: Analysis and an algorithm.
In Advances in neural information processing systems, pages 849–856.  Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543.
 Ravi and Larochelle (2017) Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for fewshot learning. In International Conference on Learning Representations, volume 1, page 6.
 Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard S Zemel. 2017. Prototypical networks for fewshot learning. arXiv preprint arXiv:1703.05175.
 Triantafillou et al. (2017) Eleni Triantafillou, Richard Zemel, and Raquel Urtasun. 2017. Fewshot learning through an information retrieval lens. In Advances in Neural Information Processing Systems, pages 2252–2262.
 Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638.
 Wang et al. (2017) YuXiong Wang, Deva Ramanan, and Martial Hebert. 2017. Learning to model the tail. In Advances in Neural Information Processing Systems 30, pages 7032–7042.
Appendix A: Perfect Recovery Guarantee for the Problem (3.1.2)
The following theorem shows the perfect recovery guarantee for the problem (3.1.2). Appendix C provides the proof for completeness.
Theorem 7.1.
Let be a rank
matrix with a singular value decomposition
, where and are the left and right singular vectors of , respectively. Similar to many related works of matrix completion, we assume that the following two assumptions are satisfied:
The row and column spaces of X have coherence bounded above by a positive number .

Max absolute value in matrix is bounded above by for a positive number .
Suppose that entries of are observed with their locations sampled uniformly at random, and among the observed entries, randomly sampled entries are corrupted. Using the resulting partially observed matrix as the input to the problem (3.1.2), then with a probability at least , the underlying matrix can be perfectly recovered, given

,

,

,
where is a positive constant; and denotes the lowrank and sparsity incoherence (Chandrasekaran et al., 2011).
Theorem 7.1 implies that even if some of the observed entries computed by (4) are incorrect, problem (3.1.2) can still perfectly recover the underlying similarity matrix if the number of observed correct entries is at least . For MATL with large , this implies that only a tiny fraction of all task pairs is needed to reliably infer similarities over all task pairs. Moreover, the completed similarity matrix X is symmetric, due to symmetry of the input matrix Y. This enables analysis by similaritybased clustering algorithms, such as spectral clustering.
Appendix B: Proof of Lowrankness of Matrix X
We first prove that the full similarity matrix is of lowrank. To see this, let be the underlying perfect clustering result, where is the number of clusters and is the membership vector for the th cluster. Given A, the similarity matrix X is computed as
where is a rank one matrix. Using the fact that and , we have , i.e., the rank of the similarity matrix X is upper bounded by the number of clusters. Since the number of clusters is usually small, the similarity matrix X should be of low rank.
Appendix C: Proof of Theorem 7.1
We then prove our main theorem. First, we define several notations that are used throughout the proof. Let be the singular value decomposition of matrix X, where and are the left and right singular vectors of matrix X, respectively. Similar to many related works of matrix completion, we assume that the following two assumptions are satisfied:

A1: the row and column spaces of X have coherence bounded above by a positive number , i.e., and , where , , and is the standard basis vector, and

A2: the matrix has a maximum entry bounded by in absolute value for a positive number .
Let be the space spanned by the elements of the form and , for , where and are arbitrary dimensional vectors. Let be the orthogonal complement to the space , and let be the orthogonal projection onto the subspace given by
The following proposition shows that for any matrix
, it is a zero matrix if enough amount of its entries are zero.
Proposition 1.
Let be a set of entries sampled uniformly at random from , and projects matrix Z onto the subset . If , where with and being a positive constant, then for any with , we have with probability .
Proof.
In the following, we will develop a theorem for the dual certificate that guarantees the unique optimal solution to the following optimization problem
s.t. 
Theorem 1.
Suppose we observe entries of X with locations sampled uniformly at random, denoted by . We further assume that entries randomly sampled from observed entries are corrupted, denoted by . Suppose that and the number of observed correct entries . Then, for any , with a probability at least , the underlying true matrices is the unique optimizer of (Appendix C: Proof of Theorem 7.1) if both assumptions A1 and A2 are satisfied and there exists a dual such that (a) , (b) , (c) , (d) , and (e) .
Proof.
First, the existence of Q satisfying the conditions (a) to (e) ensures that is an optimal solution. We only need to show its uniqueness and we prove it by contradiction. Assume there exists another optimal solution , where . Then we have
where and satisfying , , and . As a result, we have
We then choose and to be such that and . We thus have
Since is also an optimal solution, we have , leading to , or . Since , we have , where and . Hence, , where . Since , according to Proposition 1, we have, with a probability , . Besides, since and , we have . Since , we have , which leads to the contradiction. ∎
Given Theorem 1, we are now ready to prove Theorem 3.1.
Proof.
The key to the proof is to construct the matrix Q that satisfies the conditions (a)(e) specified in Theorem 1. First, according to Theorem 1, when , with a probability at least , mapping is an one to one mapping and therefore its inverse mapping, denoted by is well defined. Similar to the proof of Theorem 2 in Chandrasekaran et al. (2011), we construct the dual certificate Q as follows
where and . We further define
H  
F 
Evidently, we have since , and therefore the condition (a) is satisfied. To satisfy the conditions (b)(e), we need
(10)  
(11)  
(12)  
(13) 
Below, we will first show that there exist solutions and that satisfy conditions (10) and (12). We will then bound , , , and to show that with sufficiently small and , and appropriately chosen , conditions (11) and (13) can be satisfied as well.
First, we show the existence of and that obey the relationships in (10) and (12). It is equivalent to show that there exists that satisfies the following relation
or
where indicates the complement set of set in and denotes its cardinality. Similar to the previous argument, when , with a probability , is an one to one mapping, and therefore is well defined. Using this result, we have the following solution to the above equation
We now bound and . Since , we bound instead. First, according to Corollary 3.5 in Candès and Tao (2010), when , with a probability , for any , we have
Using this result, we have
In the last step, we use the fact that if . We then proceed to bound as follows
Combining the above two inequalities together, we have
which lead to