On the Importance of Attention in Meta-Learning for Few-Shot Text Classification

06/03/2018 ∙ by Xiang Jiang, et al. ∙ Imagia Cybernetics Inc. Dalhousie University 0

Current deep learning based text classification methods are limited by their ability to achieve fast learning and generalization when the data is scarce. We address this problem by integrating a meta-learning procedure that uses the knowledge learned across many tasks as an inductive bias towards better natural language understanding. Based on the Model-Agnostic Meta-Learning framework (MAML), we introduce the Attentive Task-Agnostic Meta-Learning (ATAML) algorithm for text classification. The essential difference between MAML and ATAML is in the separation of task-agnostic representation learning and task-specific attentive adaptation. The proposed ATAML is designed to encourage task-agnostic representation learning by way of task-agnostic parameterization and facilitate task-specific adaptation via attention mechanisms. We provide evidence to show that the attention mechanism in ATAML has a synergistic effect on learning performance. In comparisons with models trained from random initialization, pretrained models and meta trained MAML, our proposed ATAML method generalizes better on single-label and multi-label classification tasks in miniRCV1 and miniReuters-21578 datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have shown great success in learning representations from data, but effective training of a deep neural network requires a large number of training examples and many gradient-based optimization steps. This is mainly owing to a lack of prior knowledge when solving a new task. Meta-learning or “learning to learn”

Schmidhuber [1987], Mitchell and Thrun [1993], Vilalta and Drissi [2002] addresses this limitation by acquiring meta-knowledge from the learning experience across many tasks. The knowledge acquired by the meta-learner provides inductive bias Thrun [1998] that gives rise to sample-efficient fast learning algorithms.

Previous work on deep learning based meta-learning can be summarized into four categories: learning representations that encourage fast adaptation on new tasks Finn et al. [2017a, b], learning universal learning procedure approximators by supplying training examples to the meta-learner that outputs predictions on the testing examples Hochreiter et al. [2001], Vinyals et al. [2016], Santoro et al. [2016], Mishra et al. [2017], learning to generate model parameters conditioned on training examples Gomez and Schmidhuber [2005], Munkhdalai and Yu [2017], Ha et al. [2016], and learning optimization algorithms to exploit structures in related problems Bengio et al. [1992], Ravi and Larochelle [2016], Andrychowicz et al. [2016], Li and Malik [2017].

Although considerable research has been devoted to meta-learning, research until now has tended to focus on image classification and reinforcement learning, while less attention has been paid to text classification. In this work, we propose a meta-learning algorithm notably designed for text classification.

The proposed method is based on Model-Agnostic Meta-Learning [MAML; see Finn et al., 2017a] that explicitly guides optimization towards adaptive representations. While MAML does not discriminate different levels of representations and adapts all parameters for a new task, we introduce Attentive Task-Agnostic Meta-Learner (ATAML) that learns task-agnostic representation while fast-adapting attention parameters to distinguish different tasks.

In effect, ATAML involves two levels of learning: a meta-learner that learns across many tasks to obtain task-agnostic representation in the form of a convolutional or recurrent network, and a base-learner that optimizes the attention parameters of each task for fast adaptation. Crucially, ATAML takes into account of the importance of attention in document classification and aims to encourage task-specific attentive adaptation while learning task-agnostic text representations.

We introduce a smaller version of the RCV1 and Reuters-21578 dataset—miniRCV1 and miniReuters-21578—tailored to few-shot text classification, and we show that on these datasets, our method leads to significant improvements when compared with randomly initialized, pretrained and MAML-learned models. We also analyze the impact of architectural choices for representation learning and show the effectiveness of dilated convolutional networks for few-shot text classification. Furthermore, the findings in both datasets support our claim on the importance of attention in text-based meta-learning.

The contribution of this work is threefold:

  • We propose a new meta-learning algorithm ATAML for text classification that separates task-agnostic representation learning and task-specific attentive adaptation.

  • We show that attentive base-learner together with task-agnostic meta-learner generalizes better.

  • We provide evidence as to how attention helps representation learning in ATAML.

2 Model-Agnostic Meta-Learning

In this section, we follow the same meta-learning problem formulation Ravi and Larochelle [2016] and revisit the MAML Finn et al. [2017a] algorithm which we adapt for text classification in Section 4.

Model-Agnostic Meta-Learning [Finn et al., 2017a] is a meta-learning algorithm that aims to learn representations that encourage fast adaptation across different tasks. The meta-learner and base-learner share the same network structure, and the parameters learned by the meta-learner are used to initialize the base-learner on any given task.

To form an “episode” [Vinyals et al., 2016] to optimize the meta-learner, we first sample a set of tasks from the meta-training set , where . For a meta-learner parameterized by , we compute its adapted parameters for each sampled task :

(1)

where is the step size of the gradient. The adapted parameters are task-specific and tell us the effectiveness of as to whether it can achieve generalization through one or a few additional gradient steps. The meta-learner’s objective is hence to minimize the generalization error of across all tasks:

(2)

Note that the meta-learner is not aimed at explicitly optimizing the task-specific parameters . Rather, the objective of the meta-learner is to optimize the representation so that it can lead to good task-specific adaptations with a few gradient steps. In other words, the goal of fast learning is integrated into the meta-learner’s objective.

The meta-learner is optimized by backpropagating the error through the task-specific parameters

to their common pre-update parameters . The gradient-based updating rule is:

(3)

where is the learning rate of the meta-learner. The meta-learner performs slow learning at the meta-level across many tasks to support fast learning on new tasks. At meta-test time, we initialize the base-learner’s parameters from the meta-learned representation and fine-tune the base-learner using gradient descent on task . The meta learner is evaluated on .

MAML works with any differentiable neural network structure and has been applied to various tasks including regression, image classification, reinforcement learning and imitation learning. Extensions of MAML include learning the base-learner’s learning rate

Li et al. [2017]

and applying a bias transformation to concatenate a vector of parameters to the hidden layer of the base-learner

Finn et al. [2017b]. It is also theorized that MAML has the same expressive power as other universal learning procedure approximators and generalizes well to out-of-distribution tasks Finn and Levine [2017].

3 Few-Shot Text Classification

Few-shot learning is commonly characterized as -way -shot, or -class -shot learning, which contains classes with only examples for each class. Few-shot learning is often accomplished by making use of the knowledge learned from a collection of tasks at an earlier time, and it has made rapid progress on image classification problems Li et al. [2006], Lake et al. [2015], Koch et al. [2015], often represented by the Omniglot Lake et al. [2011] and MiniImageNet Vinyals et al. [2016], Ravi and Larochelle [2016] datasets.

We extend few-shot learning from image classification to the text classification domain, with the goal to learn a text classification model from a few examples. Many important problems require learning text classification models from small amounts of data. As an example, predicting if a person is likely to commit suicide could save many lives, but it is often difficult to collect psychiatric case histories and suicide notes Shneidman and Farberow [1956], Matykiewicz and Pestian [2012]

. Furthermore, in the biomedical domain, active learning can be used to classify clinical text data so as to reduce the burden of annotation

Figueroa et al. [2012]. The ability to achieve fast and effective learning from a few annotated examples can jump-start the active learning process and improve the convergence of active learning, thereby maximizing the efficiency of human involvement.

A great body of research in natural language processing emphasizes on the importance of attention in a variety of tasks

Shen et al. [2018], Lin et al. [2017], Vaswani et al. [2017]. These papers show that attention is able to retrieve task-specific representation across a sequence of text encodings from CNN or LSTM to obtain a task specific representation of the input. Attention could help decompose the contents of a document into “subproblems” Parikh et al. [2016] thus producing task-specific representations; this ability to decompose text encodings also allows us to learn shared representation across tasks.

In the context of meta-learning for few shot text classification, we empirically show that there is a synergistic relationship between meta-learning a shared text embedding across tasks and task-specific representation through attention. Intuitively, by constraining the text embedding parameters to be shared across different tasks in an episode, attention learns to be more task-specific and better decompose the document according to the task at hand.

4 Attentive Task-Agnostic Meta-Learning

4.1 The Attentive Base Learner

The base-learner is a neural network trained on each text classification task

under a loss function

. The base-learner reads the -word input document , where denotes the -th word,

(4)

The base learner in (4) encodes the input sequence to a corresponding sequence of states , where can take the form of a recurrent or convolutional network with parameters .

We then apply content-based attention mechanism Bahdanau et al. [2014], Hermann et al. [2015], Graves et al. [2014], Sukhbaatar et al. [2015] that enables the model to focus on different aspects of the document. The specific attention formulation used here is defined in (5) and belongs to a type of feedforward attention Raffel and Ellis [2015],

(5)

where represents the attention parameter vector. For each memory state , we calculate its inner product with the attention parameter, resulting in a scalar . The scalar rescales each state into , and these are averaged to obtain the final representation of a document. The attention retrieves relevant information from a document and offers interpretability into the model behavior by explaining the importance of each word, through attention weight , that contributes to the final prediction.

Once an input document is encoded into the vectorized representation , we apply a softmax classifier parameterized by to obtain the predictions . The softmax classifier is replaced by a set of sigmoid classifiers if the labels are not mutually exclusive in multi-label classification,

(6)

4.2 The Attentive Task-Agnostic Meta-Learner

ATAML learns to obtain common representations that can be shared across different tasks while having the fast learning ability to quickly adapt to new tasks. In contrast with MAML, which does not make any distinction between different parameters in the meta-learner, the proposed ATAML splits all parameters into into two disjoint sets, shared parameters and task-specific parameters , and employs discriminative strategies in the meta-training and meta-testing phrases. The shared parameters are aimed at representation learning while the task-specific parameters are aimed at capturing task-specific information for classification.

4.2.1 Meta Training

1:: the meta-train set
2:-way -shot learning
3: classification tasks for each training episode
4:: task and meta level learning rate
5:: shared parameters for representation learning
6:: parameters to be adapted at the task level
7:randomly initialize and Initialize all parameters
8:while not done do
9:     Sample tasks: Sample tasks for meta-training
10:     for all  do
11:          Get task-specific parameters      
12:      Get loss of the meta-learner
13:      Update task-specific parameters
14:      Update shared parameters
Algorithm 1 Attentive Task-Agnostic Meta-Learner

The Attentive Task-Agnostic Meta-Learning training algorithm is described in Algorithm 1. We use to denote all parameters of the model (), which is divided into shared parameters and task-specific parameters , where .

To create one meta-training “episode” [Vinyals et al., 2016], we sample tasks from and optimize the model towards fast learning across all sampled tasks . As we are sampling random tasks from in each meta-training iteration, the goal of the meta-learner is to obtain task-agnostic representation that is reusable for different tasks.

For every task in the meta-training iteration, we only update the task-specific parameters in the base-learner that are initialized with and updated to using task-specific gradients . We further calculate the expected loss according to the post-update parameters that is composed of the task-specific fast weights and shared slow weights ,

(7)

The resulting loss can be understood as the loss of the meta-learner. gives us an evaluation measure on how well the task-specific parameters can adapt across all the sampled tasks , together with a measure on how well the shared parameters can be reused across all tasks. The meta-optimization therefore consists of minimizing with respect to all parameters towards optimizing the model’s adaptability and re-usability across different tasks. The meta-training iterations are repeated until the model converges, and the resulting parameters are then used as initialization at meta-test time.

4.2.2 Meta Testing

Meta testing involves evaluating on the meta-learned model on the meta-test set by fine-tuning on and test on , where . We introduce a meta testing approach that freezes the shared representation learning parameters and only applies gradient on the task-specific parameters . In contrast to fine-tuning all parameters for a new task, this discriminative meta-testing procedure is more coherent with the stratified meta-training strategy. It also provides regularization of few-shot learning that improves generalization.

5 Experiments

We provide three sets of empirical evaluations on the single-label miniRCV1, multilabel miniRCV1 and miniRCV1miniReuters-21578 datasets to analyze the proposed meta-learning framework.

5.1 The Base Learners

We use Temporal Convolutional Networks (TCN), which is a type of dilated convolution Van Den Oord et al. [2016], as the base learner. We have also conducted experiments with bidirectional LSTM Schuster and Paliwal [1997] as the base learner. Details on those experiments as well as the LSTM architecture are included in the Appendix due to lack of space.

The TCN contains two layers of dilated causal convolutions with filter size 3 and dilation rate 3. Each convolutional layer is followed by a Leaky Rectified Linear Unit

Maas et al. [2013] with negative slope rate 0.01, which is followed by 50% dropout Srivastava et al. [2014]. For word representation, we use 300 dimensional Glove embeddings Pennington et al. [2014]. For optimization, we use Adam optimizer Kingma and Ba [2014]. For the loss function, we use categorical cross entropy error when each document contains only one label, and use sigmoid cross entropy error when each document may contain multiple labels. Although it is common to use threshold calibration algorithms for multilabel classification, we use the constant 0.5 as prediction threshold in order to reduce the impact of external algorithms.

5.2 Data

Reuters Corpus Volume I (RCV1) is an archive of news stories for research on text categorization Lewis et al. [2004]. We create two versions of the miniRCV1 dataset by selecting a subset from the full RCV1 dataset to study the effect of few-shot learning in text classification:

  1. miniRCV1 for single-label classification consisting of the 55 second-level topics as target classes. We sample 20 documents from each class which is further divided into a training set that contains 5 documents and a testing set that contains 15 documents. Documents with overlapping topics are removed to ensure each document contains a single label.

  2. miniRCV1 for multi-label classification consisting of 102 out of 103 non-mutually exclusive labels. Each document is associated with a set of labels and we exclude one label that only appeared once in the corpus. We sample about 20 documents for each class and divide them into training and testing sets in a similar manner. It is worthwhile to mention that, due to the inherent properties of multi-labeled data Zhang and Zhou [2014], some classes may contain more examples than others classes.

Similar to miniRCV1, we create a smaller version of the Reuters-21578 dataset by selecting about 20 examples for each label.

5.3 Few-shot Learning Setup

At the meta-level, we divide all classes into meta-train, meta-validation and meta-test sets. In the -way -shot setup, during meta-training, we randomly sample classes among the meta-training set where each class contains training examples. At meta-test time, we randomly sample classes among the meta-test set and calculate evaluation statistics across many runs. We evaluate 5-way 1-shot, 5-way 5-shot, 10-way 1-shot and 10-way 5-shot learning for both single-label and multi-label classification. The single-label classification task is evaluated on classification accuracy; the multi-label classification task is evaluated on micro and macro F1-scores, which are intended to measure the average F1-scores across all labels. They differ in that, micro-average gives equal weights to each example regardless of label imbalance, whereas macro-average treats different labels equally.

5.4 Results and Discussion

As with other meta-learning paradigms we consider two baselines: models trained from random initialization, i.e., “random”, and models pretrained across many sampled meta-train tasks, i.e., “pretrained”. In addition, we also compare our proposed ATAML framework with MAML under similar architecture. Our experiments show that while MAML achieves better accuracies compared to the aforementioned baselines, ATMAL significantly outperforms MAML in all 1-shot learning experiments. Table 5, Table 6 and Table 3

summarize these results on single-label miniRCV1, multi-label miniRCV1 and multi-label miniReuters-21578 experiments, wherein “Meta” denotes the type of meta learner, “Base” denotes the type of base learner, “(A)” denotes models trained with attention and the bold numbers highlight the best performing ones at 95% confidence interval.

The difficulty of few-shot learning.

Few-shot text classification is a challenging task as text data contain rich information from various aspects which are difficult to ascertain from a few training examples. This difficulty is manifested in the poor testing performances when trained from random initialization. Meanwhile, in both multi-label classification tasks, the TCN models perform much better when we increase the number of training examples from 1 to 5 examples per class. Furthermore, we show in the Appendix that, classic machine learning algorithms, such as support vector machine, naive Bayes multinomial and K-nearest neighbors, as well as document embedding algorithms, such as doc2vec 

Levine and Haus [1985] and doc2vecC Chen [2017], also suffer from data scarcity in few-shot learning.

The difficulty of pretraining in few-shot learning.

We empirically find it generally ineffective to make use of pretrained models in few-shot learning. This can be explained by the “contradictory outputs” of the pretraining tasks Finn et al. [2017a]. Put differently, as each task contains a small number of examples, when we pretrain the model from many tasks in the meta-training set, the sampled tasks provide contradictory supervisory signals to the classifier, hence making it difficult to pretrain effectively.

Why does pretrained -way -shot TCN models perform so poorly?

In multi-label classification tasks, some labels appear less frequently in the training data . This label imbalance causes uncalibrated output probabilities when using the constant 0.5 as prediction threshold. Some pretrained models performs worse than random guesses because its output probabilities are not well distributed.

The effect of meta learning.

From all three experiments, the empirical results demonstrate the basic MAML with attentive base learners performs notably better than the non-meta-learned baselines. More importantly, the proposed ATAML algorithm offers further improvements that are statistically significant in all the 1-shot learning experiments. These empirical findings support the need for meta-learning in few-shot text classification. That being the case, the empirical findings further support the importance of learning task-agnostic representations together with task-specific attentive adaptations.

Method 5-way Accuracy 10-way Accuracy
Meta Base 1-shot 5-shot 1-shot 5-shot
random TCN (A) 41.52% 65.64% 28.32% 45.12%
pretrained TCN (A) 24.06% 57.08% 18.60% 45.85%
MAML TCN (A) 47.09% 72.65% 31.57% 62.75%
ATAML TCN (A) 54.05% 72.79% 39.48% 61.74%
Table 1: Comparing single-label classification accuracies between baselines and ATAML on miniRCV1
Method 5-way Micro-F1 10-way Micro-F1 5-way Macro-F1 10-way Macro-F1
Meta Base 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
random TCN (A) 38.9% 60.9% 40.6% 45.6% 31.4% 55.7% 22.9% 33.1%
pretrained TCN (A) 26.9% 55.8% 33.5% 52.1% 17.0% 51.5% 14.9% 41.4%
MAML TCN (A) 52.3% 69.1% 44.9% 58.6% 43.2% 64.3% 27.7% 48.4%
ATAML TCN (A) 59.7% 71.1% 50.7% 61.3% 54.3% 65.0% 38.5% 49.2%
Table 2: Comparing multi-label classification outcomes between baselines and ATAML on miniRCV1
Method 5-way Micro-F1 10-way Micro-F1 5-way Macro-F1 10-way Macro-F1
Meta Base 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
random TCN (A) 38.2% 66.0% 25.1% 44.9% 30.6% 55.0% 17.9% 33.6%
pretrained TCN (A) 23.5% 50.3% 18.4% 49.1% 16.4% 37.8% 12.0% 37.3%
MAML TCN (A) 52.4% 74.1% 38.1% 61.2% 44.3% 64.3% 29.9% 51.2%
ATAML TCN (A) 66.3% 76.5% 42.6% 60.8% 60.9% 69.4% 34.9% 51.2%
Table 3: Comparing multi-label classification between baselines and ATAML on miniReuters-21578

5.5 The Importance of Attention

The importance of attention lies in its synergistic effect on the meta learners. Under the same meta learning framework, introducing attention to the base learner leads to improved generalization when compared with non-attentive base learners. From empirical results shown in Table 4, we find MAML trained with attention performs better than MAML without attention. We have similar findings on the mini-RCV1 experiments detailed in the Appendix.

Furthermore, the proposed ATAML framework implicitly evaluates the learned representations by freezing the meta-learned representation parameters when fine-tuning a new task. Under those circumstances, well trained representation will facilitate the learning of a new task while poorly trained representation will prohibit effective adaptations. From empirical studies, ATAML performs the best across all experiments when we use TCN as the base learner.

If only focusing on base learners that are equipped with attention mechanisms, we find that although MAML provides reasonable improvements from the baselines, models trained with ATAML offer substantial improvements in generalization when compared with the rest of the models. This hints at the benefits brought by shared representation learning and discriminative fine-tuning.

Putting all these together, we empirically find that both the attention mechanism and the meta learner are crucial components for good generalization in few-shot text classification. To better understand the representation learning procedure as well as the role of attention in meta training, we undertake ablation studies to provide further insights into ATAML.

Method 5-way Micro-F1 10-way Micro-F1 5-way Macro-F1 10-way Macro-F1
Meta Base 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
random E (A) 36.7% 66.1% 25.2% 49.1% 29.2% 55.0% 18.2% 36.8%
MAML E (A) 44.9% 72.3% 26.4% 59.2% 35.6% 61.7% 19.6% 47.4%
MAML TCN 26.4% 65.7% 11.4% 44.5% 19.1% 52.7% 7.6% 31.2%
MAML TCN (A) 52.4% 74.1% 38.1% 61.2% 44.3% 64.3% 29.9% 51.2%
ATAML TCN (A) 66.3% 76.5% 42.6% 60.8% 60.9% 69.4% 34.9% 51.2%
ATAML TCN (-) 62.7% 77.5% 49.5% 63.7% 58.3% 71.1% 41.6% 54.2%
Table 4: Ablation studies on miniReuters-21578 for multi-label classification

5.6 Ablation Studies

With ablation studies we can offer evidence into the need to learn text in a structured manner as opposed to making classifications at the word level alone. We use “E (A)” to denote a base learner where an attention model is directly applied to the word embeddings. The goal of this model is to extract individual words to make predictions. This model provides a measure on classification performance if we only take into account individual word-level representations. The empirical results in Table 

4 suggest classifying from word embeddings is inferior to the proposed ATAML model, indicating the need to learn text structures, such as phrase or sentence level representations. Moreover, learning from only a few examples exacerbates the effect of over-fitting as it is more likely to have spurious correlations at the word level compared with phrase or sentence level. It is therefore desirable to have the ability to learn text structures.

To analyze the role of attention in meta training, we construct an attention-based meta training strategy where the attention parameters are not updated in each meta training iteration. Although the attention parameters are not being updated in meta training, they take task-specific fast weights as regular ATAML and these fast weights have direct influence over the gradients of the TCN layers. The goal of this model is to exploit the fast weights of the attention parameters and examine if this could produce well trained representation without learning attention parameters in meta learning. This model, denoted as “TCN(-)”, has similar performance with the regular ATAML models in Table 4. Thus, the role of attention in meta training is to facilitate the learning of shared representations, rather than learning attention parameter itself. In addition, we show in the Appendix that, the proposed ATAMA works better than document embedding approaches that further confirms its ability to aggregate information from substructures.

5.7 Visualizing Learned Attentions

Figure 1: Visualizing attentions learned by MAML TCN(A).
Figure 2: Visualizing attentions learned by ATAML TCN(A).

Figure 1 and Figure 2 are visualizations of attention obtained by MAML and ATAML, respectively. The density of the blue color indicates the weights, or importance, for the words to model predictions. The target label for this document is “REGULATION/POLICY” and both models make correct predictions for this training example. Additional visualization is provided in the Appendix.

The MAML model illustrated in Figure 1 is over-fitting on the training data and only searches for repetitive words, such as “tobacco” and “drug”, that are merely spurious correlations. On the other hand, the proposed ATAML suffers less from over-fitting and searches for relevant phrases, such as “accept regulation of” and “recommendation that”, which are relevant to “REGULATION/POLICY” for prediction. This suggests the proposed ATAML is able to discover local text substructures via attention from shared representation learning which has a regularization effect when adapting to new tasks.

5.8 The Impact of Base Learner

Current research on meta learning typically use LSTM as a meta-learner, while we experimented with both LSTM and TCN as the base learner. Although meta learning works with both LSTM and TCN and they all provide improvements from randomly initialized and pretrained models, it is worthwhile to highlight their different properties. Overall, TCN has faster training speed and generalization when compared with LSTM. One main problem when using LSTM as the base learner is that, in meta-training, the LSTM saturates at a very early stage owing to difficulties in optimization, and prevents the meta-learner from obtaining sharable representations across different tasks. The detailed quantitative comparisons are included in the Appendix.

6 Conclusion

We propose a meta learning approach that enables the development of text classification models from only a few training examples. The proposed Attentive Task-Agnostic Meta-Learner encourages the learning of shared representation across different tasks. The use of attention mechanism is capable of decomposing some text into substructures for task-specific adaptation. We also found attention facilitates learning text representations that can be shared across different tasks. The importance of attention in meta-learning for few-shot text classification is clearly supported by our empirical studies on the miniRCV1 and miniReuters-21578 datasets. We also provided ablation analysis and visualization to get insights into how different components of the model work together. To the best of our knowledge, this is the first work to raise the question of few-shot text classification. Further work should further characterize what makes a good few-shot text classification algorithm.

References

Appendix A Training Details

We choose two commonly used text classification models as base learners for empirical analysis. The first base learner is a bidirectional LSTM Schuster and Paliwal [1997] that contains 128 hidden nodes in each direction. The second base learner is a dilated convolutional network Van Den Oord et al. [2016] that contains two layers of dilated causal convolutions Bai et al. [2018] with filter size 3 and dilation rate of 3. Each of the convolutional layer is followed by leaky rectified linear units Maas et al. [2013] and 50% dropconnect Wan et al. [2013]. We use residual blocks as described in Bai et al. [2018]. For word representation, we use Glove embeddings Pennington et al. [2014]. For optimization, we use Adam optimizer Kingma and Ba [2014]

and clip the gradients to a maximum L2-norm of 1.0. For the loss function, we use categorical cross entropy error when each document contains only one label, and use sigmoid cross entropy error when each document may contain multiple labels. When Temporal Convolutional Network (TCN) is used as the base learner, we experimented with a three-layer architecture but it did not work as good as a two-layer model. We have also tried batch normalization but they do not provide performance improvements. We find taking more than one fast gradient steps in meta training improves learning and we use 5 gradient steps in our experiments.

Appendix B Additional Empirical Results

b.1 The Importance of Attention

In this section, we include additional empirical results for single-label and multi-label miniRCV1 experiments in Table 5 and Table 6 to show the importance of attention, wherein “meta” denotes the type of meta learner, “Base” denotes the type of base learner, “random” denotes models trained from random initialization, “pretrained” denotes models trained from a pretrained model on the meta-training set, “(A)” denotes models trained with attention and the bold numbers highlight the best performing ones at 95% confidence interval.

The empirical results suggest that attention provides performance improvements regardless of what meta learner or base learner is used. Given the same meta learning algorithm, adding attention to the base learner always improves model performance.

Method 5-way Accuracy 10-way Accuracy
Meta Base 1-shot 5-shot 1-shot 5-shot
random TCN 26.70% 55.43% 17.64% 41.81%
random TCN (A) 41.52% 65.64% 28.32% 45.12%
pretrained TCN 22.38% 37.17% 10.67% 27.76%
pretrained TCN (A) 24.06% 57.08% 18.60% 45.85%
MAML TCN 33.86% 61.44% 22.55% 41.94%
MAML TCN (A) 47.09% 72.65% 31.57% 62.75%
ATAML TCN (A) 54.05% 72.79% 39.48% 61.74%
Table 5: miniRCV1 single-label classification accuracies
Method 5-way Micro-F1 10-way Micro-F1 5-way Macro-F1 10-way Macro-F1
Meta Base 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
random TCN 18.7% 40.6% 30.2% 40.9% 11.3% 36.4% 9.9% 23.6%
random TCN (A) 38.9% 60.9% 40.6% 45.6% 31.4% 55.7% 22.8% 33.1%
pretrained TCN 25.1% 36.2% 28.2% 35.2% 17.0% 30.1% 9.1% 20.7%
pretrained TCN (A) 26.9% 55.8% 33.5% 52.1% 17.0% 51.5% 14.9% 41.4%
MAML TCN 35.7% 45.6% 20.5% 40.2% 22.9% 41.9% 7.6% 27.7%
MAML TCN (A) 52.3% 69.1% 44.9% 58.6% 43.2% 64.3% 27.7% 48.4%
SMAML TCN (A) 59.6% 71.1% 50.7% 61.3% 54.3% 65.0% 38.5% 49.2%
Table 6: miniRCV1 multi-label classification

b.2 The Impact of Base Learner

Table 7 shows the empirical comparison between bidirectional LSTM and TCN when ATAML is used as the meta learner. The results suggest that TCN performs better than bidirectional LSTM across all experiments on miniReuters-21578.

Method 5-way Micro-F1 10-way Micro-F1 5-way Macro-F1 10-way Macro-F1
Meta Base 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
ATAML LSTM (A) 38.0% 62.3% 27.1% 33.7% 30.3% 50.2% 18.8% 21.2%
ATAML TCN (A) 59.8% 71.1% 50.7% 61.3% 54.3% 65.0% 38.5% 49.2%
Table 7: Comparing bidirectional LSTM and TCN as base learners on miniReuters-21578

b.3 Other Baseline Methods

Table 8

shows the comparison between the proposed ATAML and classic machine learning methods, i.e., SVM, Naive Bayes Multinomial and KNN, which uses tfidf features as model inputs. The results suggest that SVM and naive Bayes multinomial severely overfit on the training data generalizes poorly on evaluation. The K-nearest neighbor classifier performs better than SVM and naive Bayes multinomial mainly because it is an nonparametric and distance-based algorithm. The proposed ATAML is significantly better than KNN on the Micro-F1 measure and ATAML performs at least as good as KNN on the Macro-F1 measure.

Method 5-way Micro-F1 10-way Micro-F1 5-way Macro-F1 10-way Macro-F1
1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
SVM 3.8% 35.8% 0.3% 18.8% 3.3% 25.1% 0.2% 12.6%
Naive Bayes Multinomial 0.5% 7.7% 0.0% 0.0% 0.2% 3.4% 0.0% 0.0%
KNN 46.7% 54.4% 39.4% 57.3% 43.8% 37.3% 37.4% 52.5%
ATAML, TCN (A) 59.8% 71.1% 50.7% 61.3% 54.3% 65.0% 38.5% 49.2%
Table 8: Comparing ATAML with SVM, Naive Bayes Multinomial and KNN on miniReuters-21578

Table 9 summarizes the comparison between the proposed ATAML and document embedding approaches, i.e., doc2vec Levine and Haus [1985] and doc2vecC Chen [2017]. In contrast to ATAML that uses attention to aggregate information from substructures of some text input, the document embedding approaches directly encode each document into one embedding vector and another classifier, such as KNN or SVM, is applied on the document embeddings for classification.

The empirical results suggest the document embedding approaches are not as effective as the proposed ATAML method. This finding confirms the need to apply attention on substructures of text data, rather than treating each document as a static embedding vector.

Method 5-way Micro-F1 10-way Micro-F1 5-way Macro-F1 10-way Macro-F1
1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
Doc2Vec, KNN 31.4% 42.0% 19.4% 32.9% 18.5% 28.9% 10.1% 22.5%
Doc2Vec, SVM 27.4% 59.1% 11.4% 44.3% 19.9% 44.6% 8.5% 31.0%
Doc2VecC, KNN 42.8% 62.6% 30.2% 50.0% 34.9% 53.2% 23.9% 42.2%
Doc2VecC, SVM 33.7% 58.4% 18.6% 42.7% 25.8% 46.0% 12.5% 30.3%
ATAML, TCN (A) 59.6% 71.1% 50.7% 61.3% 54.3% 65.0% 38.5% 49.2%
Table 9: Comparing ATAML with document embeddings methods on miniReuters-21578

Appendix C Visualizing Attention

Figure 3 and Figure 4 illustrate the the same training example after the meta-learned is trained with MAML and ATAML, respectively. The target label for this document is “INTERNATIONAL RELATIONS” and both models make correct predictions for this training example. Whereas the MAML model illustrated in Figure 3 is over-fitting to the keyword “president”, the proposed ATAML model in Figure 4 identifies multiple key phrases, such as “talk with”, “agrred upon” and “negotiation with”, that are important to the classification of “INTERNATIONAL RELATIONS”. Learning meaningful phrase-level representations regularizes a model from over-fitting to spurious correlation in the training examples.

Figure 3: Visualizing attentions learned by MAML TCN(A).
Figure 4: Visualizing attentions learned by ATAML TCN(A).