On Training Instance Selection for Few-Shot Neural Text Generation

07/07/2021 ∙ by Ernie Chang, et al. ∙ Universität Saarland 0

Large-scale pretrained language models have led to dramatic improvements in text generation. Impressive performance can be achieved by finetuning only on a small number of instances (few-shot setting). Nonetheless, almost all previous work simply applies random sampling to select the few-shot training instances. Little to no attention has been paid to the selection strategies and how they would affect model performance. In this work, we present a study on training instance selection in few-shot neural text generation. The selection decision is made based only on the unlabeled data so as to identify the most worthwhile data points that should be annotated under some budget of labeling cost. Based on the intuition that the few-shot training instances should be diverse and representative of the entire data distribution, we propose a simple selection strategy with K-means clustering. We show that even with the naive clustering-based approach, the generation models consistently outperform random sampling on three text generation tasks: data-to-text generation, document summarization and question generation. We hope that this work will call for more attention on this largely unexplored area.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Few-shot text generation is an important research topic since obtaining large-scale training data for each individual downstream task is prohibitively expensive. Recently, pretraining large neural networks with a language modeling objective has led to significant improvement across different few-shot text generation tasks 

radford2019language; lewis-etal-2020-bart and many techniques are proposed based on them chen2019few; schick2020few; zhang2020pegasus; kale2020text; chang2020dart; chang2021neural; li2021prefix; chang2021jointly; chang2021does; su2020moviechats. However, all the previous works simulate the few-shot scenario by randomly sampling a subset from the full training data. Little to no attention has been paid to the selection strategies.

Figure 1: Training scenario: U represents unlabeled data and L indicates labeled instances. The annotation budget only allows selecting K data for annotating the reference text.

In this work, we present a preliminary study at searching for an optimal strategy to select the few-shot training instances. Studying the selection strategy is motivated by two rationales. First, random sampling leads to a large variance of model performance 

zhang2020pegasus; schick2020few; schick2020s. Yet current works sample their own training data which makes it difficult to compare across different models. One can then not be sure whether an improved performance can be really ascribed to the model or the randomness of sampling. Using a stable selection strategy to find the most informative few-shot instances can provide a fair platform and better benchmark different few-shot generative models. Second, in practical applications, e.g. document summarization, the training data is usually obtained by manually annotating the summaries for some selected documents. In Figure 1, we illustrate the typical training scenario for text generation where the annotation budget only allows annotating a limited amount of data. Studying the optimal selection strategy can help make the most use of our annotation budget. Specifically, we focus on the label-free setting where the selection can only condition on the unannotated data. Although leveraging the reference text may benefit the selection strategy, it conflicts with the realistic setting where we need to first select the data then get its annotated reference text.

The selection task resembles the theme of active learning 

balcan2007margin, where the model keeps identifying the most informative instances to get labeled. Existing active learning approaches can be roughly divided to uncertainty-based sampling and representative sampling settles2009active. Uncertainty-based sampling select samples that maximally reduce the uncertainty of the model tur2005combining

. This, however, requires a well-trained model with decent confidence score estimations in order to perform well. Therefore, in this paper, we opt for the representative-sampling where the selected training instances are expected to be dissimilar to each other and representative enough to cover all important patterns in the whole data distribution 

agarwal2005geometric; wei2015submodularity. This naturally matches the objectives of k-means clustering which minimizes the within-cluster variances while maximizing the between-cluster variances to encourage the diversity and representativeness of each cluster krishna1999genetic; kanungo2002efficient. As has been shown in image classification tasks, data points closer to the cluster centroids are usually most important, while other faraway points can even be safely removed without hurting model performance kaushal2018learning; birodkar2019semantic. Inspired by this, we propose a simple selection strategy which first clusters the whole unlabeled dataset with the K-means algorithm, and then from each cluster, selects the data point that is closest to the cluster centroid.

We conduct experiments on three popular text generation tasks: data-to-text, document summarization and question generation. The proposed selection strategy consistently outperforms random sampling and exhibits much smaller variance.

Contribution.

We present a preliminary study on training instance selection for few-shot text generation and propose a selection strategy based on K-means clustering. The proposed method shows consistent superior performance over random sampling, which can be used to make most use of the annotation budget in practical applications. Meanwhile, the selected training instances can serve as a better benchmark for few-shot text generation since they are not biased towards specific generative methods and do not have the large variance issue as found in random sampling. We further perform a set of ablation studies to analyze what contributes to a good selection. Our findings can also benefit research in active learning konyushkova2017learning since identifying the most informative training instances is a critical step before collecting more annotations through active learning.

2 Problem Formulation

Following the training scenario shown in Figure 1, we denote the unlabeled data as where n is the data size. Depending on the downstream task, “data” can mean unlabeled structured data, documents and paragraphs respectively in the context of data-to-text, document summarization and question generation. We will select instances from the whole unlabeled dataset, annotate them with reference text, and then train a neural generative model based on the annotated data. is defined based on the annotation budget. In this work, since we focus on the few-shot scenario, is set to be small (). The goal is to find the most representative instances that can lead to the optimal performance when trained on them.

3 Selection by K-means Clustering

The general idea of our proposed method is to first split the whole unlabeled data into

clusters, then select one data point from each cluster. Specifically, we first map each data point into a vector, then cluster the vectors with the K-means algorithm. The objective is sum of the squared errors (SSE), which is also called cluster inertia:

(1)

where is the centroid of the th cluster. is the embedding vector of . if belongs to the cluster and otherwise. We optimize the objective function with the EM algorithm dempster1977maximum which iteratively assigns each data point into its closest cluster centroid. The initial centroid points are chosen based on the K-means++ algorithm arthur2007k

. The first cluster center is chosen uniformly at random from the data points, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the point’s closest existing cluster center. By this means, we maximize the chance of spreading out the

initial cluster centers. We use 10 random seeds for selecting initial centers and the clustering with the minimum SSE is chosen.

E2E CNNDM SQUAD
10 50 100 10 50 100 10 50 100
Random 4.387.12 11.574.29 26.222.58 13.516.47 24.813.77 35.242.89 1.236.22 3.335.89 7.653.61
IC-Random 2.154.58 9.802.62 24.712.71 12.303.89 24.712.45 33.291.92 1.343.23 1.793.77 6.972.55
K-means 6.222.33 11.891.39 27.132.22 14.282.35 25.193.28 36.311.08 1.562.34 4.773.61 9.332.15
Table 1: Comparisons of random sampling, within-cluster random sampling (IC-Random) and K-means selection on the E2E, CNNDM, and SQUAD corpus (BLEU-4 reported).
Figure 2: Ablation studies on the SQUAD corpus. Performance in BLEU-4 with increasing K between different variants of K-means where selection is based on the closest point, Random point, or Farthest point from the centroid.

After splitting them into clusters, we pick from each cluster the data point that is closest to the center. We use the Euclidean distance to select, the same as the metric used for K-means clustering. The intuition is that the test performance usually depends on the nearest neighbor in the training set khandelwal2019generalization; rajani2020explaining. Ideally data points closest to the cluster centers are most representative samples, selecting them will maximize the chance that a similar sample will be found in the training dataset.

4 Experiments

We perform our experiments on the following three representative datasets which cover three different text generation tasks:

  1. Data-to-text: We use the dataset for the E2E challenge novikova2017e2e which contain 50,602 data-text pairs with 8 unique slots in the restaurant domain.

  2. Document Summarization: We use the CNN/Dailymail dataset (non-anonymized version) hermann2015teaching which contains 312,084 document-summary pairs.

  3. Question generation: We use the SQuAD dataset 

    rajpurkar2016squad with over 100k questions. Following du2017learning, we focus on the answer-independent scenario to directly generate questions from passages.

Embedding E2E CNNDM SQUAD
Mean Sum Mean Sum Mean Sum
BART 26.28 25.59 34.30 34.46 8.89 8.56
BART-FT 26.46 26.32 36.31 34.18 9.55 8.12
GloVe 25.18 23.36 33.59 31.45 7.99 7.56
FastText 27.13 24.85 33.23 34.30 9.33 9.42
Table 2: Finetuned BART generation performance comparison on E2E, CNNDM, and SQUAD for various embedding options for the k-means selection with k=100.

For all experiments, we finetune the open-sourced Bart model 

lewis-etal-2020-bart

as our generative model. Bart is pretrained with a denoising autoencoder objective on large amount of text data and has been the state-of-the-arts for many text generation tasks. To extract vectors used for clustering, we finetune the Bart model with its original self-supervised objective on the unlabeled data, then apply mean pooling over the last hidden states of the encoder.

In the later sections, we will first compare the model performance based on our proposed selection strategy and random sampling, then analyze the variance of them. Finally, we perform an ablation study to see the effects of in-cluster selection and embedding choices.

Comparison of Selection Strategies.

In Table 1, we compare the model performance based on different selection strategies. Apart from random sampling and our proposed method, we also compare with a lower bound where all instances are randomly sampled from one cluster (within-cluster random). Adding this for comparison aims to illustrate that it is important to select diverse samples across different clusters. The performance scores are averaged over 10 different trials for each selection strategy. As can be seen, K-means based selections consistently outperforms the others. Within-cluster random sampling performs the worst, proving the importance of having diverse samples in the training instance. However, it is worth noting that although random sampling underperforms K-means selection on average, its upper bound is much higher, suggesting the proposed K-means selection is by no means optimal. There is still much room for improvement.

Variance of Model Performance.

Table 1 also shows the variance of model performance with different selection strategies. The variance is computed based on 10 different runs. For within-cluster random sampling, the variance comes from both the choice of the cluster and the in-cluster sampling. For K-means selection, the variance comes from the choice of initial center points. We can see random sampling and within-cluster random sampling have a very large variance of up to for . This further suggests that comparing few-shot models based on random sampling can be be prone to variability and prevent drawing reliable conclusions. K-means-based selection, on the contrary, is rather robust with random seeds. Therefore, for future work on few-shot text generation, we suggest that models be tested on instances selected from our proposed strategy for a fair comparison.

Effects of In-cluster Selection.

In Figure 2, we show the effects of the in-cluster selection method. In our proposed method, within each cluster, we select one data point that is closest to the cluster center. To see whether it is important to select the closest data point, we compare with two selection variants that within each cluster, we select (1) one data point randomly sampled from the cluster, and (2) one data point that is farthest to the cluster center. We can observe that the choice of selection does have a big impact on the model performance. Choosing data points farthest to the cluster centers leads to the worst performance. This is consistent with previous findings kaushal2018learning; birodkar2019semantic

that data points farthest from cluster centers are usually outliers and less representative. Selecting them might mislead the model to capture non-generic patterns and thereby generalize poorly. In contrast, choosing data points closest to cluster centers performs slightly better than random selection. However, random selection has a much larger variance compared with closest/farthest point selection (shown as shadow).

Effects of Embedding Methods.

As the K-means clustering is performed on top of the embedding vectors of unlabeled data, the choice of embedding methods could affect the performance on selected points. In Table 2, we show the effects of the different embedding methods. Apart from the finetuned Bart, we compare with embeddings extracted from (1) Bart without being finetuned on the task-specific data, (2) Glove pennington2014glove and (3) FastText bojanowski2017enriching, both finetuned on the task-specific data. For each embedding method, we compare using mean pooling and sum pooling to extract the final vector representation. The results show that finetuned Bart generally outperforms the other embedding choices. We attribute this to the similarity in the embedding space between selection with BART embeddings and and the BART generation model. Moreover, FastText offers a strong baseline as it does relatively well on two scenarios in E2E and SQUAD respectively. Further, we observe that mean pooling is generally better than the sum of word vectors, which is also observed in chen2018enhancing.

Human Evaluation.

To obtain further insights with respect to the generation outputs, five annotators were instructed to evaluate samples for each of the three tasks to judge (1) whether the text is fluent (score 0-5 with 5 being fully fluent), and (2) whether it contains relevant information about its input source (adequacy). These scores are averaged and presented in Table 3. For Random selection, we sampled 10 outputs from each of the 10 trials to make it 100 samples, and the same goes for IC-random. We observe that the K-means algorithm select better subsets of the training samples that allow for better generalizability to unseen input sources. In particular, the outputs are generally more adequate. However, we see that the fluency of outputs remain relatively similar.

E2E CNNDM SQUAD
Random 4.08/4.15 4.55/3.27 4.62/3.84
IC-Random 4.32/3.54 3.62/3.01 4.23/2.74
K-means 4.12/4.24 4.32/3.66 4.51/3.98
Table 3: Human evaluation on 100 samples of the finetuned BART generation performance comparison on E2E, CNNDM, and SQUAD. Scores are presented as (fluency / adequacy).

5 Conclusion

In this work, we target at the unexplored problem of training instance selection for few-shot text generation. We show that random sampling can lead to large variance and suboptimal performance. To address this problem, we propose a selection strategy based on K-mean clustering and demonstrate that it consistently outperforms random sampling, and has much lower variance. We further perform a set of ablation studies to analyze the effects of data size, embedding and selection methods, showing that this is still much room for improvement. Future work can consider other clustering methods.

Acknowledgements

This research was funded in part by the German Research Foundation (DFG) as part of SFB 248 “Foundations of Perspicuous Software Systems”. We sincerely thank the anonymous reviewers for their insightful comments that helped us to improve this paper.

References