1 Introduction
Fewshot text generation is an important research topic since obtaining largescale training data for each individual downstream task is prohibitively expensive. Recently, pretraining large neural networks with a language modeling objective has led to significant improvement across different fewshot text generation tasks
radford2019language; lewisetal2020bart and many techniques are proposed based on them chen2019few; schick2020few; zhang2020pegasus; kale2020text; chang2020dart; chang2021neural; li2021prefix; chang2021jointly; chang2021does; su2020moviechats. However, all the previous works simulate the fewshot scenario by randomly sampling a subset from the full training data. Little to no attention has been paid to the selection strategies.In this work, we present a preliminary study at searching for an optimal strategy to select the fewshot training instances. Studying the selection strategy is motivated by two rationales. First, random sampling leads to a large variance of model performance
zhang2020pegasus; schick2020few; schick2020s. Yet current works sample their own training data which makes it difficult to compare across different models. One can then not be sure whether an improved performance can be really ascribed to the model or the randomness of sampling. Using a stable selection strategy to find the most informative fewshot instances can provide a fair platform and better benchmark different fewshot generative models. Second, in practical applications, e.g. document summarization, the training data is usually obtained by manually annotating the summaries for some selected documents. In Figure 1, we illustrate the typical training scenario for text generation where the annotation budget only allows annotating a limited amount of data. Studying the optimal selection strategy can help make the most use of our annotation budget. Specifically, we focus on the labelfree setting where the selection can only condition on the unannotated data. Although leveraging the reference text may benefit the selection strategy, it conflicts with the realistic setting where we need to first select the data then get its annotated reference text.The selection task resembles the theme of active learning
balcan2007margin, where the model keeps identifying the most informative instances to get labeled. Existing active learning approaches can be roughly divided to uncertaintybased sampling and representative sampling settles2009active. Uncertaintybased sampling select samples that maximally reduce the uncertainty of the model tur2005combining. This, however, requires a welltrained model with decent confidence score estimations in order to perform well. Therefore, in this paper, we opt for the representativesampling where the selected training instances are expected to be dissimilar to each other and representative enough to cover all important patterns in the whole data distribution
agarwal2005geometric; wei2015submodularity. This naturally matches the objectives of kmeans clustering which minimizes the withincluster variances while maximizing the betweencluster variances to encourage the diversity and representativeness of each cluster krishna1999genetic; kanungo2002efficient. As has been shown in image classification tasks, data points closer to the cluster centroids are usually most important, while other faraway points can even be safely removed without hurting model performance kaushal2018learning; birodkar2019semantic. Inspired by this, we propose a simple selection strategy which first clusters the whole unlabeled dataset with the Kmeans algorithm, and then from each cluster, selects the data point that is closest to the cluster centroid.We conduct experiments on three popular text generation tasks: datatotext, document summarization and question generation. The proposed selection strategy consistently outperforms random sampling and exhibits much smaller variance.
Contribution.
We present a preliminary study on training instance selection for fewshot text generation and propose a selection strategy based on Kmeans clustering. The proposed method shows consistent superior performance over random sampling, which can be used to make most use of the annotation budget in practical applications. Meanwhile, the selected training instances can serve as a better benchmark for fewshot text generation since they are not biased towards specific generative methods and do not have the large variance issue as found in random sampling. We further perform a set of ablation studies to analyze what contributes to a good selection. Our findings can also benefit research in active learning konyushkova2017learning since identifying the most informative training instances is a critical step before collecting more annotations through active learning.
2 Problem Formulation
Following the training scenario shown in Figure 1, we denote the unlabeled data as where n is the data size. Depending on the downstream task, “data” can mean unlabeled structured data, documents and paragraphs respectively in the context of datatotext, document summarization and question generation. We will select instances from the whole unlabeled dataset, annotate them with reference text, and then train a neural generative model based on the annotated data. is defined based on the annotation budget. In this work, since we focus on the fewshot scenario, is set to be small (). The goal is to find the most representative instances that can lead to the optimal performance when trained on them.
3 Selection by Kmeans Clustering
The general idea of our proposed method is to first split the whole unlabeled data into
clusters, then select one data point from each cluster. Specifically, we first map each data point into a vector, then cluster the vectors with the Kmeans algorithm. The objective is sum of the squared errors (SSE), which is also called cluster inertia:
(1) 
where is the centroid of the th cluster. is the embedding vector of . if belongs to the cluster and otherwise. We optimize the objective function with the EM algorithm dempster1977maximum which iteratively assigns each data point into its closest cluster centroid. The initial centroid points are chosen based on the Kmeans++ algorithm arthur2007k
. The first cluster center is chosen uniformly at random from the data points, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the point’s closest existing cluster center. By this means, we maximize the chance of spreading out the
initial cluster centers. We use 10 random seeds for selecting initial centers and the clustering with the minimum SSE is chosen.E2E  CNNDM  SQUAD  
10  50  100  10  50  100  10  50  100  
Random  4.387.12  11.574.29  26.222.58  13.516.47  24.813.77  35.242.89  1.236.22  3.335.89  7.653.61 
ICRandom  2.154.58  9.802.62  24.712.71  12.303.89  24.712.45  33.291.92  1.343.23  1.793.77  6.972.55 
Kmeans  6.222.33  11.891.39  27.132.22  14.282.35  25.193.28  36.311.08  1.562.34  4.773.61  9.332.15 
After splitting them into clusters, we pick from each cluster the data point that is closest to the center. We use the Euclidean distance to select, the same as the metric used for Kmeans clustering. The intuition is that the test performance usually depends on the nearest neighbor in the training set khandelwal2019generalization; rajani2020explaining. Ideally data points closest to the cluster centers are most representative samples, selecting them will maximize the chance that a similar sample will be found in the training dataset.
4 Experiments
We perform our experiments on the following three representative datasets which cover three different text generation tasks:

Datatotext: We use the dataset for the E2E challenge novikova2017e2e which contain 50,602 datatext pairs with 8 unique slots in the restaurant domain.

Document Summarization: We use the CNN/Dailymail dataset (nonanonymized version) hermann2015teaching which contains 312,084 documentsummary pairs.

Question generation: We use the SQuAD dataset
rajpurkar2016squad with over 100k questions. Following du2017learning, we focus on the answerindependent scenario to directly generate questions from passages.
Embedding  E2E  CNNDM  SQUAD  
Mean  Sum  Mean  Sum  Mean  Sum  
BART  26.28  25.59  34.30  34.46  8.89  8.56 
BARTFT  26.46  26.32  36.31  34.18  9.55  8.12 
GloVe  25.18  23.36  33.59  31.45  7.99  7.56 
FastText  27.13  24.85  33.23  34.30  9.33  9.42 
For all experiments, we finetune the opensourced Bart model
lewisetal2020bartas our generative model. Bart is pretrained with a denoising autoencoder objective on large amount of text data and has been the stateofthearts for many text generation tasks. To extract vectors used for clustering, we finetune the Bart model with its original selfsupervised objective on the unlabeled data, then apply mean pooling over the last hidden states of the encoder.
In the later sections, we will first compare the model performance based on our proposed selection strategy and random sampling, then analyze the variance of them. Finally, we perform an ablation study to see the effects of incluster selection and embedding choices.
Comparison of Selection Strategies.
In Table 1, we compare the model performance based on different selection strategies. Apart from random sampling and our proposed method, we also compare with a lower bound where all instances are randomly sampled from one cluster (withincluster random). Adding this for comparison aims to illustrate that it is important to select diverse samples across different clusters. The performance scores are averaged over 10 different trials for each selection strategy. As can be seen, Kmeans based selections consistently outperforms the others. Withincluster random sampling performs the worst, proving the importance of having diverse samples in the training instance. However, it is worth noting that although random sampling underperforms Kmeans selection on average, its upper bound is much higher, suggesting the proposed Kmeans selection is by no means optimal. There is still much room for improvement.
Variance of Model Performance.
Table 1 also shows the variance of model performance with different selection strategies. The variance is computed based on 10 different runs. For withincluster random sampling, the variance comes from both the choice of the cluster and the incluster sampling. For Kmeans selection, the variance comes from the choice of initial center points. We can see random sampling and withincluster random sampling have a very large variance of up to for . This further suggests that comparing fewshot models based on random sampling can be be prone to variability and prevent drawing reliable conclusions. Kmeansbased selection, on the contrary, is rather robust with random seeds. Therefore, for future work on fewshot text generation, we suggest that models be tested on instances selected from our proposed strategy for a fair comparison.
Effects of Incluster Selection.
In Figure 2, we show the effects of the incluster selection method. In our proposed method, within each cluster, we select one data point that is closest to the cluster center. To see whether it is important to select the closest data point, we compare with two selection variants that within each cluster, we select (1) one data point randomly sampled from the cluster, and (2) one data point that is farthest to the cluster center. We can observe that the choice of selection does have a big impact on the model performance. Choosing data points farthest to the cluster centers leads to the worst performance. This is consistent with previous findings kaushal2018learning; birodkar2019semantic
that data points farthest from cluster centers are usually outliers and less representative. Selecting them might mislead the model to capture nongeneric patterns and thereby generalize poorly. In contrast, choosing data points closest to cluster centers performs slightly better than random selection. However, random selection has a much larger variance compared with closest/farthest point selection (shown as shadow).
Effects of Embedding Methods.
As the Kmeans clustering is performed on top of the embedding vectors of unlabeled data, the choice of embedding methods could affect the performance on selected points. In Table 2, we show the effects of the different embedding methods. Apart from the finetuned Bart, we compare with embeddings extracted from (1) Bart without being finetuned on the taskspecific data, (2) Glove pennington2014glove and (3) FastText bojanowski2017enriching, both finetuned on the taskspecific data. For each embedding method, we compare using mean pooling and sum pooling to extract the final vector representation. The results show that finetuned Bart generally outperforms the other embedding choices. We attribute this to the similarity in the embedding space between selection with BART embeddings and and the BART generation model. Moreover, FastText offers a strong baseline as it does relatively well on two scenarios in E2E and SQUAD respectively. Further, we observe that mean pooling is generally better than the sum of word vectors, which is also observed in chen2018enhancing.
Human Evaluation.
To obtain further insights with respect to the generation outputs, five annotators were instructed to evaluate samples for each of the three tasks to judge (1) whether the text is fluent (score 05 with 5 being fully fluent), and (2) whether it contains relevant information about its input source (adequacy). These scores are averaged and presented in Table 3. For Random selection, we sampled 10 outputs from each of the 10 trials to make it 100 samples, and the same goes for ICrandom. We observe that the Kmeans algorithm select better subsets of the training samples that allow for better generalizability to unseen input sources. In particular, the outputs are generally more adequate. However, we see that the fluency of outputs remain relatively similar.
E2E  CNNDM  SQUAD  

Random  4.08/4.15  4.55/3.27  4.62/3.84 
ICRandom  4.32/3.54  3.62/3.01  4.23/2.74 
Kmeans  4.12/4.24  4.32/3.66  4.51/3.98 
5 Conclusion
In this work, we target at the unexplored problem of training instance selection for fewshot text generation. We show that random sampling can lead to large variance and suboptimal performance. To address this problem, we propose a selection strategy based on Kmean clustering and demonstrate that it consistently outperforms random sampling, and has much lower variance. We further perform a set of ablation studies to analyze the effects of data size, embedding and selection methods, showing that this is still much room for improvement. Future work can consider other clustering methods.
Acknowledgements
This research was funded in part by the German Research Foundation (DFG) as part of SFB 248 “Foundations of Perspicuous Software Systems”. We sincerely thank the anonymous reviewers for their insightful comments that helped us to improve this paper.
Comments
There are no comments yet.