C2C-GenDA: Cluster-to-Cluster Generation for Data Augmentation of Slot Filling

12/13/2020 ∙ by Yutai Hou, et al. ∙ Harbin Institute of Technology 0

Slot filling, a fundamental module of spoken language understanding, often suffers from insufficient quantity and diversity of training data. To remedy this, we propose a novel Cluster-to-Cluster generation framework for Data Augmentation (DA), named C2C-GenDA. It enlarges the training set by reconstructing existing utterances into alternative expressions while keeping semantic. Different from previous DA works that reconstruct utterances one by one independently, C2C-GenDA jointly encodes multiple existing utterances of the same semantics and simultaneously decodes multiple unseen expressions. Jointly generating multiple new utterances allows to consider the relations between generated instances and encourages diversity. Besides, encoding multiple existing utterances endows C2C with a wider view of existing expressions, helping to reduce generation that duplicates existing data. Experiments on ATIS and Snips datasets show that instances augmented by C2C-GenDA improve slot filling by 7.99 (11.9 respectively, when there are only hundreds of training utterances.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Slot filling is a fundamental module of the Spoken Language Understanding (SLU) in the task-oriented dialogue system young2013pomdp. The “inputs” in Figure 1 shows examples of slot filling, where key entities within user utterances are tagged with slot labels. Due to the high cost of manual annotation and the rapidly changing nature of dialogue domain, slot filling often faces the lack of quantity and diversity of training data. Such insufficiency in training data poses serious challenges for slot-filling models to handle myriad ways in which users express their demands.

Data augmentation

(DA) technique, which improves diversity and quantity of training data with synthetic instances, offers an appealing solution to the data scarcity problem of SLU. Success has been achieved with data augmentation on a wide range of problems, including computer vision

NIPS2012_4824, speech recognition DBLP:journals/corr/HannunCCCDEPSSCN14, text classification zhang2015character, and question answering fader-zettlemoyer-etzioni:2013:ACL2013.

Figure 1: Examples of sequence-to-sequence data augmentation and cluster-to-cluster data augmentation. 2 denotes novel utterance. -1 denotes duplication to existing utterance. 0 denotes duplication to other generated utterances.

For slot filling, state-of-the-art data augmentation works focus on generative methods shin2019vae. One of their typical ideas is generating new utterances by reconstructing existing utterances into alternative expressions while keeping the semantics. Previous works learn a Sequence-to-Sequence (Seq2Seq) model to reconstruct each existing utterance one-by-one yoo2020deep; hou2018coling; DBLP:conf/interspeech/KurataXZ16

. However, these methods tend to generate duplicated utterances, because they can only consider the expression variance between one input-output pair at a time. For example in Figure

1, each new utterance is only generated to be different from the corresponding input utterance, and thus often unconsciously duplicates other generated utterances (0) or other input utterances (-1). Such duplication will hinder the effectiveness of data augmentation. We argue these defects can be easily avoided by breaking the shackles of current one-by-one augmentation paradigm and considering the extensive instance relations during generation.

In this paper, we propose a novel Cluster-to-Cluster Generation framework for Data Augmentation of slot filling, named C2C-GenDA. As shown in Figure 1, different from previous works that augment each utterance one-by-one independently, we jointly generate multiple new instances by reconstructing a cluster of existing utterances with the same semantics. Such cluster-to-cluster generation allows model to consider the duplication between generated utterances and aware of more existing expressions in original data. These advantages of C2C-GenDA remedy the aforementioned defects of Seq2Seq DA and help to improve generation diversity. To encourage diversity and quality of generation, we propose the Duplication-aware Attention and Diverse-Oriented Regularization mechanisms, both of which promote diverse decoding. To learn to generate diverse new utterances, we train the C2C-GenDA model with cluster-to-cluster ‘paraphrasing’ pairs, and introduce a Dispersed Cluster Pairing algorithm to extract these cluster pairs from existing data.

Experiments on ATIS and Snips datasets show that the proposed method significantly improves the performance of slot-filling systems. Case studies and analysis of augmented data also confirm that our method generates diverse utterances. Our contributions can be summarized as follow (1) We propose a novel Cluster-to-Cluster generation framework for data augmentation of slot filling, which can remedy the duplication problem of existing one-by-one generation methods. (2) We propose the Duplication-aware Attention and Diverse-Oriented Regularization mechanism to improve diversity of the augmented utterances. (3) We introduce a Dispersed Cluster Pairing algorithm to extract cluster-to-cluster ‘paraphrasing’ pairs for data augmentation model training.

Problem Description

In this paper, we study the data augmentation for slot filling task that maps utterances into semantic frames (slot type and slot value pairs). Slot filling is commonly treated as a sequence labeling problem, where slot type labels are assigned to contiguous sequences of words indicating these sequences are the corresponding slot values.

We specify the data augmentation (DA) for slot filling as exploiting existing training instances to generate new expressions for each semantic frame. Suppose existing slot filling training data is . Given a semantic frame and the corresponding existing utterances , DA generates a set of new utterances with unseen expressions. Then DA constructs new training instances by associating new utterances with the semantic frame. Finally, DA takes the union of all new instances as the additional data to reinforce the model training.

Proposed Framework

In this section, we present an overview of our data augmentation framework, and introduce the Cluster2Cluster generation model. Then, we discuss how to extract cluster-to-cluster paraphrasing data for generation model training.


Here, we introduce the overview of the proposed cluster-to-cluster data augmentation framework for slot filling. For each semantic frame, we use a Cluster2Cluster (C2C) model to generate new expressions from existing utterances. The input of our framework is a cluster of existing instances for a certain semantic frame, and the output is a cluster of generated new instances with unseen expressions.

Following hou2018coling, we perform delexicalized generation. Specifically, both the inputs and outputs of C2C generation model are delexicalized utterances, where slot values tokens are replaced by slot label tokens. For the example in the Figure 1, C2C takes in “show me the distance pos” and reconstruct the expression as “please list all the pos distance ”. The delexicalization focuses the model on generating diverse expressions rather than slot values and reduces the vocabulary size. Then after generation, we recover the delexicalized utterances by filling the slots with context-suitable slot values. Such delexicalization is important since it allow us to generate both the utterance and accurate slot annotations simultaneously.

To learn the ability of generating diverse and new expressions, we construct cluster-to-cluster paraphrasing pairs from original training data with the Dispersed Cluster Pairing algorithm, which simulates the data augmentation process of generating novel expressions from existing expressions for a specific semantic frame.

Figure 2: Cluster2Cluster generation model

Cluster2Cluster Generation Model

Custer2Cluster (C2C) model is a generation model that lies at the core of our C2C-GenDA framework and aims to reconstruct input utterances into alternative expressions while keeping semantic. As Figure 2 shown, the C2C model first encodes the input cluster of utterances for a certain semantic frame, then jointly decodes a new cluster of utterances with different expressions, where and are size of input and output cluster respectively.

To further encourage the diversity of the generated utterances, we propose two novel mechanisms: (1) Duplication-aware Attention that attends to the existing expressions to avoid duplicated generation for each decoding step. (2) Diverse-Oriented Regularization that guides the synchronized decoding of multiple utterances to improve the internal diversity of the generated cluster.

Cluster Encoder

We jointly encode multiple input utterances by concatenating them, and representing the whole sequence with an -layer transformer Transformer:222We separate input utterances with special tokens SEP.

where is the final representations of the input tokens, is the embedding of token in the input utterance, is the package of all input token embeddings, and is the outputs of layer. is the layer normalization.

is the multi-head self-attention function operating on vector packages of queries

, values (also used as keys). is a position-wise feed-forward network.

Cluster Decoder with Duplication-aware Attention

To reduce duplication and encourage diversity, we propose a cluster decoder with the Duplication-aware Attention (DAA) mechanisms. It decodes each new utterance while being aware of the existing expressions in both the input cluster and other generated utterances.

Intuitively, we decode the target utterance depending on both input cluster and other output utterances . We also incorporate the diversity rank token # hou2018coling as generation conditions to encourage diversity and distinguish different output utterances. Details of the diverse rank in C2C will be introduced in a later section. Then the C2C model is formalized as:

However, it is unrealistic to decode a target utterance depending on all the other target utterances, because we jointly decode all the target utterances and the generation of other target utterances has not finished. Therefore, we approximate the dependence between target utterances and depend the decoding on already generated tokens of all the target utterances. For each step, we simultaneously decode one token for all the target utterances which depends on all the previously decoded tokens :

where is the number of decoding steps.

We calculate the decoding possibility for the step of utterance as , where is a hidden state that combines feature representations of , and .

Here, we obtain the hidden state with DAA which contains two terms: and . The first term mainly records the information of what token should be generated. To achieve this, encodes previously decoded tokens of current utterance and semantic information from the input cluster. Since encodes existing expressions in the input cluster, it also allows to reduce generation duplicated to existing expressions. For the target utterance, we compute with an -layer transformer as decoder:

where is a package of hidden states for all decoding steps. is the package of all decoded token embeddings and is the outputs of decoding layer. is the input cluster representation from encoding layer.

The second term mainly records the duplicated expressions that should not be generated, it encodes expressions generated by other target utterances as .

Finally, the hidden-state for decoding is , where is a balance factor. Subtraction makes different for each target utterances, and can implicitly punish decoding of commonly shared words.

Figure 3: Construct training instance for our Cluster2Cluster model with Dispersed Cluster Pairing.

Model Training with Diverse-Oriented Regularization

We train the C2C model with a Diverse-Oriented Regularization (DOR) to encourage internal diversity within the generated utterance cluster.

To achieve this, we propose to enlarge the distance between distributions of utterances in the output cluster. However, the distribution of an utterance is hard to estimate during the decoding process. Thus, we approximately enlarge two utterances’ distribution by encouraging the divergence of token distributions. As shown in Figure


, we train the model to enlarge the Kullback-Leibler Divergence (KL) between decoding distribution of different output utterances at each step. Formally, we define the distance between two output utterances

and as:

where denotes token distribution of output utterance at decoding step. Then we define Diverse-Oriented Regularization of generation as:

Overall, we train C2C model to minimize:

where is a balancing factor.

Generation Pre-training

When training data is insufficient, the data augmentation model itself is often poorly trained due to limited expression in the training data. To remedy this, we initialize the transformer encoder/decoder with pre-trained language model GPT-2


1:Original training data
2:Initialize cluster-to-cluster pairs
3:All semantic frame
4:Delexicalize all training data
15:for  in  do
2        = for  in  do
3               Initialize target cluster while   do
4                      Update
5              Update
Algorithm 1 Dispersed Cluster Pairing

Cluster-to-Cluster Data Construction

To learn to generate diverse new utterances, we train the C2C model with cluster-to-cluster ‘paraphrasing’ pairs extracted from existing training data, and propose a Dispersed Cluster Pairing algorithm to construct these pairs.

We hope the cluster-to-cluster generation pairs simulate the data augmentation process, where we generate diverse new utterances from limited expressions. Therefore, given all utterances with same semantic, we gather similar utterances as an input cluster and pick the utterances with the most different expressions as the output cluster. For each semantic frame , we construct the input cluster with lexical clustering and the construct output cluster with furthest including mechanism.

Figure 3 and Algorithm 1 present the workflow of the cluster-to-cluster data construction. Firstly, we perform lexical clustering on the utterances with K-Medoids clustering method FastKM. Each lexical cluster contains similar utterances and is used as an input cluster .

Then, for each source cluster , we sample target utterance according to a furthest including principle. Each time, we pick the that has the highest diversity score and include it in target cluster .

We compute diversity score between a candidate utterance and the union of source cluster and current target cluster as . Notice that maximizing the diverse score between and source cluster increases the target cluster’s novelty against the source cluster. The diverse score between and target cluster helps to avoid duplication within the target cluster.

Diversity Rank

As mentioned in the decoder section, we adopt the diversity rank to encourage diversity and distinguish sentences in the output cluster. Consequently, we incorporate the diverse rank in training data of C2C model by associating each output utterance with a diverse rank token (See the examples in Figure 3). Since the output cluster utterances are greedily picked by diversity score, we naturally use this greedy picking order as the diversity rank, which models the novelty of output utterance. When augmenting new data, we generate the new utterances at rank from 1 to , where is a preset size of output cluster.

Cross Expansion

After training of the C2C model, we generate unseen new utterances from the constructed input clusters. To avoid the new utterances to overfit to the original output utterances seen in C2C training, we perform data augmentation with a Cross Expansion mechanism. We partition all the cluster-to-cluster pairs into training ones and reserved ones . Then we train the C2C model only with and generate new utterances from the input clusters of reserved pairs . To make full use of existing utterances, we repeat such partition in a crossing manner.

Model ATIS Snips
Full Medium Small Full Medium Small
Baseline 94.93 85.85 67.33 89.30 64.84 42.33
  + NoiseSeq2Seq DBLP:conf/interspeech/KurataXZ16 94.61 87.34 67.93 - - -
  + Slot Expansion shin2019vae 94.67 87.58 74.83 - - -
  + Rel-Seq2Seq hou2018coling 94.82 88.72 73.71 - - -
  + C-VAE shin2019vae 95.04 88.82 71.97 90.93 65.13 38.46
  + Ours w/o pre-train 95.06 90.87 75.21 90.33 67.49 46.94
  + Ours 95.29 90.95 75.32 91.01 67.90 48.09
Table 1: Comparison of data augmentation methods for slot filling on ATIS and Snips datasets. Results marked with + are results of the same Bi-LSTM trained with different data augmentation methods. w/o pre-train initializes C2C with random parameters rather than pretrained GPT. The Snips results are re-implemented. *

indicates that the result is statistically significant over the strongest data augmentation baseline under t-test (p-value



We evaluate the proposed data augmentation method on two slot filling datasets.333 We only focus on DA for the sequence-labeling problem of slot-filling. So the results may be lower than some joint SLU models, which perform slot-filling using additional information from intents peng2020gpt; louvan2020simple.


We conduct experiments on ATIS and Snips datasets. ATIS atis is extensively used for slot filling and provides a well-founded comparison for data augmentation methods. It contains 4,978 training utterances and 893 testing utterances. To simulate the data insufficient situations, we follow chen2016syntax; hou2018coling; shin2019vae, and evaluate our model on two small proportions of the training data which is small proportion (1/40 of the original training set with 129 instances) and medium proportion (1/10 of the original training set with 515 instances). We use a development set of 500 instances.

Snips Snips dataset is collected from the Snips personal voice assistant. There are 13,084 training utterances and 700 testing utterances. We use another 700 utterances as the development set. We also split the snips training set into small proportion (1/100 of the original training set with 130 instances) and medium proportion (1/20 of the original training set with 654 instances).


Following previous works shin2019vae; hou2018coling

, we compute F1-score as evaluation metric with the

conlleval script.444www.clips.uantwerpen.be/conll2000/chunking/conlleval.txt


We built our Cluster2Cluster model with the transformer implemented by Wolf2019HuggingFacesTS. For pre-trained parameters, we used the GPT-2, which has 12 layers, 110M parameters and the hidden state dimension of 768. We used AdamW AdmaW optimizer with initial learning rate 6.25e-5 or 5e-5 for training. We varied in {0.1, 0.02, 0.01, 0.002, 0.001} and set as 1.0.

Following previous works shin2019vae; hou2018coling, we conduct experiments with Bi-LSTM as slot-filling model and train it with both original training data and data augmented by different data augmentation methods. We use the same Bi-LSTM implements as previous work.555github.com/AtmaHou/Bi-LSTM˙PosTagger The dimension of word embeddings and hidden states was set to 300 and 128, respectively. We used GloVe pennington2014glove to initialize word embedding. We varied training batch size in {16, 128}, set dropout rate to 0.5, and trained the model with Adam as suggested by kingma2014adam.

For all models, best hyperparameter settings are determined on the development set. We report the average of 5 differently-seeded runs for each result.

Main Results for Data Augmentation

Table 1 shows the evaluation results of data augmentation methods on two slot filling datasets: ATIS and Snips. To simulate data insufficient situations, we compare the proposed method with previous data augmentation methods with different proportions following previous works chen2016syntax; hou2018coling; shin2019vae. Baseline results are obtained with a Bi-LSTM slot-filling model trained on original training data. And results of each data augmentation methods are obtained with Bi-LSTM models that have the same architecture as the baseline but are trained with both original data and generated data.

On ATIS dataset, our model significantly outperforms the baseline model by 5.10 and 7.99 F-scores on medium and small proportion respectively. There are similar improvements on Snips dataset. These improvements show the effectiveness of our augmentation method in the data-insufficient scenarios. When tested with data sufficient scenarios on full proportions, our model also brings improvements over baselines models. The improvements are narrowed comparing to those in data scarcity settings. We address this to the fact that full ATIS and Snips are large enough for slot-fillings, which limit the effects of additional synthetic data. When we augment new data without generation pre-training, our performance drops but still achieves significant improvements in most settings, which shows the effectiveness of pre-training and C2C structure respectively. We will discuss pre-training in detail later.

We compare our methods to two kinds of popular data augmentation methods for slot filling: rephrasing-based and sampling-based methods. Similar to our methods, rephrasing-based data augmentation methods reconstruct existing data into alternative expressions.666Traditional paraphrasing and back-translation methods are not compared here, because they are not capable to generate token-level annotation for sequence labeling problem. For this kind of method, NoiseSeq2Seq DBLP:conf/interspeech/KurataXZ16 and Rel-Seq2Seqhou2018coling learn seq2seq models to reconstruct the existing utterances. To generate unseen expression, NoiseSeq2Seq Introduce noise to decoding, and Rel-Seq2Seq considering the relation between expression alternatives. Slot Expansion shin2019vae generates the new data by randomly replacing the slot values of existing utterances. These methods argument each new utterance independently, thus often generate duplicated expressions that are helpless to improve slot-filling training. Our C2C model mitigates this by jointly encoding and decoding multiple utterances and considering the extensive relation between instances. Such advantages result in higher diversity and help to achieve better performance.

For the second type of data augmentation, we compare with the sampling-based data augmentation method of C-VAE shin2019vae. C-VAE leverages a conditioned VAE model to sample new utterances and generates corresponding annotations at the same time. It also faces the diversity problem, since it samples each new data independently. Our methods outperform this strong baseline on all the six slot-filling settings. The improvements come from the better diversity and fluency of the proposed Cluster2Cluster generation. Notably, we gain significant improvements of 9.63 and 3.35 F1-scores on Snips-small and ATIS-small. It shows that our methods are more effective in data scarcity situations.


Model Full Medium Small
Ours 91.01 67.90 48.09
 - cluster-wise gen. 90.28 66.23 45.93
 - diverse reg. 90.32 66.11 44.32
 - dup. attention 90.16 66.43 44.37
Table 2: The results of the ablation test.
Model TF Layers Full Medium Small
w/   pre-train 12 91.01 67.90 48.09
w/o pre-train 12 90.42 67.32 44.96
w/o pre-train 2 90.33 67.49 46.94
w/o pre-train 1 90.10 66.45 46.59
Table 3: Effect analysis of generation pre-training.

Ablation Test

We perform an ablation study to evaluate the importance of each component in C2C framework. Table 2 shows the results on Snips. For the model without cluster-wise generation, we directly fine-tune GPT to generate new data in a seq-to-seq manner. The drops of F1-score demonstrate the superiority of the cluster-wise generation. If removing either Diverse-Oriented Regularization or Duplication-ware attention from the model, performance drops are witnessed. This shows that both of the two mechanisms help to improve slot-filling by encouraging diversity.

Effects of Generation Pre-training

We analyze the impact of initializing C2C model with pre-trained language model. We randomly initialize C2C model and vary the model sizes to avoid overfitting caused by large model sizes. As shown in Table 3, the pre-training helps to improve the effects of data augmentation on all settings. We attribute this to the fact that pre-training can improve generation fluency. However, as revealed in both Table 3 and Table 1, the drops are limited compared to the overall improvements, which shows the inherent effectiveness of C2C model.

Model ATIS Snips
Full Medium Small Full Medium Small
Baseline 94.93 85.85 67.33 89.30 64.84 42.33
  + BERT 95.53 91.18 82.56 96.63 86.34 66.43
  + BERT + Ours 95.45 92.13 85.92 96.12 88.63 73.37
Table 4: Analysis of data augmentation effects with deep pre-trained embeddings

Effects over Deep Pre-trained Embeddings

For data scarcity problem, deep pre-trained embeddings, such as BERT BERT, are also demonstrated as an effective solution wang2020static. To see whether DA is still effective when using deep pre-trained embeddings, we conduct DA experiments over a BERT-based slot-filling model.777We fine-tune the BERT-uncased-base model with AdamW optimizer and initial learning rate of 5e-5. As shown in Table 4, although BERT greatly improves the performance of slot-filling, our model still achieved improvements on Medium and Small proportion data. This shows the effectiveness of our DA methods for data scarcity problems. Our augmentation method slightly lags the BERT-only model on Full proportion. We address this to the fact that full data is large enough for slot-filling and BERT can be misled by the noise within generated data.

Evaluation for Generation Diversity

Increasing the diversity of generation is one of the essential goals of the data augmentation methods. Following shin2019vae, we evaluate the diversity of generated data from two aspects: Inter and Intra. Inter: ratio of utterances that did not appear in the original training set. Intra: ratio of unique utterances among all generated new data.

Such metrics only measure the whole-sentence level diversity, but fail to measure expression diversity at token level. To remedy this, we introduce a token-level diversity metric: Minimum Edit Distance (). For each generated utterance , we calculate its to a set of utterances as . measures novelty of a sentence comparing to a set of existing sentences at token level. We report the averaged of each generated utterance to the original training set (Inter) and to the other generated utterances (Intra).

Table 5 shows the evaluation of the generation diversity on the ATIS-Full. For Inter Diversity, our method significantly outperforms all previous methods on both Ratio and average metrics. We note that we can achieve the best diversity even evaluating the generated delexicalized utterances. It shows the great ability of the C2C model in generating unseen expressions. This is mainly due to that cluster-wise encoding mechanism allows model to be aware of more existing expression during generation.

For Intra Diversity, our method also achieves the best performances over the previous works. These improvements show that considering relations between generated utterances can significantly reduce duplication.

Diversity Analysis

To understand how the proposed method enhances expression diversity, we investigate the diversity distribution of generated delexicalized utterances on the ATIS-full. We measure the diversity with Inter . As shown in Figure 4, Seq2Seq generation yields more existing expressions, and the scores are mostly distributed in low-value areas. Comparing to Seq2Seq, Cluster2Cluster model generally has higher scores. This demonstrates the intrinsic advantage of the cluster-wise generation to generate new expressions.

When training the Cluster2Cluster model with Diverse-Oriented Regularization and Duplication-ware Attention, there is much fewer existing expressions within generated utterances, and we can see a continuous drifting of distribution towards higher diversity. This shows that the proposed mechanisms help to generate more diverse utterances.

Also, we conduct case studies to see how C2C model generates unseen expressions (See Appendix).

Model Inter Intra
Ratio MED Ratio MED
NoiseSeq2Seq 74% 1.20 86% 1.03
Rel-Seq2Seq 96% 3.16 90% 1.81
C-VAE 23% 0.62 11% 0.42
Ours (delexicalized) 96% 5.88 92% 5.55
Ours 100% 9.03 95% 4.85
Table 5: Diversity evaluation of utterance generation.

Related Work

Data augmentation (DA) solves data scarcity problems by enlarging the size of training data fader-zettlemoyer-etzioni:2013:ACL2013; zhang2015character; daforslu; daforslu2; dafordst; li2019insufficient. Previous DA works propose back-translation methods backtranslate; backtranslate1 and paraphrasing methods paraphrase; paraphrase18; paraphrase19; gao2020paraphrase to generate semantically similar sentences. However, these DA methods are not applicable to the sequence labeling problem of slot-filling. Because slot filling requires token-level annotations of semantic frame, while these methods can only provide sentence-level labels.

Spoken Language understanding, including slot filling and intent detection tasks, has drawn a lot of research attention recently yao2013recurrent; 7078572; DBLP:conf/interspeech/MesnilHDB13; DBLP:journals/taslp/MesnilDYBDHHHTY15; chen2016syntax; contextualslu; slusota2; slusota; slusota3. In this paper, we only focus on the slot filling task. For data augmentation of slot filling, previous works focus on generation-based methods. DBLP:conf/interspeech/KurataXZ16; hou2018coling; peng2020gpt augment the training data with a Sequence-to-Sequence model. shin2019vae; yoo2019joint introduced Variational Auto-Encoder Kingma2014vae and jointly generate new utterances and predict the labels. louvan2020simple introduce simple rules to generate new utterances. Different from our C2C framework, these methods augment each instance independently and often unconsciously generate duplicated expressions.

Figure 4: Diversity distribution of generated expressions.


In this paper, we study the data augmentation problem for slot filling and propose a novel data augmentation framework C2C-GenDA, which generates new instances from existing training data in a cluster-to-cluster manner. C2C-GenDA improves generation diversity by considering the relation between generated utterances and capturing more existing expressions. To further encourage diversity, we propose Duplication-aware Attention and Diverse-Oriented Regularization mechanism. We introduce a Dispersed Cluster Pairing algorithm to construct cluster-to-cluster paraphrasing pairs for C2C-DA training. Experiments show that the proposed framework can improve slot-filling by generating diverse new training data and outperform existing data augmentation systems of slot-filling.



Appendix A Case Study of Generation

Type and
Replace Phrases : show me the flights from from_city to to_city with stop in stop_city
: give me flights from from_city toto_city with stopover in stop_city
Enrich Info : i ’d like information on all the flights from from_city to to_city on depart_date
: i ’m sorry to see all the flights that i take from from_city to to_city on depart_date
Change Syntax : how much is a flight from from_city to to_city
: how much does a flight cost from from_city to to_city
Change Semantics : show all airlines with flights between from_city and to_city
: show me more airlines with seats between from_city and to_city
Table 6: Case study of new expression. For each generated example , we find the most similar existing utterance to it and compare the differences.

We conduct case study to see how the proposed Cluster2Cluster model generates unseen expressions. Specifically, we randomly picking a generated utterance , and searching in the original training set for the most similar utterance . To focus on expressions change, we perform experiments on delexicalized utterances.

By comparing the difference between each pair of and , we find the new expressions in several interesting manners: Replace Phrases, Enrich Info, Change Syntax, Change Semantics. The examples are listed in Table 6. We find that the most common new expressions are from replacing phrases and enriching information. As shown from the given examples, Cluster2Cluster model can replace semantically similar phrases and bring additional information. The phrase replacing ability is mainly learned from cluster-to-cluster training data, and the enriched information often comes from the pretraining process. As for Change Syntax, it is interesting to see that Cluster2Cluster can yield new expression by using alternative syntax, which is very close to humans in paraphrasing.