A Single Example Can Improve Zero-Shot Data Generation

08/16/2021
by   Pavel Burnyshev, et al.
0

Sub-tasks of intent classification, such as robustness to distribution shift, adaptation to specific user groups and personalization, out-of-domain detection, require extensive and flexible datasets for experiments and evaluation. As collecting such datasets is time- and labor-consuming, we propose to use text generation methods to gather datasets. The generator should be trained to generate utterances that belong to the given intent. We explore two approaches to generating task-oriented utterances. In the zero-shot approach, the model is trained to generate utterances from seen intents and is further used to generate utterances for intents unseen during training. In the one-shot approach, the model is presented with a single utterance from a test intent. We perform a thorough automatic, and human evaluation of the dataset generated utilizing two proposed approaches. Our results reveal that the attributes of the generated data are close to original test sets, collected via crowd-sourcing.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

09/02/2018

Zero-shot User Intent Detection via Capsule Neural Networks

User intent detection plays a critical role in question-answering and di...
06/22/2022

Template-based Approach to Zero-shot Intent Recognition

The recent advances in transfer learning techniques and pre-training of ...
10/24/2021

Improved Goal Oriented Dialogue via Utterance Generation and Look Ahead

Goal oriented dialogue systems have become a prominent customer-care int...
10/17/2020

Example-Driven Intent Prediction with Observers

A key challenge of dialog systems research is to effectively and efficie...
08/15/2022

Z-BERT-A: a zero-shot Pipeline for Unknown Intent detection

Intent discovery is a fundamental task in NLP, and it is increasingly re...
07/27/2021

Energy-based Unknown Intent Detection with Data Manipulation

Unknown intent detection aims to identify the out-of-distribution (OOD) ...
01/11/2021

Revisiting Mahalanobis Distance for Transformer-Based Out-of-Domain Detection

Real-life applications, heavily relying on machine learning, such as dia...

1 Introduction

Training dialogue systems used by virtual assistants in task-oriented applications requires large annotated datasets. The core machine learning task to every dialogue system is

intent detection

, which aims to detect what the intention of the user is. New intents emerge when new applications, supported by the dialogue systems, are launched. However, an extension to new intents may require annotating additional data, which may be time-consuming and costly. What is more, when developing a new dialogue system, one may face the cold start problem if little training data is available. Open sources provide general domain annotated datasets, primarily collected via crowd-sourcing or released from commercial systems, such as Snips NLU benchmark

coucke2018snips. However, it is usually problematic to gather more specific data from any source, including user logs, protected by the privacy policy in real-life settings.

For all these reasons, we suggest a learnable approach to create training data for intent detection. We simulate a real-life situation in which no annotated data but rather only a short description of a new intent is available. To this end, we propose to use methods for zero-shot conditional text generation to generate plausible utterances from intent descriptions. The generated utterances should be in line with the intent’s meaning.

Our contributions are:

  1. [topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]

  2. We propose a zero-shot generation method to generate a task-oriented utterance from an intent description;

  3. We evaluate the generated utterances and compare them to the original crowd-sourced datasets. The proposed zero-shot method achieves high scores in fluency and diversity as per our human evaluation;

  4. We provide experimental evidence of a semantic shift when generating utterances for unseen classes using the zero-shot approach;

  5. We apply reinforcement learning for the one-shot generation to eliminate the semantic shift problem. The one-shot approach retains semantic accuracy without sacrificing fluency and diversity.

2 Related work

Conditional language modelling

generalizes the task of language modelling. Given some conditioning context

, it assigns probabilities to a sequence of tokens

mikolov2012context. Machine translation sutskever2014sequence; cho2014learning

and image captioning

you2016image are seen as typical conditional language modelling tasks. More sophisticated tasks include text abstractive summarization nallapati2017summarunner; narayan2019article and simplification zhang2017sentence, generating textual comments to source code richardson2017code2text and dialogue modelling lowe2017training. Structured data may act as a conditioning context as well. Knowledge base (KB) entries vougiouklis2018neural

or DBPedia triples

colin2016webnlg serve as condition to generated plausible factual sentences. Neural models for conditional language modelling rely on encoder-decoder architectures and can be learned both jointly from scratch vaswani2017attention or by fine-tuning pre-trained encoder and decoder models budzianowski2019hello; lewis2019bart.

Zero-shot learning (ZSL)

has formed as a recognized training paradigm with neural models becoming more potent in the majority of downstream tasks. In the NLP domain, the ZSL scenario aims at assigning a label to a piece of text based on the label description. The learned classifier becomes able to assign class labels, which were unseen during the training time. The classification task is then reformulated in the form of question answering

levy2017zero or textual entailment yin2019benchmarking. Other techniques for ZSL leverage metric learning and make use of capsule networks du2019investigating and prototyping networks yu2019episodebased.

Zero-shot conditional text generation

implies that the model is trained in such a way that it can generalize to an unseen condition, for which only a description is provided. A few recent works in this direction show-case dialog generation from unseen domains zhao2018zero and question generation from KB’s from unseen predicates and entity types elsahar2018zero. CTRL keskar2019ctrl, pre-trained on so-called control codes, which can be combined to govern style, content, and surface form, provides for zero-shot generation for unseen codes combinations. PPLM dathathri2019plug uses signals, representing the class, e.g., bag-of-words, during inference, and can generate examples with given semantic attributes without pre-training.

Training data generation

can be treated as form of data augmentation, a research direction being increasingly in demand. It enlarges datasets for training neural models and help avoid labor-intensive and costly manual annotation. Common techniques for textual data augmentation include back-translation sennrich2016improving, sampling from latent distributions xia2021pseudo

, simple heuristics, such as synonym replacement

wei2019eda and oversampling chawla2002smote. Few-shot text generation has been applied to natural language generation from structured data, such as tables chen2020few and to intent detection data augmentation xia2021pseudo. However, these methods are incompatible with ZSL, requiring at least a few labeled examples for the class being augmented. An alternative approach suggests to use a model to generate data for the target class based on task-specific world knowledge chen2017automatically and linguistic features iyyer2018adversarial.

Deep reinforcement learning (RL)

methods prove to be effective in a variety of NLP tasks. Early works approach the tasks of machine translation grissom2014don, image captioning rennie2017self and abstractive summarization paulus2017deep, assessed with not differentiable metrics. wu2020textgail tries to improve the quality of transformer-derived pre-trained models for generation by leveraging proximal policy optimization. Other applications of deep RL include dialogue modeling li2016deep and open-domain question answering wang2018r.

3 Methods

Our main goal is to generate plausible and coherent utterances, which relate to unseen intents, leveraging the description of the intent only. These utterances should clearly express the desired intent. For example, if conditioned on the intent “delivery from the grocery store” the model should generate an utterance close to “Hi! Please bring me milk and eggs from the nearest convenience store” or similar.

Two scenarios can be used to achieve this goal. In the zero-shot scenario, we train the model on a set of seen intents to generate utterances. If the generation model generalizes well, the utterances generated for unseen intents are diverse and fluent and retain intents’ semantics. In the one-shot scenario, we utilize one utterance per unseen intent to train the generation model and learn the semantics of this particular intent.

3.1 Zero-shot generation

Our model as depicted in Figure 1) aims to generate plausible utterances conditioned on the intent description. We fine-tune the GPT-2 medium model radford2019language on task-oriented utterances, collected from several NLU benchmarks (see Section  5.1 for more details on the dataset).

Figure 1: Training setup. The input an intent description and an utterance concatenated, the output is the utterance.

Our approach to fine-tuning the GPT-2 model follows budzianowski2019hello. Two pieces of information, the intent description and the utterance are concatenated to form the input. More precisely, the input has the following format: [intent description] utterance. During the training phase, the model is presented with the output obtained from the input by masking the intent description. The output has the following format: <MASK>, , <MASK> utterance. The full list of intents is provided in Table 4 in Appendix.

Such input allows the model to pay attention to intent tokens while generating. The standard language modeling objective, negative log-likelihood loss, is used to train the model:

We fine-tuned the model for one epoch to avoid over-fitting. Otherwise, the model tends to repeat redundant semantic constructions of the input utterances. At the same time, a bias towards the words from the training set gets formed. The parameters of the training used were set to the following values: batch size equals to

, learning rate equals to e-, the optimizer chosen is Adam kingma2015adam with default parameters.

3.2 One-shot Generation

Motivation. The zero-shot approach to conditional generation may degrade or even fail if (i) the intent description is too short to properly reflect the semantics of the intent, (ii) the intent description is ambiguous or contains ambiguous words. Produced utterances may distort the initial meaning of the intent or be meaningless at all. The model may generate an utterance “Count the number of people in the United States” for the intent “calculator”, or “Add a book by Shakespeare to the calendar”

for a “book reading” service. Although such examples can be treated not as outliers but rather as real-life whimsical utterances, this is not the desired behavior for the generation model. We address this phenomenon as

Semantic Shift and provide experimental evidence of it in Section 5.4.

Based on these observations, we hypothesize that the problem could be solved if we provide a single training example to improve models’ generalization abilities. A single example can give the model a clue about what the virtual assistant can do with books and which entities our calculator is designed to calculate by gaining better world knowledge. For this purpose, we are moving from the zero-shot to the one-shot setting. We propose a method for improving zero-shot generation by leveraging just one example.

Our approach is inspired by the recent TextGAIL wu2020textgail approach. It addresses the problem of exposure bias in pre-trained language models and proposes a GAN-like style scheme for fine-tuning GPT-2 to produce appropriate story endings using a reinforcement algorithm. As a reward, TextGAIL uses a discriminator output trained to distinguish real samples from generated samples. As we are limited in using learnable discriminators because of the lack of training data, we propose an objective function based on a similarity score. Our objective function produces utterances, which are close to the reference example. At the same time, it forces the model to generate more diverse and plausible utterances. Table 5 in Appendix provides reference examples used for the one-shot generation method.

Method. After zero-shot fine-tuning, we perform a one-shot model update for each intent separately. We perform several steps of the Proximal Policy Optimization algorithm schulman2017proximal with the objective function described further.

Reward. Our reward function is based on BERTScore zhang2019bertscore, which serves as the measure of contextual similarity between generated sentences and the reference example. BERTScore correlates better with human judgments than other existing metrics, used to control semantics of generated texts and detect paraphrases. Given a reference and a candidate sentence, we embed them using RoBERTa model liu2019roberta. The BERTScore F1 calculated on top of these embeddings is used as a part of the final reward.

It is not enough to reward the model only for the similarity of the generated utterance to the reference one. If so, the model tends to repeat the reference example and receives the maximal reword. We add the negative sum of frequencies of all -grams in the utterance to the reward function, forcing the model to generate less frequent sequences.

Given an intent and a reference example , the reward for the sentence is calculated by the formula:

where is the -gram frequency, calculated from all the generated utterances inside one batch.

Objective function. First, we plug this reward into standard PPO objective function, getting intent-specific term . Following the TextGAIL approach, we add divergence with the model without zero-shot fine-tuning to prevent forgetting the information from the pre-trained model. We add an entropy regularizer, making the distribution smoother, which leads to more diverse and fluent sentences. According to our experiments, this term helps avoid similar prefixes for all generated sentences as -gram reward only does not cope with this issue. The final generator objective for maximization in the one-shot scenario for the intent can be written as follows:

where is intent description, is the conditional distribution (distribution, derived from model with updates from PPO policy), is an unconditional LM distribution, calculated by GPT-2 language model without fine-tuning. The entropy and are calculated per each token, while the term is calculated for the whole sentence.

3.3 Decoding strategies

Recent studies show that a properly chosen decoding strategy significantly improves consistency and diversity metrics and human scores of generated samples for multiple generation tasks, such as story generation holtzman2019curious, open-domain dialogues, and image captioning Ippolito2019comparison. However, to the best of our knowledge, no method proved to be a one-size-fits-all one. We perform experiments with several decoding strategies, which improve diversity while preserving the desired meaning. We perform an experimental evaluation of different decoding parameters.

Beam Search, a standard decoding mechanism, keeps the top partial hypotheses at every time step and eventually chooses the hypothesis that has the overall highest probability.

Random Sampling (top-) fan2018hierarchical greedily samples at each time step one of the top- most likely tokens in the distribution.

Nucleus Sampling (top-) holtzman2019curious samples from the most likely tokens whose cumulative probability does not exceed .

Post Decoding Clustering Ippolito2019comparison (i) clusters generated samples using BERT-based similarity and (ii) selects samples with the highest probability from each cluster. It can be combined with any decoding strategy.

4 Performance evaluation

We use several quality metrics to assess the generated data: (i) we use multiple fluency and diversity metrics, (ii) we account for the performance of the classifiers trained on the generated data.

Fluency. We consider fluency dependent upon the number of spelling and grammar mistakes: the utterance is treated as a fluent one if there are no misspellings and no grammar mistakes. We utilize LanguageTool milkowski2010developing, a free and open-source grammar checker, to check spelling and correct grammar mistakes.

Diversity. Following Ippolito2019comparison, we consider two types of diversity metrics:

li2016diversity is the total number of distinct -grams divided by the total number of produced tokens in all of the utterances for an intent;

zhang2018generating is an entropy of -grams distribution. This metric takes into consideration that infrequent -grams contribute more to diversity than frequent ones.

Accuracy. After we obtain a large amount of generated data, we train a RoBERTa-based classifier liu2019roberta to distinguish between different intents, based on the generated utterances. As usual, we split the generated data into two parts so that the first part is used for training, and the second part serves as the held-out validation set to compute the classification accuracy . High values mean that the intents are well distinguishable, and the utterances that belong to the same intent are semantically consistent.

Human evaluation We perform two crowd-sourcing studies to evaluate the quality of generated utterances, which aim at the evaluation of semantic correctness and fluency.

First, we asked crowd workers to evaluate semantic correctness. We gave crowd workers an utterance and asked them to assign one of the four provided intent descriptions; a correct option was among them (i.e., the one used to generate this very utterance). For the sake of completeness, we added a fifth option, “none of above”. We assess the results of this study by two metrics, accuracy and . Accuracy measures the number of correct answers, while measures the number of answers which are different from the last “none of above” option.

Second, we asked crowd workers to evaluate the fluency of generated utterances. Crowd workers were provided with an utterance and were asked to score it on a Likert-type scale from 1 to 5, where (5) means that the utterance sounds natural, (3) means that the utterance contains some errors, (1) means that it is hard or even impossible to understand the utterance. We assess the results of this study by computing the average score.

5 Zero-shot generation experiments

5.1 Data preparation

Data for fine-tuning. We combined two NLU datasets, namely The Schema-Guided Dialogue Dataset (SGD) rastogi2019towards and Natural Language Understanding Benchmark (NLU-bench) coucke2018snips for the fine-tuning stage. Both datasets have a two-level hierarchical structure: they are organized according to services (in SGD) or scenarios (in NLU-Bench). Each service/scenario contains several intents, typically 2-5 intents per high-level class. For example, the service Buses_1 is divided into two intents FindBus and BuyBusTickets.

SGD dataset consists of multi-turn task-oriented dialogues between user and system; each user utterance is labeled by service and intent. We adopted only those utterances from each dialog in which a new intent arose, which means the user clearly announced a new intention. This is a common technique to remove sentences that do not express any intents. As a result, we got three utterances per dialog on average.

As NLU-Bench consists of user utterances, each marked up with a scenario and intent label, we used it without filtering. Summary statistics of the dataset used is provided in Table 1.

SGD NLU-bench Total
No. of utterances 49986 25607 75593
No. of services 32 18 50
No. of intents 67 68 135
Total tokens 550k 170k 720k
Unique tokens 10.8k 8.3k 17.4k
Table 1: The total number of utterances, intents, services and words across datasets and final statistics of our fine-tuning data.

to l — c c c — c c c Zero-shot generation
Decoding strategy & Automated metrics   & Human evaluation
& & & & & & Fluency score
Random Sampling () & 0.82 & 0.50 & 6.20 & 0.63 & 0.87 & 4.77
Nucleus Sampling () + PDC & 0.82 & 0.40 & 5.77 & 0.68 & 0.85 & 4.95
Beam Search () + PDC & 0.85 & 0.22 & 4.92 & 0.67 & 0.85 & 4.88
Beam Search () & 0.88 & 0.15 & 4.76 & 0.60 & 0.80 & 4.76
Nucleus Sampling () & 0.89 & 0.25 & 4.95 & 0.72 & 0.90 & 4.81
One-shot generation
Nucleus Sampling () & 0.94 & 0.39 & 5.88 & 0.78 & 0.91 & 4.86

Table 2: Decoding strategies for zero-shot and one-shot generation. PDC stands for Post Decoding Clustering.

to l — c c c c & & & &
SGD+NLU-bench & 0.83 & 0.95 & 0.53 & 5.92

Table 3: Evaluation of the test dataset, created by merging and re-splitting two datasets under consideration.

Intent set for generation. For the evaluation of our generation methods, we created a set of 38 services and 105 intents111The full list of services and intents in both sets presented in the Appendix covering the most common requirements of a typical user of a modern dialogue system. The set includes services dedicated to browsing the Internet, adjusting mobile device settings, searching for vehicles, and others. To adopt a zero-shot setup, we split the data into train and test sets in the following way. Some of the services are unseen (), i.e., are present in the test set only. There are no seen services in the train set related to unseen services. The rest of the services are seen, i.e., present in both train and test set (), but different intents put in train and test sets. For example, Flight services are present in the train data and Plane service is used in the test set; from Music services, intents Lookup song and Play song were used for training, and Create playlist and Turn on music for a testing. To form the intent description for fine-tuning and generation, we join service and intent labels.

5.2 Evaluation

We generated 100 examples per intent using different decoding strategies and their parameters. For the more detailed evaluation, we picked up the generation methods of different decoding strategies that achieved good scores ( and ). For these utterances, we performed a human evaluation of semantic correctness and diversity; Table 5.1 compares the decoding strategies according to various quality metrics. For a more detailed evaluation of decoding strategies, see Table 2 in Appendix.

To compare the diversity of human-generated utterances to our generated utterances, we evaluate the fine-tuning dataset with and metrics. The semantics of generated data is assessed by and . We present metrics for this dataset in Table 3.

Beam Search (3) Random Sampling (3) Nucleus Sampling (0.98)
i need to know what’s going on with my phone i want to see my messages in the phone book show me a message from jean lee for my favorite apple company
i want you show me the message from my phone show me my most recent messages from my phone number how can you tell me mike with the message
i want you show me my messages on my phone show me the messages from the device i was using could you check to see if my friends are in a group that is gossiping
i want you to show my messages on my smart phone show me the message from my friend jane that i sent to her list all messages in my bbq menu from ausy
i want to read a new message from my friend can you please show me the messages from my phone just turn on the smart mute this monday night
Table 4: Utterances, generated by different decoding strategies and the diversity scores of the decoding strategies.

5.3 Analysis and model comparison

Fluency.

Spell checking results reveal the following issues of the generated utterances. The major issues are related to casing: an utterance may start in lower case, the first-person singular personal pronoun “I” is frequently generated in lower case, too. Punctuation issues include missing quotes, question marks, periods, or repeated punctuation marks. Common mistakes are omitting of a hyphen in the word “Wi-Fi” and “e-mail” and confusing definite and indefinite articles, as well as confusing “a”/“an”. These issues are more or less natural to humans and thus do not prevent further use of generated utterances. The only unnatural issues found by LanguageTool are phrase repetition in small numbers ( errors of this type per utterances). For examples of fluency issues in generated data, see Table 1 in Appendix.

Diversity.

Table 4 shows examples of the phrases generated by means of different decoding strategies, conditioning on the intent Show message, along with diversity metrics, and . Higher and scores indeed correspond to a more diverse decoding strategy. At the same time, extremely high diversity may generate utterances unrelated to the intent, expressing non-clear meaning and lack of common sense.

Diversity / Accuracy trade-off.

Figure 2 shows the trade-off between the diversity () and the accuracy () of the generated data.

Figure 2: The trade-off between diversity () and accuracy.

Every point corresponds to sentences generated using different zero-shot strategies. The human level stands for the diversity and accuracy metrics computed for the test set as is. The beam search scores are mainly in the top-left corner of the plane, leading to high accuracy and low diversity values. Top- Random Sampling strategy does not achieve the highest levels of accuracy. Nucleus Sampling can generate datasets with a large range of diversity and accuracy scores, depending on the chosen parameter. Post-decoding clustering increases diversity for low-diverse decoding strategies and decreases it for high-diverse ones, moving the generator closer to the human level.

Two ways to assess accuracy.

Table 5.1 shows that there is no clear correspondence between automated accuracy and human accuracy . Therefore cannot serve as the final measure for the semantic consistency of the generator. The Semantic shift problem cannot be captured by the automated accuracy : the model generates examples which are consistent inside each class, and classes are well-separated, but the generated examples do not correspond well to the intent descriptions.

Intent description and reference examples Undesirable meaning Zero-shot One-shot
Intent description Train Buy train ticket
Reference Make a purchase of the train ticket, not bus. Buy a train ticket for a specific date to some location
Meaning Get bus ticket
Example I need a bus to go there. I need to leave on the 3rd of this month.
97 23
Intent description Wallpapers Put default wallpaper
Reference Change the background picture of the device display to the default one. Replace current background on the device with the default one
Meaning Put new wall cover in a house
Example I want to put the wallpaper for my bedroom on the wall.
74 1
Intent description Calculator Find sum
Reference Compute, calculate the sum of the given numbers. Open the calculator and compute the sum of the following numbers
Meaning Find some amount of money
Example I need to find the average price of a house.
57 0
Table 5: Evaluation of semantic shift reduction by one-shot generation. The first column contains intent description and reference utterances used for one-shot generation. The second column shows examples of typical undesirable meaning. The last two columns show the percentage of examples with given incorrect meaning among 100 generated utterances by zero-shot and one-shot generation. Nucleus sampling () is used for both methods.

5.4 Semantic shift problem

The semantic consistency is crucial: how well do the generated utterances correspond to the intent description? In most cases, zero-shot generation is quite reliable: for of intents, for of intents. However, generated utterances are distinguishable from other classes for some intents, but they do not completely correspond to the intent description. Several generated utterances below illustrate this issue.

Intent: Buy train tickets
Utterance: I want to buy a bus ticket. I want to leave on the 12th of this month.
Intent: Put default wallpapers
Utterance: Put the default wallpaper for the bedroom. I want to see it on the wall.
Intent: Calculator Find sum
Utterance: I need to find a calculator. I need to know the value of one dollar.

For example, The bias in the fine-tuning data causes this issue. For example, travel-related intents mainly correspond to bus travel. So the model confuses buses and trains. In other cases, the model gets wrong the intent description due to the lack of world knowledge. E. g. the generated phrases for Wallpaper may be related to wallpapers in a house; utterances for Calculator may be related to finding some numbers like the average price of houses in the area.

6 One-shot generation experiments

Based on human evaluation of zero-shot generated data, we select Nucleus Sampling () as the best decoding strategy and apply it further in the one-shot scenario. Indeed, Table 5.1

confirms that the one-shot generation improves all evaluation metrics, both human and automated. The resulting one-shot utterances are more fluent than zero-shot utterances. The classifier trained on one-shot utterances has higher accuracy values when compared to the one trained on zero-shot utterances.

At the same time, one-shot generation restricts the semantics of the generated utterances and reduces the semantic shift. To illustrate, how the problem of semantic shift diminishes, we study several cases where the zero-shot model tends to generate utterances with undesirable meaning (see Section 5.4): bus instead of train; wallpaper as a wall cover instead of background picture; sum as amount of money instead of number. Table 5 shows that after one-shot fine-tuning, the number of utterances with undesirable meaning becomes drastically lower; for more examples, see Table 3 in Appendix.

7 Conclusion

In this paper, we have introduced zero-shot and one-shot methods for generating utterances from intent descriptions. We ensure the high quality of the generated dataset by a range of different measures for diversity, fluency, and semantic correctness, including a crowd-sourcing study. We show that the one-shot generation outperforms the zero-shot one based on all metrics considered. Using only a single utterance for an unseen intent to fine-tune the model increases diversity and fluency. Moreover, fine-tuning on a single utterance diminishes the semantic shift problem and helps the model gain better world knowledge.

Virtual assistants in real-life setup should be highly adaptive. In some tasks, we need much more data than is currently available: exploring model robustness to distribution change, finding the best architecture, dealing with a fast-growing set of intents (the number of intents could be thousands). If the intents to support come from different providers, they pose diverse semantics, style, and noises. Adaptation to different user groups and individual users, having different intent usage distribution, is another crucial problem. We need large-scale and flexible datasets to approach these tasks, which can hardly be collected via crowd-sourcing from external sources.

Zero- or one-shot generation is an appealing technique. The model obtains the background knowledge about the world and the domain during pre-training. Next, only small amounts of data are needed to fine-tune the model. State-of-the-art pre-trained language models, fine-tuned in a zero- or one-shot fashion, generate fluent and diverse phrases close to real-life utterances. The meaning of the intent and essential details, such as book titles, movie genres, expression of speech acts, or emoticons, are preserved. What is more, manipulating a decoding strategy makes it possible to balance the generated utterances’ diversity, semantic consistency, and correctness.

Our future work directions include assessing the downstream performance of proposed generation methods for an end-user application and evaluating slot-filling performance. The proposed approach can be tested to generate utterances specific to interest groups.

Acknowledgements

Ekaterina Artemova is partially supported by the framework of the HSE University Basic Research Program.

References

6 One-shot generation experiments

Based on human evaluation of zero-shot generated data, we select Nucleus Sampling () as the best decoding strategy and apply it further in the one-shot scenario. Indeed, Table 5.1

confirms that the one-shot generation improves all evaluation metrics, both human and automated. The resulting one-shot utterances are more fluent than zero-shot utterances. The classifier trained on one-shot utterances has higher accuracy values when compared to the one trained on zero-shot utterances.

At the same time, one-shot generation restricts the semantics of the generated utterances and reduces the semantic shift. To illustrate, how the problem of semantic shift diminishes, we study several cases where the zero-shot model tends to generate utterances with undesirable meaning (see Section 5.4): bus instead of train; wallpaper as a wall cover instead of background picture; sum as amount of money instead of number. Table 5 shows that after one-shot fine-tuning, the number of utterances with undesirable meaning becomes drastically lower; for more examples, see Table 3 in Appendix.

7 Conclusion

In this paper, we have introduced zero-shot and one-shot methods for generating utterances from intent descriptions. We ensure the high quality of the generated dataset by a range of different measures for diversity, fluency, and semantic correctness, including a crowd-sourcing study. We show that the one-shot generation outperforms the zero-shot one based on all metrics considered. Using only a single utterance for an unseen intent to fine-tune the model increases diversity and fluency. Moreover, fine-tuning on a single utterance diminishes the semantic shift problem and helps the model gain better world knowledge.

Virtual assistants in real-life setup should be highly adaptive. In some tasks, we need much more data than is currently available: exploring model robustness to distribution change, finding the best architecture, dealing with a fast-growing set of intents (the number of intents could be thousands). If the intents to support come from different providers, they pose diverse semantics, style, and noises. Adaptation to different user groups and individual users, having different intent usage distribution, is another crucial problem. We need large-scale and flexible datasets to approach these tasks, which can hardly be collected via crowd-sourcing from external sources.

Zero- or one-shot generation is an appealing technique. The model obtains the background knowledge about the world and the domain during pre-training. Next, only small amounts of data are needed to fine-tune the model. State-of-the-art pre-trained language models, fine-tuned in a zero- or one-shot fashion, generate fluent and diverse phrases close to real-life utterances. The meaning of the intent and essential details, such as book titles, movie genres, expression of speech acts, or emoticons, are preserved. What is more, manipulating a decoding strategy makes it possible to balance the generated utterances’ diversity, semantic consistency, and correctness.

Our future work directions include assessing the downstream performance of proposed generation methods for an end-user application and evaluating slot-filling performance. The proposed approach can be tested to generate utterances specific to interest groups.

Acknowledgements

Ekaterina Artemova is partially supported by the framework of the HSE University Basic Research Program.

References

7 Conclusion

In this paper, we have introduced zero-shot and one-shot methods for generating utterances from intent descriptions. We ensure the high quality of the generated dataset by a range of different measures for diversity, fluency, and semantic correctness, including a crowd-sourcing study. We show that the one-shot generation outperforms the zero-shot one based on all metrics considered. Using only a single utterance for an unseen intent to fine-tune the model increases diversity and fluency. Moreover, fine-tuning on a single utterance diminishes the semantic shift problem and helps the model gain better world knowledge.

Virtual assistants in real-life setup should be highly adaptive. In some tasks, we need much more data than is currently available: exploring model robustness to distribution change, finding the best architecture, dealing with a fast-growing set of intents (the number of intents could be thousands). If the intents to support come from different providers, they pose diverse semantics, style, and noises. Adaptation to different user groups and individual users, having different intent usage distribution, is another crucial problem. We need large-scale and flexible datasets to approach these tasks, which can hardly be collected via crowd-sourcing from external sources.

Zero- or one-shot generation is an appealing technique. The model obtains the background knowledge about the world and the domain during pre-training. Next, only small amounts of data are needed to fine-tune the model. State-of-the-art pre-trained language models, fine-tuned in a zero- or one-shot fashion, generate fluent and diverse phrases close to real-life utterances. The meaning of the intent and essential details, such as book titles, movie genres, expression of speech acts, or emoticons, are preserved. What is more, manipulating a decoding strategy makes it possible to balance the generated utterances’ diversity, semantic consistency, and correctness.

Our future work directions include assessing the downstream performance of proposed generation methods for an end-user application and evaluating slot-filling performance. The proposed approach can be tested to generate utterances specific to interest groups.

Acknowledgements

Ekaterina Artemova is partially supported by the framework of the HSE University Basic Research Program.

References

Acknowledgements

Ekaterina Artemova is partially supported by the framework of the HSE University Basic Research Program.

References

References