Characterizing Variation in Crowd-Sourced Data for Training Neural Language Generators to Produce Stylistically Varied Outputs

09/14/2018
by   Juraj Juraska, et al.
0

One of the biggest challenges of end-to-end language generation from meaning representations in dialogue systems is making the outputs more natural and varied. Here we take a large corpus of 50K crowd-sourced utterances in the restaurant domain and develop text analysis methods that systematically characterize types of sentences in the training data. We then automatically label the training data to allow us to conduct two kinds of experiments with a neural generator. First, we test the effect of training the system with different stylistic partitions and quantify the effect of smaller, but more stylistically controlled training data. Second, we propose a method of labeling the style variants during training, and show that we can modify the style of the generated utterances using our stylistic labels. We contrast and compare these methods that can be used with any existing large corpus, showing how they vary in terms of semantic quality and stylistic control.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/22/2018

Controlling Personality-Based Stylistic Variation with Neural Natural Language Generators

Natural language generators for task-oriented dialogue must effectively ...
research
06/04/2019

Curate and Generate: A Corpus and Method for Joint Control of Semantics and Style in Neural NLG

Neural natural language generation (NNLG) from structured meaning repres...
research
08/01/2016

Crowd-sourcing NLG Data: Pictures Elicit Better Data

Recent advances in corpus-based Natural Language Generation (NLG) hold t...
research
10/26/2019

ViGGO: A Video Game Corpus for Data-To-Text Generation in Open-Domain Conversation

The uptake of deep learning in natural language generation (NLG) led to ...
research
11/08/2019

A Good Sample is Hard to Find: Noise Injection Sampling and Self-Training for Neural Language Generation Models

Deep neural networks (DNN) are quickly becoming the de facto standard mo...
research
06/28/2017

The E2E Dataset: New Challenges For End-to-End Generation

This paper describes the E2E data, a new dataset for training end-to-end...
research
09/30/2020

Learning from Mistakes: Combining Ontologies via Self-Training for Dialogue Generation

Natural language generators (NLGs) for task-oriented dialogue typically ...

Please sign up or login with your details

Forgot password? Click here to reset