Better Conversations by Modeling,Filtering,and Optimizing for Coherence and Diversity

09/18/2018 ∙ by Xinnuo Xu, et al. ∙ Heriot-Watt University 0

We present three enhancements to existing encoder-decoder models for open-domain conversational agents, aimed at effectively modeling coherence and promoting output diversity: (1) We introduce a measure of coherence as the GloVe embedding similarity between the dialogue context and the generated response, (2) we filter our training corpora based on the measure of coherence to obtain topically coherent and lexically diverse context-response pairs, (3) we then train a response generator using a conditional variational autoencoder model that incorporates the measure of coherence as a latent variable and uses a context gate to guarantee topical consistency with the context and promote lexical diversity. Experiments on the OpenSubtitles corpus show a substantial improvement over competitive neural models in terms of BLEU score as well as metrics of coherence and diversity.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

End-to-end neural response generation methods are promising for developing open domain dialogue systems as they allow to learn from very large unlabeled datasets Shang et al. (2015); Sordoni et al. (2015); Vinyals and Le (2015). However, these models have also been shown to generate generic, uninformative, and non-coherent replies (e.g., “I don’t know.” in Figure 1), mainly due to the fact that neural systems tend to settle for the most frequent options, thus penalizing length and favoring high-frequency word sequences Sountsov and Sarawagi (2016); Wei et al. (2017).

To address these problems, JiweiLi:diversity2015 and li2017learning attempt to promote diversity by improving the objective function, but do not model diversity explicitly. Serban2017AHL focus on model structure without any upgrades to the objective function. Other works control the style of the output by leveraging external resources (hu2017toward: sentiment classifier, time annotation; zhao2017learning: dialogue acts) or focus on well-structured input such as paragraphs

Li and Jurafsky (2017).

This paper extends previous attempts to model diversity and coherence by enhancing all three aspects of the learning process: the data, the model, and the objective function. While previous research has addressed these aspects individually, this paper is the first to address all three in a unified framework. Instead of using existing linguistic knowledge or labeled datasets, we aim to control for coherence by learning directly from data, using a fully unsupervised approach. This is also the first work encoding and evaluating coherence explicitly in the dialogue generation task, as opposed to using diversity, style, or other properties of responses as a proxy.

Conversational history Response
A: You stay out of this. B-Coh: Well, I got water.
B: So you want water, huh? B-Incoh: I don’t know.
A: That’s right.
A: Where do we start? B-Coh: Specifically the stove.
B: Kitchen. B-Incoh: Let’s go for a walk.
A: Definitely the kitchen.
Figure 1: Examples of conversational history (left) with two alternative responses to follow it (right): (B-Coh) a more coherent, topical utterance, and (B-Incoh) a generic, inconsistent response.

In this work, given a dialogue history, we regard as a coherent response an utterance that is thematically correlated and naturally continuing from the previous turns, as well as lexically diverse. For example, in Figure 1 the response “Specifically the stove.” is a very natural and coherent response, elaborating on the topic of kitchen introduced in the previous two utterances and containing rich thematic words, whereas the response “Let’s go for a walk.” is unrelated and uninteresting.

In order to obtain coherent responses, we present three generic enhancements to existing encoder-decoder (E-D) models:

  1. [itemsep=2pt,leftmargin=12pt]

  2. We define a measure of coherence

    simply as the averaged word embedding similarity between the words of the context and the response computed using GloVe vectors

    Pennington et al. (2014).

  3. We filter a corpus of conversations based on our measure of coherence, which leaves us with context-response pairs that are both topically coherent and lexically diverse.

  4. We train an E-D generator recast as a conditional Variational Autoencoder (cVAE; Zhao et al., 2017) model that incorporates two latent variables, one for encoding the context and another for conditioning on the measure of coherence, trained jointly as in hu2017toward. We then decode using a context gate Tu et al. (2017) to control the generation of words that directly relate to the most topical words of the context and promote coherence.

Experiments on the OpenSubtitles Lison and Meena (2016) corpus demonstrate the effectiveness of the overall approach. Our models achieve a substantial improvement over competitive neural models. We provide an ablation analysis, quantifying the contributions that come from effective modeling of coherence into our models. All our experimental code is freely available on GitHub.111

2 Coherence-based Dialogue Generation

Our model aims to generate responses given a dialogue context, incorporating measures of coherence estimated purely from the training data. We propose the following enhancements to the attention-based E-D architecture

Bahdanau et al. (2015); Luong et al. (2015):

  • [itemsep=2pt,leftmargin=12pt]

  • We introduce a stochastic latent variable conditioned on previous dialogue context to store the global information about the conversation Bowman et al. (2016); Chung et al. (2015); Li and Jurafsky (2017); Hu et al. (2017).

  • We force the model to condition on the measure of coherence explicitly by encoding a latent variable (code) learned from data.

  • We incorporate a context gate Tu et al. (2017) that dynamically controls the ratio at which the generated words in the response derive directly from the coherence-enhanced dialogue context or the previously generated parts of the response.

In the rest of this section, we introduce the measure of coherence (Section 2.1), we present an overview of our model (Section 2.2), and finally describe the model in detail (Sections 2.32.4).

2.1 Measure of Dialogue Coherence

Semantic vector space models of language represent each word with a real-valued word embedding vector Pennington et al. (2014). By simply taking a weighted average of all its word embeddings, a whole sentence can be mapped into the semantic vector space. We define the coherence of a dialogue as the average distance between semantic vectors of preceding dialogue context and its response.

Let represent a dialogue context and a response. and are the numbers of words in the dialogue context and its response, respectively. Semantic vector space models map each word into embeddings , and into . The semantic representation of a dialogue context is then ; for a response , it is . Here, and are importance weights for each word in the sentence.222We set the importance weights to 0 for a list of stop words (high-frequency words such as articles and prepositions, names, punctuation marks), 1 otherwise. The measure of coherence is then defined as the cosine distance of the two semantic vectors of the dialogue context and its response:


2.2 Model Overview

End-to-end response generation for dialogue can be formalized as follows: Given a dialogue context , a dialogue generator generates the next utterance

. During the training process, the aim for a dialogue generator is to maximize the probability

over the training dataset. To encode dialogue contexts that adequately incorporate coherence information, we build our generator based on the cVAE model of Hu et al. (2017)

, which has been used to control text generation with respect to linguistic properties, such as tense or sentiment.

In our model, the response is generated conditioned on the previous conversation , a diversity-promoting latent variable , and a latent variable indicating dialogue coherence; and are independent. The generation probability is defined as:


Unfortunately, optimizing Eq (2) during training is intractable; therefore, we apply variational inference and optimize instead the variational lower bound:


where is the probability of generating utterance given , and ; stands for the approximate posterior distribution of the latent variable conditioned on dialogue context and the gold response ; is the measure of coherence between context and response ; is the true prior distribution of conditioned only on dialogue context ; denotes the KL-divergence. We assume that both and are Gaussian with mean vectors , and covariance matrices , .

2.2.1 Model Details

Optimizing Eq (3) consists of two parts: (1) minimizing the KL-divergence between the approximate posterior distribution and the true prior distribution of , (2) maximizing the probability of generating the gold response conditioned on dialogue context and coherence factors and . Figure 2 shows the pipeline of the training procedure.

Figure 2: The training process of the generative model. First, the dialogue context is encoded: is the final hidden state of the context encoder. Then we derive the diversity-promoting latent variable . Next, we compute the latent variable that corresponds to the measure of coherence between the dialogue context and the generated response . We concatenate all three vectors into to feed the decoder. is the attention matrix calculated for every time step of the decoding process.

First, we encode a dialogue context into a hidden state using the context encoder

, which is based on Recurrent Neural Networks (RNNs). Then the

posterior network encodes both dialogue context and gold response into a hidden state

followed by two linear transformations

and to map into mean vector and covariance matrix . The latent variable can be sampled from the distribution :


The prior network in Figure 2 takes a form similar to the posterior network:


where is the final hidden state of an RNN encoding only the dialogue context , and , are linear transformations. Code is given by the coherence measure from Eq (1).


We build an attention-based decoder Bahdanau et al. (2015); Luong et al. (2015) using RNNs to generate responses conditioned on encoded dialogue context , diversity signal , and coherence signal . We concatenate the latent variables and to the context encoder hidden state and feed them into the decoder as the initial hidden state , similar to hu2017toward.

During the decoding process, tokens are generated sequentially under the following probability distribution:


where is the length of the produced response; is an RNN; is the hidden state of the decoder at time step which is conditioned on the previously generated token , the previous hidden state , and the weighted attention vector :


where is the number of tokens of the dialogue context; is the hidden state of the encoder; the attention weight of each context hidden state is computed following luong2015effective.

Context Gate:

To increase the influence of code , we introduce the context gate . Unlike tu2016context, whose context gate assigns an element-wise weight to the input signal deriving from the encoder RNN, we build the context gate conditioned only on the coherence signal:



is the sigmoid function;

is a bias term;333We set empirically against the development set. is the target value of the measure of coherence, calculated by (see Section 2.1); is the measure of coherence between the dialogue context and the generated prefix sentence at time step , calculated by . Now Eq (7) with the context gate applied to can be rewritten as:


where denotes element-wise multiplication.

The coherence-informed context gate aims to dynamically control the ratio at which preceding dialogue context and previously generated tokens of the current response contribute to the generation of the next token in the response.

2.3 Training

Our generator is trained similarly to hu2017toward. The objective function is a weighted combination of three losses (generation, coherence, and diversity):


To teach the generator to produce responses close to the training data, we maximize the generation probability of the training response given the dialogue context according to Eq (2). During training, we set and minimize the following:


Apart from the generation loss, the coherence measure provides an extra learning signal which pushes the generator to produce responses that match the coherence signal given by the latent variable .


In Eq (13), is the prior distribution of the coherence variable . To ensure that the loss is differentiable, we cannot sample words from the response vocabulary. Instead we define as the sequence of output word probability distributions. is predicted by the coherence measure defined in Eq (1) with set as:


where is the word embedding matrix trained using GloVe (Section 2.1).

The last component in Eq (11) is the independent constraint that forces the soft distribution over the generated response to be diverse, so that it is able to faithfully reproduce the latent variable :


where is predicted by the posterior network with as the soft input to the RNN encoder at each time step .

2.4 Inference

Figure 3 shows the inference process of the generative model. Given a dialogue context and an expected coherence value , the context encoder first encodes the dialogue context into a hidden state . The prior network then generates a sample conditioned on the dialogue context. The decoder is initialized with , i.e., the concatenation of , and . During decoding, the next word is generated via the context gate modulating between the attention-reweighted context and the previously generated words of the response.

Figure 3: The inference process of the generative model, where the latent variable is given as an input.

3 Dataset and Filtering

Dataset for Generator

We train and evaluate our models on the OpenSubtitles corpus Lison and Tiedemann (2016) with automatic dialogue turn segmentation Lison and Meena (2016).444 A training pair consists of a dialogue context and a corresponding response. We consider three consecutive turns as the dialogue context and the following turn as the response. From a total of 65M instances, we select those that have context and response lengths of less than 120 and 30 words, respectively. We create two datasets:

  1. [itemsep=2pt,leftmargin=12pt]

  2. OST (plain OpenSubtitles) consists of 2M/4K/4K instances as our training/development/test sets, selected randomly from the whole corpus;

  3. fOST (filtered OpenSubtitles) contains the same amount of instances, but randomly selected only among those that have a measure of coherence score .555The coherence score is calculated as shown in Eq (1

    ). We observed that the scores on the training set follow a normal distribution with a slight tail on the negatively correlated side, so we fit a normal distribution to the data with parameters

    and set the cut-off to . A histogram of coherence scores is shown in Figure 5 in Supplemental Material A.

Filtering of the OpenSubtitles corpus is motivated by the fact that by removing the video and audio modalities which the subtitles originally accompanied, we are very often left with incomplete and incoherent dialogues. Therefore, by keeping dialogues with high coherence scores, we aim at building a high quality corpus with (1) more semantically coherent and topically related contexts and responses, and (2) fewer general and dull responses. Table 1 shows the coherence and diversity metrics (cf. Section 4.2) between OST and fOST. Unsurprisingly, coherence for fOST is much higher than OST, with a slightly higher diversity. We list dialogue examples for different coherence scores in Supplemental Material B.

Dataset Coh D-1% D-2% D-Sent%
OST 0.390 14.3 57.9 83.8
fOST 0.801 15.5 62.9 89.3
Table 1: Coherence and diversity metrics777Note that Distinct-1 and Distinct-2 are computed on a randomly selected subsets of 4k responses. for the OST and fOST datasets (see Section 3 for the datasets and Section 4.2 for metrics definition).

Dataset for Coherence Measure

In order to accurately measure coherence on our domain using the semantic distance as defined in Section 2.1, we train GloVe embeddings on the full OpenSubtitles corpus (i.e. 100K movies).

4 Experiments

Our generator model, ablative variants, and baselines are implemented using the publicly available OpenNMT-py framework Klein et al. (2017) based on bahdanau2014neural and luong2015effective. We used the publicly available glove-python package888 to implement our coherence measure.

We experiment on two versions of our model: (1) cVAE with the coherence context gate as described in Section 2.3 (cVAE-XGate), (2) cVAE with the original context gate implementation of Tu et al. (2017) (cVAE-CGate). For each of these, we consider the main variant where the input coherence measure is preset to a fixed ideal value as estimated on development data (1.0 for OST and 0.95 for fOST), as well as an oracle variant where we use the true coherence measure between the context and the gold-standard response in the test set (indicated with “(C)” in Tables 2 and 3).

We compare against two baseline models: (1) a vanilla E-D with attention (Attention) Luong et al. (2015); (2) an enhancement where output beams are rescored using the maximum mutual information anti-language model (MMI-antiLM) of Li et al. (2016a) (MMI).

4.1 Parameter Settings

We set our model parameters based on preliminary experiments on the development data.

We use 2-layer RNNs with LSTM cells Hochreiter and Schmidhuber (1997) with input/hidden dimension of 128 for both the context encoder and the decoder. The dropout rate is set to 0.2 and the Adam optimizer Kingma and Ba (2015) is used to update the parameters. A vocabulary of 25,000 words is shared between the encoder and the decoder.

Both the posterior network and prior network for the latent variable learning are built with 2-layer LSTM RNNs with input/hidden dimension of 64. The dimension of the latent variable is set to 20. Same as for the encoder and decoder, the dropout rate is 0.2 and the Adam optimizer is used to update the parameters.

The window size for GloVe computation in our coherence measure is set to 10.

4.2 Evaluation metrics

We use a number of metrics to evaluate the outputs of our models:

  • [nosep,leftmargin=12pt]

  • BLEU, B1, B2, B3 – the word-overlap score against gold-standard responses Papineni et al. (2002) used by the vast majority of recent dialogue generation works Zhao et al. (2017); Yao et al. (2017); Li et al. (2017a, 2016c); Sordoni et al. (2015); Li et al. (2016a); Ghazvininejad et al. (2017). BLEU in this paper refers to the default BLEU-4, but we also report on lower -gram scores (B1, B2, B3).999We use the Multi-BLEU script from OpenNMT to measure BLEU scores.

  • Coh – our novel GloVe-based coherence score calculated using Eq (1) showing the semantic distance of dialogue contexts and generated responses.

  • D-1, D-2, D-Sent – common metrics used to evaluate the diversity of generated responses (e.g. Li et al., 2016a; Xu et al., 2017; Xing et al., 2017; Dhingra et al., 2017): the proportion of distinct unigrams, bigrams, and sentences in the outputs.

5 Results

Training data Model BLEU% B1% B2% B3% Coh D-1% D-2% D-Sent%
OST Attention 1.32 10.92 3.85 2.10 0.293 03.4 14.2 25.6
MMI 1.31 11.06 3.88 2.09 0.284 03.3 14.6 28.2
cVAE-CGate (C) 1.58 11.86 4.45 2.48 0.311 04.1 15.0 28.2
cVAE-XGate (C) 1.51 13.38 4.97 2.58 0.324 03.9 14.5 29.8
cVAE-CGate (1.0) 1.60 17.08 5.78 2.86 0.404 05.0 27.1 79.7
cVAE-XGate (1.0) 1.44 15.83 5.34 2.62 0.413 04.5 22.6 80.2
fOST Attention 1.79 15.43 5.65 2.94 0.758 11.9 41.8 92.7
MMI 1.99 16.24 6.06 3.22 0.764 11.9 44.5 95.8
cVAE-CGate (C) 2.10 15.98 6.05 3.35 0.728 11.9 37.6 88.4
cVAE-XGate (C) 1.85 16.44 5.94 3.07 0.706 10.3 31.2 80.4
cVAE-CGate (0.95) 2.02 15.52 5.78 3.16 0.767 10.6 44.8 98.7
cVAE-XGate (0.95) 1.64 14.43 5.20 2.70 0.745 09.0 36.9 98.7
Table 2: Evaluation results on the OST test set (see Section 4 for model description and Section 4.2 for metrics definition). Note that the cVAE-CGate(C) / cVAE-XGate(C) models use the true value between the context and the gold response as input. Other cVAE-CGate / cVAE-XGate models use fixed values for selected on dev sets shown in brackets. BLEU score reported here is BLEU-4; B1, B2 and B3 denote lower -gram BLEU scores.
Training data Model BLEU% B1% B2% B3% Coh D-1% D-2% D-Sent%
OST Attention 0.86 08.34 02.79 1.45 0.284 03.6 14.6 29.4
MMI 0.89 08.47 02.89 1.48 0.278 03.7 15.3 31.5
cVAE-CGate (C) 1.64 10.20 04.17 2.40 0.329 05.1 19.4 35.8
cVAE-XGate (C) 1.80 11.70 04.90 2.83 0.359 05.2 19.2 39.7
cVAE-CGate (1.0) 2.25 16.82 06.81 3.70 0.422 05.4 28.2 81.0
cVAE-XGate (1.0) 2.41 18.62 07.56 4.09 0.434 04.8 23.4 84.0
fOST Attention 3.84 16.65 08.72 5.54 0.803 12.8 43.4 88.7
MMI 3.84 16.81 08.78 5.57 0.803 12.6 42.5 88.8
cVAE-CGate (C) 4.58 17.64 09.53 6.30 0.796 12.4 41.6 85.5
cVAE-XGate (C) 4.33 18.43 09.59 6.11 0.783 10.7 33.1 78.8
cVAE-CGate (0.95) 4.98 20.95 10.93 7.02 0.814 12.1 51.4 98.2
cVAE-XGate (0.95) 4.47 20.98 10.43 6.50 0.797 10.4 42.5 97.6
Table 3: Evaluation results on the fOST test set (see Section 4 and Table 2 for model description; see Section 4.2 for metrics definition). BLEU score reported here is BLEU-4; B1, B2 and B3 denote lower -gram BLEU scores.

All model variants described in Section 4 are trained on both OST and fOST datasets. Tables 2 and 3 present the scores of all models tested on the OST and fOST test sets, respectively. Note that in addition to testing the models on the respective test sections of their training datasets, we also test them on the other dataset (OST-trained models on fOST and vice-versa). This way, we can observe the performance of the fOST-trained models in more noisy contexts and see how good the OST-trained models are when evaluated against coherent responses only.

Given all the evaluated model variants, we can observe the effects and contributions of the individual components of our setup:

  • [itemsep=2pt,leftmargin=12pt]

  • Data filtering:

    The models trained on fOST consistently outperform the same models trained on OST – for all evaluation metrics and on both test sets. This shows that coherence-based training data filtering is generally beneficial.

  • cVAE-Context Gate models: Nearly all cVAE-based models perform markedly better than the baselines w.r.t. BLEU, coherence, and diversity.101010 We performed paired bootstrap re-sampling for the best cVAE model and the best baseline model in each experiments set (Table 2 and Table 3) as is done for MT (Koehn, 2004), which confirmed statistical significance at 99% confidence level for all cases except for models trained on fOST and tested on OST (bottom half of Table 2).

    If we look at models trained on OST and tested on fOST (the top half of Table 3), we can see that all cVAE-based models, especially cVAE-XGate, are able to learn to produce coherent and diverse response even when trained on a noisy, incoherent corpus. Examples of responses generated by the baseline MMI model and by cVAE-XGate in Figure 4 show that cVAE-XGate mostly produces more diverse and coherent responses than MMI.

  • Preset vs. oracle models with gold-standard : Table 2 shows that on the noisy OST test set, cVAE-based models using the gold-standard value of achieve higher BLEU scores than models using preset . This is expected since many gold-standard responses in the unfiltered set have a low coherence score; – the model can generate a more generic response if the gold-standard is low. The models with preset always attempt to generate coherent responses, which is apparent from the other metrics: Coh and D-Sent are consistently higher than for models using gold-standard .

    On the fOST test set where only high-coherence responses are expected, models using fixed consistently reach higher scores in all metrics including BLEU (see Table 3). This shows that in general, using a preset constant value of works well, even better than using the gold-standard .

In sum, using our coherence measure both for data filtering and inside the models leads to output performance improvements.

Dialogue context Response
A: i have an audition at 4:00. B-GT: you volunteered.
1 B: apparently now i’m cooking dinner. B-MMI: i don ’t know.
A: how did this all happen? B-XGT: well , i’m going to have a dinner.

A: it’s not that easy. B-GT: the waiting list would take too long.
2 B: we can’t contact potential donors. B-MMI: i don’t know.
A: if we want to save frank, his brother is his only hope B-XGT: we can’t save him.

A: that fish ate my fish! B-GT: you’ll wait till next year to win one of these.
3 B: you cost me £30,000 B-MMI: i’m sorry.
A: sorry, folks. B-XGT: you owe me a little fish.

A: was it what you wanted? B-GT: you like model planes?
4 B: no. B-MMI: i wanted to know.
A: i wanted an airplane. B-XGT: i wanted a helicopter.

A: great. B-GT: yeah, a vanilla decaf latte.
5 B: thanks so much. B-MMI: no.
A: vanilla decaf latte? B-XGT: yeah , that ’s a good coffee

A: to make everybody happy. B-GT: and stick it in your mouth.
6 B: which means that sometimes B-MMI: and you got to take a shower.
A: you got to take a pile of poop B-XGT: and you got to take a walk.

A: now you’ll knock again and say you forgot to ask B-GT: should i knock and make her come here for that?
7 B: if she has a friend for your friend. B-MMI: you’re crazy!
A: are you crazy? B-XGT: i’m crazy, i’m crazy.

A: i knew you would say that. B-GT: you are all way too pretty to be arguing.
8 B: mr. kerem is a charming and decent guy. B-MMI: he’s a very good man.
A: and he’s really into you. B-XGT: i didn’t say that.
Figure 4: Generation output examples comparing two models trained on the OST dataset and tested on the fOST test set: the MMI baseline (B-MMI) and our cVAE-XGate (1.0) model (B-XGT). B-GT is the ground-truth response from the test set. The three sequential dialog turns on the left are the preceding dialogue context used to generate the responses. Corresponding topical phrases are underlined. We can see that cVAE-XGate (1.0) mostly produces markedly more coherent and specific outputs than MMI (1-5). In some cases, it is comparable with MMI (6-7) and occasionally, it is less coherent (8).

6 Related Work

Our work fits into the context of the very active area of end-to-end generative conversation models, where neural E-D approaches have been first applied by Vinyals and Le (2015) and extended by many others since.

Many works address the lack of diversity and coherence in E-D outputs Sountsov and Sarawagi (2016); Wei et al. (2017) but do not attempt to model coherence directly, unlike our work: Li et al. (2016a) use anti-LM reranking; Li et al. (2016c) modify the beam search decoding algorithm, similar to Shao et al. (2017)

in addition to using a self-attention model.

Mou et al. (2016) predict keywords for the output in a preprocessing step while Wu et al. (2018) preselect a vocabulary subset to be used for decoding. Li et al. (2016b) focus specifically on personality generation (using personality embeddings) and Wang et al. (2017) promote topic-specific outputs by language-model rescoring and sampling.

A lot of recent works explore the use of additional training signals and VAE setups in dialogue generation. In contrast to this paper, they do not focus explicitly on coherence: Asghar et al. (2017)

use reinforcement learning with human-provided feedback,

Li et al. (2017a) use a RL scenario with length as reward signal. Li et al. (2017b) add an adversarial discriminator to provide RL rewards (discriminating between human and machine outputs), Xu et al. (2017) use a full adversarial training setup. The most recent works explore the usage of VAEs: Cao and Clark (2017) explore a vanilla VAE setup conditioned on dual encoder (for contexts and responses) during training, the model of Serban et al. (2017) uses a VAE in a hierarchical E-D model. Shen et al. (2017) use a cVAE conditioned on sentiment and response genericity (based on a handwritten list of phrases). Shen et al. (2018) combine a cVAE with a plain VAE in an adversarial fashion.

We also draw on ideas from other areas than dialogue generation to build our models: tu2016context’s context gates originate from machine translation and hu2017toward’s cVAE training stems from free-text generation.

7 Conclusions and Future Work

We showed that explicitly modeling coherence and optimizing towards coherence and diversity leads to better-quality outputs in dialogue response generation. We introduced three extensions to current encoder-decoder response generation models: (1) we defined a measure of coherence based on GloVe embeddings Pennington et al. (2014), (2) we filtered the OpenSubtitles training corpus Lison and Meena (2016) based on this measure to obtain coherent and diverse training instances, (3) we trained a cVAE model based on Hu et al. (2017) and Tu et al. (2017) that uses our coherence measure as one of the training signals. Our experimental results showed a considerable improvement in the output quality over competitive models, which demonstrates the effectiveness of our approach.

In future work, we plan to replace the GloVe-based measure of coherence with a trained discriminator that distinguishes between coherent and incoherent responses Li and Jurafsky (2017). This will allow us to use extend the notion of coherence to account for phenomena such as topic shifts. We also plan to verify the results with a human evaluation study.


This research received funding from the EPSRC project MaDrIgAL (EP/N017536/1). The Titan Xp used for this research was donated by the NVIDIA Corporation.


Appendix A Determining Cut-off Coherence Score

As shown in Figure 5, the scores on the training set roughly follow a normal distribution with a slight tail on the negatively correlated side. We make the assumption that the data fit a normal distribution and estimate parameters and . We set the cut-off to so that it accounts for 95% of the scores and does not severely filter the number of resulting examples in the dataset.

Figure 5: Histogram of coherence scores on the training set before filtering.

Appendix B Dialogue Examples from the Training Set

Tables 4-7 show dialogue examples from the training set with different coherence scores. The instances in Table 4 are strongly lexically coherent by repeating the same words in the response when coherence scores are over 0.9. When scores are between 0.7 and 0.9 (Table 5), the responses use synonyms to words in the context but still remain semantically coherent. When scores are between 0.4 and 0.6 (Table 6), the responses are coherent but very dull; they are getting incoherent when scores are lower (Table 7).

Dialogue contexts Responses
yeah? ## you’re new in town, right? you call this a town?
so you can have marie all to yourself? ## you’re so selfish, catherine. you’ve always been selfish.
wait. ## where are you going? ## to find the president. the president, he lives!
that’s not good! ## it ’s fine. it itches a bit at first, but then it stops. of course it itches!
you stay out of this. ## so you want water, huh? ## that’s right. well, i got water.
Table 4: Dialogue examples with coherence score (“##” in the context denotes turn boundaries).
Dialogue contexts Responses
not quite yet. ## call your grandfather to pick me up. ## i want to go home. grandpa ’s not here .
some kind of whisky nobody’s ever heard of. ## why don’t you bring your own bottle? give him the best bourbon you got, hot stuff, and don’t be gone too long.
put your head on my shoulder? ## denny ## i just want to remember. i don’t think my neck even bends, anymore.
i don’t even know where to start. ## kitchen. ## definitely the kitchen. specifically the stove.
the problem is, the liquid just stays in your gut. ## i don’t know what to do. well, obviously it’s not getting absorbed into the bloodstream.
Table 5: Dialogue examples with coherence score (“##” in the context denotes turn boundaries).
Dialogue contexts Responses
i’m gonna hold it. ## take a look at it. ## make sure it was an accident. doesn’t sound right.
in fact, thank you for underplaying it. ## so the boy becomes a man it’s amazing.
oh, sting? ## little bit. ## can we stop at the drug store? oh, uh, don’t worry.
there’s no one in there! ## it’s not gonna happen. ## it’s a private party. nice going .

give me that pad. ## what are you gonna do? ## just watch the old mastermind.

what are you doing?
Table 6: Dialogue examples with coherence score (“##” in the context denotes turn boundaries).
Dialogue contexts Responses
otherwise it’s all over do whatever he wants ## alright. ## so? how much do you want?
i’ll let it go. ## don ’t worry. ## i’ve totally let it go. it’s all water under the bridge.
i’m so sorry. ## mag? ## thank you. i’m going to go for a walk.
i dont want it. ## why ## ok you want the car.
he’s thinking about leaving the theatre. ## we’ve saved you a seat here. come on. look at her hair.
Table 7: Dialogue examples with coherence score (“##” in the context denotes turn boundaries).