LSTM based Conversation Models

03/31/2016 ∙ by Yi Luan, et al. ∙ Georgia Institute of Technology University of Washington 0

In this paper, we present a conversational model that incorporates both context and participant role for two-party conversations. Different architectures are explored for integrating participant role and context information into a Long Short-term Memory (LSTM) language model. The conversational model can function as a language model or a language generation model. Experiments on the Ubuntu Dialog Corpus show that our model can capture multiple turn interaction between participants. The proposed method outperforms a traditional LSTM model as measured by language model perplexity and response ranking. Generated responses show characteristic differences between the two participant roles.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As automatic language understanding and generation technology improves, there is increasing interest in building human-computer conversational systems, which can be used for a variety of applications such as travel planning, tutorial systems or chat-based technical support. Most work has emphasized understanding or generating a word sequence associated with a single sentence or speaker turn, potentially leveraging the previous turn. Beyond local context, language use in a goal-oriented conversation reflects the global topic of discussion, as well as the respective role of each participant. In this work, we introduce a conversational language model that incorporates both local and global context together with participant role.

In particular, participant roles (or speaker roles) impact content of a sentence in terms of both the information to be communicated and the interaction strategy, affecting both meaning and conversational structure. For example, in broadcast news, speaker roles are shown to be informative for discovering the story structures [1]; they impact speaking time and turn-taking [2]; and they are associated with particular phrase patterns [3]. In online discussions, speaker role is useful for detecting authority claims [4]. Other work shows that in casual conversations, speakers with different roles are likely to use different discourse markers [5]. For the Ubuntu technical support data used in this study, Table 1 illustrates differences in the distributions of frequent words for the poster vs. responder roles. The Poster  role tends to raise questions using words anyone, how. The Responder  role tends to use directive words (you, you’re), hedges (may, might) and words related to problem solving (sudo, check).

hi, hello, anyone, hey, guys, ideas, thanks, thank, my, how, am, ??, cannot, I’m, says

you’re, your, probably, you, may, might, sudo, ->, search, sure, ask, maybe, most, check, try

Table 1: Top 15 words based on the role likelihood ratio out of the subset with word count > 6k.

Specifically, we propose a neural network model that builds on recent work in response generation, integrating different methods that have been used for capturing local (previous sentence) context and more global context, and extending the network architecture to incorporate role information. The model can be used as a language model, as in speech recognition or translation, but our focus here is on response generation. Experiments are conducted with Ubuntu chat logs, using language model perplexity and response ranking, as well as qualitative analysis.

2 Related Work

Data-driven methods are now widely used for building conversation systems. With the popularity of social media, such as Twitter, Sina Weibo, and online discussion forums, it is easier to collect conversation text [6, 7]. Several different data-driven models have been proposed to build conversation systems. Ritter et al. [8] present a statistical machine translation based conversation system. Recently, neural network models have been explored. The flexibility of neural network models opens the possibility of integrating different kinds of information into the generation procedure. For example, Sordoni et al. [9]

present a way to integrate contextual information via feed-forward neural networks. Li

et al. propose using Maximum Mutual Information (MMI) as the objective function in neural models in order to produce more diverse and interesting responses. Shang et al. [10] introduce the attention mechanism into an encoder-decoder network for a conversation model. Most similar to our work is the Semantic Controlled LSTM (SC-LSTM) proposed by Wan et al. [11], where a Dialog-act component is introduced into the LSTM cell to guide the generated content. In this work, we utilize the role information to bias response generation without modifying LSTM cells.

Efficiently capturing local and global context remains a open problem in language modeling. Different ways of modeling document-level context has been explored in [12] and [13] based on the LSTM framework. Luan et al. [14] proposed a multi-scale recurrent architecture to incorporate both word and turn level context for spoken language understanding tasks. In this paper, we use a similar approach as [16], explicitly using Latent Dirichlet Analysis (LDA) as global-context feature to feed into RNNLM.

Early work on incorporating local context in conversational language modeling is described in [17] conditioned on the most recent word spoken by other speakers. Hutchinson et al. [18, 19] improve log-bilinear language model by introducing a multi-factor sparse matrix that could capture speaker role and topic information. In addition, Huang et al. [20] show that language models with role information significantly reduce word error rate in speech recognition. Our work differs from these approaches in using an LSTM. Recently, Li et al.

 propose using an additional vector to LSTM in order to capture personal characteristics of a speaker

[21]. In this work, we utilize both a global topic vector and role information, where a role-specific weight matrix biases the word distributions for different roles.

3 Model

In this section, we propose an LSTM based framework that integrating participant role and global topic of the conversation. As discussed in section 1, the assumption is, given the same context, each role has its own preference of picking words to generate a response. Each generated response should be both topically related to the current conversation and coherent with the local context.

3.1 Recurrent Neural Network Language Models

We start building a response generation model [9]

by using a recurrent neural network language model (RNNLM) 

[22]. In general, a RNNLM is a generative model of sentences. For a sentence consisted of word sequence , the probability of given is


where is the current hidden state and is the probability function parameterized by :


where is the output layer parameter. The hidden state is computed recurrently as


is a nonlinear function parameterized by . We use an LSTM [23] since it is good at capturing long-term dependency, which is an objective for our conversation model.

3.2 Conversation Models with Speaker Roles

To build a conversation model with different participant roles, we extend a RNNLM in two respects. First, to capture the variability from different participant roles, we incorporate role-based information into the generation procedure. Second, to model a conversation instead of single turns, our model adjoins RNNLMs for all turns in sequence to model the whole conversation.

More specifically, consider two adjacent turns111In our formulation, we use one turn as the minimal unit as multiple sentences in one turn share the same role. and with their participant role and respectively. is the number of words in the -th turn. To build a single model for the entire conversation, we simply concatenate the RNNLMs for all sentences in order. Concatenation changes the way of computing the first hidden state in each utterance (except the first utterance in the conversation). Considering the two turns and , after concatenation, the computation of the first hidden state in turn , , is


As we will see from section 4, this simple solution can capture the long-term contextual information.

We introduce the role-based information by defining a role-dependent function . For example, the probability of given and its role is


where the is also parameterized by role . In our implementation, we use


where , , are the vocabulary size and hidden layer dimension respectively. Even is shared across the entire conversation model,

is role-specific. This linear transformation defined in Eq. 

6 is easy to train in practice and appears to capture role information. This model is named the R-Conv  model, as the role-based information is introduced in the output layer.

Despite the difference between the two models, they can be learned in the same way, which is similar to training a RNNLM [22]. Following the way of training a language model, the parameters could be learned by maximizing the following objective function


where is the prediction of .

can be any loss function for classification task. We choose cross entropy 

[24] as the loss function , because it is a popular objective function used in training neural language models.

As a final comment, if we eliminate the role information, R-LDA-Conv  will be reduced to an RNNLM. To demonstrate the utility of role-based information, we will use an RNNLM over conversations as a baseline model.

3.3 Incorporating global topic context

In order to capture long-span context of the conversation, inspired by [16], we explicitly include a topic vector representing all previous dialog turns. We use Latent Dirichlet Allocation (LDA) to achieve a compact vector-space representation. This procedure maps a bag-of-words representation of a document into a low-dimensional vector which is conventionally interpreted as a topic representation. For each turn , we compute the LDA representation for all previous turns


where is the LDA inference function as in [15]. Then is concatenated with hidden layer to predict .


This model is named LDA-Conv. We assume that by including into output layer, the predicted word would be more topically related with the previous turns, thus allowing the recurrent part to learn more local context information.

When incorporating both the global topic vector and the role factor, the conditional probability of is


We call this model, illustrated in Figure 1, R-LDA-Conv.

Figure 1: The R-LDA-Conv  model. The turn-level LDA feature is concatenated with word-level hidden layer and the output weight matrix is role specific.

4 Experiments

We evaluate our model from different aspects on the Ubuntu Dialogue Corpus [7], which provides one million two-person conversations extracted from Ubuntu chat logs. The conversations are about getting technical support for various Ubuntu-related problems. In this corpus, each conversation contains two users with different roles, Poster: the user in this conversation who initializes the conversation by proposing a technical problem; Responder: the other user who tries to provide technical support. For a conversation, we replace the user of each turn with the corresponding role.

4.1 Experimental setup

Our models are trained in a subset of the Ubuntu Dialogue Corpus, in which each conversation contains 6 - 20 turns. The resulting data contains 216K conversations in the training set, 10k conversations in the test set and 13k conversations in the development set. We use a Twitter tokenizer [25] to parse all utterances in the conversations. The vocabulary is constructed on the training set with filtering out low-frequency tokens and replacing them with “UNKNOWN”. The vocabulary size is fixed to include 20K most frequent words. We did not filter out emoticons, instead we treat them as single tokens.

The LDA model is trained using all conversations in training data, where each conversation is treated as an individual training instance. We use Gensim [26] for both training and inference. There are three hyper-parameters in our models: the dimension of word representation , the hidden dimension and the number of topics in LDA model. We use grid search over ,

, and select the best combination for each model using the perplexity on the development set. We use stochastic gradient descent with the initial learning rate

to train all the models.

4.2 Evaluation Metrics

Evaluation on response generation is an emerging research field in data-driven conversation modeling. Due to the variety of possible responses for a given context, it is too conservative to only compare the generated response with the ground truth. For a reasonable evaluation, the

-gram based evaluation metrics including

Bleu [27] and Bleu [28] require multiple references for one given context. One the other hand, there are indirect evaluation methods, for example, ranking based evaluation [7, 10] or qualitative analysis [29]. In this paper, we use both ranking-based evaluation (Recall@  [30]) across all models, and leave the -gram based evaluation for future work. To compute the Recall@K metric of one model given , the model is used to select the top- candidates, and it is counted as correct if the ground-truth response is included. In addition to Recall@K, we also evaluate the different models based on test set perplexity.

To understand the chat conversations requires intensive knowledge of Ubuntu even for human readers. Therefore, the qualitative analysis focuses mainly on the capacity of capturing role information, not the justification of responses as valid answers to the technical questions.

Model Perplexity
Baseline 32 128 - 54.93
R-Conv 256 128 - 48.89
LDA-Conv 256 128 100 51.13
R-LDA-Conv 256 128 50 46.75
Table 2: The best perplexity numbers of the three models on the development set.
Metric Baseline R-Conv LDA-Conv R-LDA-Conv
Recall@1 0.12 0.15 0.13 0.16
Recall@2 0.22 0.25 0.24 0.26
Table 3: The performance of response ranking with Recall@.

4.3 Quantative Evaluation

Experiments in this section compare the performance of LDA-Conv, R-Conv  and R-LDA-Conv  to the baseline LSTM system.

4.3.1 Perplexity

The best perplexity numbers from the three models are shown in Table 2. R-LDA-Conv  gives the lowest perplexity among the four models, nearly 8 points improvement over the baseline model. Comparing role vs. global topic, role has a bigger improvement on perplexity of reduction for role vs.  for LDA topic. Combining both leads to a reduction in perplexity. To simplify the comparison, in the following experiments, we only use the best configuration for each model.

4.3.2 Response ranking

The task is to rank the ground-truth response with some randomly-selected sentences for a given context. For each test sample, we use the previous sentences as context, trying to select the best th sentence. We randomly select 9 turns from other conversations in the dataset, replacing their role with the ground truth label. As we noticed that sentences from the background channel, like “yes”, “thank you”, could fit almost all the conversations with various context. To distinguish the background channel from some contentful sentences, we sample the negative examples with the ground-truth sentence length as a constraint — samples with the similar length ( 2 words) are selected as negative examples.

The Recall@ are shown in Table 3. Both R-Conv  and LDA-Conv  are better than baseline result, while R-LDA-Conv  gives the best performance overall. Both role factors and topic feature are acting positively in ranking ground-truth responses. Even though no role information is explicitly used in the baseline model, the contextual information itself could be a useful hint to rank the ground-truth response higher. Therefore, the performance of the baseline model is still better than random guess. Again, role has a bigger effect than topic, and the combination gives the best results, but differences in Recall@ performance are small.

4.4 Qualitative Analysis

For qualitative analysis, the best R-LDA-Conv model is used to generate role-specific responses, and we examined a number of examples to determine whether the generated response fit into the expected speaker role. We include two examples in Table 4 and Table 5 due to the page limitations. For each case, we have responses generated for each of the possible roles: a further question for the Poster and a potential solution for the Responder.

As we can see from the context part of Table 4, different roles clearly have different behaviors during the conversation. Ignoring the validity of this potential solution, this generated response is consistent with our expectation of the Responder  role. The response of Poster  seems quite plausible. The reply of Responder  is clearly the right style but more domain information in the topic vector could lead to a more useful solution.

Table 5 shows another example to demonstrate the difference between the Poster  and Responder  roles. In this example, the response for the Responder  is not a potential solution but a question to the Poster. Unlike the generated question for the Poster  role in the previous example, the purpose of Responder’s question is to ask some further details in order to provide a simpler solution. The Poster’s response also fits well in the local context as well as global topic of ubuntu installation, claiming the difficulty of implementing the Poster’s suggestion. At the same time, the generated responses also show the necessity of incorporating certain domain knowledge into a domain-specific conversation system, which will be explored in future work.

5 Summary

We propose an LSTM-based conversation model by incorporating role factor and topic feature to model different word distribution for different roles. We present three models: R-Conv, LDA-Conv and R-LDA-Conv, by incorporating role factors and topic features into output layer. We evaluate the model using both perplexity and response ranking. Both R-Conv and LDA-Conv outperform the baseline model on all tasks. The model R-LDA-Conv gives the best performance by combining the two components. In addition, the generation results demonstrate the topical coherence and differences in responses associated with different roles. Besides role and topic, our model structure can be generalized to include more supervised information. For future work, we would incorporate supervised domain knowledge into our model to improve the topic relevance of the response.

Role Utterance
Poster hey people . i have a disk from someone who accidentally overwrote his ext3 partition with a ntfs partition …… any tips on how to recover files from the erased ext3 partition ? the ntfs partion was created using “ fast “ formatting ( i.e. only the partition table was erased )
Responder you can rebuild a partition table .. brb
Poster is that possible ? how ? and he also started writing to the ntfs partition , so at least some data will be lost , hopefully some is still recoverable
Responder at that point , he might as well write it off or pay a professional to do it
Poster ere4si : is there some guide somewhere on how to make a live cd using the minimal iso ? it doesn’t even have a squashfs root filesystem
Poster can you p me to that ?
Responder “ sudo fdisk -l “ then use the fstab of the new permissions . then use that combination of * to recover backup
Ground truth
Responder make an iso on usb bootable
Table 4: Response generation example (providing solution)
Role Utterance
Poster question : i am currently installing ubuntu . in the " prepare partitions " dialog box , should i check both ext3 ’ /’ and ext3 ’ /home ’ to be formatted ?
Responder did you have a previous install of ubuntu on there
Poster no
Responder err kazol_ not mount , write to it i mean
Responder then you can format them both
Poster i know , i don’t know how to do this . bad , even if i try an encrypted install of ubuntu … this means roller failed to mount it , so not the default .
Responder or something similar . are you trying to eject net crapped on there ?
Ground truth
Responder ok thanks
Table 5: Response generation example (clarification of the problem)