Log In Sign Up

Domain Aware Neural Dialog System

by   Sajal Choudhary, et al.

We investigate the task of building a domain aware chat system which generates intelligent responses in a conversation comprising of different domains. The domain, in this case, is the topic or theme of the conversation. To achieve this, we present DOM-Seq2Seq, a domain aware neural network model based on the novel technique of using domain-targeted sequence-to-sequence models (Sutskever et al., 2014) and a domain classifier. The model captures features from current utterance and domains of the previous utterances to facilitate the formation of relevant responses. We evaluate our model on automatic metrics and compare our performance with the Seq2Seq model.


page 1

page 2

page 3

page 4


A Dual Encoder Sequence to Sequence Model for Open-Domain Dialogue Modeling

Ever since the successful application of sequence to sequence learning f...

Meet Your Favorite Character: Open-domain Chatbot Mimicking Fictional Characters with only a Few Utterances

In this paper, we consider mimicking fictional characters as a promising...

Generating More Interesting Responses in Neural Conversation Models with Distributional Constraints

Neural conversation models tend to generate safe, generic responses for ...

Towards Exploiting Background Knowledge for Building Conversation Systems

Existing dialog datasets contain a sequence of utterances and responses ...

Topic Propagation in Conversational Search

In a conversational context, a user expresses her multi-faceted informat...

Structuring Latent Spaces for Stylized Response Generation

Generating responses in a targeted style is a useful yet challenging tas...

Two are Better than One: An Ensemble of Retrieval- and Generation-Based Dialog Systems

Open-domain human-computer conversation has attracted much attention in ...

1 Introduction

With the advent of personal assistants such as Siri and Alexa, there has been a renewed focus on dialog systems, specifically those which can hold open-domain conversations. Readily available conversations from social media such as Reddit 111

have facilitated the use of data driven models to generate open domain responses. A recurrent neural network-based sequence-to-sequence model has been successfully applied to generating responses for conversational tasks  

Vinyals and Le (2015).

Though these models generate responses that are fluent and natural, they often fail to produce domain-specific responses since they are trained on a single general data-set. Moreover, these models tend to capture information solely based on previous words in a single utterance and fail to acknowledge switching of domains in a multi-turn conversation. Recent research has aimed at solving these problems and has shown promising results. Topic-aware neural networks  Xing et al. (2017) seek to incorporate topic information in their model to generate responses relevant to the topic words in the query. However, the context inherent in the previous utterances is not taken into account. On the other hand, hierarchical neural networks Serban et al. (2016, 2017) model the interactive structure of a conversation in order to output more contextual responses. Though these models try to capture the context of previous utterances in a conversation, the responses generated might lack domain-specific relevance.

Figure 1: Two users(A and B) conversing with each other. At second turn, B still remembers the domain of the conversation and hence answers appropriately in the same domain.

We hypothesize that each utterance in a conversation belongs to a domain and that during a conversation, participants tend to switch between domains. Moreover, sometimes the domain of an utterance might be ambiguous, but incorporating the previous domains of the conversation might be helpful in inferring the precise domain. For instance in Figure 1, the first utterance by speaker A lies in the sports domain. The domain for A’s second utterance is ambiguous and could lie in either sports or music. Since the previous domains referenced by the participants is sports

, it is highly probable that the exchange is about playing “tennis” rather than playing a song named “tonight”. When B is a chatbot, this utterance might be misinterpreted as a request for playing a song. We address the above problems by modelling a conversation and taking into account the domain of the current utterance as well as domains of the previous utterances.

In this paper, we describe a novel domain aware dialog system, DOM-Seq2Seq, consisting of three main components: 1) Domain Classifier, 2) Domain specific response generators (Seq2Seq), and 3) Re-ranker to combine responses of the above two models.

2 Model

2.1 Overview

In order to maintain a smooth conversation within the domain as well as during switching of domains in a conversation, we built a model consisting of a domain classifier and a set of response generators. The overview of the model is depicted in Figure 2. An utterance is fed into the domain classifier as well as multiple response generators which are trained separately on domain-specific data. The output domain () predicted by the domain classifier and all the generated responses (, , ) are then fed to the re-ranker. Re-ranker outputs the final predicted domain () and the appropriate response (). The final predicted domain () is fed back to the domain classifier to be used in subsequent predictions.

Figure 2: High level overview of the DOM-Seq2Seq conversational model. The utterance is fed into two separate models: Domain Classifier and Response generators. Their output is fed to re-ranker. Re-ranker outputs the final predicted domain() and response(). acts as a feedback to the domain-classifier.

2.2 Domain classifier

Domain classification predicts the domain of an utterance. Firstly, we used a tf-idf based supervised SVM to capture utterance level word features. Subsequently, we employed two ways to combine predicted domain from SVM and previous set of domains in the conversation.

2.2.1 Ensemble based Domain Classifier

We applied a logistic regression model to capture the sequence of domains in the conversation. While training the model, three previous actual domains (

, , ) are taken into consideration, where is the domain of the previous utterance and so on. Along with this, the domain predicted by the SVM model () is also included as a feature. Hence, this model optimizes the cost function,


where y is the actual label (domain) of the utterance in the conversation.

2.2.2 RNN based Domain Classifier

Our previous approach (2.2.1

) is based on the assumption that the last three domains in the conversation are sufficient to predict the current domain. However, capturing long term dependencies on domains is more beneficial for this task instead of relying only on the last few domains. The best way to capture this information is to use an RNN model with one hot input vector representation of subsequent domains. This is depicted in Figure


Figure 3: RNN Based Domain Classifier: The vector representation of the output from SVM tf-idf () is concatenated with last the hidden state of RNN. These are then fed to softmax classifier.

The state at time step t for an RNN is represented as where is the hidden state of the RNN at time step and is the domain at the same time step. Thus, at each time step, the state modifies itself based on its previous state and the current input. In our case, RNN is fed with the sequence of domains. The last hidden state () captures all the necessary information about the sequence of domains. The SVM model’s prediction is also represented in a vector form (

). These two vectors are concatenated and fed to a softmax classifier to get a probability distribution over the domains as in Equation



2.3 Response Generators

For generating the output response, we used an LSTM based Seq2Seq model with attention mechanism Bahdanau et al. (2014)

. Seq2Seq consists of two recurrent neural networks: encoder and decoder. An encoder compresses the input sentence into a fixed length vector, which is then passed to the decoder. The decoder then generates the output response one word at a time. Thus a candidate response is output for each domain. During testing phase, the output of the decoder at time step t is fed to the input of decoder at time step t+1; it outputs the logit at each time step. We extract the output logit

from the last time step of the decoder and apply sigmoid function to limit its value between 0 and 1. The value of

is used in the re-ranker stage to combine it with the output of the domain classifier.

2.4 Re-ranker

The output scores from the the generators and the domain classifier are fed into a re-ranker which implements soft classification to figure out the best response from the set of candidate responses. Equation 3 returns the index of the response with the highest multiplied score. So,


where is the probability that the utterance belongs to domain and is the domain’s decoder probability which was calculated from its output logit.

3 Experiments

In this section, we describe the experiments performed for our proposed models. We present the quantitative results of our evaluations and compare them with the those of the Seq2Seq model. We also put forward some qualitative examples of conversation snippets tested on our models. Our experiments were restricted to three domains namely Movies, Gaming and Out of Domain to aid in data collection and evaluation. We introduced Out Of Domain to manage utterances that do not lie in either Movies or Gaming.

3.1 Data Sets

We collected data from the social discussion website Reddit for training our models. We gathered 1,334,119, and 542,594 query-response pairs from the Movies and Gaming Reddit categories respectively for training two Seq2Seq models. Furthermore, for responding to Out of Domain queries, we trained Seq2Seq using 301,812 query-response pairs from twitter.

The SVM classifier for the domain classifier was trained and tested on 544,630 posts from Reddit categories of Movies, Gaming and tweets from Twitter combined. We made sure that there was no overlap with the data used for training of Seq2Seq models.

User: hi how are you
Seq2seq: morning bro
DOM-Seq2Seq: hello how are you
User: what games interest you
Seq2seq: Super Life
DOM-Seq2Seq: Sonic 0, The Walking
User: want to play tonight
Seq2seq: lol this is the best
DOM-Seq2Seq: I wonder what you’re doing
User: do you like watching movies
Seq2seq: [no response]
DOM-Seq2Seq: I’m a fan of them
User: the x men movies are good
Seq2seq: i love this
DOM-Seq2Seq: i thought i saw them
User: do you want to watch a movie
Seq2seq: i’ m not sure he’s a great
DOM-Seq2Seq: I have a feeling I’d watch them in the theater
Table 1: Qualitative comparisons of responses from the Seq2Seq baseline and DOM-Seq2Seq.

Tagging Conversational Data with Domains

Conversations extracted from Reddit IAma were used as data for the model. Reddit IAma threads specially suited our purposes, as they are already tagged with a flair, which is essentially the category of the conversation. We found that participants often switch categories during the conversations as these are of the type “ask me anything”. Each utterance of conversation was labelled automatically. Latent Dirichlet Allocation Blei et al. (2003) was used to learn topics from the conversation and to label each utterance with a topic corresponding to a domain if the topic proportion for that domain was above the threshold of . We replaced any domains outside of Movies and Gaming by Out Of Domain. Finally, we weighted the previous domains with exponentially decaying weights and took an average of these to incorporate the effect of previous domains. We amassed 947 conversations.

3.2 Experimental Setup

Response Generators

We used a Seq2Seq model with 3 layers of LSTM each consisting of 1024 hidden units as encoder and decoder. The vocabulary size for the Seq2Seq for Movies, Gaming and Out Of Domain was kept at 253,468, 85,312 and 40,000 words, respectively. These values were decided after filtering out the words which occurred once in their respective domains. Early stopping was used based on perplexity values on validation set.

RNN based Domain Classifier

A single layer RNN architecture with 8 hidden units was employed for the domain classifier and trained for 50 epochs.

3.3 Evaluation

We evaluated our models along with the Seq2Seq model on two metrics 1) Domain Classification Accuracy of the domain classifier and 2) Word Embedding Greedy Match Liu et al. (2016). The GloVe word embedding metrics Pennington et al. (2014) were used to find the similarity between generated response and the ground truth. This metric signifies how semantically similar the generated response is to the true response, and by extension, related to the query. The results for each of the models are specified in the Table 2.

4 Analysis

DOM-Seq2seq models outperform Seq2Seq on the Word Embedding Greedy Match metric showing that response from these models are more suitable and similar to the ground truth. En-DOM-Seq2Seq surpasses RNN-DOM-Seq2Seq with respect to domain classification accuracy, as conversations in our data set are relatively short. Table 1 shows the responses generated by Seq2Seq and DOM-Seq2Seq. Both of the DOM-Seq2Seq models predicted the same domain in our example conversation. Thus, for brevity, we showed only one response to compare it against Seq2Seq. The responses by DOM-Seq2Seq models are more informative. The response to query “do you want to watch a movie” from DOM-Seq2Seq is more pertinent than the one from Seq2Seq.

                                            Model Domain Classifier Accuracy Word       Embedding Greedy
Seq2Seq N/A 0.760
RNN-DOM-Seq2Seq 67.8% 0.797
En-DOM-Seq2Seq 77.57% 0.801
Table 2: Results for evaluation of Seq2seq, RNN based DOM-Seq2Seq (RNN-DOM-Seq2Seq), and Ensemble based DOM-Seq2Seq model (En-DOM-Seq2Seq)

5 Conclusion

We proposed an original technique for building a domain-aware chat system that can generate more relevant responses. We put forth two models to achieve this: an ensemble that considers a fixed number of previous domains and the current utterance to generate an appropriate response, and an RNN based classifier that models the switching of domains during a conversation. The methods described here can be extended to model a user’s conversation pattern (domain shifts) and personalize the responses of the chat system.

=0mu plus 1mu