1 Introduction
Most successful conversational interfaces like Amazon Alexa, Apple Siri, and Google assistant have been primarily designed for short task-oriented dialogs.
Task-oriented dialogs follow template-like structures and have clear success criteria. Developing conversational agents that can hold long and natural free-form interactions continues to be a challenging problem
[1]. Sustaining long free-form conversation opens the way for creating conversational agents or chatbots that feel natural, enjoyable, and human-like.Elements of a good free-from conversation are hard to objectively define. However, over the years, there have been various attempts at defining frameworks on how free-form conversations can be rated [2, 3, 4, 5, 1]. Accurate tracking of the conversation topic can be a valuable signal for a system for dialog generation [6] and evaluation [1]. Previously, [7] have used topic models for evaluating open domain conversational chatbots showing that lengthy and on-topic conversations are a good proxy for assessing the user’s satisfaction with the conversation and hence in this work we focus on improving supervised conversational topic models. The topic models proposed by [7] are non-contextual, which prevents them from using past utterances for more accurate predictions. This work augments the supervised topic models by incorporating features that capture conversational context. The models were trained and evaluated on real user-chatbot interaction data collected during a large-scale chatbot competition known as the AlexaPrize [8].
With the goal of improving the topic model in mind, we train a separate independent model to predict dialog acts [9] in a conversation and observe that incorporating dialog act as an additional feature improves topic model accuracy. We evaluate three flavors of models: (1) optimized for speed (deep average networks [10] (DAN)), (2) optimized for accuracy (BiLSTMs), and (3) an interpretable model using unsupervised keyword detection (attentional deep average networks (ADAN)). We also evaluate the keywords produced by the ADAN model qualitatively by curating a manually-labeled dataset with keywords showing that incorporating context increases the recall of keyword detection. For this work, we annotated more than 100K utterances collected during the competition with topics, dialog acts, and topical keywords. We also annotated a similarly-sized corpus of chatbot responses for coherence and engagement. To the best of our knowledge, this is the first work that uses contextual topic models for open domain conversational agents. Our main contributions are: 1) We annotated and analyzed conversational dialog data with topics, dialog acts, as well as conversational metrics like coherence and engagement. We show high correlation between topical depth and user satisfaction score. 2) We show that including context and dialog acts in conversational topic models leads to improved accuracy. 3) We provide quantitative analysis of keywords produced by the Attentional Topic models.
2 Related work
The work by [11] is an early example of topic modeling for dialogs who define topic trees for conversational robustness. Conversational topic models to model first encounter dialogs was proposed by [12] while [13] uses Latent Dirichlet Allocation (LDA) to detect topics in conversational dialog systems. Topic tracking and detection for documents has been a long on-going research area [14]. An overview of classical approaches is provided in [15]. Topic models such as pLSA [16] and LDA [17] provide a powerful framework for extracting latent topics in text. However, researchers have found that these unsupervised models may not produce topics that conform to users’ existing knowledge [18] as the objective functions of these models often does not correlate well with human judgements [19]. This often results in nonsensical topics which cannot be used in downstream applications. There has been work on supervised topic models [20] as well as making them more coherent [18]. A common idea in the literature has been that human conversations are comprised of dialog acts or speech acts [9]
. Over the years, there has been extensive literature on both supervised and unsupervised ways to classify dialog acts
[21]. In this work, we perform supervised dialog act classification, which we use along with context as additional features for topic classification. A major hurdle for open-domain dialog systems is their evaluation [22] as there are many valid responses for any given situation. There has been a lot of recent work towards building better dialog evaluation systems [23, 24]. Some work include learning models for good dialog [5, 4], adversarial evaluation [25], and using crowd sourcing [2]. Inspired by [7] who use sentence topics as a proxy for dialog evaluation, we support their claims about topical depth being predictive of user satisfaction and extend their models to incorporate context.Topics | Dialog Acts |
Politics | InformationRequest |
Fashion | InformationDelivery |
Sports | OpinionRequest |
ScienceAndTechnology | OpinionExpression |
EntertainmentMusic | GeneralChat |
EntertainmentMovies | Clarification |
EntertainmentBooks | TopicSwitch |
EntertainmentGeneral | UserInstruction |
Phatic | InstructionResponse |
Interactive | Inappropriate |
Other | Other |
Inappropriate Content | FrustrationExpression |
- | MultipleGoals |
- | NotSet |
Agent | Sentence | Dialog Act |
User | what are your | Opinion |
thoughts on yankees | Request | |
Chatbot | I think the new york | MultipleGoals |
yankees are great. Would | (OpinionExpression | |
you like to know about sports | and request) | |
User | Yes | OpinionExpression |
Sentence | Topic | DialogAct |
huh so far i am getting | Phatic | InformationDelivery |
ready to go | ||
let’s chat | Phatic | GeneralChat |
who asked you to tell | Other | FrustrationExpression |
me anything | ||
can we play a game | Interactive | UserInstruction |
3 Data
The data used in this study was collected during a large chatbot competition from real users [8]. Upon initiating the conversation, users were paired with a randomly selected chatbot made by the competition participants.
At the end of the conversation, the users were prompted to rate the chatbot quality from 1 to 5 and had the option to provide feedback to the teams that built the chatbot. We had access to over 100k such utterances containing interactions between users and chatbots collected during the 2017 competition which we annotated for topics, dialog acts, and keywords (using all available context).
3.1 Annotation
Upon reviewing a user-chatbot interaction, annotators:
-
Classify the topic for each user utterance and chatbot response, using one of the available categories. Topics are organized into 12 distinct categories as shown in Table 1. It included the category Other for utterances or chatbot responses that either referenced multiple categories or do not fall into any of the available categories.
-
Determine the goal of the user or chatbot, which are categorized as 14 dialog acts in Table 1. Inspried by [9] we created a simplified set of dialog acts which were easy to understand and annotate. It includes the category Other for utterances which do not fall into any of the available categories and Not Set, which means annotators did not annotate the utterance because of potential speech recognition error. The goals are context-dependent, and therefore, the same utterance/chatbot response can be evaluated in different ways, depending on the available context.
-
Label the keywords that assist the analyst in determining the topic, e.g., in the sentence “the actor in that show is great,” both the word “actor” and “show” will assist the analyst when classifying the topic for this utterance.
These topics and dialog acts were selected based on the overall parameters of the competition as well as observed patterns of user interactions. As topics in conversation are strongly related to human perceptions of coherent conversation, we hypothesize that these annotations would allow improved dialog modeling and more nuanced responses. We provide some example of less obvious topics like Interactive and Phatic in Table 3.
In some cases, a user utterance or the chatbot response will imply multiple dialog acts. In these cases we default to MultipleGoals. If the request for more than one dialog act is within a topic, we categorize within the topic. Examples of annotations are shown in Table 2. The distribution of topics is shown in Figure 1 and the distribution of dialog acts is shown in Figure 2. Our inter-annotator agreement on the topic annotation task is 89% and on the dialog act annotation task it is 70%. The Kappa measure [26] on the topic annotations is 0.67(Good) and is 0.41 on dialog act annotation(Moderate).
Turn | Agent | Sentence | Topic | DialogAct | Keywords |
1 | User | let’s talk about | Politics | Information | Politics |
politics | Request | (KeywordPolitics) | |||
1 | Chatbot | ok sounds good would you | Fashion | TopicSwitch | Gucci |
like to talk about Gucci? | (KeywordFashion) | ||||
2 | User | Yes | Fashion | InformationRequest | |
2 | Chatbot | Sure! Gucci is a famous brand | Fashion | Gucci brand Italy | |
from Italy | (KeywordFashion) |


In addition to these annotations, we also asked a separate set of annotators to rate each chatbot response as “yes” or “no” on these four questions: 1) The response is comprehensible: The information provided by the chatbot made sense with respect to the user utterance. 2) The response is relevant: If a user asks about a baseball player on the Boston Red Sox, then the chatbot should at least mention something about that baseball team. 3) The response is interesting: The chatbot would provide an answer about a baseball player and provide some additional information to create a fleshed out answer. 4) I want to continue the conversation: This could be through a question back to the user for more information about the subject. We use the sum of the first two metrics as a proxy for coherence, and the sum of last two as a proxy for engagement. We consider “yes” as a and “no” as a to convert these to scores to numeric values.
3.2 Topical Metrics and Evaluation Metrics
Coherence and Engagement are thus measured on a scale of 0 to 2.
We provide more statistics about our data in Table 5. We observe that user responses are very short and have limited vocabulary compared to chatbot responses which makes our task challenging as well as context crucial for effective topic identification.
Metric | Value |
Average Conversation Length | 11.7 |
Median Conversation Length | 10.5 |
Mean User Utterance Length | 4.2 |
Median User Utterance Length | 3 |
Mean Chatbot Response Length | 24 |
Median Chatbot Response Length | 17 |
User Vocab Size (words) | 18k |
ChatBot Vocab Size (words) | 85k |
Following [7], we define the following terms:
-
Topic-specific turn: Defined as a pair containing a user utterance and a chatbot response where both utterance and response belong to the same topic. For example in Table 4, Turn 2 forms topic specific turn for fashion
-
Length of sub-conversation: Defined as the number of topic-specific turns. In Table 4 there is a sub-conversation of length 1 for fashion
Metric | Correlation |
Coherence | 0.80 |
Engagement | 0.77 |
The correlation of topical depth in a conversation with coherence, and engagement is given in Table 6
. Given by the annotators, coherence has a mean of 1.21 with a standard deviation of 0.75 and engagement has a mean 0.81 with a standard deviation of 0.62. We observe that the mean of coherence is much higher than that of engagement. Both of them are almost equally correlated with topical depth, which implies that by making conversational chatbots stay on topic there is room for improvement in user engagement and coherent conversations as was proposed in
[7].4 Models
We use DANs and ADANs as our topic classification baseline and explore various features and architectures to incorporate context. We also train BiLSTM classification models with and without context. We will first describe the DAN, ADAN, and BiLSTM models. We will then describe additional features for the models to incorporate context and dialog act information.

The structure of CDAN, note that context is averaged and appended to the average sentence vector.

4.1 Dan
DAN is a bag-of-words neural model (see Figure 3 for DAN with context) that averages the word embeddings in each input utterance as a vector representation of a sentence.
is passed through a fully connected network and fed into a softmax layer for classification. Formally, assume an input utterance of length
and corresponding dimensional word embeddings , then the network structure is:To modify the network for contextual input or CDAN, we concatenate context to the input . This is detailed in Section 4.4.
In Figure 3 we have output layer of size which corresponds to the number of topics we want to classify between. Due to lack of recurrent connections and its simplicity, DAN provides a fast-to-train baseline for our system.
4.2 Adan
The DAN model was then extended in [7] by adding an attention mechanism to jointly learn the topic-wise word saliency for each input utterance while performing the classification. The keywords can thus be extracted in an unsupervised manner. Figure 4 depicts ADAN with context (CADAN). Figure 4 excluding the context layers illustrates the ADAN model for topic classification.
As shown in the figure, ADAN models the distribution using a topic-word attention table of size , where is the vocabulary size and the number of topics in the classification task. The attention table corresponds to the relevance-weights associated with each word-topic pair. The weights essentially measure the saliency of word given topic . The utterance representation per topic is computed through weighted average. corresponds to one row of our topic-specific sentence matrix as shown in Figure 4:
Similar to DAN the topic-specific sentence matrix now denoted as is passed through a fully connected network and fed to a softmax layer for classification. More formally,
To modify the network for contextual input or CADAN, we concatenate the context to the input which is detailed in Section 4.4. Overall, ADAN provides an interpretable baseline for our system.
4.3 BiLSTM for Classification
We train a simple 1 layer BiLSTM model with word embeddings as an input for topic classification. The contextual variation of the BiLSTM model is shown in Figure 5. For BiLSTMs, we try two different ways to include the context, which will be described in detail in Section 4.4. For the final sentence representation, we use the concatenation of the final state of the forward and backward LSTM [27]. More formally assume an input utterance of the exact notation from Section 4.1
and correspond to the final state of the forward and backward LSTM respectively and are sent through a softmax layer.
4.4 Context as Input
We consider two variants of contextual features to augment the above-mentioned models for more accurate topic classification. (1) Average turn vector as context: For the current turn , the context vector is obtained by averaging the previous turn vectors. (2) Dialog act as context: dialog acts could serve as a powerful feature for determining context. We train a separate CDAN model that predicts dialog acts, which we use as an additional input to our models.
We define a turn vector as concatenation of a user’s utterance and chatbot’s response. In the conversation in Table 4, there are two turn vectors. We get a fixed-length representation of a turn vector by simply averaging the word embeddings of all the words in them. For current turn of length with words , let be the word embedding vector corresponding to the word . Then the turn vector is .
We concatenate the feature vectors with our input for DANs and ADANs as an additional input to the model. More specifically for DAN, the contextual feature vector is simply appended to the input embedding as show in in Figure 3. For ADAN, the contextual feature vector is replicated times to be able to concatenate our contextual input to every word of our input utterance as shown in Figure 4.
For BiLSTMs, we try two different ways to include the context of average word vector (1) Concatenating the context to the input of the BiLSTM as word embeddings and (2) Adding context in sequential manner as an extension to the input sequence rather than concatenating the averaging embeddings as shown in Figure 5. Additionally we append the dialog act as context to , which are the outputs of BiLSTM before it is sent through the softmax layer as shown in Figure 5.
4.5 Salient Keywords
We use our annotated keywords as a test set to evaluate our attention layer quantitatively. To choose the keywords produced by our attention layer, we first choose the topic produced by the ADAN model. From the attention table select the row corresponding to the topic and then select top keywords where is equal to keywords in the ground truth. We do this only for evaluation of the keywords produced by the ADAN model. For ground truth keywords, we use all the tokens marked with any topic by the annotators in a sentence.

Sentence with ground truth keywords | ADAN | +C | +C+D |
do you know hal | Sci&Tech | hal | Sci&Tech | hal | Sci&Tech | hal | Sci&Tech |
i like puzzles | EntGeneral | i | Sci&Tech | puzzles|EntGeneral | puzzles | EntGeneral |
can you make comments about music | EntMusic | comments | Phatic | you | EntBooks | music | EntMusic |
|
5 Experiments
Since our data set was highly imbalanced (see Figure 1), we down-sampled the Other
class in our data set. We split our annotated data into 80 % train, 10 % development, and 10 % test. We used the dev set to roughly tune our hyper-parameters separately for all our experiments. We train our networks to convergence, which we define as validation accuracy not increasing for two epochs. For DAN, ADAN and its contextual variants, we choose an embedding size of 300, a hidden layer size of 500 and the ReLU activation function. Word embeddings were initialized with Glove
[28] and fine-tuned. For BiLSTM, we choose an embedding size of 300 and a hidden layer size of 256. Word embeddings were randomly initialized and learned during training. We use the ADAM optimizer with a learning rate of 0.001 and other defaults. In our experiments all of our models are trained with 50% dropout. We noticed only marginal gains by including very long context windows, hence to speed up training, we only consider the last 5 turns as context in our model. We measure our accuracy on the supervised topic classification task. For the salient keywords detection task, we measure token level accuracy and the results are shown in Table 10.Topic Classifier | Baseline | |||
BiLSTM-Avg | 0.55 | 0.56 | 0.59 | 0.68 |
BiLSTM-Seq | - | 0.61 | - | 0.74 |
DAN | 0.51 | 0.57 | 0.52 | 0.60 |
ADAN | 0.38 | 0.39 | 0.42 | 0.40 |
Dialog Act Classifier | Sentences | +C |
DAN | 0.50 | 0.69 |
Bi-LSTM | 0.50 | 0.67 |
Topic Classifier | Precision | Recall |
ADAN | 0.37 | 0.36 |
CADAN | 0.33 | 0.32 |
CADAN + DialogAct | 0.40 | 0.40 |
6 Results and Analyses
We show our main results in Table 8, where we highlight a few key results. The BiLSTM baseline performs better than the DAN baseline. ADAN performs worse than DAN and BiLSTM across the board. We believe that this is because of the large number of parameters in the word-topic attention table of ADAN, which requires a large amount of data for robust training. Given our dataset of 100K utterances, the ADAN model may be overfitting. Adding contextual signals like past utterances and dialog acts help DAN and BiLSTM which can be seen in Table 8 by comparing the baseline with other models.
Adding context alone helps DAN but does not significantly improve BiLSTM performance. This could be because of the BiLSTM already modeling the sequential dependencies where context alone does not add a lot of value. We observe that past context and dialog acts work complementarily where adding context makes the model sensitive to topic changes while adding in the dialog acts makes the model more robust to contextual noise.
We also show results of our CDAN dialog act model in Table 9. We see that adding context also improves dialog act classification which in turn will improve the topic model, since dialog acts are used as input.
The best performing model is a combination of all the input signals: context and dialog acts. In the case of ADAN where the model is over-fitting because of insufficient data, adding both features improves the keyword detection metrics (Table 10). A few examples of keywords learned by the ADAN model are shown in Table 7.
7 Conclusion and future work
We focus on context-aware topic classification models for detecting topics in non-goal-oriented human-chatbot dialogs. We extended previous work on topic modeling (DAN and ADAN models) by incorporating conversational context features to topic modeling and we introduce the Contextual DAN, the Contextual ADAN and Contextual BiLSTM models. We describe a fast topic model (DAN), an accurate topic model (BiLSTM), and an interpretable topic model (ADAN). We show that we can add conversational context in all these models in a simple and extensible fashion. We also show that dialog acts provide valuable input which helps improve the topic model accuracy. Furthermore, we depict that the topical evaluation metrics such as topical depth highly correlates with dialog evaluation metrics such as coherence and engagement, which implies that conversational topic models can play a critical role in building great conversational agents. Furthermore, topical evaluation metrics such as topical depth can be obtained automatically and to be used for evaluation of open-domain conversational agents, which is an unsolved problem. Unsupervised topic modeling is a future direction that we plan to explore along with other context like device state and user preferences.
References
- [1] Anu Venkatesh, Chandra Khatri, Ashwin Ram, Fenfei Guo, Raefer Gabriel, Ashish Nagar, Rohit Prasad, Ming Cheng, Behnam Hedayatnia, Angeliki Metallinou, Rahul Goel, et al., “On evaluating and comparing conversational agents,” arXiv preprint arXiv:1801.03625, 2018.
- [2] Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel, “Can machine translation systems be evaluated by the crowd alone,” Natural Language Engineering, vol. 23, no. 1, pp. 3–30, 2017.
-
[3]
Dominic Espinosa, Rajakrishnan Rajkumar, Michael White, and Shoshana Berleant,
“Further meta-evaluation of broad-coverage surface realization,”
in
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
. Association for Computational Linguistics, 2010, pp. 564–574. - [4] Ryuichiro Higashinaka, Toyomi Meguro, Kenji Imamura, Hiroaki Sugiyama, Toshiro Makino, and Yoshihiro Matsuo, “Evaluating coherence in open domain conversational systems,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
- [5] Ryan Lowe, Michael Noseworthy, Iulian V Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau, “Towards an automatic turing test: Learning to evaluate dialogue responses,” arXiv preprint arXiv:1708.07149, 2017.
- [6] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan, “A diversity-promoting objective function for neural conversation models,” arXiv preprint arXiv:1510.03055, 2015.
- [7] Fenfei Guo, Angeliki Metallinou, Chandra Khatri, Anirudh Raju, Anu Venkatesh, and Ashwin Ram, “Topic-based evaluation for conversational bots,” arXiv preprint arXiv:1801.03622, 2017.
- [8] Ashwin Ram, Rohit Prasad, Chandra Khatri, and Anu Venkatesh, “Conversational ai: The science behind the alexa prize,” arXiv preprint arXiv:1801.03604, 2017.
-
[9]
Andreas Stolcke, Elizabeth Shriberg, Rebecca Bates, Noah Coccaro, Daniel
Jurafsky, Rachel Martin, Marie Meteer, Klaus Ries, Paul Taylor, Carol Van
Ess-Dykema, et al.,
“Dialog act modeling for conversational speech,”
in
AAAI Spring Symposium on Applying Machine Learning to Discourse Processing
, 1998, pp. 98–105. - [10] Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III, “Deep unordered composition rivals syntactic methods for text classification,” Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, 2015.
- [11] Kristiina Jokinen, Hideki Tanaka, and Akio Yokoo, “Context management with topics for spoken dialogue systems,” in Proceedings of the 17th international conference on Computational linguistics-Volume 1. Association for Computational Linguistics, 1998, pp. 631–637.
- [12] Trung Ngo Trong and Kristiina Jokinen, “Conversational topic modelling in first encounter dialogues,” .
- [13] Jui-Feng Yeh, Chen-Hsien Lee, Yi-Shiuan Tan, and Liang-Chih Yu, “Topic model allocation of conversational dialogue records by latent dirichlet allocation,” in Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific. IEEE, 2014, pp. 1–4.
- [14] James Allan, Jaime G Carbonell, George Doddington, Jonathan Yamron, and Yiming Yang, “Topic detection and tracking pilot study final report,” 1998.
- [15] James Allan, “Introduction to topic detection and tracking,” in Topic detection and tracking, pp. 1–16. Springer, 2002.
-
[16]
Thomas Hofmann,
“Probabilistic latent semantic analysis,”
in
Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
. Morgan Kaufmann Publishers Inc., 1999, pp. 289–296. - [17] David M Blei, Andrew Y Ng, and Michael I Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
- [18] David Mimno, Hanna M Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum, “Optimizing semantic coherence in topic models,” in Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2011, pp. 262–272.
- [19] Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L Boyd-Graber, and David M Blei, “Reading tea leaves: How humans interpret topic models,” in Advances in neural information processing systems, 2009, pp. 288–296.
- [20] Jon D Mcauliffe and David M Blei, “Supervised topic models,” in Advances in neural information processing systems, 2008, pp. 121–128.
- [21] Aysu Ezen-Can and Kristy Elizabeth Boyer, “Understanding student language: An unsupervised dialogue act classification approach,” Journal of Educational Data Mining (JEDM), vol. 7, no. 1, pp. 51–78, 2015.
- [22] Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau, “How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation,” arXiv preprint arXiv:1603.08023, 2016.
- [23] Ondřej Bojar, Yvette Graham, Amir Kamran, and Miloš Stanojević, “Results of the wmt16 metrics shared task,” in Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, 2016, vol. 2, pp. 199–231.
-
[24]
Rohit Gupta, Constantin Orasan, and Josef van Genabith,
“Reval: A simple and effective machine translation evaluation metric based on recurrent neural networks.,”
2015. - [25] Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky, “Adversarial learning for neural dialogue generation,” CoRR, vol. abs/1701.06547, 2017.
- [26] Douglas G. Altman, Practical Statistics for Medical Research, CRC Press, 1990.
- [27] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed, “Hybrid speech recognition with deep bidirectional lstm,” in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013, pp. 273–278.
- [28] Jeffrey Pennington, Richard Socher, and Christopher D Manning, “Glove: Global vectors for word representation.,” in Proceedings of EMNLP, 2014.
Comments
There are no comments yet.