Human computer conversation has been an important and challenging task in NLP and AI since the Turing Test was proposed in 1950 Turing (1950). Recently, with the rapid growth of social conversation data available on the Internet, data-driven chatbots are able to learn to generate responses directly and have attracted much more attention than before Li et al. (2016a); Tian et al. (2017).
Researches in this area mostly focus on the dialog with two interlocutors Maíra Gatti de Bayser et al. (2017). However, the real-life interaction involves a substantial part of Multi-Party Chatbots (MPC, such as internet forum and chat group), which is a form of conversation with multiple interlocutors Ouchi and Tsuboi (2016). For example, there are more than three interlocutors () involved in the conversation in Figure 1, and their roles (e.g., speaker and addressee) may change across different dialog turns.
As shown in Figure 1, at each turn, the core issue of MPC is to capture who (speaker) talks to whom (addressee) about what (utterance). In order to obtain responses in MPC, in our best knowledge, previous approaches usually employ a response selection paradigm, which simply selects one response from a set of existing utterances as the final response according to the contexts. Obviously, this paradigm, which could not generate new responses, is not so flexible. In this study, to build a more broadly applicable system, we concentrate on producing new responses word by word, named as Response Generation on Multi-Party Chatbots (RGMPC).
RGMPC is a very challenging task. The primary challenge is that the generated response has strong relevance to the interlocutor’s roles, such as the speaker and the addressee. For example, in the same context of Figure 1, what says to is different from what says to because different addressees ( and ) have different information demands. Similarly, as for the same addressee, utterances from different speakers may be different because each speaker has personal background knowledge and style of speaking. Moreover, the roles of the same interlocutor may vary across different dialog turns. For instance, in Figure 1, plays different roles in different dialog turns: speaker in the turn 2 and -, addressee in the turn and +.
Therefore, it is very important for RGMPC to capture interlocutor information. Currently, most response generation methods consider only the contextual utterance information Serban et al. (2016, 2017) but neglect the interlocutor information. Although some researches have exploited the interlocutor information for response generation, they are still suffering from certain critical limitations. Li et al. Li et al. (2016b)
learn a fixed vector for each person from all conversational texts in the training corpus. However, as a global representation, the fixed person vector needs to be trained from large-scale dialogue turns for each interlocutor, and it may have asparsity issue since some interlocutors have very few dialogue turns.
To address the aforementioned problems of RGMPC, this paper incorporates Interlocutor-aware Contexts into a Recurrent Encoder-Decoder model (ICRED) for RGMPC, which is also an end-to-end framework. Specifically, in order to capture interlocutor information, we exploit interactive interlocutor representations learned from current dialog context rather than the fixed person vectors Li et al. (2016b) obtained from all dialogs in the training corpus. We expect that the learned contextual interlocutor representation could be a good alternative to the fixed person vectors Li et al. (2016b) due to its ability of alleviating the sparsity issue. Furthermore, from the view of conversation analysis, responses are usually used for answering the addressee’s question or expanding the addressee’s utterances. Therefore, we originally introduce an addressee memory mechanism to enhance contextual information for the target addressee especially. Finally, both of the interactive interlocutor representation and addressee memory are utilized for decoding response utterances. In particular, the addressee memory is leveraged to capture the addressee information for each generated word dynamically.
In order to prove the effectiveness of the proposed model, we construct a dataset for RGMPC based on an open dataset111The dataset is available at https://www.dropbox.com/s/4chh64yaxajh0j7/RGMPC.zip?dl=0. Experimental results show that the proposed model is fairly competitive on both automatic and manual evaluations compared with state-of-the-arts.
In brief, the main contributions of the paper are as follows:
(1) We propose an end-to-end response generation model called ICRED which incorporates Interlocutor-aware Contexts into Recurrent Encoder-Decoder framework for RGMPC.
(2) We leverage an addressee memory mechanism to enhance contextual interlocutor information for the addressee.
(3) We construct an open-access dataset for RGMPC. Both automatic and manual evaluations demonstrate that our model is remarkably better than strong baselines in this dataset.
2 Task Formulation
|Responding Speaker||(or )|
|Target Addressee||(or )|
On multi-party chatbots, lots of interlocutors talk about one or more topics. At each dialogue turn (or time step) , there is a speaker (), who may talk something () to a specific addressee (), while the others are observers. As shown in Table 1, given the context of previous dialog turns, the responding speaker and the target addressee at time step +, the task of RGMPC aims to automatically generate the next utterance as the final response. Here, is a list ordered by the time step : , where means says to at time step , is the maximum number of previous dialog turns in a context. is the input utterance (word sequence) at time step , where is the number of maximum words in utterances.
The overview of the proposed ICRED for RGMPC is shown in Figure 2 along with its caption. The details are as follows.
3.1 Utterance Encoder Layer
3.2 Speaker Interaction Layer
The speaker interaction layer is leveraged to obtain the interlocutor information in the context. Similar to the Speaker Interaction RNNs Zhang et al. (2018), we utilize the interactive speaker encoder for RGMPC.
As shown in Figure 2, an interlocutor embedding matrix is used to record all interlocutors’ representation, and
is initiated with a zero matrix. Each column ofcorresponds to an interlocutor’s embedding: , where is the embedding for the interlocutor . The speaker interaction layer updates the entire interlocutors’ embeddings at each time step based on their roles (speaker, addressee or observer). Embeddings for the speaker, addressee and observer are updated by following role-differentiated GRUs: , and , respectively.
where ( / ) is the embedding for the speaker (addressee / observer) at time step , and is the utterance representation obtained from the utterance encoder layer. Take the first time step “” in Figure 2 as an example, when says to , the speaker’s (’s) embedding is updated by the speaker GRU—, and the addressee’s (’s) embedding is updated by the addressee GRU—, while other interlocutors’ embeddings are updated by the observer GRU—. Note that the addressee may be missing (such as “” at time step 2 in Figure 2), where embeddings for all interlocutors except for the speaker are updated by the observer GRU. The interlocutor embedding matrix () is updated up to the maximum time step . The final interlocutor embedding matrix is used in decoding.
3.3 Addressee Memory Layer
The interlocutor embedding matrix is updated by utterance representations and interlocutor’s roles, so it captures interlocutor’s context on the utterance level. In fact, contextual word representation is important for response generation, too. A context contains consecutive utterances, and each utterance is a word sequence. Therefore, memorizing all contextual word representations in the entire context is complex, and it is difficult to work on large-scale utterances in one context.
Intuitively, from the view of conversational analysis, responses are usually used for answering the addressee’s question or expanding the addressee’s utterances. Therefore, we design an addressee memory layer, which only memorizes the contextual word representations (noted as ) in the last utterance said by the target addressee, and the contextual representation for each word is obtained from the utterance encoder layer. Take “” at time step + in Figure 2 as an example, is the last utterance said by the target addressee because of “” at time step -, so the addressee memory layer merely memorizes contextual word representation from the utterance , where is obtained from Section 3.1.
3.4 Decoder Layer
The decoder is responsible for generating target sequences. Different from a single contextual representation in previous work Serban et al. (2017), the speaker interaction layer is able to capture different interlocutor information from contexts (e.g., personal background knowledge and style of speaking for the responding speaker, special information demands for the target addressee). Moreover, the addressee memory layer records contextual word representation for the target addressee. Therefore, we extract contextual speaker vector for the responding speaker from the final interlocutor embedding matrix (e.g., the responding speaker’s embedding obtained by for the responding speaker in Figure 2). Similarly, contextual addressee vector for the target addressee is also extracted from . However, and keep same for each generated word. In order to capture dynamic information for different generated words, we leverage an attention mechanism to selectively reads different contextual word representations from the addressee memory. For each target word, the decoder attentively reads the contextual word representation as follows:
where is the attentional addressee vector, is the contextual word representation for the k-th word in the addressee memory, and represents the hidden state in decoding GRU. A function is leveraged to compute the attentive strength, which is calculated by a projected matrix to connect and . Finally, the attentional addressee vector , contextual speaker vector and contextual addressee vector
where is the hidden state of the decoding GRU—. is the word vector of the predicted target word , and is typically performed by a classifier over a settled vocabulary based on word embedding similarity.
The proposed ICRED for RGMPC is totally differentiable, and it can be optimized in an end-to-end manner using back-propagation. Given the context , responding speaker , target addressee and target word sequence
, the objective function is to minimize the loss function:
It contains a negative log-likelihood for generated responses and L2 regularization (), where
is a hyperparameter for.
Our dataset is constructed based on the Ubuntu multi-party chatbot corpus222https://github.com/hiroki13/response-ranking, which has been widely used as the evaluation dataset for the response selection task Ouchi and Tsuboi (2016); Zhang et al. (2018). The original data comes from the Ubuntu IRC chat log, where each line consists of (Time, Speaker, Utterance). If the addressee is explicitly mentioned in the utterance, it is extracted as the addressee. Otherwise, all interlocutors except the speaker are observers. Considering that generating new responses in this paper is more complicated than retrieving responses, the generative task requires higher-quality data. We suppose that the responding speaker and target addressee have appeared in the context, where the contextual window is set to 5. Moreover, the words are tokenized by NLTK, and some general responses are removed by human rules333We list some general responses, such as containing “i don’t know”, “you are welcome”.. Finally, we randomly split the dataset into Train/Dev/Test (8:1:1), and it is publicly available1. The detailed statistics of the dataset are shown in Table 2.
4.2 Implement Details
In order to keep our model comparable to other typical existing methods, we keep the same parameters and experimental environments for ICRED and the comparative models. We take a maximum of 20 words for the utterance. The word vector dimension is 300 and it is initialized with the public released fasttext444https://github.com/facebookresearch/fastText
pre-trained on Wikipedia. The utterance and interlocutor are encoded by 512-dimensional and 1024-dimensional vectors, respectively. The joint loss function with 0.0001 L2 weight is minimized by an Adam optimizer. We implemented all the models with Tensorflow on an NVIDIA TITAN X GPU.
4.3 Automatic Evaluation Metrics
Automatic evaluations (AEs) for Natural Language Generation (NLG) is a challenging and under-researched problemNovikova et al. (2017). Following Liu et al. (2018), we leverage two referenced measurements (BLEU Papineni et al. (2002) and ROUGE Lin (2004)555Implemented by https://github.com/Maluuba-/nlg-eval. BLEU and ROUGE are transformed into percentages (%).) for automatic evaluations. Considering that current data-driven approaches tend to generate short and generic (meaningless) responses, two unreferenced (“intrinsic”) metrics are also leveraged to the evaluation. The first one is the average length of responses, which is an objective and surfaced metric reflected the substance of responses Mou et al. (2016); He et al. (2017a). The other one is the number of nouns666NLTK is utilized for part-of-speech tagging. per response Liu et al. (2018), which shows the richness of responses since nouns are usually content words. Note that the unreferenced metrics could enrich the evaluations, though they are weak metrics. The detailed results and analyses are shown as follows.
4.4 The Effectiveness of ICRED for RGMPC
Comparison Methods. We compared ICRED with the following methods:
(1) Seq2Seq Sutskever et al. (2014): Seq2Seq is one of the mainstream methods for text generation. In order to capture as much information as possible, the input sequence is all utterances concatenated in order in a context.
(2) Persona Model Li et al. (2016b): The persona-based model modified a Seq2Seq to encode a global vector for each interlocutor that appears in the training data, and it could alleviate the issue of speaker consistency for response generation.
(3) VHRED Serban et al. (2017): VHRED is essentially a conditional variational auto-encoder with hierarchical encoders, and it extends HRED Serban et al. (2016) by adding a high-dimensional latent variable for utterances.
Comparative Results. Table 3 demonstrates overall comparisons of ICRED. We can clearly obtain the following observations:
(1) ICRED obtains the highest performance on all metrics (marked as bold), and it indicates that incorporating interlocutor-aware context into RGMPC contributes to generating better responses.
(2) Although the persona-based model utilizes interlocutor information, it performs poorly. The average dialogue turn for the interlocutor is more than 5000 in Li et al. (2016a), while there is less than 100 dialogue turns per interlocutor in our dataset. Therefore, it is hard to learn a global vector for each interlocutor from the sparse corpus. In contrast, our ICRED performs well on such a sparse corpus (details in Section 4.5).
(3) VHRED brings slight improvements over the Seq2Seq and persona-base model. Even that VHRED enhances the contextual information by a high-dimensional latent variable, VHRED is still remarkably worse than ICRED because VHRED neglects the interlocutor information.
4.5 The Effect of Sparse Data on ICRED
|Interlocutor’s||Persona Model||ICRED (ours)|
Comparison Settings. Persona model Li et al. (2016b) may have a sparsity issue since some interlocutors have very few dialogue turns. To investigate whether ICRED has the sparsity issue or not, we divide the test data into four intervals according to the number of training dialogue turns said by the target addressee (called interlocutor dialogue turns), where small turns represent sparse learning data (e.g., “[0, 100]”) and large turns mean plentiful learning data (e.g., “(5000, +)”).
Comparative Results. Table 4 reports the performances of persona model and ICRED on different interlocutor’s dialogue turns for learning. We can clearly see that the persona model has a sparsity issue: it performs very poorly on sparse learning data (e.g., BLEU score = 8.47 on “[0, 100]”) while it achieves good performances on plentiful learning data (e.g., BLEU score = 9.51 on “(5000, +)”), which demonstrates that the fixed person vectors in the persona model need to be learned from large-scale training data for each interlocutor. In contrast, ICRED exploits interactive interlocutor representation learned from current dialog context rather than the fixed person vectors obtained from all training dialog utterances. Therefore, ICRED has no sparsity issues and it performs closely on sparse and plentiful learning data.
4.6 Ablation Study for Model Components
Comparison Settings. In order to validate the effectiveness of model components, we have tried to remove some main components in decoding as follows. (1) w/o Adr_Mem: without the addressee memory, such as removing in Equation 6-7; (2) w/o Ctx_Spk_Vec: without the contextual speaker vector, such as removing in Equation 6-7; (3) w/o Ctx_Adr_Vec: without the contextual addressee vector, such as removing in Equation 6-7.
Comparative Results. Results of the ablation study are shown in Table 5. We can see that removing any component causes obvious performance degradation. In particular, “w/o Ctx_Adr_Vec” performs the worst on almost all of the metrics, which demonstrates the importance of contextual information for the target addressee.
4.7 The Effectiveness of Addressee Memory
|all utterance memory||10.39||8.78||11.38||1.37|
In order to demonstrate the effectiveness of the addressee memory, we change the memory type, and then the attention model in Equation5 is based on the new memory. The comparison settings are shown as follows. (1) addressee memory: memorizing contextual word representations in the last utterance said by the target addressee (e.g., in Figure 2); (2) all utterance memory: memorizing contextual word representations in all utterances of the context (e.g., to in Figure 2); (3) latest memory: memorizing contextual word representations of the latest utterance in the context (e.g., the latest utterance in Figure 2); (4) speaker memory: memorizing contextual word representations in the last utterance said by the responding speaker; (5) w/o memory: without any memory.
Comparative Results. We report the results of different memory types as shown in Table 6. It can see that our method, the addressee memory, achieves the best or near-best performances on all metrics. Although memorizing all utterances is competitive, the complexity of all utterance memory is times compared with the one in the addressee memory, where is the number of utterances in a context. The speaker memory performs closely to without memory, which indicates that not all memories can improve the performance.
4.8 Manual Evaluations
Besides automatic evaluations, we employ manual evaluations (MEs), which is important for response generation. Similar to He et al. (2017b); Zhou et al. (2018), and we select three metrics for MEs, which measure the following aspects. (1) Fluency: measuring whether responses are grammatically correct or wrong. (2) Consistency: measuring whether responses are coherent to the context or not. (3) Informativeness: measuring how much informational (knowledgeable) content obtained from the responses.
|ICRED vs. Seq2Seq||77.25||83.69||84.35|
|ICRED vs. Persona.||78.44||80.41||82.35|
|ICRED vs. VHRED||73.20||81.29||79.47|
We conduct a pair-wise comparison between the response generated by ICRED and the one for the same input by three typical baselines. We sample 100 responses from each compared methods. Two curators judge (win, tie and lose) between these two methods. The Cohen Kappa of inter-annotator statistics is 0.750, 0.658 and 0.580 for the fluency, consistency and informativeness, respectively. As shown in Table 7, the score is the percentage that ICRED wins baselines after removing the “tie” pairs, and we can obtain that ICRED is significantly (sign test, p-value 0.005) superior to all baselines on any metric. It demonstrates our model is able to deliver more fluent, consistent and informative responses.
4.9 Case Study
Figure 3 shows an example of responses on different models for the same dialogue context. It is clearly observed that our model (ICRED) generates more fluent, consistent and knowledgeable (marked as underline) responses compared to baselines. In particular, the response given by ICRED “if you want a new kernel , you can install the kernel from the kernel repo”, not only explains the reason for kernel installation but also suggests a source of the installation. It fully captures the context and then produces a fluent, consistent and knowledgeable response, which is semantically similar to the gold one.
Interlocutor Prediction and RGMPC. The above methods assume that the responding speaker and target addressee are given for RGMPC. Though the speaker and the addressee could be obtained in some situations (e.g., extracted from chat logs), it is still a researchable task to interlocutor prediction. There have been some researches to predict either the responding speaker or the target addressee based on the given textual contexts or multimodal information Akhtiamov et al. (2017a); Meng et al. (2017); Akhtiamov et al. (2017b). Nevertheless, in order to obtain the interaction between interlocutor prediction and RGMPC, we further design a joint model for RGMPC and interlocutor prediction. Note that both the speaker and the addressee are predicted based on textual contexts, simultaneously. Firstly, the responding speaker is predicted from contexts:
is a summary contextual vector, which is max-pooled by the final interlocutor embedding matrix (), and is the hidden state of the last utterance. W is a projected matrix. and are the ID and the embedding of the responding speaker, respectively. The responding speaker is predicted by a softmax classifier based on the embedding similarity, and the target addressee is obtained in the same way. Secondly, the predicted interlocutors replace the gold ones for the addressee memory and extracting interlocutor’s embeddings from . Finally, the interlocutor prediction loss is added to the response generation loss for training. Table 8 shows the response generation performance on the situation that responding interlocutors are given and predicted. We can observe that:
|Gold||True / True||10.63||8.73||11.34||1.68|
|True / True||10.05||8.36||12.04||1.43|
|False / False||9.20||7.41||12.18||1.47|
(1) The overall performance on predicted interlocutors (“* / *” in Table 8) is slightly worse than the one with gold interlocutors (the first line in Table 8). Nevertheless, “* / *” still outperforms the strongest baseline (VHRED in Table 3).
(2) The correctness of interlocutor prediction has a significant impact on response generation performance. It performs the best when the responding speaker and the target addressee are predicted correctly. “False / False” (both are mispredicted) obtains the worst performance on the referenced metrics. These results demonstrate that both responding speaker and target addressee contribute to generating better responses.
(3) Surprisingly, the unreferenced metrics perform well on “False / False”. One possible reason is that the wrong interlocutors also capture rich contexts, and it generates long and meaningful responses but with a weak correlation to the gold interlocutors. Therefore, it achieves very poor performance on the referenced metrics.
5 Related Work
Our work is also inspired by researches on multi-party chatbots. Dielmann and Renals Dielmann and Renals (2008) automatically recognize dialogue acts in multi-party speech conversations. Recently, some studies focus on the three elements (speaker, addressee, response) on multi-party chatbots. Meng et al. Meng et al. (2017) introduce speaker classification as a surrogate task. Addressee selection is researched by Akhtiamov et al. (2017b). Some researches strive to the response selection Ouchi and Tsuboi (2016); Zhang et al. (2018). However, the response selection heavily relies on the candidates, and it can not generate new responses in new dialogue contexts. Response generation could solve this problem. Li et al. Li et al. (2016b) learn fixed person vector for response generation. Unfortunately, it needs to be obtained from large-scale dialogue turns, which has a sparsity issue: some interlocutors have very little dialog data. Differently, our model has no such restrictions.
In this study, we formalize a novel task of Response Generation for Multi-Party Chatbots (RGMPC) and propose an end-to-end model which incorporates Interlocutor-aware Contexts into Recurrent Encoder-Decoder frameworks (ICRED) for RGMPC. Specifically, we employ interactive speaker models to capture contextual interlocutor information. Moreover, we leverage an addressee memory mechanism to enrich contextual information. Furthermore, we propose to predict both the speaker and the addressee when generating responses. Finally, we construct a corpus for RGMPC. Experimental results demonstrate the ICRED remarkably outperforms strong baselines on automatic and manual evaluation metrics.
This work is supported by the National Natural Science Foundation of China (No.61533018), the Natural Key R&D Program of China (No.2017YFB1002101), the National Natural Science Foundation of China (No.61702512) and the independent research project of National Laboratory of Pattern Recognition. This work was also supported by CCF-DiDi BigData Joint Lab.
- Speech and text analysis for multimodal addressee detection in human-human-computer interaction. In Proceedings of INTERSPEECH, pp. 2521–2525. Cited by: §4.10.
- Cited by: §4.10, §5.
- Neural machine translation by jointly learning to align and translate. Proceedings of ICLR. Cited by: §5.
- Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of EMNLP, pp. 1724–1734. Cited by: §3.1, §5.
Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of NAACL, pp. 93–98. Cited by: §5.
- Recognition of dialogue acts in multiparty meetings using a switching dbn. IEEE transactions on audio, speech, and language processing, pp. 1303–1314. Cited by: §5.
Learning symmetric collaborative dialogue agents with dynamic knowledge graph embeddings. In Proceedings of ACL, pp. 1766–1776. Cited by: §4.3.
- Generating natural answers by incorporating copying and retrieving mechanisms in sequence-to-sequence learning. In Proceedings of ACL, pp. 199–208. Cited by: §4.8.
- A diversity-promoting objective function for neural conversation models. In Proceedings of NAACL, pp. 110–119. Cited by: §1, §4.4.
- A persona-based neural conversation model. In Proceedings of ACL, pp. 994–1003. Cited by: §1, §1, §4.4, §4.5, §5.
- ROUGE: a package for automatic evaluation of summaries. In Proceedings of ACL workshop, pp. 10. Cited by: §4.3.
- Curriculum learning for natural answer generation. In Proceedings of IJCAI, pp. 4223–4229. Cited by: §4.3.
- A hybrid architecture for multi-party conversational systems. CoRR abs/1705.01214. External Links: Cited by: §1.
- Hierarchical rnn with static sentence-level attention for text-based speaker change detection. In Proceedings of CIKM, pp. 2203–2206. Cited by: §4.10, §5.
- Sequence to backward and forward sequences: a content-introducing approach to generative short-text conversation. In Proceedings of COLING, pp. 3349–3358. Cited by: §4.3.
- Why we need new evaluation metrics for nlg. In Proceedings of EMNLP, pp. 2241–2252. Cited by: §4.3.
- Addressee and response selection for multi-party conversation. In Proceedings of EMNLP, pp. 2133–2143. Cited by: §1, §4.1, §5.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of ACL, pp. 311–318. Cited by: §4.3.
- Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of AAAI, pp. 3776–3783. Cited by: §1, §4.4, §5.
- A hierarchical latent variable encoder-decoder model for generating dialogues.. In Proceedings of AAAI, Cited by: §1, §3.4, §4.4, §5.
- Sequence to sequence learning with neural networks. In Proceedings of NIPS, pp. 3104–3112. Cited by: §4.4.
- How to make context more useful? an empirical study on context-aware neural conversational models. In Proceedings of ACL, pp. 231–236. Cited by: §1.
- Computing machinery and intelligence. Mind, pp. 433–460. Cited by: §1.
- Addressee and response selection in multi-party conversations with speaker interaction rnns. In Proceedings of AAAI, Cited by: §3.2, §4.1, §5.
- Commonsense knowledge aware conversation generation with graph attention. In Proceedings of IJCAI, pp. 4623–4629. Cited by: §4.8.