In the last two years, too many works from industry and academia were interested in building dialogue systems that can converse with humans in natural language by either generating responses Vinyals and Le (2015); Sordoni et al. (2015); Serban et al. (2016); Li et al. (2016b); Wen et al. (2017); Zhang et al. (2018a) or retrieving them from a set of candidate responses Lowe et al. (2015); Wu et al. (2018, 2017); Yang et al. (2018); Zhang et al. (2018b). Even if the generative systems can imitate humans and generate responses word by word, they suffer from the generality, non diversity and shortness of the generated responses Li et al. (2016a); Gao et al. (2018). On the other hand, retrieval-based dialogue systems can produce coherent and syntactically correct responses, but they are constrained by the list of candidate responses.
In this work, we are interested in studying retrieval-based dialogue systems as they have proved their efficiency in both academia and industry products such as the Alibaba’s chatbot AliMe Qiu et al. (2017) and the Microsoft’s social-bot XiaoIce111https://www.msxiaoice.com/ Shum et al. (2018)
. Our goal is to provide a short overview of the latest deep retrieval-based dialogue systems according to two aspects: novelty and relevance. Moreover, we describe the largest and most used public datasets and evaluation metrics. We also point out some drawbacks that we judge important to address in the future work. We believe that this work will be useful not only for researchers and developers recently interested in this research area but also researchers who are designing new systems so that they can easily find a recent and up-to-date list of the latest and most relevant systems.
2 Retrieval-Based Systems
The existing retrieval-based dialogue systems belong to one of the following categories according to how they match the context with the response.
2.1 Single-Turn Matching Models
Systems of this category hypothesize that the response replies to the whole context. Thus, they consider the context utterances as one single utterance to which they match the response without explicitly distinguishing the context utterances.
Dual Encoder2015). First the context and the candidate response are presented using word embeddings and are fed into a LSTM Hochreiter and Schmidhuber (1997) network. The last hidden state of the encoder is a vector that represents the context and the response. The response score is the similarity between the context and the response computed as the dot product between their two vectors and a matrix of learned parameters. Some variants of the dual encoder based on CNNs LeCun et al. (1998) and bidirectional LSTMs were also explored by Kadlec et al. (2015). Other similar single-turn matching models include Attentive-LSTM Tan et al. (2015), MV-LSTM Wan et al. (2016) and Match-LSTM Wang and Jiang (2016).
ESIM Enhanced Sequential Inference Model Chen and Wang (2019) was originally developed by Chen et al. (2018) for natural language inference. First, they concatenate the context utterances and following the same process as Lowe et al. (2015)
, they encode the context and the response using a Bidirectional LSTM network. Then, cross attention mechanism is applied in order to model the semantic relation between the context and the response. Finally, max and mean pooling are applied and the output is transformed into a probability that the response is the next utterance of the given context using a multi-layer perceptron classifier.
2.2 Multi-Turn Matching Models
The main hypothesis of this category of dialogue systems, is that the response replies to each utterance of the context. Thus, the candidate response is matched with every utterance of the context. Then, an aggregation function is applied to combine the different matching scores and produce a response score. In the following, we present the most recent multi-turn matching systems.
encoder and obtain for each utterance and the response a matrix (all the hidden states of the encoder). This matrix represents the sequence information of each input. Moreover, a word similarity matrix is computed as a dot product between the matrices of each utterance and the response. These two matrices are used as input channels of a Convolutional Neural Network (CNN) followed by a max pooling that computes a two level matching vectors between the response and each context turn. A second GRU network aggregates the obtained vectors and produces a response ranking score.
DAM The Deep Attention Matching Network Zhou et al. (2018) is an extension of the SMN Wu et al. (2017). The DAM addresses the limitations of recurrent neural networks in capturing long-term and multi-grained semantic representations. This model is based entirely on the attention mechanism Bahdanau et al. (2014). It is inspired by the Transformer Vaswani et al. (2017) to rank the response using self- and cross-attention. The first GRU encoder of the SMN model is replaced by five hierarchically stacked layers of self-attention. Five matrices of multi-grained representations of the context turns and the response are obtained instead of one matrix in the case of SMN. Following the same process of the SMN, the response matrices are matched with the context turns matrices and stacked together in a form of a 3D image (matrix). This image contains self- and cross-attention information of the inputs. Finally, a succession of convolution and max-pooling are applied on the image to produce the response score.
DMN The Deep Matching Network Yang et al. (2018) extends the SMN222SMN is called DMN in their paper. with external knowledge in two different ways. The first approach is based on the Pseudo-Relevance Feedback Cao et al. (2008) named DMN-PRF and consists of extending the candidate response with relevant words extracted from the external knowledge (Question Answering (QA) data). The second approach incorporates external knowledge with QA correspondence Knowledge Distillation named DMN-KD. It adds a third input channel to the CNN of the SMN as a matrix of the Positive Pontwise Mutual Information (PPMI) between words of the response and the most relevant responses retrieved from the external knowledge.
DUA The Deep Utterance Aggregation system Zhang et al. (2018b) also extends the SMN with an explicit weighting of the context utterances. The authors hypothesize that the last utterance of the context is the most relevant and thus concatenate its encoded representation with all the previous utterances in addition to the candidate response. After that, a gated self-matching attention Wang et al. (2017) is applied to remove redundant information from the obtained representation before feeding them into the CNN as in the SMN.
We summarize in Figure 1 the global architectures of single- and multi-turn systems.
3 Ensemble333Sometimes called hybrid systems. Systems
In addition to the retrieval-based dialogue systems, generative dialogue systems were widely explored in the literature. Most of them are based on the sequence-to-sequence (seq2seq) architecture Sutskever et al. (2014). First, they encode the context into a vector, then decode this vector to generate the response word by word. Both categories have pros and cons. For instance, retrieval-based dialogue systems are constrained by a list of candidate responses and can only respond with one of the available responses. On the other hand, they can produce syntactically correct, diverse and long responses. Generative dialogue systems are not limited by a responses list and thus, they can generate more specific responses. However, they tend to generate ”safe” responses which are short and general Li et al. (2016a); Serban et al. (2016).
Recently, some studies were interested in combining both systems. Qiu et al. (2017) built a hybrid system in which, for a given question (context), similar Question-Answer (QA) pairs are retrieved from a QA base using a retrieval system. Then, a ranker system computes a score for each retrieved answer A based on its related question Q. Based on these scores, responses are ranked and the response with the highest score determines whether a new response is generated. If its score is higher than a threshold, this best response is returned. Otherwise, an attentive seq2seq is used to generate a new response.
The system of Song et al. (2018) first retrieve candidate responses using the same previous process. Then, the query in addition to the retrieved responses are given as input to a generative system to produce a new response. Finally, the retrieved and the generated responses are ranked and the best response is returned. A retrieve and refine model was proposed by Weston et al. (2018). First, it retrieves the best response and provides it, concatenated with the context, to an attentive seq2seq to generate a new response.
The most recent work of Pandey et al. (2018) consists of an exemplar encoder-decoder which first constructs a list of -exemplar context-response pairs that are the most similar to the given context and response. Then, each exemplar context and response are encoded in addition to the original context. The exemplar responses vectors are concatenated with the original context vector and are fed into the decoder to generate a response. The score of the generated response is conditioned by the similarity between the exemplar contexts and the original context.
Many publicly available datasets were used in evaluating most of the recent retrieval-based dialogue systems. In this section, we provide a non exhaustive list of the available datasets split into two categories (See appendix for further details).
4.1 Negative Sampling Based Datasets
This category regroups datasets where the negative candidate responses of each context were randomly selected without any human judgment. The most used dataset is the Ubuntu Dialogue Corpus (UDC) Lowe et al. (2015). It is the largest available corpus, it contains Ubuntu related chat extracted from the freenode IRC. For each context, 10 candidate responses are provided among which, one response is the ground-truth and the rest are randomly sampled from the dataset. Today, three versions of UDC exist.
The third version UDC (V3) was released as part of the DSTC7 challenge with the Advising corpus Gunasekara et al. (2019). To the best of our knowledge, UDC (V3) and Advising are the only public datasets where each context has 100 candidate responses while almost all the other datasets provide 10 candidate responses. The Advising corpus contains teacher-student conversations collected at the University of Michigan with students playing teacher and student roles with simulated personas Gunasekara et al. (2019). The dataset includes additional information about preferences for workloads, class sizes, topic areas, etc.
The MSDialog Yang et al. (2018) dataset was extracted from the Microsoft Answer Community555https://answers.microsoft.com and consists of technical support conversations. As in UDC, negative sampling was used in order to produce the 9 negative responses for each context. Following the same negative response sampling process, the E-commerce Dialogue Corpus (EDC) was constructed Zhang et al. (2018b). It is a public dataset which contains conversations between customers and customer service staff.
4.2 Human-Labeled Datasets
Unlike the datasets of the first category, in order to construct the following datasets, humans were recruited to judge each candidate response and give it a label. Hence, every context may have more than one correct response.
The Douban Conversation Corpus Wu et al. (2017) contains human-human dialogues extracted from Douban666https://www.douban.com/ which is a popular social network in China. This is an open domain public dataset where conversations concern movies, books, music, etc. in contrast to the previously described datasets which are domain specific.
AliMe data Yang et al. (2018) is a human-machine corpus extracted from chat logs between customers and AliMe: the Alibaba chatbot Qiu et al. (2017). Analysts were asked to annotate candidate responses and assign positive labels to responses that match the context and negative labels otherwise. Unfortunately, the dataset is not public.
5 Evaluation Metrics
Evaluating dialogue systems is an open research problem Lowe et al. (2017)
. So far, information retrieval metrics have been widely used to evaluate retrieval-based dialogue systems. For instance Recall@k, Precision@k, Mean Recall Rank (MRR) and Mean Average Precision (MAP) evaluate the capacity of the dialogue systems to rank the correct response on top of the negative responses.
6 Discussion and Conclusion
We presented the most recent retrieval-based dialogue systems and the available datasets. Even if these systems achieve good results, we believe that there are some drawbacks that should be addressed in the future. We summarize them in the following three points.
Models The existing models simply encode the context and the response and perform a semantic matching. No explicit modeling of dialogue acts, user intent and profile, etc. was performed. However, we believe that retrieving the next utterance of a dialogue depends on multiple parameters. Explicitly extracting and modeling these information while keeping the data-driven and end-to-end properties of the models could be of a great benefit. Moreover, we invite researchers to perform a qualitative error analysis. This can help understanding what is being captured by each architecture and what is being skipped to be addressed in the future.
The idea of ensemble systems is very interesting but until now, it has been done in one direction: the retrieval systems assist generative systems. We think that generating responses and then retrieving responses that match them can help the retrieval system in finding a better response. Hopefully this research area will be explored in the future. Furthermore, we would prefer to have the number of trainable parameters of each architecture to fairly compare two approaches in terms of their complexity. The source code of most of the state-of-the-art systems is available but they are implemented with different toolkits and use different data preprocessings (we refer to the Appendix). For this reason, we plan to build a single toolkit that implements the available approaches using the same libraries and provide a large panel of datasets and a unified choice of data preprocessing for an easy reproduction and a fair comparison.
Datasets As we can notice, among the six datasets that we presented in Section 4, only one public dataset has been humanly labelled (Douban). We agree that human judgment is labour-intensive and subjective, but randomly sampling responses from the dataset and labeling them as negative responses is a very naive approach that may falsify the system training and evaluation. Moreover, most of the available datasets, limit the size of candidate responses to 10, except the latest datasets UDC (V3) and Advising released by DSTC7 Gunasekara et al. (2019) which provide 100 candidate responses for each context. In practice, a retrieval-based dialogue system has to find the best response among a large set of candidate responses which is larger than 10. Unless another system is used to filter a smaller set of candidate responses to the response retrieval system, the size of candidate responses list should be larger.
Evaluation Metrics Recently, much effort has been done towards building more robust and smart systems but less effort has been done to define new evaluation metrics adapted to dialogue systems instead of IR and machine translation metrics Serban et al. (2018). We hope that researchers will orient some of their interests into this direction to provide more dialogue adapted metrics.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Cao et al. (2008) Guihong Cao, Jian-Yun Nie, Jianfeng Gao, and Stephen Robertson. 2008. Selecting good expansion terms for pseudo-relevance feedback. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’08), pages 243–250, New York, NY, USA.
- Chen and Wang (2019) Qian Chen and Wen Wang. 2019. Sequential attention-based network for noetic end-to-end response selection. In Proceedings of the 7 th Dialog System Technology Challenge (DSTC7), Honolulu, HI, USA.
- Chen et al. (2018) Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Diana Inkpen, and Si Wei. 2018. Neural natural language inference models enhanced with external knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18), pages 2406–2417, Melbourne, Australia.
Chung et al. (2014)
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014.
Empirical evaluation of gated recurrent neural networks on sequence
Workshop on Deep Learning and Representation Learning at the 28th Annual conference on Advances in Neural Information Processing Systems (NIPS’14), Montreal, Canada.
- Gao et al. (2018) Jianfeng Gao, Michel Galley, and Lihong Li. 2018. Neural approaches to conversational ai. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR ’18), pages 1371–1374, Ann Arbor, MI, USA.
Gunasekara et al. (2019)
Chulaka Gunasekara, Jonathan Kummerfeld, Lazaros Polymenakos, and Walter S.
Dstc7 task 1: Noetic end-to-end response selection.
Proceedings of the 7th Dialog System Technology Challenges DSTC7 at the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI’19), Honolulu, HI, USA.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
Kadlec et al. (2015)
Rudolf Kadlec, Martin Schmid, and Jan Kleindienst. 2015.
Improved deep learning baselines for ubuntu corpus dialogs.
Workshop on Machine Learning for Spoken Language Understanding and Interaction at the 29th Annual Conference on Neural Information Processing Systems (NIPS’15), Montreal, Canada.
- LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
- Li et al. (2016a) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016a. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL’16), pages 110–119, San Diego, CA, USA.
Li et al. (2016b)
Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng
Deep reinforcement learning for dialogue generation.In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP’16), pages 1192–1202, Austin, TX, USA.
- Lowe et al. (2017) Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic turing test: Learning to evaluate dialogue responses. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17), pages 1116–1126, Vancouver, Canada.
- Lowe et al. (2015) Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL’15), pages 285–294, Prague, Czech Republic.
- Pandey et al. (2018) Gaurav Pandey, Danish Contractor, Vineet Kumar, and Sachindra Joshi. 2018. Exemplar encoder-decoder for neural conversation generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18), pages 1329–1338, Melbourne, Australia.
- Qiu et al. (2017) Minghui Qiu, Feng-Lin Li, Siyu Wang, Xing Gao, Yan Chen, Weipeng Zhao, Haiqing Chen, Jun Huang, and Wei Chu. 2017. Alime chat: A sequence to sequence and rerank based chatbot engine. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17), pages 498–503, Vancouver, Canada.
- Serban et al. (2016) Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16), pages 3776–3783, Phoenix, AZ, USA.
- Serban et al. (2018) Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Laurent Charlin, and Joelle Pineau. 2018. A survey of available corpora for building data-driven dialogue systems: The journal version. Dialogue & Discourse, 9(1):1–49.
- Shum et al. (2018) Heung-yeung Shum, Xiao-dong He, and Di Li. 2018. From eliza to xiaoice: challenges and opportunities with social chatbots. Frontiers of Information Technology & Electronic Engineering, 19(1):10–26.
- Song et al. (2018) Yiping Song, Cheng-Te Li, Jian-Yun Nie, Ming Zhang, Dongyan Zhao, and Rui Yan. 2018. An ensemble of retrieval-based and generation-based human-computer conversation systems. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI’18), pages 4382–4388, Stockholm, Sweden.
- Sordoni et al. (2015) Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural network approach to context-sensitive generation of conversational responses. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL’15), pages 196–205, Denver, CO, USA.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 2014 conference on Advances in Neural Information Processing Systems (NIPS’14), pages 3104–3112, Montreal, Canada.
- Tan et al. (2015) Ming Tan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2015. Lstm-based deep learning models for non-factoid answer selection. In Proceedings of the International Conference on Learning Representation (ICLR’15), San Juan, Puerto Rico.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’17), pages 5998–6008, Long Beach, CA, USA.
- Vinyals and Le (2015) Oriol Vinyals and Quoc Le. 2015. A neural conversational model. In Workshop on Deep Learning at the 31 st International Conference on Machine Learning (ICML’15), Lille, France.
- Wan et al. (2016) Shengxian Wan, Yanyan Lan, Jun Xu, Jiafeng Guo, Liang Pang, and Xueqi Cheng. 2016. Match-srnn: Modeling the recursive matching structure with spatial rnn. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI’16), pages 2922–2928, New York, NY, USA.
- Wang and Jiang (2016) Shuohang Wang and Jing Jiang. 2016. Learning natural language inference with lstm. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’16), pages 1442–1451, San Diego, CA, USA.
- Wang et al. (2017) Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17), pages 189–198, Vancouver, Canada.
- Wen et al. (2017) Tsung-Hsien Wen, David Vandyke, Nikola Mrkšić, Milica Gasic, Lina M. Rojas Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL’17), pages 438–449, Valencia, Spain.
- Weston et al. (2018) Jason Weston, Emily Dinan, and Alexander Miller. 2018. Retrieve and refine: Improved sequence generation models for dialogue. In Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI (SCAI’18), pages 87–92, Brussels, Belgium.
- Wu et al. (2018) Yu Wu, Zhoujun Li, Wei Wu, and Ming Zhou. 2018. Response selection with topic clues for retrieval-based chatbots. Neurocomputing, 316:251–261.
- Wu et al. (2017) Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2017. Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17), pages 496–505, Vancouver, Canada.
- Yang et al. (2018) Liu Yang, Minghui Qiu, Chen Qu, Jiafeng Guo, Yongfeng Zhang, W. Bruce Croft, Jun Huang, and Haiqing Chen. 2018. Response ranking with deep matching networks and external knowledge in information-seeking conversation systems. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR ’18), pages 245–254, New York, NY, USA.
- Zhang et al. (2018a) Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, and Bill Dolan. 2018a. Generating informative and diverse conversational responses via adversarial information maximization. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’18), pages 1815–1825.
- Zhang et al. (2018b) Zhuosheng Zhang, Jiangtong Li, Pengfei Zhu, Hai Zhao, and Gongshen Liu. 2018b. Modeling multi-turn conversation with deep utterance aggregation. In Proceedings of the 27th International Conference on Computational Linguistics (COLING’18), pages 3740–3752, Santa Fe, NM, USA.
- Zhou et al. (2018) Xiangyang Zhou, Lu Li, Daxiang Dong, Yi Liu, Ying Chen, Wayne Xin Zhao, Dianhai Yu, and Hua Wu. 2018. Multi-turn response selection for chatbots with deep attention matching network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18), pages 1118–1127, Melbourne, Australia.