IART: Intent-aware Response Ranking with Transformers in Information-seeking Conversation Systems

02/03/2020 ∙ by Liu Yang, et al. ∙ Ant Financial University of Massachusetts Amherst Institute of Computing Technology, Chinese Academy of Sciences Rutgers University 0

Personal assistant systems, such as Apple Siri, Google Assistant, Amazon Alexa, and Microsoft Cortana, are becoming ever more widely used. Understanding user intent such as clarification questions, potential answers and user feedback in information-seeking conversations is critical for retrieving good responses. In this paper, we analyze user intent patterns in information-seeking conversations and propose an intent-aware neural response ranking model "IART", which refers to "Intent-Aware Ranking with Transformers". IART is built on top of the integration of user intent modeling and language representation learning with the Transformer architecture, which relies entirely on a self-attention mechanism instead of recurrent nets. It incorporates intent-aware utterance attention to derive an importance weighting scheme of utterances in conversation context with the aim of better conversation history understanding. We conduct extensive experiments with three information-seeking conversation data sets including both standard benchmarks and commercial data. Our proposed model outperforms all baseline methods with respect to a variety of metrics. We also perform case studies and analysis of learned user intent and its impact on response ranking in information-seeking conversations to provide interpretation of results.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The recent boom of artificial intelligence has witnessed the emerging and flourishing of many intelligent personal assistant systems, including Amazon Alexa, Apple Siri, Alibaba AliMe, Microsoft Cortana and Google Assistant. This trend has led to an interest in conversational search systems, where users would be able to access information with conversational interactions. Existing approaches to building conversational systems include generation-based methods

(DBLP:conf/emnlp/RitterCD11; DBLP:conf/acl/ShangLL15), retrieval-based methods (DBLP:journals/corr/JiLL14; DBLP:conf/sigir/YanSW16; DBLP:conf/sigir/YanZE17), and hybrid methods (Song:2018:ERG:3304222.3304379; yang:cikm19). Significant progress has been made on the integration of conversation context by generating reformulated queries with contexts (DBLP:conf/sigir/YanSW16), enhancing context-response matching with sequential interactions (DBLP:conf/acl/WuWXZL17), and learning with external knowledge (DBLP:conf/sigir/YangQQGZCHC18). However, much less attention has been paid on the user intent in conversations and how to leverage user intent for response ranking in information-seeking conversations.

To illustrate user intent in information-seeking conversations, we show an example dialog from the Microsoft Answers Community111https://answers.microsoft.com in Table 1. Microsoft Answers Community is a customer support QA forum where users can ask questions relevant to Microsoft products. Agents like Microsoft employees or other experienced users will reply to these questions. There could be multi-turn conversation interactions between users and agents. We define a taxonomy of user intent following previous research (DBLP:conf/sigir/QuYCTZQ18; DBLP:journals/corr/abs-1901-03489). We can observe that there are diverse user intents such as “Original Question (OQ)”, “Information Request (IR)”, “Potential Answers (PA)”, “Follow-up Questions (FQ)”, “Further Details (FD)”, etc. in an information-seeking conversation. Moreover, several transition patterns can happen between different user intent. For example, given a question from the user, an agent could provide a potential answer directly or ask for some information as clarification questions before providing answers. Users will provide further details regarding the information requests from agents. At the beginning of a conversation, the agent would like to greet customers or express gratitude to users before they move on to next steps. Near the end of a conversation, the user may provide positive or negative feedback about answers from agents, or ask a follow-up question to continue the conversation interactions.

ID Role Utterances Intent
Utterance-1 User Windows downloaded this update “2018-02 Cumulative Update for Windows 10 ……” But during the restart it says “we couldn’t complete the update, undoing changes”. So what can I do to stop this? Thanks OQ
Utterance-2 Agent Is there any other pending updates? Try Download troubleshooter for Win 10. IR/ PA
Utterance-3 User Yes, pending updates the same one. I already used the built in troubleshooter, it did fix some 3 issues, but doing a restart the problem persists. Can I stop updates from installing this particular one? Thanks. PA/ FQ
Utterance-4 User Not sure if related but I just saw that Malicious Software Removal of March did not install …… FD
Response-1 (Correct) Agent Try run troubleshooter and then restart your PC. If problem persist, open start and search for Feedback and open Feedback Hub app and report this issue. PA
Response-2 (Wrong) Agent Glad to know that you fixed the issue, and as I said downloading the “Show or hide updatestroubleshooter and restarting the PC will help you. Thank you for asking questions and providing feedback here! GG
Table 1. An example dialog to illustrate user intent transition patterns from the Microsoft Answers Community. The user intent “OQ”, “IR”, “PA”, “FQ”, “FD”, “GG” denote “Original Question”, “Information Request”, “Potential Answer”, “Follow-up Question”, “Further Details”, “Greetings/ Gratitude” respectively. We highlight some lexical match between utterances and response candidates. This table is better readable in color.

Such user intent patterns can be helpful for conversation models to select good responses due to the following reasons: (1) The intent sequence in conversation context utterances can provide additional signals to promote response candidates with correct intent and demote response candidates with wrong intent. For example, in Table 1, given the intent sequence [OQ] [IR/ PA] [PA/ FQ] [FD], we know that the user is still expecting an answer to solve her question. Although both Response-1 and Response-2 show some lexical and semantic similarities with context utterances, only Response-1 has the intent “Potential Answers” (PA). In this case, the model should have the capability to promote the rank of Response-1 and demote Response-2. (2) Intent information can help the model to derive an importance weighting scheme over context utterances with attention mechanisms. In the given example dialog in Table 1, the model should learn to assign larger weights to utterances on question descriptions (OQ and FQ) and further details (FD) in order to address the information need of the user.

Most existing neural conversation models do not explicitly model user intent in conversations. More research needs to be done to understand the role of user intent in response retrieval and to develop effective models for intent-aware response ranking in information-seeking conversations, which is exactly the goal of this paper. There is some existing related work from the Dialog System Technology Challenge (formerly the Dialog State Tracking Challenge, DSTC)222https://www.microsoft.com/en-us/research/event/dialog-state-tracking-challenge/. Many DSTC tasks focus on goal oriented conversations like restaurant reservation. These tasks are typically tackled with slot filling (Zhang:2016:JMI:3060832.3061040; DBLP:journals/csl/HoriPHHBITTYK19), which is not applicable to information-seeking conversations because of the diversity of information needs. Recently in DSTC7 of 2018,333http://workshop.colips.org/dstc7/ an end-to-end response selection challenge has been introduced, which shares similar motivation to our work. However, the evaluation treated response selection as a classification task and there was no explicit modeling of user intent.

In this paper, we analyze user intent in information-seeking conversations and propose neural ranking models with the integration of user intent modeling. Different user intent types are defined and characterized following previous research (DBLP:conf/sigir/QuYCTZQ18; DBLP:journals/corr/abs-1901-03489). Then we propose an intent-aware neural ranking model for response retrieval, which is built on top of recent breakthroughs in natural language representation learning with Transformers (NIPS2017_Transformers; DBLP:journals/corr/abs-1810-04805). We refer to the proposed model as “IART444IART is pronounced as “art”., which is “Intent-Aware Ranking with Transformers”. IART incorporates intent-aware utterance attention to derive the importance weighting scheme of utterances in conversation context towards better conversation history understanding. We conduct extensive experiments with three information-seeking conversation data sets: MSDialog555https://ciir.cs.umass.edu/downloads/msdialog/ (DBLP:conf/sigir/QuYCTZQ18), Ubuntu Dialog Corpus (UDC) (DBLP:journals/corr/LowePSP15), and another commercial customer service data from the AliMe assistant (alime-demo) in Alibaba group (AliMe). We compare our methods with various neural ranking models and baseline methods on response selection in multi-turn conversations including the recently proposed Deep Attention Matching Network (DAM) (DBLP:conf/acl/WuLCZDYZL18). The results show our methods outperform all baselines. We also perform visualization and analysis of learned user intent patterns.

Our contributions can be summarized as follows: (1) We analyze user intent in information-seeking conversations for intent-aware response ranking. To the best of our knowledge, our work is the first to explicitly define and model user intent for response ranking in information-seeking conversations. (2) We propose an intent-aware response ranking model with Transformers to utilize user intent information for response ranking. (3) Experimental results with three different conversation data sets show that our methods outperform various baselines. We also perform analysis on learned user intent and ranking examples to provide insights. The code of our model implementation will be released on GitHub666https://github.com/yangliuy/Intent-Aware-Ranking-Transformers.

2. Related Work

User Intent in Conversations. Some previous research studied utterance intent modeling in conversation systems (Stolcke2000Dialogue; DBLP:journals/corr/abs-1710-10609; bhatia2014summarizing; Shiga:2017:MIN:3077136.3080787). Stolcke2000Dialogue performed dialog acts classification with a statistical approach on the SwitchBoard corpus, which consists of human-human chit chats conversations. In this paper, we explore how to combine utterance intent modeling with response ranking in conversations, so that the learned user intent of context utterances and response candidates can help the model select better responses in information-seeking conversations.

Conversational Search. Our research is relevant to conversational search (radlinski2017theoretical; zhang2018towards; thomas2017misc; DBLP:journals/corr/YangZZGC17), which has received significant attention recently. Radlinski and Craswell described the basic features of conversational search systems (radlinski2017theoretical). Zhang et al. (zhang2018towards) introduced the System Ask User Respond (SAUR) paradigm for conversational search and recommendation. In addition to conversational search models, researchers have also studied the medium of conversational search (spina2017extracting; trippas2015towards). Our research targets at the response ranking of information-seeking conversations, with Transformer based ranking models and the integration of user intent modeling.

Neural Conversational Models.

There is growing interest in research about conversation response generation and ranking with deep learning and reinforcement learning

(DBLP:journals/corr/abs-1809-08267). Existing work includes retrieval-based methods (DBLP:conf/acl/WuWXZL17; DBLP:conf/emnlp/ZhouDWZYTLY16; DBLP:conf/sigir/YanSW16; DBLP:conf/cikm/YanSZW16; DBLP:conf/sigir/YanZE17; DBLP:conf/sigir/YangQQGZCHC18; Tao:2019:MFN:3289600.3290985; DBLP:conf/cikm/QuYQZCCI19), generation-based methods (DBLP:conf/acl/ShangLL15; DBLP:conf/emnlp/RitterCD11; DBLP:conf/naacl/SordoniGABJMNGD15; DBLP:journals/corr/VinyalsL15; P17-1045; alime-chat), and hybrid methods (Song:2018:ERG:3304222.3304379; yang:cikm19). Our work is a retrieval-based method. DBLP:conf/acl/WuLCZDYZL18 investigated matching a response with conversation contexts with dependency information learned by Transformers. Our proposed models are also built with Transformer encoders. The main difference between our work and their research is that we explicitly define and model user intent in conversations. We show that the intent-aware attention mechanism can help improve response ranking in conversations.

Neural Ranking Models. Recent progress of research on neural approaches to IR has introduced a number of neural ranking models for information retrieval, question answering and conversation response ranking (DBLP:journals/corr/abs-1903-06902). These models include representation focused models (DBLP:conf/cikm/HuangHGDAH13) and interaction focused models (DBLP:conf/nips/HuLLC14; DBLP:conf/aaai/PangLGXWC16; alime-tl; Guo:2016:DRM:2983323.2983769; Yang:2016:ARS:2983323.2983818). The neural ranking models proposed in our research adopt Transformers, which are solely based on attention mechanisms, as the encoder to learn representations.

3. Our Approach

3.1. Problem Formulation

The research problem of response ranking in information-seeking conversations is defined as follows. We are given an information-seeking conversation data set , where in which is the utterance in the -th turn of the -th dialog. and are a set of response candidates and the corresponding labels , where denotes is a true response for . Otherwise . For user intent information, there are sequence level user intent labels for both dialog context utterances and response candidates , where and are user intent labels for context utterances and response candidates for the -th dialog respectively. Our task is to learn a ranking model with and . For any given , the model should be able to generate a ranking list for the candidate responses with . Note that in practice,

can come from predicted results of user intent classifiers to reduce human annotation costs. In our paper,

are predicted results of the user intent classifier (DBLP:journals/corr/abs-1901-03489) for MSDialog and Ubuntu Dialog Corpus. For AliMe data, is the output of the intention classifier which is a probabilistic distribution over 40 intention scenarios (alime-demo).

3.2. Method Overview

In following sections, we describe the proposed method for intent-aware response ranking in information-seeking conversations. The model incorporates intent-aware utterance attention to derive the importance weighting scheme of different context utterances. Given input context utterances and response candidates, we first generate representations from two different perspectives: user intent representations with a trained neural classifier and semantic information encoding with Transformers. Then self-attention and cross-attention matching will be performed over encoded representations from Transformers to extract matching features. These matching features will be weighted by the intent-aware attention mechanism and aggregated into a matching tensor. Finally a two-layer 3D convolutional neural network will distill final representations over the matching tensor and generate the ranking score for the conversation context/ response candidate pair.

3.3. User Intent Taxonomy

We use the MSDialog data that consists of technical support dialogs for Microsoft products developed by DBLP:conf/sigir/QuYCTZQ18. Over dialogs with utterances were sampled for user intent annotation on Amazon Mechanical Turk.777https://www.mturk.com/ A taxonomy of 12 labels presented in Table 2 were developed to characterize the user intent in information-seeking conversations. The user intent labels include question related labels (e.g., Original Questions, Clarifying Question, etc.), answer related labels (e.g., Potential Answer, Further Details, etc.), feedback related labels (e.g., Positive Feedback, Negative Feedback) and greeting related labels (e.g., Greetings/ Gratitude), which cover most of the user intent types in information-seeking conversations. In addition to MSDialog, we also consider the Ubuntu Dialog Corpus (UDC) (DBLP:journals/corr/LowePSP15). User intent annotation is also performed for randomly sampled UDC utterances. More details can be found in DBLP:conf/sigir/QuYCTZQ18.

Code Label Description
OQ Original Question The first question that initiates a QA dialog
RQ Repeat Question Questions repeating a previous question
CQ Clarifying Question Users or agents ask for clarification
FD Further Details Users or agents provide more details
FQ Follow Up Question Follow-up questions about relevant issues
IR Information Request Agents ask for information from users
PA Potential Answer A potential solution to solve the question
PF Positive Feedback Positive feedback for working solutions
NF Negative Feedback Negative feedback for useless solutions
GG Greetings/Gratitude Greet each other or express gratitude
JK Junk No useful information in the utterance
O Others Utterances that cannot be categorized
Table 2. Descriptions of user intent taxonomy.
Figure 1. The architecture of the IART model for intent-aware conversation response ranking.

3.4. Utterance/ Response Input Representations

Given a response candidate and an utterance in the context , we represent the utterance/ response pair from two different perspectives: 1) user intent representation with intent classifiers (Section 3.4.1); 2) utterance/ response semantic information encoding with Transformers (Section 3.4.2).

3.4.1. User Intent Representation

To represent user intent, we adopt the best setting of the neural classifiers CNN-Context-Rep proposed by DBLP:journals/corr/abs-1901-03489

for user intent classification. Specifically, given sequences of embedding vectors for context utterances and response candidate

and , convolutional filters with the shape are applied to a window of words to produce a new feature . This operation is applied to every possible window of words in the utterance and generates a feature map

. Max pooling is applied to select the most salient feature. The model uses multiple filters with varying window sizes to obtain multiple features in different granularity. These features will be concatenated and flattened into an output tensor, which will be projected into a tensor with shape

with a fully connected layer. is the number of different user intent labels.888In our experiments for MSDialog and UDC, as presented in Section 3.3.

3.4.2. Utterance/ Response Encoding and Matching with Transformers

We adopt the encoder architecture in Transformers (NIPS2017_Transformers)

to encode the semantic dependency information in utterance/ response pairs. Transformers are built with Scaled Dot-Product Attention, which performs transformation from a query and a set of key-value pairs to an output. Following the design of Transformers, we also add a feed-forward network FFN with ReLU activation over the layer normalized

(DBLP:journals/corr/BaKH16) sum of the output and the query . We refer to this module as the TransformerEncoder module, which will be used as a feature extractor for utterances and responses to capture both the dependency information within words in the same sequence and interactions between words in two different sequences. We consider both self-attention and cross-attention based interaction matching to learn representations for context utterance/ response candidate pairs.

3.5. Intent-aware Attention Mechanism

Given the self-attention/ cross-attention interaction matching matrices for different utterances/ response pairs from a dialog, we first stack them to aggregate them as a 4D matching tensor as follows:


where are the number of utterance turns in conversation context, number of words in the context utterance, number of words in the response candidate and number of stacked layers in TransformerEncoder. are indexes along these dimensions of the matching tensor.

We propose an intent-aware attention mechanism to weight matching representations of different utterance turns in a conversation context, so that the model can learn to attend to different utterance turns in context. The motivation is to incorporate a more flexible way to weight and aggregate matching features of different turns with intent-aware attention. Specifically, let denote the intent representation vectors defined in Section 3.4.1 for context utterances and response candidates, we design three different types of intent-aware attention as follows:

Dot Product. We concatenate the two intent representation vectors of the utterance/ response pair, and compute the dot product between the parameter and the concatenated vector: , where is a model parameter.

Bilinear. We compute the bilinear interaction between and and then normalize the result: , where is the bilinear interaction matrix to be learned.

Outer Product. We compute the outer product between and and then flatten the result matrix to a feature vector. Finally we project this feature vector into an attention score with a fully connected layer and a softmax function: , where flat and denote the flatten layer which transforms a matrix with shape into a vector with shape and outer product operation. is a model parameter.

Note that the normalization in the softmax function is performed over all utterance turns within a conversation context. Thus the result is the attention weight corresponding to the

-th utterance turn in a conversation context. We also add masks over the padded utterance turns to avoid introducing noise matching feature representations. With the computed attention weights over context utterance turns, we can scale the 4D matching tensor to generate a weighted matching tensor:


Finally IART adopts a two layer 3D convolution neural network (CNN)999https://www.tensorflow.org/api_docs/python/tf/nn/conv3d to extract important matching features from this weighted matching tensor . A 3D CNN requires 5D input and filter tensors, as we can add one more input dimension corresponding to the batched training examples over the 4D weighted matching tensor. We compute the final matching score with a MLP over the flattened output of the 3D CNN. For model training, we compute the cross-entropy loss between the predicted matching scores and the ground truth matching labels. The parameters of IART are optimized using back-propagation with Adam algorithm (DBLP:journals/corr/KingmaB14).

4. Experiments

4.1. Data Set Description

We evaluated our method with three data sets: Ubuntu Dialog Corpus (UDC), MSDialog, and a commercial data collected from the AliMe assistant at Alibaba group. The statistics of different experimental data sets are shown in Table 3. The Ubuntu Dialog Corpus (UDC) (DBLP:journals/corr/LowePSP15) contains multi-turn technical support conversation chat logs on the Ubuntu system. We used the data copy shared by DBLP:journals/corr/XuLWSW16. It is also used in several previous related works (DBLP:conf/acl/WuWXZL17; DBLP:conf/acl/WuLCZDYZL18; DBLP:conf/sigir/YangQQGZCHC18).101010The data can be downloaded from https://www.dropbox.com/s/2fdn26rj6h9bpvl/ubuntu%20data.zip?dl=0 MSDialog is released from previous related work by DBLP:conf/sigir/QuYCTZQ18. It contains QA dialogs on various Microsoft products crawled from the Microsoft Answer community. For the AliMe dataset, it contains the chat logs between customers and the AliMe assistant bot at Alibaba. For each query of the dataset, it contains several response candidates from the chatbot engine which are labeled by a business analyst. The details about these data sets are in DBLP:conf/sigir/YangQQGZCHC18. Note that the proposed model is more on response re-ranking instead of response retrieval in one step.

Data UDC MSDialog AliMe
Items Train Valid Test Train Valid Test Train Valid Test
# C-R pairs 1000k 500k 500k 173k 37k 35k 51k 6k 6k
# Cand. per C 2 10 10 10 10 10 15 15 15
# + Cand. per C 1 1 1 1 1 1 2.9 2.8 2.9
Avg # turns per C 10.1 10.1 10.1 5.0 4.9 4.4 2.4 2.1 2.2
Avg # words per C 116.8 116.3 116.7 451.3 435.2 375.1 38.3 35.3 34.2
Avg # words per R 22.2 22.2 22.3 106.1 107.4 105.5 4.9 4.7 4.6
Table 3. The statistics of experimental datasets, where C denotes context and R denotes response. # Cand. per C denotes the number of candidate responses per context. Note that we did not filter any stop words or words with low frequency for computing the average length of contexts or responses.
Data UDC MSDialog AliMe
Methods R10@1 R10@2 R10@5 MAP R10@1 R10@2 R10@5 MAP R10@1 R10@2 R10@5 MAP
BM25 (Robertson:1994:SEA:188490.188561) 0.5138 0.6439 0.8206 0.6504 0.2626 0.3933 0.6329 0.4387 0.2371 0.4204 0.6407 0.6392
BM25-PRF (DBLP:conf/sigir/YangQQGZCHC18) 0.5289 0.6554 0.8292 0.6620 0.2652 0.3970 0.6423 0.4419 0.2454 0.4209 0.6510 0.6412
MV-LSTM (DBLP:conf/aaai/WanLGXPC16) 0.4973 0.6733 0.8936 0.6611 0.2768 0.5000 0.8516 0.5059 0.2480 0.4105 0.7017 0.7734
DRMM (Guo:2016:DRM:2983323.2983769) 0.5287 0.6773 0.8776 0.6749 0.3507 0.5854 0.9003 0.5704 0.2212 0.3616 0.6575 0.7165
Duet (Mitra:2017:LMU:3038912.3052579) 0.4756 0.5592 0.8272 0.5692 0.2934 0.5046 0.8481 0.5158 0.2433 0.4088 0.6870 0.7651
DMN-KD (DBLP:conf/sigir/YangQQGZCHC18) 0.6443 0.7841 0.9351 0.7655 0.4908 0.7089 0.9304 0.6728 0.3596 0.5122 0.7631 0.8323
DMN-PRF (DBLP:conf/sigir/YangQQGZCHC18) 0.6552 0.7893 0.9343 0.7719 0.5021 0.7122 0.9356 0.6792 0.3601 0.5323 0.7701 0.8435
DAM (DBLP:conf/acl/WuLCZDYZL18) 0.7686 0.8739 0.9697 0.8527 0.7012 0.8527 0.9715 0.8150 0.3819 0.5567 0.7717 0.8452
IARTDot 0.7703 0.8746 0.9688 0.8535 0.7234 0.8650 0.9772 0.8300 0.3821 0.5547 0.7802 0.8454
IARTOuterproduct 0.7717 0.8766 0.9691 0.8548 0.7212 0.8664 0.9749 0.8289 0.3901 0.5649 0.7812 0.8493
IARTBilinear 0.7713 0.8747 0.9688 0.8542 0.7317 0.8752 0.9792 0.8364 0.3892 0.5592 0.7801 0.8471
Table 4. Comparison of different models over Ubuntu Dialog Corpus (UDC), MSDialog, and AliMe data sets. Numbers in bold font mean the result is better compared with the best baseline DAM. and means statistically significant difference over the best baseline DAM with and

measured by the Student’s paired t-test respectively.

Context [User] Hi, I have the new Outlook which updated a few days ago. I cannot find how to add senders to my blocked senders list manually. How do I do this on the new Outlook? Thanks   [Agent] Hi, There are different ways to block senders on Outlook depending on the version of Outlook that you are using. May we know what version of Outlook are you using?   [User] Hi, I’m using the desktop website beta version. Thanks.   [Agent] Desktop Website beta version? Are you referring to the Outlook Web App or the Windows mail?   [User] I go to Outlook.com and sign in on there.
Context Intent [OQ] [IR] [PA] [IR] [FD/ OQ]
Method Top-1 Ranked Response
DAM 0 Thanks for the reply. Some email domain needs to be manually added to Outlook. However, it’s good to know that the issue is resolved from your end. Should you need further assistance in the future, please do let us know. [PF]
IARTBilinear 1 In Outlook Web App …… to manually block an email address, follow these steps: …… Let us know how things go. [PA]
Table 5. A case study and examples of Top-1 ranked responses by different methods. means the label of a response candidate.

4.2. Experimental Setup

4.2.1. Baselines

We consider different baselines as follows 111111Note that the experimental setup where we compare our method with baselines without user intent modeling is reasonable. User intent modeling should be only added into the treatment instead of baselines for controlled experimental comparison to show the effectiveness of the incorporation of user intent.:

Traditional retrieval models: these methods treat the dialog context as the query to retrieve response candidates for response selection. We consider BM25 (Robertson:1994:SEA:188490.188561) as the retrieval model. We also consider BM25-PRF (DBLP:conf/sigir/YangQQGZCHC18), which matches conversation context with the expanded responses using BM25.

Neural ranking models: we consider several representative neural ranking models: MV-LSTM (DBLP:conf/aaai/WanLGXPC16), DRMM (Guo:2016:DRM:2983323.2983769) and Duet (Mitra:2017:LMU:3038912.3052579). We also consider models based on Deep Matching Networks (DMN) with external knowledge (DBLP:conf/sigir/YangQQGZCHC18), which incorporate external knowledge with pseudo-relevance feedback (DMN-PRF) and QA correspondence knowledge distillation (DMN-KD).

Deep Attention Matching Network (DAM) (DBLP:conf/acl/WuLCZDYZL18)

: DAM is a strong baseline method for response ranking in multi-turn conversations with open source code released

121212https://github.com/baidu/Dialogue/tree/master/DAM until this paper. DAM also represents and matches a response with its multi-turn context using dependency information learned by Transformers. It does not explicitly model user intent in conversations.

For evaluation metrics, we adopted mean average precision (MAP) and

which is the recall at top ranked responses from available candidates for a given conversation context following previous related works (DBLP:conf/acl/WuLCZDYZL18; DBLP:conf/sigir/YangQQGZCHC18; DBLP:conf/acl/WuWXZL17; DBLP:journals/corr/LowePSP15).

4.2.2. Parameter Settings and Implementation Details.

All models are implemented with TensorFlow

131313https://www.tensorflow.org/ and the MatchZoo141414https://github.com/NTMC-Community/MatchZoo toolkit. Hyper-parameters are tuned with the validation data. For the hyper-parameter settings of IART, we set the size of the convolution and pooling kernels as . The number of stacked Transformers layers is set as for UDC and for MSDialog. The batch size is for UDC and for MSDialog. All models are trained on a single Nvidia Titan X GPU. Learning rate is initialized as 1e-3 with exponential decay during training process. The decay steps and decay rate are set as and . The maximum utterance length is for UDC and for MSDialog. The maximum number of context utterance turns is set as for UDC and for MSDialog. We padded zeros if the number of utterance turns in a context is less than the maximum number of utterance turns. For user intent labels, there are different types for UDC/ MSDialog, and different types for AliMe data. For the word embeddings, we trained word embeddings with the Word2Vec tool (DBLP:conf/nips/MikolovSCCD13) with the CBOW model using our training data following previous work (DBLP:conf/acl/WuWXZL17; DBLP:conf/acl/WuLCZDYZL18). The max skip length between words and the number of negative examples is set as and . The dimension of word embeddings is . Word embeddings will be initialized by these pre-trained word vectors and updated during the training process.

4.3. Evaluation Results

We present evaluation results over different methods in Table 4. We summarize our observations as follows: (1) On MSDialog, all three variations of IART with dot, outer product and bilinear based intent-aware attention mechanism show significant improvements over all baseline methods, including the recently proposed strong baseline method DAM. On UDC, IART with three different intent-aware attention mechanisms also show improvements under all metrics except for R10@5. With the comparison between the results of DAM and IART, we can find that incorporating user intent modeling and intent-aware attention weighting scheme can help improve the response ranking performance. (2) If we compare three variations of IART, we can find that the bilinear based intent-aware attention mechanism works better for MSDialog and outer product based intent-aware attention mechanism works better for UDC. The overall performances of these three model variations are close to each other. Overall our proposed model IART shows larger performance improvements on MSDialog. One possible reason is that the intent classifier on MSDialog is more accurate due to the larger annotated training data of MSDialog for user intent prediction and more formal language used in MSDialog, as shown in evaluation results by DBLP:journals/corr/abs-1901-03489. (3) On AliMe data, all three variations of IART also show comparable or better results than all baseline methods including the strong baseline DAM. These results on real product data further verify the effectiveness of our proposed methods.

4.4. Case Study and User Intent Visualization

We perform a case study in Table 5 on the top ranked responses by different methods including the best baseline DAM and our proposed model IART with bilinear based intent-aware attention mechanism. We show the conversation context utterances and top-1 ranked response by each method. In this example, IART produced the correct top ranked response. We visualized the learned user intent representation of context utterances and returned top-1 ranked response by DAM and IART in Figure 2. The predicted user intent of conversation utterances is [OQ] [IR] [PA] [IR] [FD/ OQ]. The agent performed “Information Request (IR)” to confirm whether it is the Outlook Web app or the Windows desktop app. The user confirmed “Further Details (FD)” that the problem was related to the Outlook Web app (Outlook.com). Given such a user intent pattern in the conversation context, a reasonable response can be with intent “Potential Answers (PA)” on providing potential solutions to the user’s question, which is captured by IART due to the integration of user intent modeling. The DAM model, without user intent modeling, failed in such cases and selected a response candidate with “Positive Feedback (PF)” intent. The response returned by DAM assumed that “the issue is resolved”, but actually the user was expecting an answer to her unsolved technical problem. This gives an example and interpretation of why user intent modeling can be helpful for response ranking in conversations.

Figure 2. Visualization of learned user intent representation of context utterances and returned top-1 ranked response by DAM and IART from the case study in Table 5

. U-0 to U-4 denotes the 0-th turn to the 4-th utterance turn in the context. R-DAM and R-IART denotes the top-1 ranked response returned by DAM and IART respectively. Darker spots mean higher predicted probabilities.

5. Conclusions

In this paper, we analyze user intent in information-seeking conversations and propose an intent-aware neural ranking model with Transformers. We first define and characterize different user intent types, and then propose an intent-aware neural ranking model for response retrieval which incorporates intent-aware utterance attention to derive the importance weighting scheme of different utterances to improve conversation history understanding. Our proposed methods outperform all baseline methods on three different data sets including both standard benchmarks and commercial data. We also perform case studies and analysis of the learned user intent with their impact on response ranking in information-seeking conversations to provide insights.

6. Acknowledgments

This work was supported in part by the Center for Intelligent Information Retrieval, in part by NSF IIS-1715095, and in part by China Postdoctoral Science Foundation (No. 2019M652038). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.