Addressee and Response Selection in Multi-Party Conversations with Speaker Interaction RNNs

09/12/2017 ∙ by Rui Zhang, et al. ∙ University of Michigan ibm Yale University 0

In this paper, we study the problem of addressee and response selection in multi-party conversations. Understanding multi-party conversations is challenging because of complex speaker interactions: multiple speakers exchange messages with each other, playing different roles (sender, addressee, observer), and these roles vary across turns. To tackle this challenge, we propose the Speaker Interaction Recurrent Neural Network (SI-RNN). Whereas the previous state-of-the-art system updated speaker embeddings only for the sender, SI-RNN uses a novel dialog encoder to update speaker embeddings in a role-sensitive way. Additionally, unlike the previous work that selected the addressee and response separately, SI-RNN selects them jointly by viewing the task as a sequence prediction problem. Experimental results show that SI-RNN significantly improves the accuracy of addressee and response selection, particularly in complex conversations with many speakers and responses to distant messages many turns in the past.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Real-world conversations often involve more than two speakers. In the Ubuntu Internet Relay Chat channel (IRC), for example, one user can initiate a discussion about an Ubuntu-related technical issue, and many other users can work together to solve the problem. Dialogs can have complex speaker interactions: at each turn, users play one of three roles (sender, addressee, observer), and those roles vary across turns.

In this paper, we study the problem of addressee and response selection in multi-party conversations: given a responding speaker and a dialog context, the task is to select an addressee and a response from a set of candidates for the responding speaker. The task requires modeling multi-party conversations and can be directly used to build retrieval-based dialog systems [Lu and Li2013, Hu et al.2014, Ji, Lu, and Li2014, Wang et al.2015].

The previous state-of-the-art Dynamic-RNN model from ouchi-tsuboi:2016:EMNLP2016 ouchi-tsuboi:2016:EMNLP2016 maintains speaker embeddings to track each speaker status, which dynamically changes across time steps. It then produces the context embedding from the speaker embeddings and selects the addressee and response based on embedding similarity. However, this model updates only the sender embedding, not the embeddings of the addressee or observers, with the corresponding utterance, and it selects the addressee and response separately. In this way, it only models who says what and fails to capture addressee information. Experimental results show that the separate selection process often produces inconsistent addressee-response pairs.

To solve these issues, we introduce the Speaker Interaction Recurrent Neural Network (SI-RNN). SI-RNN redesigns the dialog encoder by updating speaker embeddings in a role-sensitive way. Speaker embeddings are updated in different GRU-based units depending on their roles (sender, addressee, observer). Furthermore, we note that the addressee and response are mutually dependent and view the task as a joint prediction problem. Therefore, SI-RNN models the conditional probability (of addressee given the response and vice versa) and selects the addressee and response pair by maximizing the joint probability.

On a public standard benchmark data set, SI-RNN significantly improves the addressee and response selection accuracy, particularly in complex conversations with many speakers and responses to distant messages many turns in the past. Our code and data set are available online.111The released code:

2 Related Work

We follow a data-driven approach to dialog systems. singh1999reinforcement singh1999reinforcement, henderson2008hybrid henderson2008hybrid, and young2013pomdp young2013pomdp optimize the dialog policy using Reinforcement Learning or the Partially Observable Markov Decision Process framework. In addition, henderson2014second henderson2014second propose to use a predefined ontology as a logical representation for the information exchanged in the conversation. The dialog system can be divided into different modules, such as Natural Language Understanding

[Yao et al.2014, Mesnil et al.2015], Dialog State Tracking [Henderson, Thomson, and Young2014, Williams, Raux, and Henderson2016]

, and Natural Language Generation

[Wen et al.2015]. Furthermore, wen2016network wen2016network and bordes2017learning bordes2017learning propose end-to-end trainable goal-oriented dialog systems.

Recently, short text conversation has been popular. The system receives a short dialog context and generates a response using statistical machine translation or sequence-to-sequence networks [Ritter, Cherry, and Dolan2011, Vinyals and Le2015, Shang, Lu, and Li2015, Serban et al.2016, Li et al.2016, Mei, Bansal, and Walter2017]. In contrast to response generation, the retrieval-based approach uses a ranking model to select the highest scoring response from candidates [Lu and Li2013, Hu et al.2014, Ji, Lu, and Li2014, Wang et al.2015]. However, these models are single-turn responding machines and thus still are limited to short contexts with only two speakers. As for larger context, lowe2015ubuntu lowe2015ubuntu propose the Next Utterance Classification (NUC) task for multi-turn two-party dialogs. ouchi-tsuboi:2016:EMNLP2016 ouchi-tsuboi:2016:EMNLP2016 extend NUC to multi-party conversations by integrating the addressee detection problem. Since the data is text based, they use only textual information to predict addressees as opposed to relying on acoustic signals or gaze information in multimodal dialog systems [Jovanović, Akker, and Nijholt2006, op den Akker and Traum2009].

Furthermore, several other papers are recently presented focusing on modeling role-specific information given the dialogue contexts [Meng, Mou, and Jin2017, Chi et al.2017, Chen et al.2017]. For example, meng2017towards meng2017towards combine content and temporal information to predict the utterance speaker. By contrast, our SIRNN explicitly utilizes the speaker interaction to maintain speaker embeddings and predicts the addressee and response by joint selection.

3 Preliminaries

3.1 Addressee and Response Selection

ouchi-tsuboi:2016:EMNLP2016 ouchi-tsuboi:2016:EMNLP2016 propose the addressee and response selection task for multi-party conversation. Given a responding speaker and a dialog context , the task is to select a response and an addressee. is a list ordered by time step:

where says to at time step , and is the total number of time steps before the response and addressee selection. The set of speakers appearing in is denoted . As for the output, the addressee is selected from , and the response is selected from a set of candidates . Here, contains the ground-truth response and one or more false responses. We provide some examples in Table 4 (Section 6).

Data Notation


Responding Speaker
Input Context
Candidate Responses
Output Addressee
Sender ID at time
Addressee ID at time
Utterance at time
Utterance embedding at time
Speaker embedding of at time
Table 1: Notations for the task and model.
Figure 1: Dialog encoders in Dynamic-RNN

(Left) and SI-RNN (Right) for an example context at the top. Speaker embeddings are initialized as zero vectors and updated recurrently as hidden states along the time step. In SI-RNN, the same speaker embedding is updated in different units depending on the role (

for sender, for addressee, for observer).

3.2 Dynamic-RNN Model

In this section, we briefly review the state-of-the-art Dynamic-RNN model [Ouchi and Tsuboi2016], which our proposed model is based on. Dynamic-RNN solves the task in two phases: 1) the dialog encoder maintains a set of speaker embeddings to track each speaker status, which dynamically changes with time step ; 2) then Dynamic-RNN produces the context embedding from the speaker embeddings and selects the addressee and response based on embedding similarity among context, speaker, and utterance.

Dialog Encoder.

Figure 1 (Left) illustrates the dialog encoder in Dynamic-RNN on an example context. In this example, says to , then says to , and finally says to . The context will be:


with the set of speakers .

For a speaker , the bold letter denotes its embedding at time step . Speaker embeddings are initialized as zero vectors and updated recurrently as hidden states of GRUs [Cho et al.2014, Chung et al.2014]. Specifically, for each time step with the sender and the utterance , the sender embedding is updated recurrently from the utterance:

where is the embedding for utterance . Other speaker embeddings are updated from . The speaker embeddings are updated until time step .

Selection Model.

To summarize the whole dialog context

, the model applies element-wise max pooling over all the speaker embeddings to get the context embedding



The probability of an addressee and a response being the ground truth is calculated based on embedding similarity. To be specific, for addressee selection, the model compares the candidate speaker , the dialog context , and the responding speaker :


where is the final speaker embedding for the responding speaker , is the final speaker embedding for the candidate addressee ,

is the logistic sigmoid function,

is the row-wise concatenation operator, and is a learnable parameter. Similarly, for response selection,


where is the embedding for the candidate response , and is a learnable parameter.

The model is trained end-to-end to minimize a joint cross-entropy loss for the addressee selection and the response selection with equal weights. At test time, the addressee and the response are separately selected to maximize the probability in Eq 3 and Eq 4.

4 Speaker Interaction RNN

While Dynamic-RNN can track the speaker status by capturing who says what in multi-party conversation, there are still some issues. First, at each time step, only the sender embedding is updated from the utterance. Therefore, other speakers are blind to what is being said, and the model fails to capture addressee information. Second, while the addressee and response are mutually dependent, Dynamic-RNN selects them independently. Consider a case where the responding speaker is talking to two other speakers in separate conversation threads. The choice of addressee is likely to be either of the two speakers, but the choice is much less ambiguous if the correct response is given, and vice versa. Dynamic-RNN often produces inconsistent addressee-response pairs due to the separate selection. See Table 4 for examples.

In contrast to Dynamic-RNN, the dialog encoder in SI-RNN updates embeddings for all the speakers besides the sender at each time step. Speaker embeddings are updated depending on their roles: the update of the sender is different from the addressee, which is different from the observers. Furthermore, the update of a speaker embedding is not only from the utterance, but also from other speakers. These are achieved by designing variations of GRUs for different roles. Finally, SI-RNN selects the addressee and response jointly by maximizing the joint probability.

6:// Initialize speaker embeddings
7:for  do
9:end for
10://Update speaker embeddings
11:for  do
12:     // Update sender, addressee, observers
16:     // Compute utterance embedding
19:     // Update sender embedding
21:     // Update addressee embedding
23:     // Update observer embeddings
24:     for  do
26:     end for
27:end for
28:// Return final speaker embeddings
30:return for
Algorithm 1 Dialog Encoder in SI-RNN

4.1 Utterance Encoder

To encode an utterance of

words, we use a RNN with Gated Recurrent Units

[Cho et al.2014, Chung et al.2014]:

where is the word embedding for , and is the hidden state. is initialized as a zero vector, and the utterance embedding is the last hidden state, i.e. .

4.2 Dialog Encoder

Figure 1 (Right) shows how SI-RNN encodes the example in Eq 1. Unlike Dynamic-RNN, SI-RNN updates all speaker embeddings in a role-sensitive manner. For example, at the first time step when says to , Dynamic-RNN only updates using , while other speakers are updated using . In contrast, SI-RNN updates each speaker status with different units: updates the sender embedding from the utterance embedding and the addressee embedding ; updates the addressee embedding from and ; updates the observer embedding from .

Algorithm 1 gives a formal definition of the dialog encoder in SI-RNN. The dialog encoder is a function that takes as input a dialog context (lines 1-5) and returns speaker embeddings at the final time step (lines 28-30). Speaker embeddings are initialized as -dimensional zero vectors (lines 6-9). Speaker embeddings are updated by iterating over each line in the context (lines 10-27).

4.3 Role-Sensitive Update

Figure 2: Illustration of (upper, blue), (middle, green), and (lower, yellow). Filled circles are speaker embeddings, which are recurrently updated. Unfilled circles are gates. Filled squares are speaker embedding proposals.

In this subsection, we explain in detail how // update speaker embeddings according to their roles at each time step (Algorithm 1 lines 19-26).

As shown in Figure 2, // are all GRU-based units. updates the sender embedding from the previous sender embedding , the previous addressee embedding , and the utterance embedding :

The update, as illustrated in the upper part of Figure 2, is controlled by three gates. The gate controls the previous sender embedding , and controls the previous addressee embedding . Those two gated interactions together produce the sender embedding proposal . Finally, the update gate combines the proposal and the previous sender embedding to update the sender embedding . The computations in (including gates , , , the proposal embedding , and the final updated embedding ) are formulated as:

where are learnable parameters. uses the same formulation with a different set of parameters, as illustrated in the middle of Figure 2. In addition, we update the observer embeddings from the utterance. is implemented as the traditional GRU unit in the lower part of Figure 2. Note that the parameters in // are not shared. This allows SI-RNN to learn role-dependent features to control speaker embedding updates. The formulations of and are similar.

4.4 Joint Selection

The dialog encoder takes the dialog context as input and returns speaker embeddings at the final time step, . Recall from Section 3.2 that Dynamic-RNN produces the context embedding using Eq 2 and then selects the addressee and response separately using Eq 3 and Eq 4.

In contrast, SI-RNN performs addressee and response selection jointly: the response is dependent on the addressee and vice versa. Therefore, we view the task as a sequence prediction process: given the context and responding speaker, we first predict the addressee, and then predict the response given the addressee. (We also use the reversed prediction order as in Eq 7.)

In addition to Eq 3 and Eq 4, SI-RNN is also trained to model the conditional probability as follows. To predict the addressee, we calculate the probability of the candidate speaker to be the ground-truth given the ground-truth response (available during training time):


The key difference from Eq 3 is that Eq 5 is conditioned on the correct response with embedding . Similarly, for response selection, we calculate the probability of a candidate response given the ground-truth addressee :


At test time, SI-RNN selects the addressee-response pair from to maximize the joint probability :


In Eq 7, we decompose the joint probability into two terms: the first term selects the response given the context, and then selects the addressee given the context and the selected response; the second term selects the addressee and response in the reversed order.222Detail: We also considered an alternative decomposition of the joint probability as , but the performance was similar to Eq 7.



Chance - 0.62 0.62 1.24 50.00 0.12 0.12 1.24 10.00
Recent+TF-IDF 15 37.11 37.13 55.62 67.89 14.91 15.44 55.62 29.19
Direct-Recent+TF-IDF 15 45.83 45.76 67.72 67.89 18.94 19.50 67.72 29.40
Static-RNN 5 47.08 46.99 60.39 75.07 21.96 21.98 60.26 33.27
[Ouchi and Tsuboi2016] 10 48.52 48.67 60.97 77.75 22.78 23.31 60.66 35.91
15 49.03 49.27 61.95 78.14 23.73 23.49 60.98 36.58
Static-Hier-RNN 5 49.19 49.38 62.20 76.70 23.68 23.75 62.24 34.51
[Zhou et al.2016] 10 51.37 51.76 64.61 78.28 25.46 25.83 64.86 36.94
[Serban et al.2016] 15 52.78 53.04 65.84 79.08 26.31 26.62 65.89 37.85
Dynamic-RNN 5 49.38 49.80 63.19 76.07 23.44 23.72 63.28 33.62
[Ouchi and Tsuboi2016] 10 52.76 53.85 66.94 78.16 25.44 25.95 66.70 36.14
15 54.45 54.88 68.54 78.64 26.73 27.19 68.41 36.93
5 60.57 60.69 74.08 78.14 30.65 30.71 72.59 36.45
SI-RNN (Ours) 10 65.34 65.63 78.76 80.34 34.18 34.09 77.13 39.20
15 67.01 67.30 80.47 80.91 35.50 35.76 78.53 40.83
SI-RNN w/ shared IGRUs 15 59.50 59.47 74.20 78.08 28.31 28.45 73.35 36.00
SI-RNN w/o joint selection 15 63.13 63.40 77.56 80.38 32.24 32.53 77.61 39.73
Table 2: Addressee and response selection results on the Ubuntu Multiparty Conversation Corpus. Metrics include accuracy of addressee selection (ADR), response selection (RES), and pair selection (ADR-RES). RES-CAND: the number of candidate responses. : the context length.
Total Train Dev Test


# Docs 7,355 6,606 367 382
# Utters 2.4M 2.1M 132.4k 151.3k
# Samples - 665.6k 45.1k 51.9k
Adr Mention Freq - 0.32 0.34 0.34
# Speakers / Doc 26.8 26.3 30.7 32.1
# Utters / Doc 326.3 317.9 360.8 396.1
# Words / Utter 11.1 11.1 11.2 11.3
Table 3: Data Statistics. “Adr Mention Freq” is the frequency of explicit addressee mention.

5 Experimental Setup

Data Set. We use the Ubuntu Multiparty Conversation Corpus [Ouchi and Tsuboi2016] and summarize the data statistics in Table 3. The whole data set (including the Train/Dev/Test split and the false response candidates) is publicly available.333 The data set is built from the Ubuntu IRC chat room where a number of users discuss Ubuntu-related technical issues. The log is organized as one file per day corresponding to a document . Each document consists of (Time, SenderID, Utterance) lines. If users explicitly mention addressees at the beginning of the utterance, the addresseeID is extracted. Then a sample, namely a unit of input (the dialog context and the current sender) and output (the addressee and response prediction) for the task, is created to predict the ground-truth addressee and response of this line. Note that samples are created only when the addressee is explicitly mentioned for clear, unambiguous ground-truth labels. False response candidates are randomly chosen from all other utterances within the same document. Therefore, distractors are likely from the same sub-conversation or even from the same sender but at different time steps. This makes it harder than lowe2015ubuntu lowe2015ubuntu where distractors are randomly chosen from all documents. If no addressee is explicitly mentioned, the addressee is left blank and the line is marked as a part of the context.
Baselines. Apart from Dynamic-RNN, we also include several other baselines. Recent+TF-IDF always selects the most recent speaker (except the responding speaker

) as the addressee and chooses the response to maximize the tf-idf cosine similarity with the context. We improve it by using a slightly different addressee selection heuristic (

Direct-Recent+TF-IDF): select the most recent speaker that directly talks to by an explicit addressee mention. We select from the previous 15 utterances, which is the longest context among all the experiments. This works much better when there are multiple concurrent sub-conversations, and responds to a distant message in the context. We also include another GRU-based model Static-RNN from ouchi-tsuboi:2016:EMNLP2016 ouchi-tsuboi:2016:EMNLP2016. Unlike Dynamic-RNN, speaker embeddings in Static-RNN are based on the order of speakers and are fixed. Furthermore, inspired by zhou16multi zhou16multi and serban2016building serban2016building, we implement Static-Hier-RNN, a hierarchical version of Static-RNN. It first builds utterance embeddings from words and then uses high-level RNNs to process utterance embeddings.
Implementation Details

For a fair comparison, we follow the hyperparameters from ouchi-tsuboi:2016:EMNLP2016 ouchi-tsuboi:2016:EMNLP2016, which are chosen based on the validation data set. We take a maximum of 20 words for each utterance. We use 300-dimensional GloVe word vectors


, which are fixed during training. SI-RNN uses 50-dimensional vectors for both speaker embeddings and hidden states. Model parameters are initialized with a uniform distribution between -0.01 and 0.01. We set the mini-batch size to 128. The joint cross-entropy loss function with 0.001 L2 weight decay is minimized by Adam

[Kingma and Ba2015]

. The training is stopped early if the validation accuracy is not improved for 5 consecutive epochs. All experiments are performed on a single GTX Titan X GPU. The maximum number of epochs is 30, and most models converge within 10 epochs.

6 Results and Discussion

For fair and meaningful quantitative comparisons, we follow ouchi-tsuboi:2016:EMNLP2016 ouchi-tsuboi:2016:EMNLP2016’s evaluation protocols. SI-RNN improves the overall accuracy on the addressee and response selection task. Two ablation experiments further analyze the contribution of role-sensitive units and joint selection respectively. We then confirm the robustness of SI-RNN with the number of speakers and distant responses. Finally, in a case study we discuss how SI-RNN handles complex conversations by either engaging in a new sub-conversation or responding to a distant message.
Overall Result. As shown in Table 2, SI-RNN significantly improves upon the previous state-of-the-art. In particular, addressee selection (ADR) benefits most, with different number of candidate responses (denoted as RES-CAND): around 12% in RES-CAND and more than 10% in RES-CAND . Response selection (RES) is also improved, suggesting role-sensitive GRUs and joint selection are helpful for response selection as well. The improvement is more obvious with more candidate responses (2% in RES-CAND and 4% in RES-CAND ). These together result in significantly better accuracy on the ADR-RES metric as well.
Ablation Study. We show an ablation study in the last rows of Table 2. First, we share the parameters of //. The accuracy decreases significantly, indicating that it is crucial to learn role-sensitive units to update speaker embeddings. Second, to examine our joint selection, we fall back to selecting the addressee and response separately, as in Dynamic-RNN. We find that joint selection improves ADR and RES individually, and it is particularly helpful for pair selection ADR-RES.

Figure 3: Effect of the number of speakers in the context (Upper) and the addressee distance (Lower). Left axis: the histogram shows the number of test examples. Right axis: the curves show ADR accuracy on the test set.

Number of Speakers. Numerous speakers create complex dialogs and increased candidate addressee, thus the task becomes more challenging. In Figure 3 (Upper), we investigate how ADR accuracy changes with the number of speakers in the context of length 15, corresponding to the rows with T=15 in Table 2. Recent+TF-IDF always chooses the most recent speaker and the accuracy drops dramatically as the number of speakers increases. Direct-Recent+TF-IDF shows better performance, and Dynamic-RNNis marginally better. SI-RNN is much more robust and remains above 70% accuracy across all bins. The advantage is more obvious for bins with more speakers.
Addressing Distance. Addressing distance is the time difference from the responding speaker to the ground-truth addressee. As the histogram in Figure 3 (Lower) shows, while the majority of responses target the most recent speaker, many responses go back five or more time steps. It is important to note that for those distant responses, Dynamic-RNN sees a clear performance decrease, even worse than Direct-Recent+TF-IDF. In contrast, SI-RNN handles distant responses much more accurately.

Sender Addressee Utterance
1 codepython wafflejock thanks
1 wafflejock codepython yup np
2 wafflejock theoletom you can use sudo apt-get install packagename – reinstall, to have apt-get install reinstall some package/metapackage and redo the configuration for the program as well
3 codepython - i installed ubuntu on a separate external drive. now when i boot into mac, the external drive does not show up as bootable. the blue light is on. any ideas?
4 Guest54977 - hello there. wondering to anyone who knows, where an ubuntu backup can be retrieved from.
2 theoletom wafflejock it’s not a program. it’s a desktop environment.
4 Guest54977 - did some searching on my system and googling, but couldn’t find an answer
2 theoletom - be a trace of it left yet there still is.
2 theoletom - i think i might just need a fresh install of ubuntu. if there isn’t a way to revert to default settings
5 releaf - what’s your opinion on a $500 laptop that will be a dedicated ubuntu machine?
5 releaf - are any of the pre-loaded ones good deals?
5 releaf - if not, are there any laptops that are known for being oem-heavy or otherwise ubuntu friendly?
3 codepython - my usb stick shows up as bootable (efi) when i boot my mac. but not my external hard drive on which i just installed ubuntu. how do i make it bootable from mac hardware?
3 Jordan_U codepython did you install ubuntu to this external drive from a different machine?
5 Umeaboy releaf what country you from?
5 wafflejock


Model Prediction Addressee Response
Direct-Recent+TF-IDF theoletom ubuntu install fresh
Dynamic-RNN codepython no prime is the replacement
SI-RNN releaf there are a few ubuntu dedicated laptop providers like umeaboy is asking depends on where you are
(a) SI-RNN chooses to engage in a new sub-conversation by suggesting a solution to “releaf” about Ubuntu dedicated laptops.
Sender Addressee Utterance
1 VeryBewitching nicomachus anything i should be concerned about before i do it?
1 nicomachus VeryBewitching always back up before partitioning.
1 VeryBewitching nicomachus i would have assumed that, i was wondering more if this is something that tends to be touch and go, want to know if i should put coffee on : )
2 TechMonger - it was hybernating. i can ping it now
2 TechMonger - why does my router pick up disconnected devices when i reset my device list? or how
2 Ionic - because the dhcp refresh interval hasn’t passed yet?
2 TechMonger - so dhcp refresh is different than device list refresh?
2 D33p TechMonger what an enlightenment @techmonger : )
2 BuzzardBuzz - dhcp refresh for all clients is needed when you change your subnet ip
2 BuzzardBuzz - if you want them to work together
2 Ionic BuzzardBuzz uhm, no.
2 chingao TechMonger nicomachus asked this way at the beginning: is the machine that you ’re trying to ping turned on?
1 nicomachus


Model Prediction Addressee Response
Direct-Recent+TF-IDF VeryBewitching i have tried with this program y-ppa manager, yet still doesn’t work.
Dynamic-RNN chingao install the package “linux-generic”, that will install the kernel and the headers if they are not installed
SI-RNN VeryBewitching if it’s the last partition on the disk, it won’t take long. if gparted has to copy data to move another partition too, it can take a couple hours.
(b) SI-RNN remembers the distant sub-conversation 1 and responds to “VeryBewitching” with a detailed answer.
Table 4: Case Study. denotes the ground-truth. Sub-conversations are coded with different numbers for the purpose of analysis (sub-conversation labels are not available during training or testing).

Case Study. Examples in Table 4 show how SI-RNN can handle complex multi-party conversations by selecting from 10 candidate responses. In both examples, the responding speakers participate in two or more concurrent sub-conversations with other speakers.

Example (a) demonstrates the ability of SI-RNN to engage in a new sub-conversation. The responding speaker “wafflejock” is originally involved in two sub-conversations: the sub-conversation 1 with “codepython”, and the ubuntu installation issue with “theoletom”. While it is reasonable to address “codepython” and “theoletom”, the responses from other baselines are not helpful to solve corresponding issues. TF-IDF prefers the response with the “install” key-word, yet the response is repetitive and not helpful. Dynamic-RNN selects an irrelevant response to “codepython”. SI-RNN chooses to engage in a new sub-conversation by suggesting a solution to “releaf” about Ubuntu dedicated laptops.

Example (b) shows the advantage of SI-RNN in responding to a distant message. The responding speaker “nicomachus” is actively engaged with “VeryBewitching” in the sub-conversation 1 and is also loosely involved in the sub-conversation 2: “chingao” mentions “nicomachus” in the most recent utterance. SI-RNN remembers the distant sub-conversation 1 and responds to “VeryBewitching” with a detailed answer. Direct-Recent+TF-IDF selects the ground-truth addressee because “VeryBewitching” talks to “nicomachus”, but the response is not helpful. Dynamic-RNN is biased to the recent speaker “chingao”, yet the response is not relevant.

7 Conclusion

SI-RNN jointly models who says what to whom by updating speaker embeddings in a role-sensitive way. It provides state-of-the-art addressee and response selection, which can instantly help retrieval-based dialog systems. In the future, we also consider using SI-RNN to extract sub-conversations in the unlabeled conversation corpus and provide a large-scale disentangled multi-party conversation data set.

8 Acknowledgements

We thank the members of the UMichigan-IBM Sapphire Project and all the reviewers for their helpful feedback. This material is based in part upon work supported by IBM under contract 4915012629. Any opinions, findings, conclusions or recommendations expressed above are those of the authors and do not necessarily reflect the views of IBM.


  • [Bordes and Weston2017] Bordes, A., and Weston, J. 2017. Learning end-to-end goal-oriented dialog. In ICLR.
  • [Chen et al.2017] Chen, P.-C.; Chi, T.-C.; Su, S.-Y.; and Chen, Y.-N. 2017. Dynamic time-aware attention to speaker roles and contexts for spoken language understanding. In ASRU.
  • [Chi et al.2017] Chi, T.-C.; Chen, P.-C.; Su, S.-Y.; and Chen, Y.-N. 2017. Speaker role contextual modeling for language understanding and dialogue policy learning. In IJCNLP.
  • [Cho et al.2014] Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP.
  • [Chung et al.2014] Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling.

    NIPS 2014 Deep Learning and Representation Learning Workshop

  • [Henderson, Lemon, and Georgila2008] Henderson, J.; Lemon, O.; and Georgila, K. 2008.

    Hybrid reinforcement/supervised learning of dialogue policies from fixed data sets.

    Computational Linguistics 34(4):487–511.
  • [Henderson, Thomson, and Williams2014] Henderson, M.; Thomson, B.; and Williams, J. 2014. The second dialog state tracking challenge. In SIGDIAL.
  • [Henderson, Thomson, and Young2014] Henderson, M.; Thomson, B.; and Young, S. 2014. Word-based dialog state tracking with recurrent neural networks. In SIGDIAL.
  • [Hu et al.2014] Hu, B.; Lu, Z.; Li, H.; and Chen, Q. 2014. Convolutional neural network architectures for matching natural language sentences. In NIPS.
  • [Ji, Lu, and Li2014] Ji, Z.; Lu, Z.; and Li, H. 2014. An information retrieval approach to short text conversation. arXiv preprint arXiv:1408.6988.
  • [Jovanović, Akker, and Nijholt2006] Jovanović, N.; Akker, R. o. d.; and Nijholt, A. 2006. Addressee identification in face-to-face meetings. In EACL.
  • [Kingma and Ba2015] Kingma, D., and Ba, J. 2015. Adam: A method for stochastic optimization. International Conference for Learning Representations (ICLR).
  • [Li et al.2016] Li, J.; Galley, M.; Brockett, C.; Spithourakis, G.; Gao, J.; and Dolan, B. 2016. A persona-based neural conversation model. In ACL.
  • [Lowe et al.2015] Lowe, R.; Pow, N.; Serban, I.; and Pineau, J. 2015. The Ubuntu Dialogue Corpus: A large dataset for research in unstructured multi-turn dialogue systems. In SIGDIAL.
  • [Lu and Li2013] Lu, Z., and Li, H. 2013. A deep architecture for matching short texts. In NIPS.
  • [Mei, Bansal, and Walter2017] Mei, H.; Bansal, M.; and Walter, M. R. 2017. Coherent dialogue with attention-based language models. In AAAI.
  • [Meng, Mou, and Jin2017] Meng, Z.; Mou, L.; and Jin, Z. 2017. Towards neural speaker modeling in multi-party conversation: The task, dataset, and models. arXiv preprint arXiv:1708.03152.
  • [Mesnil et al.2015] Mesnil, G.; Dauphin, Y.; Yao, K.; Bengio, Y.; Deng, L.; Hakkani-Tur, D.; He, X.; Heck, L.; Tur, G.; Yu, D.; et al. 2015. Using recurrent neural networks for slot filling in spoken language understanding. Audio, Speech, and Language Processing, IEEE/ACM Transactions on 23(3):530–539.
  • [op den Akker and Traum2009] op den Akker, R., and Traum, D. 2009. A comparison of addressee detection methods for multiparty conversations. Workshop on the Semantics and Pragmatics of Dialogue.
  • [Ouchi and Tsuboi2016] Ouchi, H., and Tsuboi, Y. 2016. Addressee and response selection for multi-party conversation. In EMNLP.
  • [Ritter, Cherry, and Dolan2011] Ritter, A.; Cherry, C.; and Dolan, W. B. 2011. Data-driven response generation in social media. In EMNLP.
  • [Serban et al.2016] Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A.; and Pineau, J. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI.
  • [Shang, Lu, and Li2015] Shang, L.; Lu, Z.; and Li, H. 2015. Neural responding machine for short-text conversation. In ACL.
  • [Singh et al.1999] Singh, S. P.; Kearns, M. J.; Litman, D. J.; and Walker, M. A. 1999. Reinforcement learning for spoken dialogue systems. In NIPS.
  • [Vinyals and Le2015] Vinyals, O., and Le, Q. 2015. A neural conversational model. ICML Deep Learning Workshop.
  • [Wang et al.2015] Wang, M.; Lu, Z.; Li, H.; and Liu, Q. 2015. Syntax-based deep matching of short texts. In IJCAI.
  • [Wen et al.2015] Wen, T.-H.; Gašić, M.; Mrkšić, N.; Su, P.-H.; Vandyke, D.; and Young, S. 2015. Semantically conditioned lstm-based natural language generation for spoken dialogue systems. In EMNLP.
  • [Wen et al.2016] Wen, T.-H.; Vandyke, D.; Mrksic, N.; Gasic, M.; Rojas-Barahona, L. M.; Su, P.-H.; Ultes, S.; and Young, S. 2016. A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562.
  • [Williams, Raux, and Henderson2016] Williams, J.; Raux, A.; and Henderson, M. 2016. The dialog state tracking challenge series: A review. Dialogue & Discourse 7(3):4–33.
  • [Yao et al.2014] Yao, K.; Peng, B.; Zhang, Y.; Yu, D.; Zweig, G.; and Shi, Y. 2014.

    Spoken language understanding using long short-term memory neural networks.

    In Spoken Language Technology Workshop (SLT), 2014 IEEE, 189–194. IEEE.
  • [Young et al.2013] Young, S.; Gasic, M.; Thomson, B.; and Williams, J. D. 2013. Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE 101(5):1160–1179.
  • [Zhou et al.2016] Zhou, X.; Dong, D.; Wu, H.; Zhao, S.; Yu, D.; Tian, H.; Liu, X.; and Yan, R. 2016. Multi-view response selection for human-computer conversation. In EMNLP.