A Speaker-aware Parallel Hierarchical Attentive Encoder-Decoder Model for Multi-turn Dialogue Generation

10/13/2021 ∙ by Zihao Wang, et al. ∙ University of Illinois at Urbana-Champaign 0

This paper presents a novel open-domain dialogue generation model emphasizing the differentiation of speakers in multi-turn conversations. Differing from prior work that solely relies on the content of conversation history to generate a response, we argue that capturing relative social relations among utterances (i.e., generated by either the same speaker or different persons) benefits the machine capturing fine-grained context information from a conversation history to improve context coherence in the generated response. Given that, we propose a speaker-aware Parallel Hierarchical Attentive Encoder-Decoder (PHAED) model that aims to model each utterance with the awareness of its speaker and contextual associations with the same speaker's previous messages. Specifically, in a conversation involving two speakers, we regard the utterances from one speaker as responses and those from the other as queries. After understanding queries via our encoder with inner-query and inter-query encodings, our decoder reuses the hidden states of previously generated responses, instead of reconstructing these by the encoder, to generate a new response. Our empirical results show that PHAED outperforms the state-of-the-art in both automatic and human evaluations. Furthermore, our ablation study shows that dialogue models with speaker tokens can generally decrease the possibility of generating non-coherent responses regarding the conversation context.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dialogue generation is a text generation task that aims to automatically generate a reasonable response regarding a conversation history. With the goal of providing AI-based virtual agents to support various services such as personal secretary, companion to humans with an emotional connections and customer services, this task has become a popular research topic in both academia and industry 

(Chen et al., 2017). Traditional dialogue models are mainly task-oriented (Henderson et al., 2013; Zhao and Eskenazi, 2016; Madotto et al., 2018)

. To improve the generalization ability of these AI models, recent studies have begun to focus on the development of open-domain conversational agents 

(Serban et al., 2016; Xing et al., 2018; Chen et al., 2019; Bao et al., 2020).

Figure 1: A sample of the responses generated by a typical approach (Transformer) and our approach. speaker0 and speaker1 are speaker tokens. Underlined words indicate the part that does not meet the speaker token.

The common practice of building a dialogue model is to train a sequence-to-sequence model using the conversation history to generate a context-coherent response. Considering the difficulty of understanding the complex scenarios in real life as social interaction, merchandising, and small talk, how to effectively understand conversation history is a critical challenge in the encoding process (Zhao et al., 2020; Tian et al., 2017). To address this challenge, prior work usually proposes a learning-based model based on the framework of hierarchical recurrent encoder-decoder (HRED) (Serban et al., 2016), where the model contains word-level and utterance-level encoders to understand the conversation history.

Despite remarkable contributions made by prior work for capturing context information from the conversation history (Serban et al., 2017; Tian et al., 2017; Zhang et al., 2018b; Xing et al., 2018; Zhang et al., 2019; Zhao et al., 2020), one major limitation of these studies is that they primarily focus on previous utterances’ content but ignore the social relationships between these utterances (i.e., generated by the same or different speaker) (Hovy and Yang, 2021). We argue that such missing information is helpful with the machine in learning fine-grained context information by differentiating the content of previous utterances based on their speakers. By losing the chance to get such fine-grained context information, the machine is difficult to capture the latent properties of the speaker (e.g., role) represented by the machine for response generation, and hence, hard to guarantee the context coherence of the generated response. As shown in Figure 1, we can infer from the conversation history that speaker0 knows the shoes’ price from his message "These shoes are on sale". However, due to the lack of awareness of the speaker information behind utterances, the response generated by an existing dialogue model based on transformer (Vaswani et al., 2017) mistakenly asks "How much is it".

To address the aforementioned limitation, we propose a speaker-aware learning model towards improving multi-turn dialogue coherence via making full use of the conversation history from different speakers. Instead of mixing each utterance with all conversation history, our approach aims to model each utterance with the awareness of its speaker and contextual associations with the same speaker’s previous messages. More specifically, to make the model distinguish between queries and responses, we first add different speaker tokens at the beginning of queries and responses. Then, a hierarchical attentive encoder with two-level encodings is proposed to obtain the local and global contextual representations of queries. Finally, the decoder utilizes the turn-level recurrence and cross attention to take advantage of both previous responses and queries for generating the current response. Moreover, we argue that it is unnecessary to re-understand the generated responses since the model must have understood a response before synthesizing it. Therefore, our decoder reuses hidden states of the previously generated responses instead of reconstructing these by the encoder. After considering the speaker roles, we can see from Figure 1 that our approach generates a coherent response with respect to the context of speaker0.

Our main contributions include:

  • We propose a novel dialogue generation model, PHAED, to generate context-coherent responses in multi-turn dialogues by dealing with utterance information with the awareness of their speakers.

  • By performing experiments on three public datasets, we show that our approach outperforms the state-of-the-art in terms of response coherence and diversity.

  • We conduct a fine-grained analysis of the performance of PHAED, which deepens our understanding of the characteristics of PHAED.

2 Related Work

As the multi-turn dialogue generation is accordant with the scenarios in daily life, it has gained increasing attention. Besides, since using plain texts has a clear limitation in dialogue generation, it is crucial to making most of the semi-structured data (i.e., containing both textual and auctorial information) in learning neural dialogue models.

Recent work on multi-turn dialogue generation mainly focuses on utterance-aware models towards using conversation history effectively. Early, Serban et al. (2016) successfully apply HRED (Sordoni et al., 2015) on dialogue generation that models conversation history via a hierarchical recurrent encoder and generates a response by a recurrent decoder. Most subsequent studies focus on designing exquisite attention mechanisms to detect the relevant words or utterances for response generation (Tian et al., 2017; Xing et al., 2018; Zhang et al., 2018b, 2019, 2020). Sun et al. (2021) and Xu et al. (2021) use RNN-based variational auto-encoders to generate responses that are relevant to the content of the conversation history, but they do not take into account the differences between speakers. Sankar et al. (2019) find that transformer-based models have lower test perplexities than recurrent models.

Regarding speaker-aware models, some methods are proposed on conversational language understanding and generation tasks. Chi et al. (2017) propose speaker role-based contextual model for language understanding and dialogue policy learning. Meng et al. (2018) propose a speaker classification task in multi-party conversation. Ma et al. (2019) study the implicit discourse relation identification between different utterances. Liu et al. (2020b) consider extra fine-grained manual roles (speaker or listener) of each utterance for multi-turn dialogue generation. Besides, there are also some persona-based conversation models (Li et al., 2016; Olabiyi et al., 2018; Bak and Oh, 2019; Chan et al., 2019), which extract the persona characteristics from the same person’s conversations. They require a corpus that has specific identifiers of each particular person. However, for general conversational datasets (Lowe et al., 2015; Li et al., 2017), in different conversations, speakers can only be unified and anonymous as speaker0 and speaker1, rather than labeling all speakers that occurred in the dataset with specific personal identifiers. In this general setting, Liu et al. (2020a) finds that the response selection model benefits from filling the gap of utterance-aware and speaker-aware representations, and  Zhao and Kawahara (2019) propose a speaker-aware generative dialogue model with relative speaker modeling and relative utterance encoders.  Hovy and Yang (2021) point out that the reason for the current limitations for NLP applications is the focus on the content of information while ignoring the social factors of the language.

Compared with utterance-aware dialogue generation models, our approach not only focuses on the content but also considers the speaker roles of utterances. Moreover, instead of requiring the extra fine-grained manual labels (Liu et al., 2020b) or extra conversations with specific personal identifiers (Li et al., 2016), we aim at improving multi-turn dialogue coherence in a conversation by the awareness of which utterances from the same speaker.

3 Approach

Figure 2: The architecture of PHAED. Given a conversation involving a query set and a response set, and denote the local and global contextual representations of the -th query . denote hidden states of -th response from -th decoder block. denotes memory length of the decoder.

Figure 2 provides an overview of our approach. Following the regular dialog flow, we regard utterances in each turn as a query-response pair, and the order of two speakers in a conversation is usually consistent. With the assumption that differentiating the speaker of previous utterances should be helpful with the model’s sensibility to the conversation context to generate coherent responses, our goal is to design a multi-turn dialogue generation framework (i.e., PHAED) with the consideration of the speaker role of utterances in the process of generating a context-coherent response in each turn.

Overall, we will describe PHAED from four aspects: (1) We first formalize the problem in §3.1; (2) For modeling multi-turn conversation involving speaker roles, the input representation is designed in §3.2; (3) Given queries from speaker-Q, the hierarchical attentive encoder constructs local contextual representations (inner-query encoding) and then combines all of them to obtain global contextual representations (inter-query encoding) in §3.3; (4) After understanding queries, the decoder generates its current response based on the global contextual representations of the queries, the hidden states of its previous responses, and the local context of its partial current response in §3.4.

3.1 Problem Formalization

Suppose that we have a multi-turn dialogue dataset , where is the -th conversation and is the number of all conversations in the dataset. Each conversation involves two speakers who give queries (i.e., [Speaker-Q]) and responses (i.e., [Speaker-R]) iteratively. Hence we represent each conversation as a sequence of query-response pairs denoted as , where , is the number of pairs. Dialogue modes often adopt an encoder-decoder framework. Here dialogue models aim to generate response given queries in order. Training criterion is to maximize the conditional log-likelihood, which can be formulated as:

(1)

where refers to all previous queries up to the -th turns, denotes the previous responses prior to , and denotes parameter of the model.

3.2 Input Representation

To distinguish the speaker identities over a multi-turn conversation, we design a novel speaker-aware input representation for words in the query and response utterances. More specifically, we first append two speaker tokens (i.e., [Speaker-Q] and [Speaker-R]) to the beginning of all queries and responses respectively. We then prepend a start-of-utterance token (i.e., [SOU]) and append an end-of-utterance token (i.e., [EOU]) to each utterance. Finally, we add the turn-level and token-level position embeddings to the token embedding as the input representation:

(2)

where () is the -th word in the -th query (response). looks up a token embedding from a embedding matrix. is the aligned turn-level embedding indicating the position of the -th utterance. is the token-level embedding indicating the position of () in the -th utterance. All embeddings are learnable in training. A detailed visualization example of our input representation structure is provided in appendix A.

3.3 Hierarchical Encoder

We want the encoder to capture and encode all the external information passed to PHAED. In other words, the encoder is responsible for understanding all queries from other people (i.e., speaker-Q). Since there are multiple queries from the same person and each query has its information, we use two steps to understand all queries. Figure 2 (left) shows the architecture of our encoder. We first encode each query by self-attention blocks in Inner-Query Encoding. Then, we need to combine all information of queries, so we propose turn-level relative attention in Inter-Query Encoding to consider all queries comprehensively.

3.3.1 Inner-Query Encoding

To summarize the information from the individual query, we apply a standard -layer Transformer encoder (Vaswani et al., 2017)

to encode each query. Specifically, we obtain a hidden representation matrix

for all words in the -th query from the top layer, where is the length of . To simplify the notation, we skip the superscript hereafter, i.e., . Notably, we adopt the pre-normalization (Bao et al., 2020) which has proven effective to stabilize the performance.

3.3.2 Inter-Query Encoding

As historical contexts from the query speaker are crucial for understanding the current query, we aim to combine the information from all preceding queries to obtain a global context. To this end, we introduce an inter-query encoding method that obtains a global contextual representation for based on a set of the preceding queries denoted as , where each element is obtained in §3.3.1. In particular, we propose a turn-level relative attention network that extends the relative attention from token-level (Shaw et al., 2018) and segment-level (Zheng et al., 2020) to turn-level:

(3)

where FFN denotes a feedforward network and TurnRelAttn is our attention network that takes as the query, and as the keys and values.

Turn-level relative attention

As the conversation goes on, each query appears in order (from to ). In some cases, focuses on queries that are closer to it. For example, may pay more attention to the closest query than other previous queries since may contain more relevant information to than others.

To consider the turn-level relative position among queries in the history, we compute an attention operation from the -th query to the past -th query. Specifically, we first compute the attention’s query, key, and value matrices by multiplying the corresponding weight matrices, i.e., , and then add the relative position information to the keys and values:

(4)

where measures the relative position between and up to a pre-defined maximum number , are two learnable matrices that captures the relative position information for the attention’s keys and values, and

is an all-one row vector. Here we take the

-th column vector from and copy it times across columns by multiplying it with . Similar operations apply to .

We then compute the attention matrix , and obtain a global contextual representation

by a weighted sum over the values of all preceding queries, followed by a residual connection and a feedforward network as follows:

(5)

3.4 Decoder with Turn-level Recurrence

PHAED understands other people’s queries through the encoder, but it still needs to consider its previously generated responses from Speaker-R for generating its current response. Our idea is to store the hidden states of its previous responses as memory and reuse the memory as the information of these responses to generate its current response. For this purpose, as shown in figure 2 (right), we take the Transformer-xl (Dai et al., 2019; Zheng et al., 2020) with cross attention as the decoder.

Turn-level Recurrence.

In Dai et al. (2019), the decoder is consists of Transformer layers, where each layer augments the attention’s keys and values from the previous layer by caching the hidden states of a fixed length of previous words in the memory. This caching design allows the decoder to access a longer context in the memory. Similarly, we aim to extend the context from preceding responses for the decoder and encourage the decoder to capture the turn-level relationship between responses. To this end, we cache the hidden states of words from at most previous responses and concatenate them along the length dimension as the memory from the -th layer:

(6)

where in the memory denotes the hidden states of the -th response of word size from the -th transformer layer. Similar to Dai et al. (2019), we truncate the gradient from the memory, augment the attention’s keys and values with the memory, and obtain the next layer’s hidden states by a transformer layer.

(7)

where denotes stop-gradient and the Transformer-layer takes the query, key, and value matrices for cross-attention followed by a feedforward network.

Finally, we obtain the output probabilities of

by a linear layer and a softmax layer on the word representations

from the top decoder layer:

(8)

where is a linear project matrix for a vocabulary of size . denotes the -th row vector from and is the length of .

4 Experiments

We conduct experiments on three datasets and compare PHAED with the state-of-the-art based on both automatic and human evaluations. Besides, we further conduct a fine-grained analysis to deepen our understanding of the characteristics of PHAED.

4.1 Experimental Setup

4.1.1 Datasets

Three popularly used benchmark datasets for open-domain multi-turn dialogue generation are adopted, which include: (1) DailyDialog, (2) PersonaChat, and (3) Ubuntu v2.0. Conversations in all datasets involve two participants. Dialogues in DailyDialog (Li et al., 2017) cover various topics about our daily life such as social activities and school life. PersonaChat (Zhang et al., 2018a) is a knowledge grounded dataset that contains dialogues and speaker profile information. Following the standard practice of prior work (Zhao et al., 2020), we append the profile to the conversation history. Ubuntu v2.0 (Lowe et al., 2015) is a large multi-turn dialogue corpus extracted from Ubuntu question-answering forum. We truncate the utterances with more than 50 tokens, and the truncated utterances with abnormal endings are corrected. Table 1 provides a detailed statistics of each dataset.

DailyDialog PersonaChat Ubuntu
train dialogues 11,118 8,939 1,000,000
valid dialogues 1,000 1,000 19,560
test dialogues 1,000 968 18,920
avg. utterances per dialogue 7.9 14.8 4.9
avg. tokens per utterance 14.6 12.9 16.2
Table 1: Statistics of three datasets.

4.1.2 Compared Methods

We select utterance-aware and speaker-aware state-of-the-art methods as baselines and train all models on the same preprocessed data with speaker tokens added: (1) HRAN: A HRED equipped with hierarchical attention based on the utterance-level and the word-level presentations (Xing et al., 2018); (2) DSHRED: A HRED equipped with the static and dynamic attention (Zhang et al., 2018b); (3) SpkHRED: A recent speaker-aware HRED with relative speaker modeling and relative utterance encoders (Zhao and Kawahara, 2019); (4) Transformer: Under the encoder-decoder framework for dialogue generation, the most simple but natural idea is to directly use the Transformer (Vaswani et al., 2017) to encode all the previous utterances and then decode the representations to generate a response; (5) ReCoSa: A hierarchical transformer-based model for detecting the relevant contexts (Zhang et al., 2019); and (6) DialoGPT: Following  Zhang et al. (2020), we train a multi-turn dialogue generation model from scratch on the basis of the GPT-2 (Radford et al., 2019).

4.1.3 Implementation Details

The dimension of hidden states is set to be 512 in HRAN, DSHRED, Transformer(=6), ReCoSa(=6), PHAED(=4) and PHAED(=6). denotes the number of layers. Except for DialoGPT, the head number of all transformer-based models is 8. For DialoGPT(=6 and =12), we use the small GPT-2 architecture with 768 and head number 12. We increase the to 560 in PHAED(=12) to make it has the same parameter size as DialoGPT(=12). Greedy search is taken as the decoding strategy. Adam optimizer (Kingma and Ba, 2015)

with an initial learning rate of 0.005 is utilized for training, and the batch size is 32. We train and evaluate each model on a Tesla P100 card or V100 card with PyTorch.

4.1.4 Evaluation Measures

Automatic evaluation

We evaluate PHAED and baselines based on coherence and diversity metrics. For coherence evaluation, we adopt Perplexity (Xing et al., 2018), BLUE-n

for n-grams (

=1,2,3,4) (Tian et al., 2017), and three embedding-based metrics (Serban et al., 2017). Embedding-based metrics include Average(Avg), Extrema(Ext), and Greedy(Gre), using pre-trained Google news word embedding (Mikolov et al., 2013). For diversity evaluation, we use Distinct-1 and Distinct-2, which calculate the ratio of unique unigrams and bigrams, respectively. Notably, since the probability of the speaker token ([Speaker-R]) is much higher than other tokens, the Perplexity with speaker token probability will be much lower than the Perplexity without that. However, we focus on the other tokens in response, so we do not take the probability of the speaker token into account when calculating Perplexity and remove the speaker token from the generated responses before using other metrics for evaluation.

Human evaluation

We further conduct a manual evaluation to explicitly examine the quality of dialogue models based on human judgments. Following prior work (Xu et al., 2021; Cai et al., 2020), we randomly select 100 examples containing conversation history and responses generated by baselines and PHAED as testing examples. Based on such testing data, we recruit three human annotators with good English skills to score the response quality on a scale of [0, 1, 2] from four aspects, which includes: (1) Fluency in terms of the smoothness of response and the correctness of grammar, (2) Coherence indicating whether the response is coherent with conversation history, (3) Informativeness that focuses on the amount of information contained in the response, and (4) Overall that stands for the general evaluation, where 0, 1, and 2 indicate bad, good, and perfect responses, respectively.

Dataset Model Coherence Diversity
Perplexity BLEU-1 / 2 / 3 / 4 Avg / Ext / Gre Distinct-1 / 2
DailyDialog HRAN 31.04 19.10 8.511 4.670 2.790 63.90 38.86 45.52 1.732 8.700 32.6M
DSHRED 31.71 18.80 8.351 4.547 2.686 63.64 38.98 45.01 1.492 7.606 34.2M
SpkHRED 34.22 19.21 8.421 4.485 2.533 63.84 38.51 45.11 1.052 4.510 40.2M
PHAED(=4) 25.67 19.24 9.200 5.444 3.517 64.24 39.47 46.30 2.633 13.80 42.7M
Transformer(=6) 27.34 17.72 7.205 3.722 2.092 62.85 37.88 44.58 2.398 11.85 50.2M
ReCoSa(=6) 25.34 17.77 7.186 3.684 2.100 62.90 37.42 44.89 2.481 12.39 68.5M
DialoGPT(=6) 29.93 17.69 8.528 5.214 3.564 64.12 40.24 45.98 1.905 12.37 57.2M
PHAED(=6) 24.45 19.02 9.174 5.508 3.602 64.42 39.82 46.34 2.932 15.58 53.8M
DialoGPT(=12) 27.89 18.54 9.432 6.077 4.382 64.86 40.70 46.89 2.109 14.00 99.7M
PHAED(=12) 23.71 20.47 10.61 7.089 5.326 64.91 39.96 47.34 3.639 20.19 99.3M
PersonaChat HRAN 36.59 21.06 10.22 5.292 2.779 62.18 38.21 43.48 0.2396 0.9633 34.1M
DSHRED 36.96 21.69 10.46 5.401 2.841 62.58 38.56 44.17 0.2718 1.357 34.8M
SpkHRED 38.21 21.19 10.08 5.088 2.580 61.95 37.97 43.37 0.1856 0.6902 40.7M
PHAED(=4) 33.13 21.48 10.51 5.603 3.101 63.60 39.75 45.16 0.5401 2.580 43.4M
Transformer(=6) 37.09 19.55 8.807 4.027 1.959 59.84 38.05 40.19 0.1443 0.4177 51.0M
ReCoSa(=6) 33.88 21.04 9.925 4.973 2.563 62.64 37.40 44.54 0.4417 1.703 69.4M
DialoGPT(=6) 32.91 20.89 10.19 5.369 2.958 62.69 39.38 44.43 0.5126 2.298 57.7M
PHAED(=6) 32.62 21.93 10.66 5.595 3.026 63.09 39.48 44.98 0.4996 2.453 54.4M
Ubuntu Transformer(=6) 42.74 11.51 3.460 1.315 0.5546 62.35 34.08 43.64 0.08594 0.3306 82.9M
ReCoSa(=6) 36.40 12.74 4.750 2.183 1.150 59.39 34.65 43.05 0.3970 3.441 83.8M
DialoGPT(=6) 29.98 13.06 5.077 2.425 1.299 59.77 35.00 44.02 0.6783 5.425 81.7M
PHAED(=6) 28.71 13.15 5.676 2.508 1.449 60.43 35.72 44.29 0.7328 5.735 86.4M
Table 2: Automatic evaluation results (%) on three datasets. Notably, we train all models on the same preprocessed data with speaker tokens added and evaluate the test results with the speaker tokens removed. denotes the number of stacked blocks. denotes the parameter size. Memory length of PHAED is 1. The best results in each metric are highlighted with bold. “” means higher is better, and “” means lower is better.

4.2 Evaluation Results

Considering the influence of the parameter size, we compare PHAED(=4) with RNN-based baselines and PHAED(=6) with Transformer-based baselines. Table 2 shows the automatic evaluation results, where we observe that PHAED outperforms baselines in three datasets. For PHAED with different , a small increase in (from 4 to 6) makes PHAED perform better on Perplexity, and with substantial amplification of (from 4 to 12 or 6 to 12), PHAED achieves better performance on all metrics. Taking the results on DailyDialog as an example, the metrics scores of the PHAED(=4 and =6) are better than other baselines overall, and when the models’ parameter sizes are the same, PHAED( =12) outperforms DialoGPT(=12) on most metrics. Therefore, we demonstrate that PHAED performs better than baselines on automatic evaluation and generates high-quality responses. Moreover, with respect to the lower value of Perplexity scores achieved by PHAED with larger , we can infer that stacking more blocks benefits PHAED increasing the possibility of generating coherent and diverse responses based on the conversation history.

Model Flu. Coh. Inf. Overall
HRAN 1.09 0.94 0.89 0.86
DSHRED 0.90 0.78 0.73 0.69
SpkHRED 0.89 0.77 0.69 0.66
Transformer 0.97 0.84 0.92 0.78
ReCoSa 1.00 0.84 0.86 0.77
DialoGPT 1.15 0.97 0.92 0.93
PHAED 1.28 1.19 1.21 1.14
Table 3: Human evaluation results.

We carry out the human evaluation on the DailyDialog that contains a wide variety of high-quality conversations from daily life (Cai et al., 2020). The human evaluation results are shown in Table 3. The average kappa score (Fleiss, 1971) is 41.68, which indicates the moderate agreement of the three annotators. From the results, we can see that PHAED achieves better performance on all the metrics than baselines, which indicates that PHAED is preferred by humans. The coherence assessments indicate that our responses are coherent with the context. Besides, compared with the baseline methods, the scores of fluency and informativeness are also high, revealing that our approach tends to generate more fluent and informative responses.

Model Variant Coherence Diversity
Perplexity BLEU-1 / 2 / 3 / 4 Avg / Ext / Gre Distinct-1 / 2
PHAED(=4) 25.67 19.24 9.200 5.444 3.517 64.24 39.47 46.30 2.633 13.80
    w/o speaker tokens 25.77 18.05 8.416 4.855 3.045 63.86 39.57 45.58 2.364 12.73
    w/o aligned turn embedding 26.03 18.20 8.154 4.461 2.636 63.73 39.28 45.49 2.339 12.40
    w/o turn-level relative attention 26.15 17.37 7.905 4.492 2.772 62.87 38.88 44.81 2.335 12.08
    w/o turn-level recurrence 25.99 18.42 8.264 4.626 2.815 63.72 39.27 45.72 2.249 11.70
Transformer(=6) 27.34 17.72 7.205 3.722 2.092 62.85 37.88 44.58 2.398 11.85
    w/o speaker tokens 27.93 17.48 7.078 3.620 1.966 62.51 37.78 44.44 1.967 9.483
DialoGPT(=6) 29.93 17.69 8.528 5.214 3.564 64.12 40.24 45.98 1.905 12.37
    w/o speaker tokens 30.86 17.71 8.474 5.124 3.458 63.91 39.90 45.78 1.869 11.96
Table 4: Ablation Study results(%). We also evaluate test results with the speaker tokens removed.

4.3 Ablation Study

To provide a fine-grained analysis of the contribution of each component in PHAED (i.e., speaker tokens and aligned turn embedding in input representations, turn-level relative attention in encoder, and turn-level recurrence in decoder), we conduct an ablation study. Table 4 shows our results. The ablation models without speaker tokens show deteriorations in the majority of metrics such as perplexity, suggesting that adding speaker tokens in the input representations benefits both PHAED and other dialogue models generating coherent and diverse responses regarding the conversation context. Without aligned turn embedding that encodes the order of utterances, PHAED achieves a decreasing performance in all metrics. Meanwhile, by removing turn-level relative attention block or memory relative attention, the performance also obviously decreases in all metrics. Therefore, it is critical to consider the utterance-level positional information and the contextual information of both queries and responses. Besides, PHAED without speaker tokens shows less deterioration than the other ablation models of PHAED in most metrics, suggesting that the components we proposed in PHAED play more important roles than the input representations with speaker tokens.

Figure 3: The impact of memory length on the performance of PHAED on PersonaChat. The range of is from 1 to 7. metric represents the change in the score of the metric. Perplexity is abbreviated as PPL.

By looking into the PHAED structure more in-depth, we further explore the impact of decoder memory length (i.e., number of previously machine-generated responses cached in memory) on PHAED’s performances to identify the optimal value of . According to Dai et al. (2019), the dependency length of the turn-level recurrence is the sum of the lengths of prior responses. However, is the bigger, the better? To answer this question, we re-train PHAED with different . The values in the Figure 3 represent the difference between the results of PHAED(=1) (Table 2) and the results of PHAED(). Overall, with an increase of , PHAED obtains a better perplexity, but there is only a small change on each metric score. Considering PHAED with a large memory length costs high computing resources, we empirically set the value of as 3 in PHAED(=4) and 2 in PHAED(=6).

4.4 Case Study

Figure 4: Visualization of weights from query-to-query. The total weights of each row are 1.

We would like to know what PHAED has learned from the conversation history. We visualize the query-to-query weights of a conversation based on turn-level relative attention of PHAED(=6). Formally, the weight of the -th query attending to the -th query is computed by , where is defined by Eq.(5), and gets the sum of all elements of the input matrix. As shown in Figure 4, two findings are also common in other conversations: the first query (first column), which contains the major topic of a conversation, seems to be a terrifically useful context for subsequent queries; since representations of current query can be passed by the residual connection, the turn-level attention seems to care less about the current query (diagonal queries). Besides, two dialogue examples from the DailyDialog test results are provided in Appendix C.

5 Conclusion and Future Work

We have presented a novel learning model called PHAED for multi-turn dialogue generation by utilizing utterance relations based on their speakers to capture fine-grained conversation context information. The presented experiments with three benchmark datasets have shown that PHAED outperforms state-of-the-art by improving the context coherence of responses. Moreover, we find that PHAED learns more from utterances containing high-level topic information of a conversation history than other utterances.

In the future, we will extend PHAED in multi-party conversation, where the encoder is responsible for understanding the utterances of all other speakers (i.e., Speaker1-QSpeaker2-Q, Speaker3-Q …), and the decoder generates the utterances of self-speaker (i.e., Speaker-R).

References

  • J. Bak and A. Oh (2019) Variational hierarchical user-based conversation model. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    ,
    Hong Kong, China, pp. 1941–1950. External Links: Link, Document Cited by: §2.
  • S. Bao, H. He, F. Wang, H. Wu, H. Wang, W. Wu, Z. Guo, Z. Liu, and X. Xu (2020) PLATO-2: towards building an open-domain chatbot via curriculum learning. arXiv preprint arXiv:2006.16779. Cited by: §3.3.1.
  • S. Bao, H. He, F. Wang, H. Wu, and H. Wang (2020) PLATO: pre-trained dialogue generation model with discrete latent variable. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 85–96. Cited by: §1.
  • H. Cai, H. Chen, Y. Song, C. Zhang, X. Zhao, and D. Yin (2020) Data manipulation: towards effective instance learning for neural dialogue generation via learning to augment and reweight. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 6334–6343. External Links: Link, Document Cited by: §4.1.4, §4.2.
  • Z. Chan, J. Li, X. Yang, X. Chen, W. Hu, D. Zhao, and R. Yan (2019)

    Modeling personalization in continuous space for response generation via augmented Wasserstein autoencoders

    .
    In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 1931–1940. External Links: Link, Document Cited by: §2.
  • C. Chen, J. Peng, F. Wang, J. Xu, and H. Wu (2019) Generating multiple diverse responses with multi-mapping and posterior mapping selection. In

    Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19

    ,
    pp. 4918–4924. External Links: Document, Link Cited by: §1.
  • H. Chen, X. Liu, D. Yin, and J. Tang (2017) A survey on dialogue systems: recent advances and new frontiers. SIGKDD Explor. Newsl. 19 (2), pp. 25–35. External Links: ISSN 1931-0145 Cited by: §1.
  • T. Chi, P. C. Chen, S. Su, and Y. Chen (2017) Speaker role contextual modeling for language understanding and dialogue policy learning. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 163–168. Cited by: §2.
  • Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988. Cited by: §3.4, §3.4, §4.3.
  • J. L. Fleiss (1971) Measuring nominal scale agreement among many raters.. Psychological bulletin 76 (5), pp. 378. Cited by: §4.2.
  • M. Henderson, B. Thomson, and S. Young (2013)

    Deep neural network approach for the dialog state tracking challenge

    .
    In Proceedings of the SIGDIAL 2013 Conference, Metz, France, pp. 467–471. External Links: Link Cited by: §1.
  • D. Hovy and D. Yang (2021) The importance of modeling social factors of language: theory and practice. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 588–602. Cited by: §1, §2.
  • D. P. Kingma and J. L. Ba (2015) Adam: a method for stochastic optimization. In ICLR 2015 : International Conference on Learning Representations 2015, Cited by: §4.1.3.
  • J. Li, M. Galley, C. Brockett, G. Spithourakis, J. Gao, and B. Dolan (2016) A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 994–1003. External Links: Link, Document Cited by: §2, §2.
  • Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu (2017) DailyDialog: a manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 986–995. Cited by: §2, §4.1.1.
  • L. Liu, Z. Zhang, H. Zhao, X. Zhou, and X. Zhou (2020a) Filling the gap of utterance-aware and speaker-aware representation for multi-turn dialogue. External Links: 2009.06504 Cited by: §2.
  • Y. Liu, H. Qian, H. Xu, and J. Wei (2020b) Speaker or listener? the role of a dialogue agent. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4861–4869. Cited by: §2, §2.
  • R. Lowe, N. Pow, I. Serban, and J. Pineau (2015) The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 285–294. Cited by: §2, §4.1.1.
  • M. D. Ma, K. Bowden, J. Wu, W. Cui, and M. Walker (2019) Implicit discourse relation identification for open-domain dialogues. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 666–672. Cited by: §2.
  • A. Madotto, C. Wu, and P. Fung (2018) Mem2Seq: effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1468–1478. External Links: Link, Document Cited by: §1.
  • Z. Meng, L. Mou, and Z. Jin (2018) Towards neural speaker modeling in multi-party conversation: the task, dataset, and models. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. External Links: Link Cited by: §2.
  • T. Mikolov, G. Corrado, C. Kai, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    .
    In Proceedings of the International Conference on Learning Representations (ICLR 2013), Cited by: §4.1.4.
  • O. O. Olabiyi, A. Khazane, and E. T. Mueller (2018) A persona-based multi-turn conversation model in an adversarial learning framework. In

    2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)

    ,
    Vol. , pp. 489–494. External Links: Document Cited by: §2.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Technical report, OpenAI. Cited by: §4.1.2.
  • C. Sankar, S. Subramanian, C. Pal, S. Chandar, and Y. Bengio (2019) Do neural dialog systems use the conversation history effectively? an empirical study. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 32–37. Cited by: §2.
  • I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau (2016) Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 3776–3783. Cited by: §1, §1, §2.
  • I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. Courville, and Y. Bengio (2017) A hierarchical latent variable encoder-decoder model for generating dialogues. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 3295–3301. Cited by: §1, §4.1.4.
  • P. Shaw, J. Uszkoreit, and A. Vaswani (2018) Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464–468. Cited by: §3.3.2.
  • A. Sordoni, Y. Bengio, H. Vahabi, C. Lioma, J. Grue Simonsen, and J. Nie (2015) A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 553–562. Cited by: §2.
  • B. Sun, S. Feng, Y. Li, J. Liu, and K. Li (2021) Generating relevant and coherent dialogue responses using self-separated conditional variational AutoEncoders. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 5624–5637. External Links: Link, Document Cited by: §2.
  • Z. Tian, R. Yan, L. Mou, Y. Song, Y. Feng, and D. Zhao (2017) How to make context more useful? an empirical study on context-aware neural conversational models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 231–236. Cited by: §1, §1, §2, §4.1.4.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30, pp. 5998–6008. Cited by: §1, §3.3.1, §4.1.2.
  • C. Xing, W. Wu, Y. Wu, M. Zhou, Y. Huang, and W. Ma (2018) Hierarchical recurrent attention network for response generation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §1, §2, §4.1.2, §4.1.4.
  • J. Xu, Z. Lei, H. Wang, Z. Niu, H. Wu, and W. Che (2021) Discovering dialog structure graph for coherent dialog generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 1726–1739. External Links: Link, Document Cited by: §2, §4.1.4.
  • H. Zhang, Y. Lan, L. Pang, J. Guo, and X. Cheng (2019) ReCoSa: detecting the relevant contexts with self-attention for multi-turn dialogue generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3721–3730. Cited by: §1, §2, §4.1.2.
  • S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018a) Personalizing dialogue agents: i have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 2204–2213. Cited by: §4.1.1.
  • W. Zhang, Y. Cui, Y. Wang, Q. Zhu, L. Li, L. Zhou, and T. Liu (2018b) Context-sensitive generation of open-domain conversational responses. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 2437–2447. Cited by: §1, §2, §4.1.2.
  • Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan (2020) DIALOGPT : large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 270–278. Cited by: §2, §4.1.2.
  • T. Zhao and M. Eskenazi (2016)

    Towards end-to-end learning for dialog state tracking and management using deep reinforcement learning

    .
    In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Los Angeles, pp. 1–10. External Links: Link, Document Cited by: §1.
  • T. Zhao and T. Kawahara (2019) Effective incorporation of speaker information in utterance encoding in dialog. External Links: 1907.05599 Cited by: §2, §4.1.2.
  • Y. Zhao, C. Xu, and W. Wu (2020) Learning a simple and effective model for multi-turn response generation with auxiliary tasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 3472–3483. External Links: Link, Document Cited by: §1, §1, §4.1.1.
  • Z. Zheng, X. Yue, S. Huang, J. Chen, and A. Birch (2020)

    Towards making the most of context in neural machine translation

    .
    In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, C. Bessiere (Ed.), pp. 3983–3989. Note: Main track External Links: Document, Link Cited by: §3.3.2, §3.4.

Appendix A Input Representation Visualization

Figure 5: A sample of the input representations.

As shown in Figure 5, the input representation of each word is calculated by the sum of the token, aligned turn, and position feature vectors of this word.

Appendix B Ablation Study about on DailyDialog

DailyDialog

PersonaChat

Figure 6: The impact of different memory length on the performance of PHAED on DailyDialog. The range of is from 1 to 5.

In the section of discussions, we re-train PHAED with different on PersonaChat to analyze the impact of memory length on the performance of PHAED. As a supplement to this analysis, we conduct the same experiment on DailyDialog and show the results in Figure 6. Since the number of average utterances per dialogue in DailyDialog is shorter than that in PersonaChat, we set the range of from 1 to 5. We observe that the pattern of results on DailyDialog is similar to the one on PersonaChat. With an increase of L, there are only small changes in all metrics scores, and we cannot guarantee that PHAED performs better on most metrics. Moreover, the appropriate memory length of PHAED on PersonaChat is also applicable to DailyDialog.

Appendix C Case Study about Two examples

In the section of discussions, we visualize the query-to-query weights of a conversation based on turn-level relative attention of PHAED. As a supplement to the case study, we show two examples from the DailyDialog test result in Table 5. In example 1, PHAED generates an appropriate and informative response, but other baselines either generate responses from the wrong speaker perspective or generate short and safe responses. In example 2, PHAED generates a response that includes clear location information. However, DialGPT and ReCoSa generate responses based on a wrong previous query, and the other responses only contain fuzzy location information.

Sample 1
History
Speaker-R May i help you?
Speaker-Q I ’m looking for a pair of shoes.
Speaker-R What size do you wear? these shoes are on sale.
Speaker-Q I wear size 36. this pair looks nice. i try them on?
HRAN Speaker-R Of course.
DSHRED Speaker-R Of course. it’s a bit too expensive.
SpkHRED

Speaker-R Of course. you can have a look at the moment.

Transformer Speaker-R Yes. they look nice. how much is it?
ReCoSa Speaker-R Yes, they are.
DialGPT Speaker-R Yes, of course.
PHAED Speaker-R Of course. they are all handmade.
Sample 2
History
Speaker-S 911. how may i help you?
Speaker-R I need to report a break in.
Speaker-S When did this break in occur?
Speaker-R Last night.
Speaker-S Where did the break in happen?
HRAN Speaker-R I was in the middle of the city.
DSHRED Speaker-R It was in the middle of the city.
SpkHRED Speaker-R Well, i’m sorry, sir. we have a nice room.
Transformer Speaker-R It was a very good game.
ReCoSa Speaker-R It was in the middle of October.
DialGPT Speaker-R It’s a couple of days ago.
PHAED Speaker-R It happened at the hilton hotel.
Table 5: Two samples from the DailyDialog test result.