Toward Interpretability of Dual-Encoder Models for Dialogue Response Suggestions

03/02/2020 ∙ by Yitong Li, et al. ∙ Google Duke University University of Washington 0

This work shows how to improve and interpret the commonly used dual encoder model for response suggestion in dialogue. We present an attentive dual encoder model that includes an attention mechanism on top of the extracted word-level features from two encoders, one for context and one for label respectively. To improve the interpretability in the dual encoder models, we design a novel regularization loss to minimize the mutual information between unimportant words and desired labels, in addition to the original attention method, so that important words are emphasized while unimportant words are de-emphasized. This can help not only with model interpretability, but can also further improve model accuracy. We propose an approximation method that uses a neural network to calculate the mutual information. Furthermore, by adding a residual layer between raw word embeddings and the final encoded context feature, word-level interpretability is preserved at the final prediction of the model. We compare the proposed model with existing methods for the dialogue response task on two public datasets (Persona and Ubuntu). The experiments demonstrate the effectiveness of the proposed model in terms of better Recall@1 accuracy and visualized interpretability.



There are no comments yet.


page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning based dialogue systems have shown promising performance in many applications such as smart reply [10], conversation semantic embedding [41], human-computer interaction [8] and others  [21, 45, 48]. Deep neural nets extract rich representations with high-level semantic information that are useful for message retrieval [41, 39] and response generation [48] in conversations.

The dual encoder model [10, 41] is widely used among various dialogue models especially for retrieving response messages, due to its simple structure and competitive computational speed. The dual encoder model consists of two separate encoders, which extract features for the dialogue context (e.g. the previous messages) and a candidate response, respectively. Then a similarity score is computed between the extracted context and candidate features. Whichever candidate from a predefined list has the highest score is selected as the best response. Such a "retrieval" based method shows many advantages over generative language models [48] in industrial applications, such as computational efficiency and preventing undesirable responses.

In this work, we focus on interpreting and improving the dual encoder model, which is normally considered a black box. Although there are many existing models with interpretability designed for question answering [27, 23, 32, 1] or textual entailment [19, 26, 46], fewer works have investigated interpretability in dialogue response generation. In this paper, we present an attentive dual encoder model, which adds an attention mechanism on top of the extracted word-level features from both encoders. With this pairwise word-level attention, not only is the prediction accuracy improved, but also the most important context and response words contributing to the decision of the model can be highlighted. Interpreting a model in terms of the relationship between inputs and outputs can greatly assist developers to debug and improve models, and help users understand why a certain result is suggested.

Figure 1: Attention learned without constraining on unimportant words. Words with higher learned wights are highlighted in darker color. This example is selected from Persona dataset. A and B denote two different people in one conversation. [45]

There are two potential problems when directly applying the attention mechanism at the word-level. First, the standard attention mechanism only emphasizes predictive words to optimize the training loss, without any constraints on attentions weights for the purpose of interpretation. The example in Figure 1 shows that many unimportant words are highlighted, such as ‘about’ and ‘perhaps’. Emphasizing unimportant words muddles interpretability and may also harm model performance by over-fitting to the training set. The second problem is caused by commonly used text encoder structures. Most existing text encoders, such as LSTM and Transformers, discard fine-grained word-level information and create representations that entangle information from across the whole sentence. While this brings advantages for sentence prediction tasks, it impedes word-level interpretation of the prediction.

In order to solve the first problem, we borrow an idea from information theory. Intuitively, prediction-related words should contain useful information while unimportant words should have little information for response retrieval. Thus, in addition to the original attention method, we design a novel regularization loss that minimizes the mutual information between unimportant words in the context and the desired response, so that important words are emphasized while unimportant words are de-emphasized. We propose an approximation method to calculate the mutual information by using a neural network. In practice, this loss can improve both the quality of interpretability as well as response retrieval accuracy.

For the second problem, we present a simple yet effective solution that uses a residual layer connecting raw word embeddings and the final encoded context feature. By tuning the weights on the raw word embeddings, we can balance the importance of the encoded contextual information (for retrieval accuracy) and individual word features (for interpretability at the word-level).

In summary, our contributions are three-fold:

  • Introduced a learnable attention mechanism between input dialogue context and response text pairs, which improves both retrieval accuracy and interpretability.

  • Proposed a regularization term that emphasizes important word pairs and penalizes unimportant word pairs, therefore improving interpretability in an unsupervised way.

  • Demonstrated that the fusion of both encoded features and word specific embeddings further improves the interpretability.

2 Related Works

Traditional dialogue systems can be roughly divided as goal-oriented and non-goal driven models [45]. Goal-oriented models target specific application circumstances [42], such as customer service [38] and computer system troubleshooting [20] dialogues, etc. These works tend to use lexical semantics to match basic syntactic similarity [14]. In contrast, non-goal driven models focus more on data statistics, rather than using hand-coded rules. They score responses based on how well they match the dialogue context.

With the development of neural networks, recent dialogue systems are usually non-goal driven and trained in an end-to-end fashion. Among them, generative dialogue systems have attracted growing interest in recent years [48, 8, 29], especially among the research community. They are designed to learn the conditional distribution of the responses given dialogue history [40, 30].  [29] uses Determinantal Point Processes to generate responses with diversity and ground knowledge can also be utilized to generate novel results [8]. The most significant problem that hinders wide industrial use of generative conversation models is reliability. Most existing generative models suffer issues of incorrect grammar, lack of long-term coherence, and even generation of offensive responses [35].

Compared with generative conversation models, dialogue retrieval based models are more reliable and simpler in structure [10]. The well known dual encoder model [10, 41] has had success with semantic similarity and response scoring in conversations. Recent works tackle more challenging situations, like multi-party conversation recommendation [44], multi-turn response selection [49], and personalization [45, 21].

Since most existing deep learning models are black boxes, interpretability becomes a desired property to explain why the neural network gives a certain result. Interpretable neural networks have been developed for text generation 

[43, 47], visual question answering [34, 2] and sequential data classification [16][18]

further extends single image question answering to a collection of images. Most existing works are based on generative models, where interpretability is usually achieved via a Variational AutoEncoder (VAE) 

[13, 12, 47]. To further improve interpretability, an attention mechanism [33] is integrated to most existing methods [31, 34, 2, 22]. Although interpretability has been applied in the domains mentioned above, there are few works that aim to interpret neural conversation models [23].

With the development of a technique to estimate mutual information (MI) in high dimensional data 

[3, 7], using knowledge from information theory to improve the performance of neural networks has received growing attention. For instance, Deep Infomax [11] learns generalized features for images by maximizing the MI between global and localized features. [17] focuses on dynamic scene navigation with MI and  [25]

applies MI on the graph aligning task. Mutual information has became an effective and efficient way to measure the correlation among random variables in neural networks.

3 Model

3.1 Preliminary

Given a dialogue context , which is sampled from a dialogue context set and contains at most messages , the response suggestion task is to retrieve the best response from a given response candidate list. Note that there could be other non-text input signals associated with each message, such as the user id. Such signals can be encoded in the same way as word embeddings. For simplicity, we leave out the user id signal in the problem formulation below. The response candidate list contains messages for possible responses. is the total number of candidates and can be on the order of tens of thousands in real-world applications. The best response message , corresponds to the label of the dialogue context .

(a) Baseline
(b) Attentive Dual Encoder Model
Figure 2: Comparison of dual encoder model and our attentive dual encoder model.

The previous work [10] defines an dual encoder model, where the features of dialogue and each label () are extracted by a dialogue encoder and a response encoder, respectively. The framework is given in Figure 2(a). The two encoders can be designed with partially shared or totally separate structures based on the training size of and , while the word embedding is usually shared. Denote and the corresponding encoded token-level features of the dialogue context and response. and where and are the lengths of the dialogue context and response, respectively, and is the dimension of encoded tokens. The training objective is to maximize the similarity score of paired dialogue context-response samples while minimizing the scores from other mismatched pairs.

The similarity score can be formulated by , where

denotes the cosine similarity function and

is a function that aggregates the encoded token-level features into a fixed dimension . Since the length of each sentence is varied, an average pooling function

across the token length dimension can be used to get a final feature vector with fixed dimension

. Specifically, and . Although this method is simple and effective in practice, it is a black box without interpretability.

On top of the dual encoder model, we introduce a more interpretable model, called "attentive dual encoder model", in Section 3.2. In Section 3.3, we introduces a new loss term in the attentive dual encoder model to regularize the learning of the attention mask that emphasizes important tokens and de-emphasizes unimportant tokens. To improve interpretability at the encoded feature layer where word-level information is entangled with contextual information, in Section 3.4, we propose a residual layer that leverages the raw word embedding.

3.2 Attentive dual encoder Model

An attentive dual encoder model is introduced to learn the connection between dialogue context and response at the word-level. We adopt the attention mechanism [33] on top of the standard dual encoder model, as shown in Figure 2(b). Specifically, a similarity matrix is defined on the encoded features and to measure the pairwise word relationships. The -th entry of is given as


where is the th word feature of the dialogue and is the th word feature of the response. For simplicity, the similarity function is the cosine similarity [6] though other similarity functions [28] are also applicable.

Given , the intuition is to find out if two words have strong connections, i.e. if there are any words in the dialogue context and responses that greatly influence the prediction. In the final response prediction, the dialogue context and response features are weighted according to the similarity matrix. For each word in the dialogue context, we first select the words from the candidate response with the maximum similarity, and vice versa for each word in the candidate response. Specifically, the maximum pooled attention weight for dialogue context and response can be defined as


where and are the indexes for word in dialogue and response, respectively. Then the final attention weight for dialogue context and response are defined as and . Note that other attention mechanisms can also be adopted, like mean pooling or weighted mean pooling w.r.t the similarity matrix  [6].

The attention weights the original encoded feature and by their importance as and . The final prediction score of dialogue context-response pair is given as


where the dot product can be replaced by other metrics [36].

To train the model, the observed pairs of context and response are considered positive pairs and should have higher scores, while all other mismatched pairs of context and response , where , are negative pairs that should have lower scores. However, randomly sampling negative training pairs is time consuming. Practically, we construct a mini-batch by sampling positive dialogue-response pairs, where all mismatched pairs are used as negative pairs. We therefore conduct the retrieval task as a dialogue context-response matching problem in each mini-batch. A softmax retrieval loss can be defined as:


where is a temperature parameter that normalizes the context-response similarity to a proper range. Since in each mini-batch, dialogue contexts and responses for dual encoder modeling are symmetric, we can also use each response to retrieve its corresponding context. Therefore, a response-context retrieval loss can be written as:


The overall retrieval loss for the proposed attentive dual encoder is:


3.3 Non-attention Regularization

As illustrated in Figure 1, it is possible that the learned attention has higher weights on unrelated information due to limited training samples or biased words with high frequency in the training set. As a result, the learned attention can be noisy for the purpose of interpretation. In this section, we introduce a non-attention regularization mechanism to help the model attend on semantically important words, while ignoring unimportant words.

Recall that the attended dialogue context feature is , where all entries in the attention weights are positive and have summation of one. In contrast to attention weights, we define as the non-attention weight, which means the model should de-emphasize those non-attended words during prediction. By applying the non-attended weight on the encoded features, the unimportant feature for dialogue context is defined as


Analogously, we can derive for the unimportant response feature.

Ideally, should contain little information about the response. In order to achieve this, we adopt mutual information

from information theory as the evaluation metric. In our situation, we use it to measure the uncertainty of the correct response

given the unattended dialogue context feature. Thus the new mutual information is used as a regularization objective and can be written as .

However, it is not straightforward to calculate in a high-dimensional space. Inspired by the recent work [3, 24], we adopt a neural network to approximate this mutual information value. From  [5], can be upper bounded by the following formulation, given samples , in one mini-batch.


where and are used as index of training samples. The expectation is taken over .

In practice, this is similar to the discriminator in a Generative Adversarial Network (GAN) [9]. Specifically, in Eq. (8) contains correct sample pairs (real samples) while () is the distribution of mis-matched pairs (fake samples). In the following, we suppose is approximated by a neural network with parameter

. This network classifies true or false pairs from the given

and . Note that can also be simplified as the vector inner product without any learnable parameters. When updating  (8), we use moving average for each mini-batch to alleviate the biased gradient problem. Further discussion of this point can be found in [3, 24].

3.4 Combine Word Embeddings

As pointed out by [4, 27], using only the features computed after attention can lead to inaccuracies. The output of the encoder can mix representations of multiple words, even the whole sentence, depending on the encoder structure. For a standard transformer model [33], encoded word-level features after the first layer only have of the information of the original word embedding. After ten layers, this number drops down to . Therefore, the attention weights and , calculated from the deeply encoded features and , have a smoothed distribution at the word-level, undermining the interpretability of word importance during the prediction.

We use a simple yet effective method to address this issue, by adding raw word embeddings directly to the encoded features after multiple layers. As illustrated in Figure 1, this can be done by a residual layer between the raw word embeddings and the top layer of the dialogue context or response encoder. Taking the dialogue context as example, the residual feature learned by the raw word embedding can be written as:


where contains raw word embeddings for each word in dialogue in each column. To simplify the notation, parameters in are included in the pre-defined parameter sets or . is implemented by a single fully connected layer in the experiment, which ensures that and are of the same dimension. can be concatenated or directly added to . In this work, the final word embedding is calculated as , where is determined by the validation set. Thus the effect of individual word information is explicitly considered in the final encoded feature representation. In the experiment section, applying Eq. (9) allows us to better discriminate the importance of individual words when visualizing the attention.

3.5 Train the Attentive dual encoder Model

The overall training objective of the attentive dual encoder model is given as


where is a hyper-parameter to balance the value of regularization term. During the experiment, is set to 1 based on validation set. The retrieval cost is used to maximize the score of correct dialogue context-response pairs within one mini-batch, while the mutual information regularization term is used to force the attention weights to highlight useful information only.

The training objective is a min-max game between the dual-encoder and the neural mutual information estimator. Note that parameter , which is used to estimate the mutual information, only appears in the second term of Eq. (10). The update of can be separated from the cross-entropy loss, while the update of and need to consider gradients from both terms.

4 Experiment

Experimental Setup:

Both dialogue context and response encoders are built upon Transformer [33] with three layers. The dimensionality of the embedding and the number of head are set to 128 and 4, respectively. The word embedding dimension is set to 100. We used the Adam optimizer [15] with a learning rate , and a batch size of 64. We also conduct an ablation study to show the effectiveness of each proposed component in our network. In the ablation study, DE denotes the standard dual encoder (DE) model [10]. ADE is the Attentive dual encoder model with the additional attention mechanism introduced in Section 3.2. WE is the acronym for Word Embedding, where a residual layer is connected between the raw word embedding and the output encoded features. REG represents the mutual information regularization term introduced in Section 3.3.


The model is compared with several existing works. IR baseline [30] measures the TF-IDF weighted cosine similarity between the bags of word features of dialogue and messages in the candidate list. Similarly, Starspace [37] is trained by maximizing the learnable word embeddings between dialogue contexts and responses. Both of these two methods do not have any text encoder involved. KVPM [45] uses memory network and performs attention over dialogue contexts. It was originally designed for personalized dialogue model, while it is used without the user profile information in the experiment.


In the experiment, we evaluate the proposed attentive dual encoder model with existing methods on two public datasets:

Ubuntu Dataset [20] contains training dialogues. The testing set contains dialogues. Each dialogue contains four to five utterances. Since most of the response messages appear at a low frequency, only the top most common messages are selected for the response candidate list. Only dialogues with responses included in this list are evaluated.

Persona Dataset [45] was initially published for developing personalized dialogue agents. It contains a total of utterances over dialogues, where dialogues with utterances are used as training and dialogues with utterances as testing. Instead of using a fixed candidate list, for each test dialogue, we randomly sample responses from other dialogues and combine with the ground truth response to evaluate scoring of the 20 candidates.


In the testing stage, we rank candidate responses from a candidate list using the score between the candidate and dialogue context (Eq. (3)). In the experiment, we only use Recall@1 as the quantitative evaluation metric, which matches real usages of the model. Recall@k is the accuracy defined as


where is the number of total evaluated instances. is the rank of the similarity score between the dialogue context and its ground truth response in the final sorted list, and


For the Ubuntu dataset, we also use the prior knowledge of response frequencies to further remove noise caused by rarely used responses. Denote as the normalized usage frequency of response , in the training set. Then the prediction score with prior knowledge can be written as


which is computed for each message in the candidate list.

4.1 Quantitative Results

Persona Ubuntu
IR baseline [30] 24.1 N/A
Starspace [37] 31.8 N/A
KVPM [45] 34.9 N/A
DE 35.2 7.6
ADE 35.8 15.9
ADE + REG 38.0 16.0
ADE + WE 36.2 15.3
ADE + WE + REG 38.1 15.6
Table 1: Recall@1 in percentage comparison on two dialogue datasets.

We summarize the evluated results from the two datasets in Table 1. As shown, the attentive dual encoder (ADE) outperforms other baselines on Recall@1 accuracy. Since neither the IR baseline [30] nor Starspace [37] has an effective text encoder, their results are not competitive with others. Though KVPM [45] shares the text encoder for the dialogue context and the response, it does not have the non-attention mechanism. As a result, KVPM is not able to accurately select a response from a large candidate list.

We also compare the model in an ablation study, where the mutual information regularization term and word embedding layers are gradually added to the model. As can be seen, the dual encoder (DE) has the lowest result, while it is still higher than other baselines. Adding the attention mechanism improves the model performance on Persona dataset, while there is a large performance gain on Ubuntu. This demonstrates that the attentive dual encoder model can not only help with the visualization, but can also improve the retrieval accuracy.

Adding the mutual information regularization term can further improve the results of the attentive dual encoder model. This contributes to the fact that irrelevant words are excluded by the explicit constraint of mutual information regularization. This not only alleviates the overfitting, but also helps the visualization for interpretation. Adding the word embedding residual layer (WE) does not have significant difference. This is reasonable since most of the textual knowledge has already been encoded in the feature vectors. However, it can help with the attention visualization, which will be shown in the next section. Additionally, compared with the parameters in the DE model, the parameters in the attention (Eq. (1)) and WE components (Eq. (9)) are negligible.

4.2 Attention Visualization

(a) Attentive dual encoder
(b) Attentive dual encoder + WE
(c) Attentive dual encoder + WE + REG
Figure 3: Visualization on Persona dataset example with different competitors. (Top) Standard dual encoder model. (Middle) Standard dual encoder model with residual layer added for word embeddings. (Bottom) The attentive dual encoder model with all components included.

In addition to the quantitative results, model interpretability is also essential. We visualize the learned attention weights in Figure 3, where darker colors indicate higher weight values. Because of the entangled effect of word-level information from both the dialogue context and response encoders, the plain attentive dual encoder model cannot distinguish well the importance of different words. This can be observed by comparing Figure 3(a) and Figure 3(b), where the former has similar weights for nearly all words. In contrast, WE connects the information between the raw word embedding and the deeply encoded features, providing more fine word-level interpretation for model prediction.

The mutual information regularization term helps to alleviate the effect from uncorrelated words. As shown in Figure 3(c), the emphasized predictive words are more reasonable and distinguished than the others. In the given example, one person is talking about being treated unfairly by his/her friends. From the human intuition, the attended words are expected to be ’buy beer’, ‘nerd’, ‘friends’ in the dialogue context and ‘nothing wrong’, ‘math’, ‘science’ in the response. For the method without mutual information regularization (Figure 3(b)), words like ‘excelling’ are emphasized more than the others. This is caused by attentions on arbitrary words without any constraints. In summary, our two proposed components can help with improving interpretability for dialogue context-response prediction. A more rigorous study to quantify and evaluate the attention visualization effect will be done in the future.

5 Conclusion

In this work, we presented a new interpretable model for dialogue response suggestion. The model is built upon the well-known dual encoder language model, where an attention mechanism is integrated to further improve the performance and show the word importance during the prediction. As a result, the proposed attentive dual encoder achieves better dialogue context-response prediction results on two datasets compared to existing methods. Additionally, we consider two problems to further improve the attention visualization quality. First, mutual information is used to constrain unimportant words in the dialogue context to have lower weights. Second, a residual layer is added between encoded sentence features and raw word embeddings, providing more fine-grained information on the word-level. With little effect on the prediction, the proposed methods further improve the word-level interpretability in the dialogue context-response prediction.