Neural conversation models have achieved promising performance in response generation. However, it is widely observed that the generated responses lack sufficient content and information li2016diversitymmi. One way to address this issue is to integrate various external information into conversation models. Examples of external information include document topics xing2017topic
, commonsense knowledge graphszhou2018commonsense, and domain-specific knowledge bases yang2019hybrid. Conversing by reading (CbR) CMRACL2019 is a recently proposed scenario where external information can be ingested to conversations. In CbR, conversations take place with reference to a document. The key problem in CbR is to learn how to integrate information from the external document into response generation on demand.
To exploit knowledge from documents for conversations, a conventional way is to extend the sequence-to-sequence (Seq2Seq) model seq2seq with Memory Networks sukhbaatar2015end, which store knowledge representations accessible to their decoder ghazvininejad2018knowledge; parthasarathi2018extending. dinan2018wizard propose to encode the dialogue context as well as a set of retrieved knowledge by Transformer vaswani2017attention to construct the memory. However, these methods only use sentence-level representations of the documents in the memory, which cannot pinpoint accurate token-level document information.
To discover token-level document information, researchers borrow models from other generation tasks, which are adept at extracting segments of sentences for given questions. moghe2018towards explore the pointer generator network see2017get for abstractive summarization and the bi-directional attention flow model seo2016bidirectional, which is a QA model to predict a span of the document to be contained in the response. CMRACL2019 follow the stochastic answer network (SAN) SAN in machine reading comprehension (MRC), integrating both context and document information to form the context-aware document memory. This approach obtains the state-of-the-art performance on the CbR task.
However, we should notice the difference between existing generation tasks and CbR. For summarization, QA, and MRC, they require models to extract exact answers from documents, where documents cover all requisite knowledge. Meanwhile, CbR expects to output a general utterance relevant to both context and document. As the example in Fig. 1, the document refers to actor, films, fans, wealthy and the context mentions disease. Document and context discuss the same person but have no topic overlap; thus we cannot pinpoint document information from the context. If we use SAN as in CMRACL2019, SAN can hardly acquire helpful information from context-document interaction. To ingest useful knowledge for response generation, we argue that processing documents should consider not only the interaction between context and document but also the target response. As in the example, the document should attend more on fans, wealthy by considering the response.
In this work, we propose a method to construct a response-anticipated memory to contain document information that is potentially more important in generating responses. Particularly, we construct a teacher-student framework based on CMRACL2019. The teacher model accesses the ground-truth response, context, and document. It learns to construct a weight matrix that contains information about the importance of tokens in the document to the response. The student model learns to mimic the weight matrix constructed by the teacher without access to the response. That is, the teacher learns to build a response-aware memory, while the student learns to build a response-anticipated memory. During inference on testing data, the student will be applied. Our experiments show our model exceeds all competing methods.
2 Related Work
Most neural conversation models in open domain chit-chat scenarios are based on the Seq2Seq model seq2seq; NRM_Shang15. A critical issue of these models is the safe response problem, i.e., generated responses often lack enough content and information. To address this issue, previous work encourages response diversity and informativeness by introducing new training objectives li2016deep; zhao2017CVAE, refining beam search strategies li2016diversitymmi; vijayakumar2016diverse; song2017diversifying, exploiting information from conversational contexts aaai16_hred; serban2017hierarchical; tian2017acl, or incorporating with retrieval-based conversation systems song2018ensemble; wu2019response; tian2019ACL.
Some researchers augment information in generating responses by external resources. zhou2018commonsense utilize the commonsense knowledge graph by their designed graph attention. agarwal2018knowledge propose a knowledge encoder to encode query-entity pairs from the knowledge base. wu2019proactive enrich response generation with knowledge triplets. These work all uses knowledge information in structured formats.
External unstructured text information has also been investigated to improve conversation models. Some researchers directly build “document memory” by using distributed representations of the knowledge sentences into conversation modelsghazvininejad2018knowledge; parthasarathi2018extending. dinan2018wizard make use of the Transformer vaswani2017attention to encode the knowledge sentences as well as the dialogue context. ren2019thinking design a knowledge selector to construct the document memory on selective knowledge information. As stated in the introduction, some other researchers borrow models from other generation tasks, including abstractive summarization models moghe2018towards, QA models moghe2018towards and MRC models meng2019refnet; CMRACL2019. Especially, CMRACL2019 get the state-of-the-art performance. However, they all construct the document memory relying on connections between context and document without consideration of the response. If context or document contains a lot of noise tokens irrelevant to the response, which is indeed the case in CbR, the constructed memory may be misled by these noise information (as the case in Fig. 1). Therefore, we propose to involve the consideration of responses in the memory construction, which can benefit generating a more desired response.
In this section, we will first give an overall description of the proposed teacher-student architecture for CbR, then briefly describe the base model. The detailed teacher model and student model are presented in Sec 3.3 and 3.4. Lastly, we summarize the training updates of the two models in Sec 3.5.
3.1 Model Architecture
The CbR task provides a conversation context and a document as inputs, requiring the model to generate a response to by referring to . In the rest of the paper, we use , , and to denote the number of tokens in , , and respectively. To pinpoint accurate document information for response generation, we design a teacher-student framework to construct document memory as follows:
The teacher model learns a response-aware document memory used in our base conversation model. Specifically, we construct a response-aware weight matrix , which considers the correlation between context-aware document representations and response representations, and then impose on the memory matrix . The teacher model is optimized to reconstruct the response with the use of response-aware memory .
The teacher model and the student model are jointly optimized with training data, while only the student model is applied to testing data.
3.2 Base Model
Following CMRACL2019, we use SAN SAN as our base model, which mainly consists of three components:
Input encoder: We use two bi-directional LSTM encoders to extract token-level representations of the document and the context .
Memory construction: We build the document memory ( is the hidden size of the memory) which will be used in the decoder. A cross-attention layer is first applied to the outputs of the two encoders to integrate information from the context to the document. Then, we obtain a set of context-aware document representation . Since each corresponds to a document token, we treat it as the context-aware token representation of the -th token. Next, a self-attention layer is employed to ingest salient information of the context-aware document representations:
where the softmax conducts the normalization over each row of the matrix.
Output decoder: We use an attentional recurrent decoder to generate response tokens by attending to the memory . The initial hidden state is set as the summation of token-level context representations. For each decoding step , we get a hidden state :
where indicates concatenation, and the cross-attention layer here integrates information from the memory to the recurrent outputs. is the word-embedding at step . Finally, we generate a token by a softmax on .
Our model modifies the memory construction by refining its self-attention layer so that the memory represents more accurate and on-demand knowledge that helps generating the response.
3.3 Teacher Model
To ingest accurate memory information for response generation under the aforementioned base model, our teacher model builds a response-aware weight matrix given the context-aware document representation and the response , then refines the document memory with . Elements in ’s indicate the importance of tokens or token pairs in the document, with consideration of the response information.
First, we describe how to modify the memory matrix when is given. The original memory is constructed by a self-attention operation as Eq. 1. To facilitate response awareness, we update the attention weight matrix by element-wise multiplying , and then get the refined memory as
In the following, we describe two methods to construct the response-aware weight matrix : (1) We measure the response-aware token importance (RTI) considering the ground-truth response to construct . (2) We measure the response-aware pairwise importance (RPI) of each token pair , which can be directly assigned to the element in . For both methods, matrix elements can be either continuous or binary.
Response-Aware Token Importance (RTI)
We denote the response-aware token importance of document tokens as , and measure it by response and context-aware token representation . To obtain , we first apply an encoder to obtain the token-level representations of the response as and use its last hidden state as the sentence-level response representation. The response-aware token importance of token is defined as the similarity between its context-aware token representation and the response representation . Next, we adjust each attention distribution (i.e., each column of ) with each of its attention weight multiplied by the token importance . Therefore, the resulting can be obtained as:
represents an identity vector with all elements as. By plugging the above in Eq. 5, we can construct a memory matrix with plagiarized signals from the response. In this way, the self-attention distributions can adjust to emphasize important tokens, and their corresponding context-aware document token representations become more important in the memory matrix.
Recall that the document contains a large amount of noise information in CbR. Thus the attention distributions may become long-tailed due to the existence of many redundant document tokens. Hence, we can further construct a binary weighting vector based on
. We keep the weight of each element as 1 with the probability ofcalculated in Eq. 5
. If the weight of a token turns to 0, this token is deactivated in calculating the attention distributions. However, the binary weight sampled from the Bernoulli distribution is not differentiable. To enable back-propagation of our model, we apply the Gumbel-Softmaxjang2016categorical to approximate the Bernoulli distribution in the training phase, and sample the binary value from the Bernoulli distribution in the prediction phase as:
where is defined as:
The objective function of the teacher model is to maximize the log-likelihood of responses generated by the response-aware memory constructed with :
Response-Aware Pairwise Importance (RPI)
Instead of using token importance, we can construct by the pairwise importance of token pairs. After obtaining the token representations from the response encoder similarly as in RTI, we can calculate the similarity of each towards all ’s, denoted as . Each element in can be associated with a weight defined as the inner-product between and . Thus, we can treat as the response-aware pairwise importance, and directly set each element in as :
Compared with response-aware token importance in which the designed has identical column values, response-aware pairwise importance allows different values of different index ’s in (but and have the same value since is symmetric). Thus, the space of is larger.
Notice that, the aforementioned binary processing with each can also be applied on each here and the resulting is binary. By using a binary in our model, the memory construction can be considered as passing through a Graph Attention Network (GAT) GAT2017, which also constructs a graph and updates its representations relying on the information from itself and neighbors on the graph. However, our neighborhood matrix (i.e. in our model) is not pre-defined as in GAT but dependant on the inputs ’s and ’s, which involve parameters to be estimated.
3.4 Student Model
The student model learns to construct a response-anticipated weight matrix to estimate the weight matrix in the teacher model without access to the ground-truth . If we employ RTI, the estimated target of the student model is in Eq. 5. For RPI, the estimated target is in Eq. 9.
as inputs, we apply a bilinear attention layer to obtain a hidden representation matrix; we combine two attention outputs by to estimate in the RPI:
The objective function of the student model is to maximize the log-likelihood of generating responses based on the estimated or , and diminish the gap of the weighting vector or matrix between the student model and the teacher model by a mean square loss. Taking the RTI strategy as an example, we optimize the following objective:
where denotes the operation in Eq. 11 and its preorder operations. consists of the layer parameters in . balances the two loss terms. For RPI, we replace to optimize with and .
3.5 Model Training
We first train the teacher model until it converges, and then train the student model with the use of or from the converged teacher model. Next, we repeat the above processes iteratively. In the training of the teacher model, we fix parameters in (except parameters shared with ) and train the model subject to ; for the student model, we fix and (except parameters shared with ) and train the model subject to . For inference, only the student model will be used to infer the response-anticipated weight matrix and the decoder applies it for generating the output response.
As stated in RPI, it has better model capacity by allowing a larger space of with the use of the weight matrix instead of the token importance vector in RTI. In terms of optimization, we need to estimate more parameters by using RPI, which requires higher training difficulty.
4 Experiment Setting
We use the dataset for the CbR task released by CMRACL2019. The dataset contains crawled articles and discussions about these articles from Reddit. The articles act as the documents, while the discussions serve as conversational contexts and responses. In total, we have 2.3M/13k/1.5k samples for training/testing/validation.
4.2 Implementation Details
For all methods, we set word embedding dimension to 300 with the pre-trained GloVe pennington2014glove. Following CMRACL2019, our vocabulary contains top 30k frequent tokens. We use bi-LSTMs with the hidden dimensions of 512 and the dropout rate of 0.4 in our encoders. We optimize models by Adam with an initial learning rate of 0.0005 and the batch size of 32. All conversation contexts/responses/documents are truncated to have the maximum length of 30/30/500. For training, we set as 1 in the loss of student models after tuning. For inference, we apply a top- random sampling decoding edunov2018understanding with =20. The validation set is for early stopping. Aforementioned implementation details can be found in our codes 111https://github.com/tianzhiliang/RAM4CbR.
4.3 Competing Methods
Seq2Seq seq2seq. The standard Seq2Seq model that leverages only the conversational context for response generation.
MemNet ghazvininejad2018knowledge. A knowledge-grounded conversation model that uses a memory network to store knowledge facts.
GLKS ren2019thinking. It applies a global knowledge selector in encoding and a local selector on every decoding step.
Conversation with Machine Reading (CMR) CMRACL2019. The state-of-the-art model on the CbR task, which is also our base model (Sec 3.2). Here, we use the full model of CMR (called CMR+w in CMRACL2019), since the full model outperforms other CMR’s variants on most metrics. We further apply the copy mechanism see2017get to this base model (CMR+Copy).
Four variants of our proposed models: RAM_T denotes our Response-Anticipated Memory-based model with RTI, and RAM_T+Copy denotes its copy version. RAM_P and RAM_P+Copy denote our model with RPI and its copy variant .
4.4 Evaluation Metrics
Following all metrics in CMRACL2019, we evaluate all methods by both automatic and human evaluations. For automatic evaluations, we evaluate the responses in three aspects:
We use three metrics to evaluate the overall quality of a response: BLEU-4 papineni2002bleu, METEOR banerjee2005meteor, and NIST doddington2002automatic
. NIST is a variant of BLEU that measures n-gram precision weighted by the informativeness of n-grams.
Grounding. We measure the relevance between documents and generated responses to reveal the effectiveness of responses exploiting the document information. We define as the number of non-stopword tokens in both the document and the generated response but not in contexts . We calculate the precision P and recall R as
where denotes the stopword list. F1
is the harmonic mean of precisionP and recall R.
We further propose to measure the effectiveness of exploiting the document information considering the ground-truth. In this way, we evaluate how many ground-truth information models can exploit from the document. We define as the number of non-stopword tokens in the document , the generated response and the ground-truth but not in contexts
. The precision and recall are as following,
where is the harmonic mean of precision and recall .
Informativeness. Ent-n seq2BF measures responses’ informativeness with the entropy of the n-gram count distribution. Dist-n li2016diversitymmi evaluates the diversity of responses via the proportion of unique n-grams among all responses.
For human evaluations, we hire five annotators from a commercial annotation company to evaluate 200 randomly selected test samples, and results from different models are shuffled. The annotators evaluate on a 5-point scale in three aspects: overall quality (H-Appr), relevance with documents (H-Ground), and informativeness (H-Info).
|Top10 tokens||Top20 tokens|
5 Experimental Results and Analysis
In this part, we first show the performance of all methods in Sec 5.1. Then, we validate the effectiveness of response anticipation on CbR in Sec 5.2 by comparing the top similar tokens with the response using their representations in the memory. We also compare more variants of our model in Sec 5.3, including the token importance versus pairwise importance, and each method with continuous weights versus their variants with binary weights. At last, we conduct a case study in Sec 5.4.
5.1 Overall Performance
Results of all models on automatic and human evaluations are shown in Table 1 and Table 3. MemNet outperforms Seq2Seq on most metrics, which validates that it is important to utilize document information in CbR. However, MemNet only slightly improves on Grounding. Both GLKS and CMR outperform MemNet on most metrics, indicating that it matters how to construct the document memory used in conversation models for CbR. Compared with CMR, CMR+Copy is more competitive on Grounding but weaker on other metrics.
Our proposed models outperform other competing methods on all metrics, including automatic and human evaluations. For models without the copy mechanism, RAM_T performs the best. For models with copy, RAM_T+Copy and RAM_P+Copy excel CMR+Copy on most metrics. Overall, our proposed strategy works well on both the model with and without copy mechanism. We will compare RAM_T and RAM_P in details in Sec 5.3.
5.2 Effectiveness of Response Anticipation
In this section, we investigate whether anticipating response contributes to building a better document memory. We first calculate the semantic similarity between each document token and the response using their Glove embeddings, and select top document tokens. Next, we accumulate the attention weights of each token in all attention distributions in the self-attention weights in Eq. 1, i.e. summation over each column of . Then we select the top tokens according to their accumulated attention weights. Here, we set . We apply metrics in liu2016not to calculate the similarity of two token sets extracted above, including maximal tokens-tokens embedding similarity (Emb-M) and bag-of-word embedding similarity (Emb-B). A higher similarity score indicates more response information anticipated by the model. Table 4 shows the results of our two models RAM_T and RAM_P as well as CMR (We use the original self-attention matrix for the above calculation for CMR). Results demonstrate that our model is able to output more response-anticipated self-attention distributions, which benefits generating a response close to the ground truth.
5.3 Analysis on Different Model Variants
Token importance vs Pairwise importance.
We compare our model variants with different strategies to construct the response-aware/anticipated weight matrix , i.e. RAM_T (Eq. 5) and RAM_P (Eq. 9). We not only compare their overall performance by the teacher-student framework (Eq. 8 & 3.4) but also the teacher model only (Eq. 3.4).
The first four rows in Table 2 shows the results. We have an interesting finding that RAM_P underperforms RAM_T in the full teacher-student framework, but outperforms RAM_T on the mode with teacher model only on most metrics. This result is actually consistent with our discussion in Sec 3.5 that RAM_P has a higher capacity to carry more information in , thus its teacher model yields better performance. However, for the student model, RAM_P is more difficult to converge to a good local optimum due to more parameters to be estimated, resulting in that its overall performance may not exceed that of RAM_T.
Continuous weight vs Binary weight.
We also compare the model variants with continuous weight (Eq. 5) and binary weight (Eq. 6). The last two rows in Table 2 give the results of the variants of RAM_T and RAM_P with a binary . We can see that both RAM_T and RAM_P with a binary weight matrix performs better on Appropriateness, which means a sparse on the attention matrix can help select more concise information to construct the memory. Nevertheless, models with a continuous weight matrix can generate more informative responses owing to their ability to access broader and more information from the document.
5.4 Case Study
Table 3 shows two test samples with generated responses of all models. For Case 1, Seq2Seq and MemNet cannot generate responses relevant to either the document or context. CMR catches the topic “sports”, while GLKS and CMR+Copy use “first person” and “first one” to reflect “only two” mentioned in the document. The response of RAM_T contains information related to both document (“num teams” and “premier league”) and context (“europa”). RAM_T+Copy is also highly relevant to the document and the context, and copies “player” from the document. For Case 2, the first four methods have little relation to the document or the context. CMR+Copy mentions “played”. Our models mention “played” and “num years”. By examining the cases, our method shows promising improvements over existing methods. However, generation on the CbR task is very challenging and there is still a huge space to improve.
Focusing on the CbR task, we propose a novel response-anticipated document memory to exploit and memorize the document information that is important in response generation. We construct the response-anticipated memory by a teacher-student framework. The teacher accesses the response and learns a response-aware weight matrix; the student learns to estimate the weight matrix in the teacher model and construct the response-anticipated document memory. We verify our model on both automatic and human evaluations and experimental results show our model obtains the state-of-the-art performance on the CbR task.
Research on this paper was supported by Hong Kong Research Grants Council under grants 16202118 and 16212516 and Tencent AI Lab Rhino-Bird Focused Research Program (No. GF202035).