1 Introduction
Neural conversation models have achieved promising performance in response generation. However, it is widely observed that the generated responses lack sufficient content and information li2016diversitymmi. One way to address this issue is to integrate various external information into conversation models. Examples of external information include document topics xing2017topic
, commonsense knowledge graphs
zhou2018commonsense, and domainspecific knowledge bases yang2019hybrid. Conversing by reading (CbR) CMRACL2019 is a recently proposed scenario where external information can be ingested to conversations. In CbR, conversations take place with reference to a document. The key problem in CbR is to learn how to integrate information from the external document into response generation on demand.To exploit knowledge from documents for conversations, a conventional way is to extend the sequencetosequence (Seq2Seq) model seq2seq with Memory Networks sukhbaatar2015end, which store knowledge representations accessible to their decoder ghazvininejad2018knowledge; parthasarathi2018extending. dinan2018wizard propose to encode the dialogue context as well as a set of retrieved knowledge by Transformer vaswani2017attention to construct the memory. However, these methods only use sentencelevel representations of the documents in the memory, which cannot pinpoint accurate tokenlevel document information.
To discover tokenlevel document information, researchers borrow models from other generation tasks, which are adept at extracting segments of sentences for given questions. moghe2018towards explore the pointer generator network see2017get for abstractive summarization and the bidirectional attention flow model seo2016bidirectional, which is a QA model to predict a span of the document to be contained in the response. CMRACL2019 follow the stochastic answer network (SAN) SAN in machine reading comprehension (MRC), integrating both context and document information to form the contextaware document memory. This approach obtains the stateoftheart performance on the CbR task.
However, we should notice the difference between existing generation tasks and CbR. For summarization, QA, and MRC, they require models to extract exact answers from documents, where documents cover all requisite knowledge. Meanwhile, CbR expects to output a general utterance relevant to both context and document. As the example in Fig. 1, the document refers to actor, films, fans, wealthy and the context mentions disease. Document and context discuss the same person but have no topic overlap; thus we cannot pinpoint document information from the context. If we use SAN as in CMRACL2019, SAN can hardly acquire helpful information from contextdocument interaction. To ingest useful knowledge for response generation, we argue that processing documents should consider not only the interaction between context and document but also the target response. As in the example, the document should attend more on fans, wealthy by considering the response.
In this work, we propose a method to construct a responseanticipated memory to contain document information that is potentially more important in generating responses. Particularly, we construct a teacherstudent framework based on CMRACL2019. The teacher model accesses the groundtruth response, context, and document. It learns to construct a weight matrix that contains information about the importance of tokens in the document to the response. The student model learns to mimic the weight matrix constructed by the teacher without access to the response. That is, the teacher learns to build a responseaware memory, while the student learns to build a responseanticipated memory. During inference on testing data, the student will be applied. Our experiments show our model exceeds all competing methods.
2 Related Work
Most neural conversation models in open domain chitchat scenarios are based on the Seq2Seq model seq2seq; NRM_Shang15. A critical issue of these models is the safe response problem, i.e., generated responses often lack enough content and information. To address this issue, previous work encourages response diversity and informativeness by introducing new training objectives li2016deep; zhao2017CVAE, refining beam search strategies li2016diversitymmi; vijayakumar2016diverse; song2017diversifying, exploiting information from conversational contexts aaai16_hred; serban2017hierarchical; tian2017acl, or incorporating with retrievalbased conversation systems song2018ensemble; wu2019response; tian2019ACL.
Some researchers augment information in generating responses by external resources. zhou2018commonsense utilize the commonsense knowledge graph by their designed graph attention. agarwal2018knowledge propose a knowledge encoder to encode queryentity pairs from the knowledge base. wu2019proactive enrich response generation with knowledge triplets. These work all uses knowledge information in structured formats.
External unstructured text information has also been investigated to improve conversation models. Some researchers directly build “document memory” by using distributed representations of the knowledge sentences into conversation models
ghazvininejad2018knowledge; parthasarathi2018extending. dinan2018wizard make use of the Transformer vaswani2017attention to encode the knowledge sentences as well as the dialogue context. ren2019thinking design a knowledge selector to construct the document memory on selective knowledge information. As stated in the introduction, some other researchers borrow models from other generation tasks, including abstractive summarization models moghe2018towards, QA models moghe2018towards and MRC models meng2019refnet; CMRACL2019. Especially, CMRACL2019 get the stateoftheart performance. However, they all construct the document memory relying on connections between context and document without consideration of the response. If context or document contains a lot of noise tokens irrelevant to the response, which is indeed the case in CbR, the constructed memory may be misled by these noise information (as the case in Fig. 1). Therefore, we propose to involve the consideration of responses in the memory construction, which can benefit generating a more desired response.3 Methodology
In this section, we will first give an overall description of the proposed teacherstudent architecture for CbR, then briefly describe the base model. The detailed teacher model and student model are presented in Sec 3.3 and 3.4. Lastly, we summarize the training updates of the two models in Sec 3.5.
3.1 Model Architecture
The CbR task provides a conversation context and a document as inputs, requiring the model to generate a response to by referring to . In the rest of the paper, we use , , and to denote the number of tokens in , , and respectively. To pinpoint accurate document information for response generation, we design a teacherstudent framework to construct document memory as follows:

[wide=0noitemsep]

The teacher model learns a responseaware document memory used in our base conversation model. Specifically, we construct a responseaware weight matrix , which considers the correlation between contextaware document representations and response representations, and then impose on the memory matrix . The teacher model is optimized to reconstruct the response with the use of responseaware memory .

The student model learns to construct a responseanticipated weight matrix to estimate
used in the teacher model but without access to the response. It is a feedforward neural network with document and context as its input.
The teacher model and the student model are jointly optimized with training data, while only the student model is applied to testing data.
3.2 Base Model
Following CMRACL2019, we use SAN SAN as our base model, which mainly consists of three components:

[wide=0noitemsep]

Input encoder: We use two bidirectional LSTM encoders to extract tokenlevel representations of the document and the context .

Memory construction: We build the document memory ( is the hidden size of the memory) which will be used in the decoder. A crossattention layer is first applied to the outputs of the two encoders to integrate information from the context to the document. Then, we obtain a set of contextaware document representation . Since each corresponds to a document token, we treat it as the contextaware token representation of the th token. Next, a selfattention layer is employed to ingest salient information of the contextaware document representations:
(1) where the softmax conducts the normalization over each row of the matrix.

Output decoder: We use an attentional recurrent decoder to generate response tokens by attending to the memory . The initial hidden state is set as the summation of tokenlevel context representations. For each decoding step , we get a hidden state :
(2) (3) where indicates concatenation, and the crossattention layer here integrates information from the memory to the recurrent outputs. is the wordembedding at step . Finally, we generate a token by a softmax on .
Our model modifies the memory construction by refining its selfattention layer so that the memory represents more accurate and ondemand knowledge that helps generating the response.
3.3 Teacher Model
To ingest accurate memory information for response generation under the aforementioned base model, our teacher model builds a responseaware weight matrix given the contextaware document representation and the response , then refines the document memory with . Elements in ’s indicate the importance of tokens or token pairs in the document, with consideration of the response information.
First, we describe how to modify the memory matrix when is given. The original memory is constructed by a selfattention operation as Eq. 1. To facilitate response awareness, we update the attention weight matrix by elementwise multiplying , and then get the refined memory as
(4) 
In the following, we describe two methods to construct the responseaware weight matrix : (1) We measure the responseaware token importance (RTI) considering the groundtruth response to construct . (2) We measure the responseaware pairwise importance (RPI) of each token pair , which can be directly assigned to the element in . For both methods, matrix elements can be either continuous or binary.
ResponseAware Token Importance (RTI)
We denote the responseaware token importance of document tokens as , and measure it by response and contextaware token representation . To obtain , we first apply an encoder to obtain the tokenlevel representations of the response as and use its last hidden state as the sentencelevel response representation.
The responseaware token importance of token is defined as the similarity between its contextaware token representation and the response representation .
Next, we adjust each attention distribution (i.e., each column of ) with each of its attention weight multiplied by the token importance .
Therefore, the resulting can be obtained as:
(5) 
where
represents an identity vector with all elements as
. By plugging the above in Eq. 5, we can construct a memory matrix with plagiarized signals from the response. In this way, the selfattention distributions can adjust to emphasize important tokens, and their corresponding contextaware document token representations become more important in the memory matrix.Recall that the document contains a large amount of noise information in CbR. Thus the attention distributions may become longtailed due to the existence of many redundant document tokens. Hence, we can further construct a binary weighting vector based on
. We keep the weight of each element as 1 with the probability of
calculated in Eq. 5. If the weight of a token turns to 0, this token is deactivated in calculating the attention distributions. However, the binary weight sampled from the Bernoulli distribution is not differentiable. To enable backpropagation of our model, we apply the GumbelSoftmax
jang2016categorical to approximate the Bernoulli distribution in the training phase, and sample the binary value from the Bernoulli distribution in the prediction phase as:(6) 
where is defined as:
(7) 
The objective function of the teacher model is to maximize the loglikelihood of responses generated by the responseaware memory constructed with :
(8) 
where denotes operations in Eq. 5 and its preorder operations. consists of all parameters in the layers of . denotes parameters in Eq. 1 to Eq. 3. Both and are learning parameters for .
ResponseAware Pairwise Importance (RPI)
Instead of using token importance, we can construct by the pairwise importance of token pairs.
After obtaining the token representations from the response encoder similarly as in RTI,
we can calculate the similarity of each towards all ’s, denoted as .
Each element in can be associated with a weight defined as the innerproduct between and . Thus, we can treat as the responseaware pairwise importance, and directly set each element in as :
(9) 
Compared with responseaware token importance in which the designed has identical column values, responseaware pairwise importance allows different values of different index ’s in (but and have the same value since is symmetric). Thus, the space of is larger.
Notice that, the aforementioned binary processing with each can also be applied on each here and the resulting is binary. By using a binary in our model, the memory construction can be considered as passing through a Graph Attention Network (GAT) GAT2017, which also constructs a graph and updates its representations relying on the information from itself and neighbors on the graph. However, our neighborhood matrix (i.e. in our model) is not predefined as in GAT but dependant on the inputs ’s and ’s, which involve parameters to be estimated.
3.4 Student Model
The student model learns to construct a responseanticipated weight matrix to estimate the weight matrix in the teacher model without access to the groundtruth . If we employ RTI, the estimated target of the student model is in Eq. 5. For RPI, the estimated target is in Eq. 9.
Given and
as inputs, we apply a bilinear attention layer to obtain a hidden representation matrix
. We apply a twolayer multilayer perceptron (MLP) with ReLU activation to estimate
; we combine two attention outputs by to estimate in the RPI:(10)  
(11) 
The objective function of the student model is to maximize the loglikelihood of generating responses based on the estimated or , and diminish the gap of the weighting vector or matrix between the student model and the teacher model by a mean square loss. Taking the RTI strategy as an example, we optimize the following objective:
where denotes the operation in Eq. 11 and its preorder operations. consists of the layer parameters in . balances the two loss terms. For RPI, we replace to optimize with and .
Appropriateness  Grounding  Informativeness  
NIST  BLEU  METEOR  P  R  F1  P  R  F1  Ent4  Dist1  Dist2  Len  
Human  2.650  3.13%  8.31%  2.89%  0.45%  0.78%  0.44%  0.09%  0.14%  10.445  0.167  0.670  18.8 
Seq2Seq  2.223  1.09%  7.34%  1.20%  0.05%  0.10%  0.89%  0.05%  0.09%  9.745  0.023  0.174  15.9 
MemNet  2.185  1.10%  7.31%  1.25%  0.06%  0.12%  0.91%  0.05%  0.10%  9.821  0.035  0.226  15.5 
GLKS  2.413  1.34%  7.61%  2.47%  0.13%  0.24%  0.84%  0.05%  0.10%  9.715  0.034  0.213  15.3 
CMR  2.238  1.38%  7.46%  3.39%  0.20%  0.38%  0.91%  0.05%  0.10%  9.887  0.052  0.283  15.2 
CMR+Copy  2.155  1.41%  7.39%  5.37%  0.28%  0.54%  0.92%  0.06%  0.11%  9.798  0.044  0.266  14.4 
RAM_T  2.510  1.43%  7.74%  4.46%  0.26%  0.49%  1.04%  0.08%  0.15%  9.900  0.053  0.290  15.1 
RAM_P  2.353  1.40%  7.59%  3.89%  0.21%  0.41%  0.97%  0.07%  0.13%  9.891  0.049  0.279  14.9 
RAM_T+Copy  2.467  1.41%  7.64%  6.14%  0.32%  0.61%  0.65%  0.04%  0.08%  9.813  0.045  0.265  14.9 
RAM_P+Copy  2.342  1.41%  7.51%  5.83%  0.30%  0.57%  0.84%  0.06%  0.10%  9.798  0.045  0.267  14.6 
3.5 Model Training
We first train the teacher model until it converges, and then train the student model with the use of or from the converged teacher model. Next, we repeat the above processes iteratively. In the training of the teacher model, we fix parameters in (except parameters shared with ) and train the model subject to ; for the student model, we fix and (except parameters shared with ) and train the model subject to . For inference, only the student model will be used to infer the responseanticipated weight matrix and the decoder applies it for generating the output response.
As stated in RPI, it has better model capacity by allowing a larger space of with the use of the weight matrix instead of the token importance vector in RTI. In terms of optimization, we need to estimate more parameters by using RPI, which requires higher training difficulty.
4 Experiment Setting
4.1 Dataset
We use the dataset for the CbR task released by CMRACL2019. The dataset contains crawled articles and discussions about these articles from Reddit. The articles act as the documents, while the discussions serve as conversational contexts and responses. In total, we have 2.3M/13k/1.5k samples for training/testing/validation.
4.2 Implementation Details
For all methods, we set word embedding dimension to 300 with the pretrained GloVe pennington2014glove. Following CMRACL2019, our vocabulary contains top 30k frequent tokens. We use biLSTMs with the hidden dimensions of 512 and the dropout rate of 0.4 in our encoders. We optimize models by Adam with an initial learning rate of 0.0005 and the batch size of 32. All conversation contexts/responses/documents are truncated to have the maximum length of 30/30/500. For training, we set as 1 in the loss of student models after tuning. For inference, we apply a top random sampling decoding edunov2018understanding with =20. The validation set is for early stopping. Aforementioned implementation details can be found in our codes ^{1}^{1}1https://github.com/tianzhiliang/RAM4CbR.
4.3 Competing Methods

[wide=0,noitemsep]

Seq2Seq seq2seq. The standard Seq2Seq model that leverages only the conversational context for response generation.

MemNet ghazvininejad2018knowledge. A knowledgegrounded conversation model that uses a memory network to store knowledge facts.

GLKS ren2019thinking. It applies a global knowledge selector in encoding and a local selector on every decoding step.

Conversation with Machine Reading (CMR) CMRACL2019. The stateoftheart model on the CbR task, which is also our base model (Sec 3.2). Here, we use the full model of CMR (called CMR+w in CMRACL2019), since the full model outperforms other CMR’s variants on most metrics. We further apply the copy mechanism see2017get to this base model (CMR+Copy).

Four variants of our proposed models: RAM_T denotes our ResponseAnticipated Memorybased model with RTI, and RAM_T+Copy denotes its copy version. RAM_P and RAM_P+Copy denote our model with RPI and its copy variant .
Appropriateness  Grounding  Informativeness  
NIST  BLEU  METEOR  P  R  F1  P  R  F1  Ent4  Dist1  Dist2  Len  
RAM_T  2.510  1.43%  7.74%  4.46%  0.26%  0.49%  1.04%  0.08%  0.15%  9.900  0.053  0.290  15.1 
RAM_P  2.353  1.40%  7.59%  3.89%  0.21%  0.41%  0.97%  0.07%  0.13%  9.891  0.049  0.279  14.9 
RAM_T (Teacher)  2.539  1.43%  7.85%  4.47%  0.26%  0.49%  1.05%  0.08%  0.15%  9.904  0.053  0.290  15.1 
RAM_P (Teacher)  2.551  1.47%  7.88%  4.56%  0.27%  0.50%  0.99%  0.08%  0.16%  9.900  0.053  0.287  15.1 
RAM_T_Binary  2.560  1.63%  7.91%  3.75%  0.21%  0.40%  0.87%  0.07%  0.12%  9.890  0.052  0.283  15.1 
RAM_P_Binary  2.403  1.51%  7.63%  3.55%  0.18%  0.38%  0.85%  0.07%  0.12%  9.887  0.046  0.274  14.6 
4.4 Evaluation Metrics
Following all metrics in CMRACL2019, we evaluate all methods by both automatic and human evaluations. For automatic evaluations, we evaluate the responses in three aspects:

[wide=0noitemsep]

Appropriateness.
We use three metrics to evaluate the overall quality of a response: BLEU4 papineni2002bleu, METEOR banerjee2005meteor, and NIST doddington2002automatic
. NIST is a variant of BLEU that measures ngram precision weighted by the informativeness of ngrams.

Grounding. We measure the relevance between documents and generated responses to reveal the effectiveness of responses exploiting the document information. We define as the number of nonstopword tokens in both the document and the generated response but not in contexts . We calculate the precision P and recall R as
(13) (14) where denotes the stopword list. F1
is the harmonic mean of precision
P and recall R.We further propose to measure the effectiveness of exploiting the document information considering the groundtruth. In this way, we evaluate how many groundtruth information models can exploit from the document. We define as the number of nonstopword tokens in the document , the generated response and the groundtruth but not in contexts
. The precision and recall are as following,
(15) (16) where is the harmonic mean of precision and recall .

Informativeness. Entn seq2BF measures responses’ informativeness with the entropy of the ngram count distribution. Distn li2016diversitymmi evaluates the diversity of responses via the proportion of unique ngrams among all responses.
For human evaluations, we hire five annotators from a commercial annotation company to evaluate 200 randomly selected test samples, and results from different models are shuffled. The annotators evaluate on a 5point scale in three aspects: overall quality (HAppr), relevance with documents (HGround), and informativeness (HInfo).
HAppr  HGround  HInfo  

Human  2.986  2.521  3.007 
Seq2Seq  1.902  1.564  2.040 
MemNet  1.872  1.574  2.105 
GLKS  2.073  1.593  2.071 
CMR  2.188  1.678  2.219 
CMR+Copy  2.063  1.773  2.075 
RAM_T  2.259  1.714  2.312 
RAM_P  2.213  1.682  2.231 
RAM_T+Copy  2.109  1.861  2.240 
RAM_P+Copy  2.114  1.775  2.115 
Top10 tokens  Top20 tokens  
EmbM  EmbB  EmbM  EmbB  
CMR  0.482  0.356  0.571  0.420 
RAM_T_Soft  0.745  0.520  0.867  0.616 
RAM_P_Soft  0.518  0.441  0.634  0.493 
5 Experimental Results and Analysis
In this part, we first show the performance of all methods in Sec 5.1. Then, we validate the effectiveness of response anticipation on CbR in Sec 5.2 by comparing the top similar tokens with the response using their representations in the memory. We also compare more variants of our model in Sec 5.3, including the token importance versus pairwise importance, and each method with continuous weights versus their variants with binary weights. At last, we conduct a case study in Sec 5.4.
5.1 Overall Performance
Results of all models on automatic and human evaluations are shown in Table 1 and Table 3. MemNet outperforms Seq2Seq on most metrics, which validates that it is important to utilize document information in CbR. However, MemNet only slightly improves on Grounding. Both GLKS and CMR outperform MemNet on most metrics, indicating that it matters how to construct the document memory used in conversation models for CbR. Compared with CMR, CMR+Copy is more competitive on Grounding but weaker on other metrics.
Our proposed models outperform other competing methods on all metrics, including automatic and human evaluations. For models without the copy mechanism, RAM_T performs the best. For models with copy, RAM_T+Copy and RAM_P+Copy excel CMR+Copy on most metrics. Overall, our proposed strategy works well on both the model with and without copy mechanism. We will compare RAM_T and RAM_P in details in Sec 5.3.
5.2 Effectiveness of Response Anticipation
In this section, we investigate whether anticipating response contributes to building a better document memory. We first calculate the semantic similarity between each document token and the response using their Glove embeddings, and select top document tokens. Next, we accumulate the attention weights of each token in all attention distributions in the selfattention weights in Eq. 1, i.e. summation over each column of . Then we select the top tokens according to their accumulated attention weights. Here, we set . We apply metrics in liu2016not to calculate the similarity of two token sets extracted above, including maximal tokenstokens embedding similarity (EmbM) and bagofword embedding similarity (EmbB). A higher similarity score indicates more response information anticipated by the model. Table 4 shows the results of our two models RAM_T and RAM_P as well as CMR (We use the original selfattention matrix for the above calculation for CMR). Results demonstrate that our model is able to output more responseanticipated selfattention distributions, which benefits generating a response close to the ground truth.
5.3 Analysis on Different Model Variants
Token importance vs Pairwise importance.
We compare our model variants with different strategies to construct the responseaware/anticipated weight matrix , i.e. RAM_T (Eq. 5) and RAM_P (Eq. 9). We not only compare their overall performance by the teacherstudent framework (Eq. 8 & 3.4) but also the teacher model only (Eq. 3.4).
The first four rows in Table 2 shows the results. We have an interesting finding that RAM_P underperforms RAM_T in the full teacherstudent framework, but outperforms RAM_T on the mode with teacher model only on most metrics. This result is actually consistent with our discussion in Sec 3.5 that RAM_P has a higher capacity to carry more information in , thus its teacher model yields better performance. However, for the student model, RAM_P is more difficult to converge to a good local optimum due to more parameters to be estimated, resulting in that its overall performance may not exceed that of RAM_T.
Continuous weight vs Binary weight.
We also compare the model variants with continuous weight (Eq. 5) and binary weight (Eq. 6). The last two rows in Table 2 give the results of the variants of RAM_T and RAM_P with a binary . We can see that both RAM_T and RAM_P with a binary weight matrix performs better on Appropriateness, which means a sparse on the attention matrix can help select more concise information to construct the memory. Nevertheless, models with a continuous weight matrix can generate more informative responses owing to their ability to access broader and more information from the document.
5.4 Case Study
Table 3 shows two test samples with generated responses of all models. For Case 1, Seq2Seq and MemNet cannot generate responses relevant to either the document or context. CMR catches the topic “sports”, while GLKS and CMR+Copy use “first person” and “first one” to reflect “only two” mentioned in the document. The response of RAM_T contains information related to both document (“num teams” and “premier league”) and context (“europa”). RAM_T+Copy is also highly relevant to the document and the context, and copies “player” from the document. For Case 2, the first four methods have little relation to the document or the context. CMR+Copy mentions “played”. Our models mention “played” and “num years”. By examining the cases, our method shows promising improvements over existing methods. However, generation on the CbR task is very challenging and there is still a huge space to improve.
6 Conclusion
Focusing on the CbR task, we propose a novel responseanticipated document memory to exploit and memorize the document information that is important in response generation. We construct the responseanticipated memory by a teacherstudent framework. The teacher accesses the response and learns a responseaware weight matrix; the student learns to estimate the weight matrix in the teacher model and construct the responseanticipated document memory. We verify our model on both automatic and human evaluations and experimental results show our model obtains the stateoftheart performance on the CbR task.
7 Acknowledgments
Research on this paper was supported by Hong Kong Research Grants Council under grants 16202118 and 16212516 and Tencent AI Lab RhinoBird Focused Research Program (No. GF202035).
Comments
There are no comments yet.