CROWN: Conversational Passage Ranking by Reasoning over Word Networks

11/07/2019
by   Magdalena Kaiser, et al.
Max Planck Society
0

Information needs around a topic cannot be satisfied in a single turn; users typically ask follow-up questions referring to the same theme and a system must be capable of understanding the conversational context of a request to retrieve correct answers. In this paper, we present our submission to the TREC Conversational Assistance Track 2019, in which such a conversational setting is explored. We propose a simple unsupervised method for conversational passage ranking by formulating the passage score for a query as a combination of similarity and coherence. To be specific, passages are preferred that contain words semantically similar to the words used in the question, and where such words appear close by. We built a word-proximity network (WPN) from a large corpus, where words are nodes and there is an edge between two nodes if they co-occur in the same passages in a statistically significant way, within a context window. Our approach, named CROWN, improved nDCG scores over a provided Indri baseline on the CAsT training data. On the evaluation data for CAsT, our best run submission achieved above-average performance with respect to AP@5 and nDCG@1000.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/27/2020

Conversational Question Answering over Passages by Leveraging Word Proximity Networks

Question answering (QA) over text passages is a problem of long-standing...
02/26/2021

Open-domain question classification and completion in conversational information search

Searching for new information requires talking to the system. In this re...
03/30/2020

TREC CAsT 2019: The Conversational Assistance Track Overview

The Conversational Assistance Track (CAsT) is a new track for TREC 2019 ...
05/05/2020

Query Reformulation using Query History for Passage Retrieval in Conversational Search

Passage retrieval in a conversational context is essential for many down...
02/17/2021

Leveraging Query Resolution and Reading Comprehension for Conversational Passage Retrieval

This paper describes the participation of UvA.ILPS group at the TREC CAs...
02/17/2021

Open-Retrieval Conversational Machine Reading

In conversational machine reading, systems need to interpret natural lan...
05/13/2018

Learning to Ask Questions in Open-domain Conversational Systems with Typed Decoders

Asking good questions in large-scale, open-domain conversational systems...

1 Introduction

Information needs are usually not one-off: a user who searches for information regarding a specific topic usually asks several questions in a row. Previous turns have an impact on later turns and the system’s answer affects subsequent user queries as well. As a result, questions are often not well-formed and self-contained, but incomplete with ungrammatical phrases and references to previous mentions. Thus, a key challenge is to be able to understand context left implicit by the user in their current utterance. However, today’s systems are not capable of answering such questions and there are no resources appropriate for training and evaluating models for Conversational Search. The Conversational Assistance Track (CAsT)111http://www.treccast.ai/ was organized at TREC 2019, with the goal to create a reusable benchmark for open-domain conversational search where answers are passages from large text corpora.

In this work, we describe our submissions to TREC CAsT 2019. We propose an unsupervised method called CROWN (Conversational Passage Ranking by Reasoning Over Word N

etworks), in which the passage score for a query is formulated as a combination of similarity and coherence. Similarity between query terms and words in a passage is measured in terms of the cosine similarity of their word embedding vectors. In order to estimate passage coherence, we built a word-proximity network (WPN) over a large corpus. At query time, the WPN is used to rank passages preferring those with semantically similar words to the ones appearing in the question and those containing query-relevant term pairs that have an edge in the network. Our CROWN method was able to outperform an Indri baseline on the provided training data and achieved above-average results with respect to AP@5 and nDCG@1000 on the TREC CAsT evaluation data.

2 Related Work

Conversations in search. Conversational questions are often posed to voice assistants. However, current commercial systems cannot handle conversations with incomplete context well. Yet, conversational approaches are explored theoretically in [14]. Context-aware search is also considered in [22, 3, 20]. Information about previous queries from the same session and from click logs is used to better recognize the user’s information need and thus to improve document ranking. Some works focus on query suggestion by using auto-completion logs additionally to click logs [11, 1]. Since conversational queries can often be incomplete, a query reformulation approach is described in [17]: if a query depends on previous context it is reformulated taking into account the information of previous turns in order to obtain a full-fledged query a standard search engine can deal with. However, in these works, a ranked list of documents is usually considered as a result for a conversational query, whereas passage-level retrieval, like it is performed in TREC CAsT, has not been explored yet.
Conversations in reading comprehension. In machine reading comprehension, answers to questions are text spans in provided paragraphs, like in the SQuAD benchmark [15]. There are also several benchmarks available regarding conversational reading comprehension, like QBLink [8], CoQA [16], QuAC [5] and ShARC [18]. A conversational machine reading model is presented, for example, in [23]. Decision rules are extracted from procedural text and reasoning is performed on whether these rules are already entailed by the conversational history or whether the information must be requested from the user. In [13] the pre-trained language model BERT [7] is used to encode a paragraph together with each question and answer in the conversational context and the model predicts an answer based on this paragraph representation. However, these works differ from conversational search, since candidate paragraphs or candidate documents are given upfront.
Conversations over knowledge graphs.

Initial attempts in answering conversational questions are also being made in the area of question answering over knowledge graphs (KGs). In 

[19] the paradigm of sequential question answering over KGs is introduced and a large benchmark, called CSQA, was created for this task. An unsupervised approach, CONVEX, that uses a graph exploration algorithm is presented in [6] along with another benchmark, named ConvQuestions. Furthermore, an end-to-end neural model, that uses dialog memory management for inferring the logical forms, is presented in [9]. These works differ from ours since a knowledge graph is searched for an answer, whereas in TREC CAsT large textual corpora are used as source of answer. Questions over knowledge graphs are mainly objective and factoid, while questions over text corpora have a broader scope. Moreover, answers cannot always be found in KGs due to their incompleteness, whereas the required information can often be located readily in web or news corpora.

3 Task Description

This is the first year of the Conversational Assistance Track in TREC. In this Track, conversational search is defined as a retrieval task considering the conversational context. The goal of the task is to satisfy a user’s information need, which is expressed through a sequence of conversational turns (usually ranging from seven to twelve turns in the provided data). Additionally, the topic of the conversation and a description of its content is given. In this year, the conversational topics and turns are specified in advance. Here is an example for such a conversation taken from the TREC CAsT training data:

Title: Flowering plants for cold climates

Description: You want to buy and take care of flowering plants for cold climates

  1. What flowering plants work for cold climates?

  2. How much cold can pansies tolerate?

  3. Does it have different varieties?

  4. Can it survive frost?

  5. How do I protect my plants from cold weather?

  6. How do plants adapt to cold temperature?

  7. What is the UK hardiness rating for plants?

  8. How does it compare to the US rating?

  9. What’s the rating for pansies?

  10. What about petunias?

Figure 1: Sample conversation from the TREC CAsT 2019 data.

As can be seen in the example, subsequent questions contain references to previously mentioned entities and concepts. References like “it” in Turn 3, which refer to “pansy” or “What about … ” in the last turn referring to “hardiness rating’’ cannot be resolved easily. The response from the retrieval system is a ranked list of passages. The passages are short texts (roughly 1-3 sentences each) and thus also suitable for voice-interfaces or mobile screens. They are retrieved from a combination of three standard TREC collections: MS MARCO Passage Ranking, Wikipedia (TREC CAR), and news (Washington Post).

Thirty example training topics, that have been created manually, are provided by the organizers as well as relevance judgments on a three-point scale (2: very relevant, 1: relevant, and 0: not relevant) are given for a limited subset, in total for around 120 questions. The evaluation is performed over 50 different provided topics. Additionally, an Indri baseline using query likelihood is provided. For this baseline run, AllenNLP coreference resolution is performed on the query and stopwords are removed using the Indri stopword list.

4 Method

We now describe CROWN, our unsupervised method for conversational passage ranking. We maximized the passage score for a query that is defined as a combination of similarity and coherence. Intuitively, passages are preferred that contain words semantically similar to the words used in the question and that have such words close to each other. Table 1 gives an overview of notations used for describing our method.

Notation Concept
Conversational turn , current turn
Query at turn (without stopwords), weight for turn
, , Conversational query sets
, , Sets with conversational query weights
, , , Indri query sets
, , Sets with indri query weights
Word proximity network with nodes and edges
Node weights, edge weights
Set of candidate passages, passage, token in passage
Word embedding
Cosine similarity between word embedding vectors
Normalized point-wise mutual information between two passage tokens
Returns true if there is an edge between two tokens in the graph
Similarity score, coherence score, score using Indri result
Threshold for node weights, threshold for edge weights
Context window size
Hyperparameters for final score calculation
Table 1: Notation for key concepts in CROWN.

4.1 Building the Word Proximity Network

Word proximity networks have widely been studied in previous literature, for example in [2], where links in a network are defined as significant co-occurrences between words in the same sentence. We chose the MS MARCO Passage Ranking collection as a representative to build the word proximity network for CROWN. Formerly, we built the graph , where nodes are all words appearing in the collection (excluding stopwords) and there is an edge between two nodes if they co-occur in the same passage within a context window in a statistically significant way. We use NPMI (normalized pointwise mutual information) as a measure of this word association significance, as defined below:

where

and

is the joint probability distribution and

the individual distributions over random variables

and .

While these parts of the network are static and query agnostic, the network’s nodes and edge weights depend on the user input. The NPMI value is used as edge weight between the nodes that are similar to conversational query tokens, whereas node weights are a measure of similarity between conversational query tokens and tokens in the network. In the following sections we will explain the exact weight and score calculations in more detail.

Figure 2: Small sample word proximity network

Figure 2 shows a small sample word proximity network. Lets assume that we are in Turn 4 in the sample conversation presented in Figure 1, where “it” is correctly resolved with “pansies”: “Can pansies survive frost?”. For simplicity, further information from previous turns is not used here. One candidate passage (Passage 1) is the following: “Some of the types of flowers that are considered hardy annuals include the pansy.” The non-stopwords from Passage 1 and words from two further passages (Passage 2 and Passage 3) are displayed. Note that words can appear in multiple passages, like “cold” in the example. The colored values are the node weights, where the color indicates which of the query words is closest to the word in the passage. If the similarity is below a threshold, then the node weight is set to 0, like for “annual” and “types” in the example. The values at the connection of two nodes indicate the edge weights. There is only an edge between two nodes if their NPMI value is above certain threshold. For example, there is no edge between “types” and “hardy”.

4.2 Formulating the Conversational Query

The three query expansion strategies that worked best in CROWN are described in the following. The conversational query () consists of several queries from different turns . In the weighted versions, each considered query has a turn weight . When calculating the node weights, the resulting similarity scores are multiplied with the respective turn weight. The conversational query also influences the calculation of the edge weights which will be seen later. The conversational query set consists of the current query () and the first (). No weights are used, therefore the set of conversational query weights is empty. The second option uses a weighted version: consists of the current, the previous and the first turn, where each turn has a weight in , which is only decayed for the previous turn. The last option we consider () contains all previous turns, where weights are used for each turn ().

  • ,

  • , , where: and

  • , , where if() {}  else {}

4.3 Retrieving Candidate Passages

We used the Indri search engine [21] in CROWN to obtain a set of candidate passages . Our Indri query () also consists of a combination of queries from different turns. Furthermore, Indri supports weighting query terms. Here is an example how the weighting of certain words can be done in an Indri query:

#weight( 1.0 #combine (survive frost) 0.8 #combine (pansy types) )

We were able to produce the best results with the following expansions: In , the Indri query consists of the current, the previous and the first turn and no weights are used; consists of the current turn, turn T-1, turn T-2 and the first turn, again without using weights. The weighted version uses all previous turns and the corresponding weights are in . Finally, means that three different queries (built from , and ) are issued to Indri and the union of the resulting passages is used for re-ranking.

  • ,

  • ,

  • , , where
    in ; if () {} else {}

4.4 Scoring Candidate Passages

In CROWN, the final score of a passage consists of several components that will be described in the following.

Estimating similarity. The similarity score that is built upon the node weights is calculated in the following way:

where the node weight of a token and the condition will be defined next.

where maps to 1 if the condition is fulfilled otherwise to 0; is the word embedding vector of the j-th token in the i-th passage; is the word embedding vector of the k-th token in the conversational query and is the weight of the turn in which appeared; denotes the cosine similarity between the passage token and the query token embeddings. is defined as

which means that condition is only fulfilled if the similarity between a query word and a word in the passage is above certain threshold .

Estimating coherence. Coherence is expressed by term proximity which is reflected in the edge weights. The corresponding score is calculated as follows:

The indicator function maps to 1 if condition is fulfilled otherwise to 0. The edge weight is defined as:

The NPMI value between the tokens is calculated from MS MARCO Passage Ranking as a representative corpus. Condition is defined as

where

Condition assures that there is an edge between the two tokens in the graph and that the edge weight is above certain threshold . The second condition, , states that there are two non-identical words in the conversational query where one of them is the one that is most similar to (more than any other query token and with similarity above threshold ) and the other word is most similar to .

Estimating priors. We also consider the original ranking received from Indri. In CROWN, this score is defined as:

where is the rank the passage received from Indri.

Putting it together. The final score for a passage consists of a weighted sum of these three individual scores. More formally:

where , and are hyperparameters that are tuned using the provided training data.

5 Experimental Setup

5.1 Baseline and Metrics

The provided Indri retrieval model mentioned in Section 3 has been used as the baseline in our experiments. Since responses are assessed using graded relevance, we used nDCG [10] (normalized discounted cumulative gain) and ERR [4] (expected reciprocal rank) as metrics. Furthermore, AP (Average Precision) is reported on the evaluation data.

5.2 Configuration

5.2.1 Dataset.

As mentioned in Section 3, the underlying document collection consists of a combination of three standard TREC collections: MS MARCO, TREC CAR and Washington Post.

5.2.2 Initialization.

We used embeddings [12] pre-trained on the Google News dataset and obtained via the python library gensim. Furthermore, the python library spaCy has been used for tokenization and stopword removal. As already mentioned, Indri has been used for candidate passage retrieval. We set the number of retrieved passages from Indri to 1000, so as not to lose any relevant documents. For graph processing, we used the NetworkX python library. The window size for which word co-occurrences are taken into account is set to three in our graph.

5.3 Submitted Runs

We submitted four runs for the TREC CAsT track. These are described below.

5.3.1 Run 1: mpi-d5_igraph (indri + graph).

For our first run, we used the unweighted conversational query and the first unweighted option for the Indri query. These options performed best in our experiments. For definitions of and , refer to Section 4.2 and Section 4.3 respectively. The node threshold is set to , which means that nodes require having a word embedding similarity to a query token that is greater than in order to influence the score calculation. The edge threshold is set to to exclude negative NPMI values. The three hyperparameters are chosen as follows: (indri score), (node score), (edge score).

5.3.2 Run 2: mpi-d5_intu (indri-tuned).

In our second run, we vary the set of hyperparameters, while the rest stays the same as in run 1: (indri score), (node score), (edge score). This run gives most emphasis towards the indri score, while coherence in our graph is not considered by giving no weight to the edge score.

5.3.3 Run 3: mpi-d5_union (union of indri queries).

Here we use which means that we issue three separate queries to Indri and take the union of all returned passages. However, this leads to three separate Indri rankings which are incomparable. Therefore, we do not consider the indri score in our final score calculation by setting to . Setting (node score) and (edge score) worked best on the training data in this setting. The conversational query and the node threshold are the same as for the previous runs.

5.3.4 Run 4: mpi-d5_cqw (weighted conversational query).

In our final run, the conversational query is varied as follows: option is used and the node threshold is a bit more restrictive with . Apart from that, the parameters are set to the same values as in run 1.

6 Results and Insights

mpi-d5_igraph mpi-d5_intu mpi-d5_union mpi-d5_cqw indri_baseline
nDCG@1000 0.322 0.341 0.195 0.317 0.293
ERR@1000 0.147 0.151 0.038 0.145 0.157
Table 2: Results on training data
Figure 3: Turn-wise results for our four runs on evaluation data.

We present the results of our four runs on the training and the evaluation data. For the training data, we compared our runs to the Indri baseline provided by the organizers (see Table 2). Note that for calculating the nDCG and ERR metrics for the training data, only the limited relevance judgements from the manual created dialogues have been used. Three of our runs were able to outperform the Indri baseline with respect to nDCG@1000.

In Table 3 and Table 4, the results on the evaluation data for the metrics AP@5 and nDCG@1000 are reported. Average values for each turn (up to Turn 8) and over all turns are displayed. The results for our four runs are reported as well as the median and best turn-wise results over all submissions to the track. Additionally, in Figure 3 the results of our four runs are visualized over eight turns for AP@5 and nDCG@1000.

There seems to be the tendency that the results increase for later turns (up to Turn 6 for AP@5) or do not vary much (up to Turn 7 for nDCG@1000). This means that our method is robust with respect to turn depth and that later turn successfully exploit the information available from previous turns. Three of our runs, namely mpi-d5_igraph, mpi-d5_intu and mpi-d5_cqw achieve above-average performance with respect to both metrics. Our mpi-d5_union run does not achieve competitive results probably because the candidate passages which are taken from the union of the three separate Indri retrievals create a pool which is too large for effective re-ranking.

Table 5 shows some exemplary queries taken from the training data, that appear at different turns in the respective conversation, together with passage snippets taken from top-ranked passages by CROWN (rank 1-5). Information from previous turns is required to be able to correctly answer the questions. For example, the query “What about in the US?”, asked at Turn 5, needs the additional information “physician assistants” and “starting salary”, given at Turn 1 and Turn 4 respectively. These are directly matched in the correct answer, resulting in a high node score, and additionally appear next to each other (high edge score).

mpi-d5_igraph mpi-d5_intu mpi-d5_union mpi-d5_cqw median best
Turn 1 0.033 0.033 0.023 0.033 0.035 0.100
Turn 2 0.040 0.042 0.009 0.040 0.031 0.147
Turn 3 0.039 0.040 0.014 0.038 0.035 0.128
Turn 4 0.044 0.052 0.009 0.046 0.043 0.141
Turn 5 0.048 0.053 0.009 0.051 0.041 0.147
Turn 6 0.050 0.059 0.006 0.051 0.039 0.183
Turn 7 0.042 0.051 0.006 0.043 0.038 0.167
Turn 8 0.028 0.032 0.003 0.025 0.029 0.205
All 0.039 0.043 0.010 0.039 0.034 0.145

Table 3: Turn-wise results on evaluation data for AP@5.
mpi-d5_igraph mpi-d5_intu mpi-d5_union mpi-d5_cqw median best
Turn 1 0.497 0.518 0.444 0.497 0.472 0.761
Turn 2 0.448 0.480 0.330 0.446 0.367 0.759
Turn 3 0.486 0.504 0.399 0.479 0.417 0.779
Turn 4 0.438 0.456 0.350 0.436 0.382 0.778
Turn 5 0.425 0.453 0.353 0.410 0.374 0.777
Turn 6 0.454 0.494 0.329 0.458 0.364 0.821
Turn 7 0.463 0.499 0.352 0.456 0.404 0.841
Turn 8 0.374 0.420 0.296 0.376 0.309 0.810
All 0.441 0.470 0.352 0.437 0.362 0.754

Table 4: Turn-wise results on evaluation data for nDCG@1000.

7 Conclusion

In this work, we presented our unsupervised method CROWN. We showed that using a combination of similarity and coherence to score relevant passages is a simple estimate but works quite well in practice. A context window of size three seems to successfully capture significant word co-occurrences. In general, it seemed that giving greater influence to the Indri ranking and giving a higher preference to node weights than edge weights improves the results. Regarding query expansion strategies we could observe that including the previous and the first turns is beneficial. Weighted turns did not improve the results significantly. In the future, we would also consider the positions of query terms in passages following the intuition that passages are more relevant in which the query terms appear earlier.

Turn Query Passage Snippet
4 “What makes it (smoking, Turn 1) so addictive?” “Nicotine, the primary psychoactive chemical in cigarettes, is highly addictive.”
2 “What makes it (Uranus, Turn 1) so unusual?” “One fact that is not only unusual, but also makes Uranus markedly different from earth is the angle of its spin axis.”
3 “How much do Holsteins (cattle, Turn 1) produce (milk, Turn 2)?” “The Holstein-Friesian is the breed of dairy cow most common in […], around 22 litres per day is average.”
5 “What about (physician assistant, Turn 1, starting salary, Turn 4) in the US?” “Physician assistant’s starting salary varies from city to city. For instance, in New York [it] is around $69,559, […]”
9 ”Do NPs (nurse practitioner, Turn 8) or PAs (physician assistant, Turn 1) make more?” “The average salary among all NPs […] is $94,881.22 and the average salary among all PAs […] is $100,497.78.”
Table 5: Examples for correct answer snippets (rank 1 - 5 in CROWN) for queries from different turns taken from training conversations.

References

  • [1] Z. Bar-Yossef and N. Kraus (2011) Context-sensitive query auto-completion. In WWW, Cited by: §2.
  • [2] R. F. I. Cancho and R. V. Solé (2001) The small world of human language. Proceedings of the Royal Society of London. Series B: Biological Sciences 268 (1482). Cited by: §4.1.
  • [3] H. Cao, D. Jiang, J. Pei, E. Chen, and H. Li (2009)

    Towards context-aware search by learning a very large variable length hidden markov model from search logs

    .
    In WWW, Cited by: §2.
  • [4] O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan (2009) Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management, Cited by: §5.1.
  • [5] E. Choi, H. He, M. Iyyer, M. Yatskar, W. Yih, Y. Choi, P. Liang, and L. Zettlemoyer (2018) QuAC: Question answering in context. In EMNLP, Cited by: §2.
  • [6] P. Christmann, R. Saha Roy, A. Abujabal, J. Singh, and G. Weikum (2019) Look before you hop: conversational question answering over knowledge graphs using judicious context expansion. In CIKM, Cited by: §2.
  • [7] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §2.
  • [8] A. Elgohary, C. Zhao, and J. Boyd-Graber (2018) A dataset and baselines for sequential open-domain question answering. In EMNLP, Cited by: §2.
  • [9] D. Guo, D. Tang, N. Duan, M. Zhou, and J. Yin (2018) Dialog-to-action: conversational question answering over a large-scale knowledge base. In NeurIPS, Cited by: §2.
  • [10] K. Järvelin and J. Kekäläinen (2002) Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS) 20 (4). Cited by: §5.1.
  • [11] L. Li, H. Deng, A. Dong, Y. Chang, R. Baeza-Yates, and H. Zha (2017) Exploring query auto-completion and click logs for contextual-aware web search and query suggestion. In WWW, Cited by: §2.
  • [12] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, Cited by: §5.2.2.
  • [13] Y. Ohsugi, I. Saito, K. Nishida, H. Asano, and J. Tomita (2019) A simple but effective method to incorporate multi-turn context with BERT for conversational machine comprehension. In Proceedings of the First Workshop on NLP for Conversational AI, Cited by: §2.
  • [14] F. Radlinski and N. Craswell (2017) A theoretical framework for conversational search. In CHIIR, Cited by: §2.
  • [15] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, Cited by: §2.
  • [16] S. Reddy, D. Chen, and C. D. Manning (2019) Coqa: a conversational question answering challenge. TACL 7. Cited by: §2.
  • [17] G. Ren, X. Ni, M. Malik, and Q. Ke (2018) Conversational query understanding using sequence to sequence modeling. In WWW, Cited by: §2.
  • [18] M. Saeidi, M. Bartolo, P. Lewis, S. Singh, T. Rocktäschel, M. Sheldon, G. Bouchard, and S. Riedel (2018) Interpretation of natural language rules in conversational machine reading. In EMNLP, Cited by: §2.
  • [19] A. Saha, V. Pahuja, M. M. Khapra, K. Sankaranarayanan, and S. Chandar (2018) Complex sequential question answering: towards learning to converse over linked question answer pairs with a knowledge graph. In AAAI, Cited by: §2.
  • [20] X. Shen, B. Tan, and C. Zhai (2005) Context-sensitive information retrieval using implicit feedback. In SIGIR, Cited by: §2.
  • [21] T. Strohman, D. Metzler, H. Turtle, and W. Croft (2005-01) Indri: a language-model based search engine for complex queries. Information Retrieval - IR. Cited by: §4.3.
  • [22] A. Sun and C. Lou (2014) Towards context-aware search with right click. In SIGIR, Cited by: §2.
  • [23] V. Zhong and L. Zettlemoyer (2019) E3: entailment-driven extracting and editing for conversational machine reading. In ACL, Cited by: §2.