TopiOCQA: Open-domain Conversational Question Answeringwith Topic Switching

10/02/2021 ∙ by Vaibhav Adlakha, et al. ∙ Montréal Institute of Learning Algorithms 1

In a conversational question answering scenario, a questioner seeks to extract information about a topic through a series of interdependent questions and answers. As the conversation progresses, they may switch to related topics, a phenomenon commonly observed in information-seeking search sessions. However, current datasets for conversational question answering are limiting in two ways: 1) they do not contain topic switches; and 2) they assume the reference text for the conversation is given, i.e., the setting is not open-domain. We introduce TopiOCQA (pronounced Tapioca), an open-domain conversational dataset with topic switches on Wikipedia. TopiOCQA contains 3,920 conversations with information-seeking questions and free-form answers. TopiOCQA poses a challenging test-bed for models, where efficient retrieval is required on multiple turns of the same conversation, in conjunction with constructing valid responses using conversational history. We evaluate several baselines, by combining state-of-the-art document retrieval methods with neural reader models. Our best models achieves F1 of 51.9, and BLEU score of 42.1 which falls short of human performance by 18.3 points and 17.6 points respectively, indicating the difficulty of our dataset. Our dataset and code will be available at



There are no comments yet.


page 6

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Q: when was the byzantine empire born what was it originally called?
A: 5th century AD and was called Eastern Roman Empire, or Byzantium
Topic: Byzantine Empire
. . . . .
Q: which battle or event marked the fall of this empire?
A: A six-year-long civil war followed by attack from Sultan Mehmed’s army
Topic: Byzantine Empire
Q: did he conquer other territories as well?
A: Yes. Anatolia and in Southeast Europe as far west as Bosnia
Topic: Mehmed the Conqueror
Q: where is the first area located in present day terms?
A: Turkey
Topic: Anatolia
. . . . .
Q: what is the present day capital of the country?
A: Ankara
Topic: Turkey
Q: can you name some of the other major cities here?
A: Istanbul
Topic: Turkey
Q: were any of these cities associated with the first empire you were discussing?
A: The Ottomans made the city of Ankara the capital first of the Anatolia Eyalet and then the Angora Vilayet
Topic: Ankara
Figure 1: A conversation from TopiOCQA. Our dataset has information-seeking questions with free-form answers across multiple topics (documents). The consecutive turns from the same topic (document) have been excluded for brevity.
Dataset Multi–turn Open–domain Free–form answers Information–seeking questions Topic Switching
TopiOCQA (ours)
QReCC Anantha et al. (2021) B B
OR-QuAC Qu et al. (2020)
CoQA Reddy et al. (2019)
QuAC Choi et al. (2018)
NarrativeQA Kočiský et al. (2018)
Natural Questions Kwiatkowski et al. (2019)
SQuAD 2.0 Rajpurkar et al. (2018)
Table 1: Comparison of TopiOCQA with other QA datasets. TopiOCQA incorporates topical changes, along with several best practices of previous datasets. Brepresents that only a proportion of dataset satisfies the property.

People often engage in conversations to discover new knowledge Walton (2019). In such conversations, a questioner (the seeker) asks multiple rounds of information-seeking questions to an answerer (the expert). As the conversation proceeds, the questioner becomes inquisitive of new but related topics based on the information provided in the answers Stede and Schlangen (2004). Such topic switching behaviour is natural in information-seeking conversations and is commonly observed when people seek information through search engines Spink et al. (2002).

According to Spink et al., people switch from one to ten topics with a mean of 2.11 topic switches per search session. For example, a person can start a search session about tennis, and then land on Roger Federer, and after learning a bit about him may land on his country Switzerland and spend more time learning about other Swiss athletes. Thanks to tremendous progress in question answering research Rogers et al. (2021), we are coming close to enabling information-seeking conversations with machines (as opposed to just using keywords-based search). In order to realize this goal further, it is crucial to construct datasets that contain information-seeking conversations with topic switching, and measure progress of conversational models on this task, the two primary contributions of this work.

In the literature, a simplified setting of information-seeking conversation known as conversational question answering (CQA) has been deeply explored (Choi et al., 2018; Reddy et al., 2019). In this task, the entire conversation is based on a given reference text of a topic/entity. While the CQA task is challenging, it still falls short of the real-world setting, where the reference text is not known beforehand (first limitation) and the conversation is not restricted to a single topic (second limitation).

chen&al20orcqa and anantha&al21qrecc have attempted to overcome the first limitation by adapting existing CQA datasets to the open-domain setting. They do so by obtaining context-independent rewrites of the first question to make the question independent of the reference text. For example, if the reference text is about Augusto Pinochet and the conversation starts with a question "Was he known for being intelligent?", the question will be re-written to "Was Augusto Pinochet known for being intelligent?." However, as the entire question sequence in the conversation was collected with a given reference text of a topic, all the turns in a conversation revolve around a single topic.

In this work, we present TopiOCQA111TopiOCQA is pronounced as Tapioca.Topic switching in Open-domain Conversational Question Answering, a large-scale dataset for information-seeking conversations in open-domain based on the Wikipedia corpus. We consider each Wikipedia document to be a separate topic. The conversations in TopiOCQA start with a real information-seeking question from Natural Questions Kwiatkowski et al. (2019) in order to determine a seed topic (document), and then the questioner may shift to other related topics (documents) as the conversation progresses.222A portion of the training data also contains conversations where the questioner asks the first question given a seed topic. Throughout the conversation, the questioner is never shown the content of the documents (but only the section titles) to simulate an information-seeking scenario, whereas the answerer has full access to the content along with the hyperlink structure for navigation. In each turn, both questioner and answerer use free-form text (as opposed to extractive text spans as is common for an answerer in many existing datasets).

Figure 1 shows an example of a conversation from our dataset. The first question leads to the seed topic Byzantine Empire, and after two turns switches to Mehmed the Conqueror in Q, based on part of the answer (A) that contains reference to Mehmed. Note that the answers A, A and A are free-form answers that do not occur as spans in either the seed document or the follow up document. The topic then switches immediately to Anatolia based on part of the previous answer (A). The topics change in further turns to Turkey and Ankara. Because of the conversational nature, TopiOCQA contains questions rife with complex coreference phenomena, for instance, Q relies on entities mentioned in A, A and Q.

TopiOCQA contains 3,920 conversations and 50,574 QA pairs, based on Wikipedia corpus of 5.9 million documents. On average, a conversation in TopiOCQA has 13 question-answer turns and involves 4 topics. 28% of turns in our dataset require retrieving a document different from the previous turn. To the best of our knowledge, there is no other information-seeking CQA dataset in open-domain that incorporates topical changes, along with other desirable properties (see Table 1).

To investigate the difficulty of the TopiOCQA dataset, we benchmark several strong retriever-reader neural baselines, considering both sparse and dense retrievers, as well as extractive and generative readers (Karpukhin et al., 2020; Izacard and Grave, 2021). Inspired by previous work, we explore two ways to represent the question: (1) concatenating the entire conversation history Qu et al. (2020), and (2) self-contained rewrites of the conversational question Anantha et al. (2021). The best performing model – Fusion-in-Decoder Izacard and Grave (2021)

trained on concatenated conversation history – is 18.3 F1 points and 17.6 BLEU points short of human performance, indicating significant room for improvement. We also evaluate GPT-3 to estimate the performance in a closed-book few-shot setting, and its performance is 56.4 F1 and 41.2 BLEU points below the human performance.

2 Related Work

2.1 Open-Domain Question Answering

In open-domain question answering, a model has to answer natural language questions by retrieving relevant documents. This can be considered as a simplified setting of open-domain CQA, where the conversation is limited to just one turn. Several datasets have been proposed for this task. On one hand, reading comprehension datasets like SQuAD Rajpurkar et al. (2016) and TriviaQA Joshi et al. (2017), which consist of (question, document, answer) triplets, have been adapted by withholding access to the document Chen et al. (2017). While these datasets have been helpful in spurring modelling advances, they suffer from an annotator bias because they were not collected in an information-seeking setup. That is, annotators have access to the target answer and its surrounding context and therefore tend to formulate questions that have a high lexical overlap with the answer (Jia and Liang, 2017)

. On the other hand, web-search based datasets do not suffer from such artefacts because they are curated from real search engine queries. The WikiQA

Yang et al. (2015) and MS Marco Nguyen et al. (2016) datasets contain queries from the Bing search engine, whereas Natural Questions Kwiatkowski et al. (2019) contain queries from the Google search engine.

Models for open-domain QA often follow a two-stage process: (1) A retriever selects a small collection of documents relevant to the question from a big corpus (e.g. Wikipedia), (2) a reader extracts or generates an answer from the selected documents. While classical approaches rely on counting-based bag-of-words representations like TF-IDF or BM25 Chen et al. (2017); Wang et al. (2018); Yang et al. (2019)

, more recent deep learning approaches learn dense representations of the questions and document through a dual-encoder framework

Lee et al. (2019); Karpukhin et al. (2020). In such learned retriever setups, document retrieval is done efficiently using Maximum Inner Product Search (MIPS, Shrivastava and Li (2014)).

2.2 Conversational Question Answering (CQA)

CQA extends the reading comprehension task from a single turn to multiple turns. Given a reference document, a system is tasked with interactively answering a sequence of information-seeking questions about the corresponding document. This conversational extension leads to novel challenges in modeling linguistic phenomenons such as anaphora (referencing previous turns) and ellipsis (omitting words from questions), as well as in performing pragmatic reasoning. Large-scale conversational datasets such as CoQA Reddy et al. (2019) and QuAC Choi et al. (2018) have facilitated much of the research in this area. The two datasets differ along several dimensions – (1) CoQA has short free-form answers, whereas QuAC has longer spans from evidence text in reference text, (2) CoQA contains documents from several domains such as Children Stories, Exams etc., whereas QuAC conversations are based on Wikipedia documents of “people” category, (3) Unlike CoQA, QuAC is collected in a simulated information-seeking scenario.

Models for CQA have used simple concatenation of the question-answer history Zhu et al. (2019), history turn selection Qu et al. (2019, 2019), and question-rewrites Vakulenko et al. (2021). For question-rewriting, a different module is trained on self-contained rewrites of context-dependent questions. For example, a plausible rewrite of Q (Figure 1) is “Which other people Anne Boleyn are buried at the Chapel of St Peter ad Vincula?”. The re-written question is then answered using open-domain QA systems. Two popular question-rewriting datasets for training this module are (1) CANARD Elgohary et al. (2019), which contains re-writes of  50% of QuAC, and (2) QReCC Anantha et al. (2021), which contains rewrites of the entire QuAC dataset, along with rewrites of CAsT Dalton et al. (2020) and NQ-based conversations.

2.3 Open-Domain CQA

In this work, we focus on constructing a challenging benchmark for open-domain CQA. The open-domain aspect of the task requires systems to answer questions without access to a reference document. The conversational aspect enables users to ask multiple related questions, which can, in principle, span several different topics. With TopiOCQA we introduce the first open-domain CQA dataset that explicitly covers such topical switches.

Previous datasets for this task re-purpose existing CQA datasets. The OR-QuAC dataset Qu et al. (2020) is automatically constructed from QuAC Choi et al. (2018) and CANARD Elgohary et al. (2019) by replacing the first question in QuAC with context-independent rewrites from CANARD. QReCC Anantha et al. (2021) is a large-scale open-domain CQA and question rewriting dataset which contains conversations from QuAC, TREC CAsT Dalton et al. (2020) and Natural Questions (NQ; Kwiatkowski et al. 2019). All the questions in OR-QuAC and 78% of questions in QReCC are based on QuAC. As QuAC was collected with a given reference document, the conversations corresponding to these questions revolve around a the topic or entity corresponding to that document. 21% of questions in QReCC are from NQ-based conversations. As NQ is not a conversational dataset, the annotators of QReCC use NQ to start a conversation. A single annotator is tasked with providing both follow-up questions and answers for a given NQ question. In contrast to QReCC, conversations in our dataset are collected in a simulated information-seeking scenario using two annotators (Section 3).

Deep learning models for this task has followed a similar retriever-reader setup as open-domain QA. Instead of a single question, previous works have explored feeding the entire conversation history Qu et al. (2020), or a context independent re-written question Anantha et al. (2021).

3 Dataset Collection

Each conversation in TopiOCQA is an interaction between two annotators – a questioner and an answerer. The details about the annotators and the software used for building the annotation interface are provided in Appendix A.

3.1 Seed topics and document collection

The seed topics essentially drive the conversation. In order to make them interesting for annotators, we select the good333Wikipedia Good articles articles of Wikipedia as seed topics (around 35k). But we do not use this information during modeling, and use all articles of Wikipedia. We used the Wikipedia dump from 10/20/2020, which consists of  5.9 million documents. We used Wikiextractor444github:wikiextractor to clean the text from the document dump. While pre-processing the Wikipedia documents, we retain the hyperlinks that refer to other Wikipedia documents, thus ensuring that all the documents requested by annotators (via hyperlinks) during the conversation are available in our corpus.

3.2 Simulating information-seeking cenario

Information-seeking conversations are closer to the real-world if an information need can be simulated via the interface. In TopiOCQA, we achieve this by withholding questioner’s access to the full reference text of the document. The questioner can only see the metadata (main title and the section titles) of the Wikipedia documents, whereas the answerer can access the entire text of the documents. On finding the answer, the answerer highlights a contiguous span of text as rationale, and generates a free-form answer. The answerer also has the option to mark the question as unanswerable. The conversation history is visible to both the annotators.

As a conversation starting point, the first question is sampled from the subset of Natural Questions (NQ; Kwiatkowski et al. 2019) since NQ contains genuine information-seeking questions asked on Google. We only sample those questions that contains the answer in our seed document pool. To increase the diversity of our dataset, we also allow the questioner to formulate the first question for 28% of the conversations ensuring that the seed topic entity is present in the question.

3.3 Enabling topic-switching

The key feature of the interface is enabling topic switching via hyperlinks. For the answerer, the text of the document includes clickable hyperlinks to other documents. On clicking these links, the current document in the answerer’s interface changes to the requested (clicked) document. This enables the answerer to search for answers in documents beyond the current one. The questioner can access the metadata of documents visited by the answerer and documents present in the rationale of the answers. For example, let us assume given the seed document Daniel Radcliffe and the first question “Where was Daniel Radcliffe born?”, the answerer selects “London, England” span as rationale and provides “London” as the answer. If London is a hyperlink in the span, then the metadata of both Daniel Radcliffe and London is available to the questioner to form the next question. If the next question is “What is its population?”, the answerer can switch the current document from Daniel Radcliffe to London by clicking on the hyperlink, and can then find and provide the answer. The conversation up till this point involves two topics – Daniel Radcliffe and London. We also provide easy navigation to previously visited documents for both the annotators. This interface design ensures that information about the new topic is semantically connected to topics of the previous turns, similar to natural human-human conversations Sacks and Jefferson (1995). Figure 6 in Appendix A shows annotation interfaces for both questioners and answerers.

3.4 Additional annotations

To account for multiple valid answers, we collected three additional annotations for answers of conversations in evaluation sets (development and test splits). For this task, at any turn, the annotator can see all the previous questions and original answers. Showing original answers of previous turns is important in a conversational setting as the subsequent questions can potentially depend on them. We also provide the list of documents corresponding to previous turns of the original conversation. This ensures that the current annotator has all the information the original answerer had while providing the answer. Similar to the answerer, the annotator then provides the rationale and the answer, or marks the question as unanswerable.

Figure 2: Analysis of the topic switches in TopiOCQA. In (a) we show the distribution of the number of topics (in percentage) for each conversation length. Longer conversations typically include more topics. In (b) we show a histogram of the topic lengths, illustrating that usually 3-4 consecutive questions stay within the same topic.

4 Dataset Analysis

Dataset Train Dev Test Overall
# Turns 45,450 2,514 2,502 50,466
# Conversations 3,509 205 206 3920
# Tokens / Question 6.91 6.89 7.11 6.92
# Tokens / Answer 11.71 11.96 12.27 11.75
# Turns / conversation 13 12 12 13
# Topics / conversation 4 4 4 4
Table 2: Dataset statistics of TopiOCQA

We collected a total of 3,920 conversations, consisting of 50,466 turns. The annotators were encouraged to complete a minimum of 10 turns. Conversations with fewer than 5 turns were discarded. The cost of annotating a single turn was $0.3. We split the data into train, development and test splits.

Table 2 reports simple statistics of the dataset splits. On average, a conversation has 13 question-answer turns and is based on 4 documents. TopiOCQA differs from other conversational question-answering datasets by incorporating topic switches in the conversation.

Figure 3: A flow diagram of topic switches over conversations up to 15 turns. There are complex interactions between the topics, especially later in the conversation.
Figure 4: Distribution of various question types for each turn type. Questions asking about specific attributes are most common. Generic questions are likely to be observed when switching to a new topic. Question Type Avg Answer length ask-generic 22.43 ask-specific 11.38 ask-further 11.23 Table 3: Average answer length of different question types. Generic questions tend to have longer answers. Conversation turns Turn type Question type Q: who is mariah carey? no-switch ask-generic A: An American singer-songwriter and actress Topic: Mariah Carey Q: name one of her famous songs. no-switch ask-specific A: Oh Santa! Topic: Mariah Carey Q: how was it received? switch-to-new ask-specific A: There were mixed reviews Topic: Oh Santa! Q: is she married? switch-to-old ask-specific A: Yes Topic: Mariah Carey Q: to whom? no-switch ask-further A: Tommy Mottola Topic: Mariah Carey
Table 4: Example of a conversation various turn types and question types. Random samples of each turn type are manually annotated with one of the question types.

4.1 Topic Switching

Before we start our analysis, let us first define the notion of a topic switch in TopiOCQA. Recall that answers are based on Wikipedia articles, where each document consists of several sections. While one can argue that a topic switch occurs when the answer is based on a different section of the same document, we opt for a more conservative notion and define a switch of topic if the answer is based on a different Wikipedia document.

Number of topics vs conversation length

We begin our analysis by investigating how the number of topics varies with the conversation length. In Figure 1(a) we show a heat-map of the number of topics for each conversation length, where each column is normalized by the number of conversations of that length. We observe that longer conversations usually include more topics. Most 10-turn conversations include 3 topics, 14-turn conversations include 4 topics, and 18-turn conversations include 5 topics. The conversations with fewer than 10 turns mostly include 2 topics.

Topic flow in conversation

Next, we examine how often consecutive questions stay within the same topic. To do so, we first cluster conversations into sequences of turns for which all answers are from the same document. Then, we count how many turns belong to topic clusters of a particular length. Figure 1(b) shows the distribution of topic lengths. The mode of the distribution is 3, signifying that annotators usually ask 3 questions about the same topic before switching. Asking 2 or 4 consecutive questions on the same topic is also frequently observed. However, we rarely see more than 10 consecutive turns on the same topic.

We also analyse the flow of topics throughout the conversation. Do annotators always introduce new topics or do they also go back to old ones? Figure 3 depicts a flow diagram of topics in conversations up to 15 turns. Note that we have indexed topics according to its first occurrence in the conversation. We can see that the majority of switches introduce new topics, but also that more complex topic switching emerges in later turns. Specifically, we see that, after turn 5, questioners frequently go back one or two topics in the conversation. All in all, this diagram suggests that there are complex interactions among the topics in the conversation.

Qualitative assessment of topic switching

We try to probe the questions in an attempt to understand causes of a topic switch. Inspired from Stede and Schlangen (2004)

, we classify questions into three types:

ask-generic refers to general open-ended questions, ask-specific questions ask about a specific attribute or detail of a topic, and ask-further is a question type that seeks additional details of an attribute discussed in one of the previous turns. Table 4 shows examples of each type for questions in the same conversation. We consider three types of turns for our evaluation. If the answer document of the turn is same as the previous turn, we refer to it as no-switch. If a topic switch has happened, and the answer document is present in one of the previous turns, it is considered to be switch-to-old. The final category, switch-to-new refers to turns where current answer document has not been seen in the conversation before. These different types of topic switches are illustrated in Table 4.

We sample 50 turns of each type, and manually label them with one of the three question types. Figure 4 shows the results of our evaluation. ask-specific is the most common question type across all types of turns, indicating that most of the questions in the dataset focus on specific attributes of a topic. ask-generic has a much higher proportion in switch-to-new turn types, indicating that it is more likely to see generic questions in turns that introduce a new topic in the conversation, compared to other turn types. ask-further has almost equal proportion in no-switch and switch-to-old, with switch-to-old being slightly higher. ask-further is not observed in switch-to-new as follow-up questions are generally not possible without the topic being discussed in the previous turns.

We also look at average answer length of answers of all three question types (Table 3). Unsurprisingly, ask-generic has a much higher answer length compared to other types, presumably due to the open-ended nature of the question.

5 Experimental Setup

The task of open-domain information-seeking conversation can be framed as follows. Given previous questions and ground truth answers and current question , the model has to provide the answer . The models can optionally use a corpus of documents .

5.1 Models

We consider models from two categories, based on whether they use the document corpus. The closed-book models use just the question-answer pairs, whereas open-book models use of document corpus, along with question-answer pairs. We now describe the implementation and technical details of both classes of models.

5.1.1 Closed-book

Large-scale language models often capture a lot of world knowledge during unsupervised pre-training Petroni et al. (2019); Roberts et al. (2020). These models, in principle, can answer questions without access to any external corpus. We consider GPT-3 Brown et al. (2020) – an autoregressive language model with 175 billion parameters, and evaluate it on TopiOCQA. The input to GPT-3 is a prompt555 followed by previous questions-answers pairs and the current question. Except for the first turn, other turns can be considered as being evaluated in a few-shot setting, as the gold answers of the previous turns are provided.

5.1.2 Open-book

We build on state-of-the-art QA models that adapt a two step retriever-reader approach. For the retriever, we consider BM25 Robertson et al. (1995) and DPR Retriever Karpukhin et al. (2020)

. Given a query, BM25 ranks the documents based on a bag-of-words scoring function. On the other hand, DPR learns dense vector representations of document and query, and uses the dot product between them as a ranking function.

We consider two types of neural readers - (1) DPR Reader Karpukhin et al. (2020), which re-ranks the retrieved passages and selects a span from each document independently. The span with highest span score is chosen as the answer. (2) Fusion-in-Decoder (FiD; Izacard and Grave 2021), which encodes all retrieved passages independently, and then jointly attends over all of them in the decoder to generate the answer.

Q: who is lead singer of rage against the machine?
A: Zack de la Rocha
Q: when was it formed?
A: 1991
Q: was it nominated for any award?
Original: was it nominated for any award
AllHistory: who is lead singer of rage against the machine [SEP] Zack de la Rocha [SEP] when was it formed? [SEP] 1991 [SEP] was it nominated for any award
Rewrites: was rage against the machine nominated for any award
Figure 5: A partial conversation and different question representations of Q. The Rewrites representation is an example, not the output of our QR module.

For these models, we consider three different question representations for question at turn of the conversation (). Figure 5 shows an example of different question representations for the third question (Q) of a conversation.

  • Original: This serves as a naive baseline where just the current question is passed to the model.

  • AllHistory: The question is represented as [SEP][SEP][SEP][SEP] [SEP][SEP][SEP]. When constrained by the encoder input sequence length, we retain the first turn and as many turns prior to the current turn as possible, i.e. is chosen such that [SEP][SEP][SEP][SEP] [SEP][SEP][SEP] satisfies encoder input limits.

  • Rewrites: Given a query-rewriting module , let denote the decontextualized question, conditioned on the conversation history. is then passed to the model.

Wang et al. (2019) observed that fixed-length text segments performs better in both retrieval and final QA accuracy, hence, we first pre-process the Wikipedia corpus to extract text using Beautiful Soup library666 and then split each section of Wikipedia document into multiple text blocks of at least 100 words, while preserving sentence boundaries. These text blocks, augmented with the metadata (main title and section title) are referred to as passages. Segmenting the Wikipedia corpus resulted in 25.7 million passages, which act as basic units of retrieval. To form question-passage pairs for training DPR Retriever, we select the passage from gold answer document that contains the rationale (either entirely, or a major proportion of it).

Following the original works, we use BERT Devlin et al. (2019) for DPR (both Retriever and Reader) and T5 Raffel et al. (2020)

for FiD as base models. Since DPR Reader requires a span from passage for each training example, we heuristically select the span from the gold passage that has the highest lexical overlap (F1 score) with the gold answer.

For the query-rewriting module , we fine-tune T5 model on rewrites of QReCC Anantha et al. (2021), and use that to generate the rewrites for TopiOCQA. We refer the reader to Appendix C

for more details. The hyperparameters for each baseline model are mentioned in Appendix 


5.2 Evaluation metrics

Previous works in open-domain QA and CQA have used Exact Match and F1 as evaluation metrics. However, these works either had short free-form answers

Reddy et al. (2019) or span-based answers Rajpurkar et al. (2016); Choi et al. (2018); Kwiatkowski et al. (2019). Following works in open-domain QA that have long free-form answers Nguyen et al. (2016); Kočiský et al. (2018), a setting similar to TopiOCQA, we include BLEU-4 Papineni et al. (2002) as an evaluation metric, in addition to EM and F1.

While comparing a prediction with a reference set, the maximum score over all references is considered. Given human answers, human performance on the task is determined by considering each answer as prediction and other human answers as the reference set. This results in scores, which are averaged to give the final human performance score. Note that human performance is not necessarily an upper bound for the task, as document retrieval can potentially be performed better by the systems.

Naively comparing system prediction with all human answers lends unfair advantage to the systems as human answers are compared with other answers. To mitigate this bias, the system prediction is compared with distinct reference sets, each containing human answers. The final system score is computed by averaging scores from these reference sets, similar to Choi et al. (2018) and Reddy et al. (2019). For TopiOCQA, (the original answer and three additional annotations).

6 Results and Discussion

Model Question Rep Dev Test
Human 40.1 70.1 59.4 40.5 70.2 59.7
GPT-3 12.4 33.4 20.4 10.4 13.8 18.5
Original 0.0 3.3 0.8 0.0 3.2 0.8
BM25 + DPR Reader AllHistory 0.0 3.2 0.8 0.0 3.3 0.8
Rewrites 0.1 3.8 1.0 0.0 3.3 0.8
Original 8.9 19.7 13.1 10.2 20.7 14.0
BM25 + FiD AllHistory 22.7 35.9 29.2 22.3 35.6 28.7
Rewrites 24.2 41.7 33.4 23.9 41.1 32.9
Original 4.2 14.1 7.8 3.3 13.8 7.2
DPR Retriever + DPR Reader AllHistory 18.0 40.9 29.9 17.4 39.8 29.0
Rewrites 16.0 34.9 25.4 15.5 33.9 24.4
Original 7.5 20.2 12.3 8.2 20.6 12.6
DPR Retriever + FiD AllHistory 31.0 53.4 43.7 29.8 51.9 42.1
Rewrites 22.9 41.2 32.0 23.9 42.0 32.8
Table 5: Overall performance of all model variants on TopiOCQA development and test set

We report the end-to-end performance of all systems in Table 5. For open-book models, we also look at the performance of its constituents. Table 6 reports the retrieval performance and Table 7 reports the reading comprehension performance of the readers, given the gold passage. Based on these results, we try to address the following research questions.

How do the models compare against humans for TopiOCQA?

We report model and human performance on development and test set in Table 5. Overall, model performance in all settings is significantly lower than human performance. The best performing model (DPR Retriever + FiD using AllHistory question representation) achieves 51.9 points F1 and 42.1 points BLEU on the test set, which falls short of human performance by 18.3 points and 17.6 points respectively, indicating room for further improvement.

Which class of models perform better – Closed book or Open book?

GPT-3 is directly comparable to AllHistory variant of open-book models as it takes the entire conversation history as input. Apart from BM25 + DPR Reader, GPT-3 performs worse than all other AllHistory variants of open-book models. It achieves a BLEU score of 18.5 on the test set, which is less than the best performing open-book model (DPR Retriever + FiD) by 23.6 points. We find that GPT-3 hallucinates many answers. For the question, in which country did the Battle of Sio took place, it hallucinates the response the Battle of Sio was fought in the year 1223 between the forces of Mongol Empire and the Kingdom of Georgia, whereas the ground-truth answer is Papua New Guinea in 1943. This indicates that grounding the conversation in a knowledge source is an important step for TopiOCQA.

What impact does various question representations have on the performance of open-book models?

For all open-book models, we fine-tune on three different question representations (Section 5). From the results in Table 5, we observe that the Original representation is consistently worse than the others for all models. This highlights the importance of encoding the conversational context for TopiOCQA. Between AllHistory and Rewrites, we observe that AllHistory performs better with dense retriever (DPR Retriever), whereas Rewrites performs better with sparse retriever (BM25). This difference is particularly significant for DPR Retriever + FiD, where the performance of the two question representations differ by 9.3 BLEU points on the test set. To confirm that this performance difference in end-to-end systems stems from the retriever, we look at Top-20 and Top-100 retrieval accuracy of BM25 and DPR Retriever in Table 6. Indeed, AllHistory representation performs better than Rewrites for DPR Retriever but worse for BM25. As DPR Retriever is trained on TopiOCQA

, it can probably learn how to select relevant information from the

AllHistory representation, whereas for BM25, which is a not a trainable retriever, the non-relevant keywords in the representation act as distractors. In general, the better performance of DPR Retriever over BM25 indicates that TopiOCQA requires learning task-specific dense semantic encoding for superior retrieval performance

How much are the readers constrained due to retrieved results?

Model Question Rep Dev Test
Top-20 Top-100 Top-20 Top-100
Original 5.2 8.9 5.9 9.8
BM25 AllHistory 21.1 34.7 20.0 32.8
Rewrites 31.9 48.5 32.5 46.5
Original 7.3 14.1 8.2 14.6
DPR Retriever AllHistory 67.0 80.6 64.0 78.8
Rewrites 48.6 60.4 48.1 59.2
Table 6: Retrieval performance of all model variants on TopiOCQA development and test set
Model Question Rep Dev Test
Extractive Bound 81.1 72.7 81.0 72.4
Original 48.3 37.4 48.7 37.2
DPR Reader AllHistory 54.1 42.5 52.5 40.8
Rewrites 55.3 44.1 52.6 41.3
Original 59.2 49.5 59.5 49.5
FiD AllHistory 64.4 54.9 62.7 53.1
Rewrites 60.6 50.8 60.8 51.3
Table 7: Reader performance of all model variants on TopiOCQA development and test set when provided with the gold passage

Table 6 shows retrieval results. In an end-to-end system, the reader take as input the retrieved passages, which may or may not contain the gold passage. To get an estimate of reader performance independently from the retriever, we experiment with directly providing only the gold passage to the readers, instead of the retrieved ones. Table 7 shows the results. This can be seen as an “Ideal Retriever” setting, in which a retriever always retrieves the correct passage as the top passage. Although we observe significant gains over end-to-end systems for all models across all variants, the best model (FiD using AllHistory representation) still falls short of human performance by 7.5 F1 points and 6.6 BLEU points on the test set. These experiments indicate that while passage retrieval is a significant bottleneck for an effective system, technical advancements are needed for the readers as well.

While it is plausible that DPR Reader has an upper bound on performance due to its extractive nature, we show that the current performance is far from it. We calculate the extractive upper bound for TopiOCQA (reported in Table 7), by selecting the span with best F1 overlap from the gold document. This bound is 81.0 F1 points and 72.4 BLEU points, which essentially represents the best of any extractive model can do on this task.

7 Conclusion

We introduced TopiOCQA

, a novel open-domain conversational question answering dataset with topic switching. In this work, we described our data collection effort, analyzed its topic switching behaviour, and established strong neural baselines. The best performing model (DPR retriever + FiD) is 18.3 F1 points and 17.6 BLEU points below human performance, suggesting that advances in modeling are needed. We hope our dataset will be an important resource to enable more research on conversational agents that support topic switches in information-seeking scenarios.


  • R. Anantha, S. Vakulenko, Z. Tu, S. Longpre, S. Pulman, and S. Chappidi (2021) Open-domain question answering goes conversational via question rewriting. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Cited by: Appendix C, Table 1, §1, §2.2, §2.3, §2.3, §5.1.2.
  • T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 1877–1901. External Links: Link Cited by: §5.1.1.
  • D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading Wikipedia to answer open-domain questions. In Association for Computational Linguistics (ACL), Cited by: §2.1, §2.1.
  • E. Choi, H. He, M. Iyyer, M. Yatskar, W. Yih, Y. Choi, P. Liang, and L. Zettlemoyer (2018) QuAC: question answering in context. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    Brussels, Belgium, pp. 2174–2184. External Links: Link, Document Cited by: Table 1, §1, §2.2, §2.3, §5.2, §5.2.
  • J. Dalton, C. Xiong, and J. Callan (2020) TREC cast 2019: the conversational assistance track overview. ArXiv abs/2003.13624. Cited by: §2.2, §2.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §5.1.2.
  • A. Elgohary, D. Peskov, and J. Boyd-Graber (2019) Can you unpack that? learning to rewrite questions-in-context. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5918–5924. External Links: Link, Document Cited by: Appendix C, §2.2, §2.3.
  • G. Izacard and E. Grave (2021) Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, pp. 874–880. External Links: Link Cited by: §1, §5.1.2.
  • R. Jia and P. Liang (2017) Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2021–2031. External Links: Link, Document Cited by: §2.1.
  • M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer (2017) TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1601–1611. External Links: Link, Document Cited by: §2.1.
  • V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 6769–6781. External Links: Link, Document Cited by: §1, §2.1, §5.1.2, §5.1.2.
  • T. Kočiský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette (2018) The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics 6, pp. 317–328. External Links: Link, Document Cited by: Table 1, §5.2.
  • T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, K. Lee, K. N. Toutanova, L. Jones, M. Chang, A. Dai, J. Uszkoreit, Q. Le, and S. Petrov (2019) Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics. Cited by: Table 1, §1, §2.1, §2.3, §3.2, §5.2.
  • K. Lee, M. Chang, and K. Toutanova (2019) Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6086–6096. External Links: Link, Document Cited by: §2.1.
  • T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS marco: a human generated machine reading comprehension dataset. External Links: Link Cited by: §2.1, §5.2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, USA, pp. 311–318. External Links: Link, Document Cited by: §5.2.
  • F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, and A. Miller (2019) Language models as knowledge bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2463–2473. External Links: Link, Document Cited by: §5.1.1.
  • C. Qu, L. Yang, C. Chen, M. Qiu, W. B. Croft, and M. Iyyer (2020) Open-retrieval conversational question answering. SIGIR ’20, New York, NY, USA, pp. 539–548. External Links: ISBN 9781450380164, Link, Document Cited by: Table 1, §1, §2.3, §2.3.
  • C. Qu, L. Yang, M. Qiu, W. B. Croft, Y. Zhang, and M. Iyyer (2019) BERT with history answer embedding for conversational question answering. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. External Links: ISBN 9781450361729, Link, Document Cited by: §2.2.
  • C. Qu, L. Yang, M. Qiu, Y. Zhang, C. Chen, W. B. Croft, and M. Iyyer (2019) Attentive history selection for conversational question answering. Proceedings of the 28th ACM International Conference on Information and Knowledge Management. External Links: ISBN 9781450369763, Link, Document Cited by: §2.2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer


    Journal of Machine Learning Research

    21 (140), pp. 1–67.
    External Links: Link Cited by: §5.1.2.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 784–789. External Links: Link, Document Cited by: Table 1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2383–2392. External Links: Link, Document Cited by: §2.1, §5.2.
  • S. Reddy, D. Chen, and C. Manning (2019) CoQA: a conversational question answering challenge. Transactions of the Association for Computational Linguistics 7 (0), pp. 249–266. External Links: ISSN 2307-387X, Link Cited by: Table 1, §1, §2.2, §5.2, §5.2.
  • A. Roberts, C. Raffel, and N. Shazeer (2020) How much knowledge can you pack into the parameters of a language model?. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 5418–5426. External Links: Link, Document Cited by: §5.1.1.
  • S. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford (1995) Okapi at trec-3. In Overview of the Third Text REtrieval Conference (TREC-3), Overview of the Third Text REtrieval Conference (TREC–3) edition, pp. 109–126. External Links: Link Cited by: §5.1.2.
  • A. Rogers, M. Gardner, and I. Augenstein (2021) QA dataset explosion: a taxonomy of nlp resources for question answering and reading comprehension. arXiv preprint arXiv:2107.12708. Cited by: §1.
  • H. Sacks and G. Jefferson (1995) Winter 1971. In Lectures on Conversation, pp. 289–331. External Links: ISBN 9781444328301, Document, Link, Cited by: §3.3.
  • A. Shrivastava and P. Li (2014) Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, pp. . External Links: Link Cited by: §2.1.
  • A. Spink, H. C. Ozmutlu, and S. Ozmutlu (2002) Multitasking information seeking and searching processes. Journal of the american society for information science and technology 53 (8), pp. 639–652. Cited by: §1, §1.
  • M. Stede and D. Schlangen (2004) Information-seeking chat : dialogue management by topic structure. Cited by: §1, §4.1.
  • S. Vakulenko, S. Longpre, Z. Tu, and R. Anantha (2021) Question rewriting for conversational question answering. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, WSDM ’21, New York, NY, USA, pp. 355–363. External Links: ISBN 9781450382977, Link, Document Cited by: §2.2.
  • D. Walton (2019) The new dialectic: conversational contexts of argument. University of Toronto Press. External Links: Document, Link, ISBN 9781442681859 Cited by: §1.
  • S. Wang, M. Yu, X. Guo, Z. Wang, T. Klinger, W. Zhang, S. Chang, G. Tesauro, B. Zhou, and J. Jiang (2018) Cited by: §2.1.
  • Z. Wang, P. Ng, X. Ma, R. Nallapati, and B. Xiang (2019) Multi-passage BERT: a globally normalized BERT model for open-domain question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5878–5882. External Links: Link, Document Cited by: §5.1.2.
  • W. Yang, Y. Xie, A. Lin, X. Li, L. Tan, K. Xiong, M. Li, and J. Lin (2019) End-to-end open-domain question answering with BERTserini. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, Minnesota, pp. 72–77. External Links: Link, Document Cited by: §2.1.
  • Y. Yang, W. Yih, and C. Meek (2015) WikiQA: a challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 2013–2018. External Links: Link, Document Cited by: §2.1.
  • C. Zhu, M. Zeng, and X. Huang (2019) SDNet: contextualized attention-based deep network for conversational question answering. External Links: 1812.03593 Cited by: §2.2.

Appendix A Dataset Collection Details

Each conversation in TopiOCQA is an interaction between two annotators – a questioner and an answerer. The annotators were selected from’s in-house workforce, based on their English language proficiency. Each annotator is trained for the role of both questioner and answerer. The annotators are provided the following guidelines.

Guidelines for the questioner:

  • The first question should be unambiguous and about the seed entity.

  • The follow-up questions be contextualized and dependent on the conversation history whenever possible.

  • Avoid using same words as in section titles of the document. E.g. if the section title is “Awards”, a plausible question can be “What accolades did she receive for her work?”.

  • The conversation should involve multiple documents (topics).

Guidelines for the answerer:

  • Based on the question, identify the relevant document and section.

  • The answer should be based on the contents of the identified document.

  • The rationale should be selected such that it justifies the answer.

  • The answer should be a sub-string in rationale whenever possible. However, answers should be edited to fit the conversational context (adding yes, no), perform reasoning (e.g. counting) etc.

  • Personal opinions should never be included.

(a) Questioner Interface
(b) Answerer Interface
Figure 6: Annotation interface for questioners and answerers
Q: when was the byzantine empire born what was it originally called?
A: 5th century AD and was called Eastern Roman Empire, or Byzantium
Topic: Byzantine Empire
Q: and when did it fall?
A: 1453
Topic: Byzantine Empire
Q: which battle or event marked the fall of this empire?
A: A six-year-long civil war followed by attack from Sultan Mehmed’s army
Topic: Byzantine Empire
Q: did he conquer other territories as well?
A: Yes. Anatolia and in Southeast Europe as far west as Bosnia
Topic: Mehmed the Conqueror
Q: where is the first area located in present day terms?
A: Turkey
Topic: Anatolia
Q: who were the oldest known inhabitants of this region?
A: Mesopotamian-based Akkadian Empire
Topic: Anatolia
Q: what is the present day capital of the country?
A: Ankara
Topic: Turkey
Q: can you name some of the other major cities here?
A: Istanbul
Topic: Turkey
Q: were any of these cities associated with the first empire you were discussing?
A: The Ottomans made the city of Ankara the capital first of the Anatolia Eyalet and then the Angora Vilayet
Topic: Ankara
Q: what are some of the most famous landmarks in the second city?
A: The obelisk, Valens Aqueduct, Column of Constantine, Church of the Saints Sergius and Bacchus
Topic: Istanbul
Q: who was the first monument you mentioned dedicated to?
Q: and who was the third monument name after?
A: Roman emperor Constantine the Great
Topic: Column of Constantine
Q: what is it made of?
A: Porphyry and white marble
Topic: Column of Constantine
Q: how tall is it?
A: The column’s top is 34.8 m above the present-day ground level but the original height of the monument as a whole would have been nearly 50 m tall
Topic: Column of Constantine
Figure 7: A full conversation from TopiOCQA.

After providing the guidelines and a few examples, the initial annotated conversations were manually inspected by the authors. The workers that provided low-quality annotations during this inspection phase were disqualified. The final workforce consisted of 15 workers, which provided annotations for the dataset over a period of two months. Random quality checks were performed by the authors and periodic feedback was given to the workers throughout the data collection to maintain high quality of data. Figure 6 shows annotation interfaces for questioner and answerer. Figure 7 shows an example from the dataset.

We also implemented several real-time checks in the questioner’s interface to encourage topic switching and use of co-reference, and to reduce the lexical overlap with the metadata of the document while forming the question.

Appendix B Question Characteristics

(a) TopiOCQA
(b) CoQA
(c) QReCC
Figure 8: Distribution of question trigrams in TopiOCQA, CoQA, and QReCC

We plot the distribution of frequent trigram prefixes of the questions in TopiOCQA in Figure 7(a). Our dataset consists of a diverse set of question types revolving around people (Who is the …?), locations (Where was it …?), dates (When was it …?), yes/no (Did he …?) etc. The distribution is similar to other information-seeking conversational QA datasets like CoQA and QReCC (Figure 8).

Appendix C Query Rewriting

A query-rewriting module , takes the current question and the conversation history as input and provides a decontextualized rewritten question, , as the output. As we don’t collect rewrites in TopiOCQA, we rely on other datasets to train our model. Two datasets that provide rewrites for information-seeking conversations are CANARD Elgohary et al. (2019) and QReCC Anantha et al. (2021). Due to its large-scale and diverse nature, we use QReCC to train our T5 based model.

To rewrite the question using the conversation history, the input to T5 is [SEP][SEP][SEP][SEP] [SEP][SEP][SEP]. We train this model on QReCC dataset. On the test split of QReCC, our model achieves a BLEU score of 62.74 points. We use this model to generate rewrites for TopiOCQA in our experiments.

Appendix D Hyperparameter Details

We use Lucene BM25 with default values of and . For DPR Retriever, apart from the batch size, we use the hyperparameters suggested in their codebase. We use base

model size with a batch batch size of 32, dropout of 0.1, and learning rate of 2e-5. We train the DPR Retriever for 40 epochs on 4 40GB NVIDIA A100 GPUs, and select the model checkpoint based on best development set performance. DPR Reader (also using

base size) is trained with a batch size of 16, learning rate of 1e-5 and dropout of 0.1. The training is run for 20 epochs on 8 32GB Tesla V100 GPUs. The model checkpoint with best EM score on development set is selected.

For FiD, we re-use the hyperparameters suggested on their codebase. FiD was originally trained on 64 GPUs. We train on 8 32GB Tesla V100 GPUs with 8 step gradient accumulation to get the same effective batch size of 64. We use the learning rate of 5e-5, and train for 15000 gradient steps. For each training example, we encode top 50 retrieved passages (original implementation has 100). The model checkpoint with best EM score on development set is selected.