MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

11/28/2016
by   Tri Nguyen, et al.
0

This paper presents our recent work on the design and development of a new, large scale dataset, which we name MS MARCO, for MAchine Reading COmprehension.This new dataset is aimed to overcome a number of well-known weaknesses of previous publicly available datasets for the same task of reading comprehension and question answering. In MS MARCO, all questions are sampled from real anonymized user queries. The context passages, from which answers in the dataset are derived, are extracted from real web documents using the most advanced version of the Bing search engine. The answers to the queries are human generated. Finally, a subset of these queries has multiple answers. We aim to release one million queries and the corresponding answers in the dataset, which, to the best of our knowledge, is the most comprehensive real-world dataset of its kind in both quantity and quality. We are currently releasing 100,000 queries with their corresponding answers to inspire work in reading comprehension and question answering along with gathering feedback from the research community.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/13/2022

PQuAD: A Persian Question Answering Dataset

We present Persian Question Answering Dataset (PQuAD), a crowdsourced re...
research
05/01/2020

Clinical Reading Comprehension: A Thorough Analysis of the emrQA Dataset

Machine reading comprehension has made great progress in recent years ow...
research
11/14/2017

DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

In this paper, we introduce DuReader, a new large-scale, open-domain Chi...
research
06/10/2018

Adaptations of ROUGE and BLEU to Better Evaluate Machine Reading Comprehension Task

Current evaluation metrics to question answering based machine reading c...
research
03/26/2018

CliCR: A Dataset of Clinical Case Reports for Machine Reading Comprehension

We present a new dataset for machine comprehension in the medical domain...
research
10/04/2016

Embracing data abundance: BookTest Dataset for Reading Comprehension

There is a practically unlimited amount of natural language data availab...
research
03/25/2023

Thistle: A Vector Database in Rust

We present Thistle, a fully functional vector database. Thistle is an en...

Please sign up or login with your details

Forgot password? Click here to reset