WikiQA Dataset


The WikiQA corpus is a new publicly available set of questions and their sentence pairs, collected and annotated for performing research on open-domain question answering. In order to reflect the true information need of the general user, Bing query logs were used as the source of questions. Each question is then linked to a Wikipedia page that potentially contains the answer. Using crowdsourcing, ~3k questions and ~30k sentences were included in the dataset, where a small subset of sentences were labeled as the correct answer sentences to their corresponding questions.