DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine

03/19/2022
by   Yifu Qiu, et al.
0

In this paper, we present DuReader_retrieval, a large-scale Chinese dataset for passage retrieval. DuReader_retrieval contains more than 90K queries and over 8M unique passages from Baidu search. To ensure the quality of our benchmark and address the shortcomings in other existing datasets, we (1) reduce the false negatives in development and testing sets by pooling the results from multiple retrievers with human annotations, (2) and remove the semantically similar questions between training with development and testing sets. We further introduce two extra out-of-domain testing sets for benchmarking the domain generalization capability. Our experiment results demonstrate that DuReader_retrieval is challenging and there is still plenty of room for the community to improve, e.g. the generalization across domains, salient phrase and syntax mismatch between query and paragraph and robustness. DuReader_retrieval will be publicly available at https://github.com/baidu/DuReader/tree/master/DuReader-Retrieval

READ FULL TEXT
research
04/07/2023

T2Ranking: A large-scale Chinese Benchmark for Passage Ranking

Passage ranking involves two stages: passage retrieval and passage re-ra...
research
07/30/2021

MTVR: Multilingual Moment Retrieval in Videos

We introduce mTVR, a large-scale multilingual video moment retrieval dat...
research
10/24/2020

COUGH: A Challenge Dataset and Models for COVID-19 FAQ Retrieval

We present a large challenging dataset, COUGH, for COVID-19 FAQ retrieva...
research
03/07/2022

Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval

Passage retrieval is a fundamental task in information retrieval (IR) re...
research
10/25/2022

Instance Segmentation for Chinese Character Stroke Extraction, Datasets and Benchmarks

Stroke is the basic element of Chinese character and stroke extraction h...
research
11/03/2020

CMT in TREC-COVID Round 2: Mitigating the Generalization Gaps from Web to Special Domain Search

Neural rankers based on deep pretrained language models (LMs) have been ...
research
06/08/2021

Neural Extractive Search

Domain experts often need to extract structured information from large c...

Please sign up or login with your details

Forgot password? Click here to reset