What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams

09/28/2020
by   Di Jin, et al.
5

Open domain question answering (OpenQA) tasks have been recently attracting more and more attention from the natural language processing (NLP) community. In this work, we present the first free-form multiple-choice OpenQA dataset for solving medical problems, MedQA, collected from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively. We implement both rule-based and popular neural methods by sequentially combining a document retriever and a machine comprehension model. Through experiments, we find that even the current best method can only achieve 36.7%, 42.0%, and 70.1% of test accuracy on the English, traditional Chinese, and simplified Chinese questions, respectively. We expect MedQA to present great challenges to existing OpenQA systems and hope that it can serve as a platform to promote much stronger OpenQA models from the NLP community in the future.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/12/2022

CSL: A Large-scale Chinese Scientific Literature Dataset

Scientific literature serves as a high-quality corpus, supporting a lot ...
research
11/11/2021

A Chinese Multi-type Complex Questions Answering Dataset over Wikidata

Complex Knowledge Base Question Answering is a popular area of research ...
research
02/04/2022

Pirá: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean

Current research in natural language processing is highly dependent on c...
research
09/23/2021

BiRdQA: A Bilingual Dataset for Question Answering on Tricky Riddles

A riddle is a question or statement with double or veiled meanings, foll...
research
12/05/2018

An enhanced computational feature selection method for medical synonym identification via bilingualism and multi-corpus training

Medical synonym identification has been an important part of medical nat...
research
08/17/2023

CMB: A Comprehensive Medical Benchmark in Chinese

Large Language Models (LLMs) provide a possibility to make a great break...
research
02/03/2023

Towards a responsible machine learning approach to identify forced labor in fisheries

Many fishing vessels use forced labor, but identifying vessels that enga...

Please sign up or login with your details

Forgot password? Click here to reset