Huatuo-26M, a Large-scale Chinese Medical QA Dataset

05/02/2023
by   Jianquan Li, et al.
0

In this paper, we release a largest ever medical Question Answering (QA) dataset with 26 million QA pairs. We benchmark many existing approaches in our dataset in terms of both retrieval and generation. Experimental results show that the existing models perform far lower than expected and the released dataset is still challenging in the pre-trained language model era. Moreover, we also experimentally show the benefit of the proposed dataset in many aspects: (i) trained models for other QA datasets in a zero-shot fashion; and (ii) as external knowledge for retrieval-augmented generation (RAG); and (iii) improving existing pre-trained language models by using the QA pairs as a pre-training corpus in continued training manner. We believe that this dataset will not only contribute to medical research but also facilitate both the patients and clinical doctors. See <https://github.com/FreedomIntelligence/Huatuo-26M>.

READ FULL TEXT

page 4

page 12

research
10/14/2021

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

With the rise of large-scale pre-trained language models, open-domain qu...
research
09/03/2023

MedChatZH: a Better Medical Adviser Learns from Better Instructions

Generative large language models (LLMs) have shown great success in vari...
research
03/19/2021

Controllable Generation from Pre-trained Language Models via Inverse Prompting

Large-scale pre-trained language models have demonstrated strong capabil...
research
01/23/2023

PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development

The field of Question Answering (QA) has made remarkable progress in rec...
research
06/06/2022

Learning to Ask Like a Physician

Existing question answering (QA) datasets derived from electronic health...
research
05/18/2023

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Vision-language pre-training (VLP) models have been demonstrated to be e...
research
05/11/2023

Long-Tailed Question Answering in an Open World

Real-world data often have an open long-tailed distribution, and buildin...

Please sign up or login with your details

Forgot password? Click here to reset