Log In Sign Up

MedDialog: A Large-scale Medical Dialogue Dataset

by   Shu Chen, et al.

Medical dialogue systems are promising in assisting in telemedicine to increase access to healthcare services, improve the quality of patient care, and reduce medical costs. To facilitate the research and development of medical dialogue systems, we build a large-scale medical dialogue dataset – MedDialog – that contains 1.1 million conversations between patients and doctors and 4 million utterances. To our best knowledge, MedDialog is the largest medical dialogue dataset to date. The dataset is available at


page 1

page 2

page 3

page 4


A Spoken Drug Prescription Dataset in French for Spoken Language Understanding

Spoken medical dialogue systems are increasingly attracting interest to ...

Terminology-aware Medical Dialogue Generation

Medical dialogue generation aims to generate responses according to a hi...

M^2-MedDialog: A Dataset and Benchmarks for Multi-domain Multi-service Medical Dialogues

Medical dialogue systems (MDSs) aim to assist doctors and patients with ...

On the Generation of Medical Dialogues for COVID-19

Under the pandemic of COVID-19, people experiencing COVID19-related symp...

BotsTalk: Machine-sourced Framework for Automatic Curation of Large-scale Multi-skill Dialogue Datasets

To build open-domain chatbots that are able to use diverse communicative...

MedDG: A Large-scale Medical Consultation Dataset for Building Medical Dialogue System

Developing conversational agents to interact with patients and provide p...

Prompt-based Generative Approach towards Multi-Hierarchical Medical Dialogue State Tracking

The medical dialogue system is a promising application that can provide ...

Code Repositories

1 Introduction

Telemedicine refers to the practice of delivering patient care remotely, where doctors provide medical consultations to patients using HIPAA compliant video-conferencing tools. As an important complement to traditional face-to-face medicine practiced physically in hospitals and clinics, telemedicine has a number of advantages. First, it increases access to care. For people living in medically under-served communities (e.g., rural areas) that are in shortage of clinicians, telemedicine enables them to receive faster and cheaper care compared with traveling over a long distance to visit a clinician. Second, it reduces healthcare cost. In a study111 by Jefferson Health, it is shown that diverting patients from emergency departments with telemedicine can save more than $1,500 per visit. Third, telemedicine can improve quality of care. The study in (pande2015leveraging) shows that telemedicine patients score lower for depression, anxiety, and stress, and have 38% fewer hospital admissions. Other advantages include improving patient engagement and satisfaction, improving provider satisfaction, etc. Please refer to (wootton2017introduction) for a more comprehensive review.

While telemedicine is promising, it has several limitations. First, it puts additional burden to physicians. In additional to practicing face-to-face medicine which already makes physicians highly occupied, physicians need to provide remote consultations in telemedicine, which further increases the risk of physician burnout. Second, different from in-hospital patients, the progression of whose medical conditions can be easily tracked by clinicians, remote patients are difficult to track and monitor. To address such problems, there has been increasing research interests in developing artificial intelligence (AI) methods to assist in telemedicine. In particular, medical dialogue systems are being developed to server as “virtual doctors”. These “virtual doctors” are aimed to interact with patients via natural dialogues, asking about the medical conditions and history of patients and providing clinical advice. They can also proactively reach out to patients to ask about the progression of patients’ conditions and provide timely interventions accordingly.

To build medical dialogue systems, a large collection of conversations between patients and doctors are needed as training data. Due to data privacy concerns, such data is very difficult to obtain. The existing medical dialogue datasets are limited in size or biased to certain diseases, which cannot adequately serve the purpose to train medical dialogue systems that can achieve doctor-level intelligence and cover all specialities in medicine.

To address the limitations of existing datasets, we build a large-scale medical dialogue dataset that contains 1.1 million patient-doctor consultations and 4 million utterances. It covers almost all specialities in medicine, ranging from internal medicine to family medicine and covers a wide spectrum of diseases, including cancer, pneumonia, etc. To our best knowledge, it is the largest medical dialogue dataset to date. The data is open to the public.

2 Dataset

The MedDialog dataset contains 1,145,231 consultations between patients and doctors. The total number of utterances is 3,959,333: 2,179,008 from doctors and 1,780,325 from patients. Each consultation consists of three parts: (1) description of patient’s medical condition and history; (2) conversation between patient and doctor; (3) (optional) diagnosis and treatment suggestions given by the doctor. In the description of patient’s medical condition and history, the following fields are included: present disease, detailed description of present disease, what help is needed from the doctor, how long the disease has been, medications, allergies, and past disease. Figure 1 shows an exemplar consultation. In the conversation, there are cases that multiple consecutive utterances are from the same person (either doctor or patient) and these utterances were posted at different time points. If we combine consecutive utterances from the same person into a single one, there are 3,209,660 utterances: 1,981,844 from doctors and 1,227,816 from patients. The data is crawled from haodf.com222, which is an online platform of healthcare services, including medical consultation, scheduling appointment with doctors, etc.

The consultations cover 29 broad categories of specialties including internal medicine, pediatrics, dentistry, etc. and 172 fine-grained specialties including cardiology, neurology, gastroenterology, urology, etc. The consultations are conducted from 2010 to 2020.

Figure 1: An exemplar consultation, which includes (1) description of medical conditions and history of the patient, (2) dialogue between doctor and patient, and (3) diagnosis and treatment suggestions given by the doctor.

2.1 Advantages of the dataset

  • Large number of conversations and utterances. To our best knowledge, MedDialog is the largest medical dialogue dataset. It has about 1.1 million conversations and 4 million utterances.

  • Broad coverage of medical specialities. consultations are about 29 broad categories of specialties and 172 fine-grained specialties.

  • Diversity of the patients. The patients are from 31 provincial-level administrative divisions in China, with different ethics, age, gender, occupation, education, income, etc. Such diversity greatly minimizes population bias in the dataset.

2.2 Limitations of the dataset

  • The language is Chinese, which is not easy for non-Chinese-speaking researchers to work on.

  • The patients are from China. The dataset may have a bias to the Chinese population.

  • The doctors are from China. The medical consultations, diagnosis, and treatment recommendations may be biased to the practice of medicine in China.

3 Conclusions

To facilitate the research and development of medical dialogue systems that can potentially assist in telemedicine, we build a large-scale medical dialogue dataset that contains 1.1 million conversations between patients and doctors and 4 million utterances. The dataset is publicly available and is continuously growing.