PadChest: A large chest x-ray image dataset with multi-label annotated reports

We present a labeled large-scale, high resolution chest x-ray dataset for the automated exploration of medical images along with their associated reports. This dataset includes more than 160,000 images obtained from 67,000 patients that were interpreted and reported by radiologists at Hospital San Juan Hospital (Spain) from 2009 to 2017, covering six different position views and additional information on image acquisition and patient demography. The reports were labeled with 174 different radiographic findings, 19 differential diagnoses and 104 anatomic locations organized as a hierarchical taxonomy and mapped onto standard Unified Medical Language System (UMLS) terminology. Of these reports, 27 remaining set was labeled using a supervised method based on a recurrent neural network with attention mechanisms. The labels generated were then validated in an independent test set achieving a 0.93 Micro-F1 score. To the best of our knowledge, this is the largest public chest x-ray database suitable for training supervised models concerning radiographs, and the first to contain radiographic reports in Spanish. The PadChest dataset can be downloaded from


Automated Enriched Medical Concept Generation for Chest X-ray Images

Decision support tools that rely on supervised learning require large am...

Learning to recognize Abnormalities in Chest X-Rays with Location-Aware Dense Networks

Chest X-ray is the most common medical imaging exam used to assess multi...

Learning to diagnose common thorax diseases on chest radiographs from radiology reports in Vietnamese

We propose a data collecting and annotation pipeline that extracts infor...

Identifying ARDS using the Hierarchical Attention Network with Sentence Objectives Framework

Acute respiratory distress syndrome (ARDS) is a life-threatening conditi...

A clinical validation of VinDr-CXR, an AI system for detecting abnormal chest radiographs

Computer-Aided Diagnosis (CAD) systems for chest radiographs using artif...

1 Introduction

Chest x-rays are essential for both the screening and the diagnosis of pulmonary, cardiovascular, bone and other thoracic disorders. The adequate interpretation of the radiographic findings requires medical training acquired over many years, with radiologists being the most qualified professionals in this fields. Due to increasing workload pressures, many radiologists today have to read more than 100 x-ray studies daily. Therefore, automated tools trained to predict the risk of specific abnormalities given a particular x-ray image have the potential to support the reading workflow of the radiologist. Those tools could be used to enhance the confidence of the radiologist or prioritize the reading list where critical cases would be read first. Decision support systems (DSS) designed as tools to assist in the clinical interpretation of chest x-rays would therefore fulfill an unmet need.

Deep learning techniques are currently obtaining promising results and perform extremely well in a variety of sophisticated tasks (naturedl)

, especially those related to computer vision, often equaling or exceeding human performance

(Goodfellow2016DeepLearning). The application of deep neural networks to medical imaging and chest radiographs in particular, has become a growing area of research in recent years (qin2018computer). For instance, wang2017chestx

trained a Convolutional Neural Network (CNN) to classify and localize 8 pathologies using the chest x-ray database (

ChestX-Ray8) which comprised 108,948 frontal-view x-ray images of 32,717 different patients. Using the same repository, rajpurkar2017chexnet extended the annotations to 14 different pathologies (ChestX-Ray14) and designed a model with a deeper CNN architecture to classify images as 14 pathological entities. This method was reported to obtain greater diagnostic efficiency in the detection of pneumonias when compared to that of radiologists. (guan2018diagnose) proposed the attention guided CNN to help combine global and local information in order to improve recognition performance. Chest-XRay14 was also employed by wang2018tienet to design a network architecture that combines text and image with attention mechanisms capable of generating text that describes the image, while jing2017automatic introduced a hierarchical model of Recurrent Neural Networks (RNN) with which to generate long paragraphs from the images and obtain semantically linked phrases for the same purpose.

Despite claims that they achieve and/or surpass physician-level performance, current deep learning models for the classification of pathologies using chest x-rays are proving not to be generalizable across institutions and not yet ready for adoption in real-world clinical settings (zech2018variable). Moreover, warnings of potential unintended consequences of their use are discussed by cabitza2017unintended.

It is unclear how to extend the significant success in computer vision using deep neural networks to the medical domain (shin2017book). Open questions concerning medical radiology datasets that still need to be addressed are:

  • How to annotate the huge amount of medical images required by deep learning models (shin2017book) and meet the required quality. Large-scale crowd-sourced hand-annotation which has proved successful in the general domain, e.g. ImageNet (deng2009imagenet), is not feasible because of the medical expertise required to carry it. This, is compounded by the fact that the semantic interpretation and extraction of medical knowledge from the corpora of medical text in unstructured natural language remains a challenge (Weng_2010) and is an area of active research (bustos2018learning).

  • The clinically relevant image labels that need to be defined and which criteria should be followed to annotate them (shin2017book).

  • How to deal with uncertainties in radiology texts. Medical data is characterized by uncertainty and incompleteness and machine learning decision support systems (ML-DSS) need to adapt to input data reflecting the nature of medical information, rather than at imposing an idea of data accuracy and completeness that does not fit patient records and medical registries, for which data quality is far from optimal. In this respect,

    cabitza2017unintended advises caution as regards the unintended consequences of adopting ML-DSS that demise context and ignore the fact that observer variability obeys not only interpretative deficiencies but also intrinsic variability in the observed phenomena.

  • How to effectively control for potential confounding factors, such as the presence of tubes, catheters, the quality of the image as assessed by radiologists, patient position, etc. and unbalanced entity prevalence, which models learn to exploit as predictive features to the detriment of clinical radiological patterns.

There are a number of publicly available chest x-ray datasets that can be used for image classification and retrieval tasks. The National Institute of Health of America (NIH) repository (wang2017chestx) contains 112,120 frontal-view chest x-rays, corresponding to 30,805 different patients, and multi-labeled with 14 different thoracic diseases (rajpurkar2017chexnet). The Korean Institute of Tuberculosis dataset (ryoo2014activities) consists of 10,848 DICOMs, of which 3,828 show tuberculosis abnormalities. The Indiana University dataset (demner2015preparing) comprises 7,470 frontal and lateral chest x-ray images, corresponding to 3,955 radiology reports with disease annotations, such as cardiac hypertrophy, pulmonary edema, opacity, or pleural effusion. The JSRT dataset (shiraishi2000development) consists of 247 x-rays of which 154 have been labeled for lung nodules (100 malignant). van2006segmentation also provides masks of the lung area for the evaluation of segmentation performance. The Shenzhen dataset (jaeger2014two) has a total of 662 images belonging to two categories (normal and tuberculosis).

With regard to the annotation methods applied, shin2016interleaved; shin2017book used a large dataset that included 780,000 documents from 216,000 images comprising CT, MR, PET and other image modalities, and automatically mined categorical semantic labels using a non-parametric topic modeling method. The resulting annotations were judged to be non-specific. In order to increase disease specificity, the authors have matched frequent pathology types using a disease ontology and semantics. This method, however, assigned specific disease labels to around only 10% of the dataset. In ChestX-Ray-8, wang2017chestx used MetaMap (aronson2010overview), DNorm (leaman2015challenges) and custom negation rules applied to a syntactic parser in order to label the presence or otherwise of 8 entities, which was further expanded to 14 entities in 125,000 images (Chest-XRay14), also by applying analogous methods. They validated the image labeling procedure against 3,800 annotated reports of x-ray images from OpenI - Indiana DB- (demner2015preparing).

Outstanding questions, such as those mentioned above, still remain unaddressed by the medical image datasets currently available. This problem is compounded by the fact that the, medical annotations used as a ground-truth are reduced to a small number of entities and, on many occasions, owing to the inherent limitations of the natural language processing (NLP) techniques applied to automate their extraction, these annotations contain omissions, inconsistencies and are not validated by physicians.

In this work, we propose a dataset called PadChest (PAthology Detection in Chest radiographs) which is, to the best of our knowledge, one of the largest and most exhaustively labeled public chest x-ray dataset and the only one to contain the source report excerpts, which are written in Spanish. The labels of the images are mapped onto standard Unified Medical Language System (UMLS) terminology and can therefore be used regardless of the language. They cover the full spectrum of thoracic entities, contrasting with the much more reduced number of entities annotated in previous datasets. Moreover, entities are localized with anatomical labels, images are stored in high resolution, and extensive information is provided, including the patient’s demography, type of projection and acquisition parameters for the imaging study, among others.

In contrast to previous approaches that relied solely on automatic annotation tools, in this work the labeling required for the baseline ground-truth was carried out manually by trained physicians. While tools to annotate medical text in Spanish are less extensively developed and tested than in English, the most compelling reason for opting for an annotation carried out manually by physicians was to maximize the reliability of the baseline ground-truth.

Using a laboriously developed ground-truth as a baseline, we propose a supervised annotation method based on deep neural networks and labeling resolution rules in order to extract labels from the remaining reports (73% of the samples). This pipeline is designed for large-scale exhaustive annotation of Spanish chest x-ray reports and aims to overcome some common NLP challenges in the medical domain. For instance, the proposed method deals with anaphora resolution in which findings are mentioned in different sentences or even in different studies for the same patient and whose interpretation depends upon another expression in that context. This method also deals with co-reference resolution, in which words for locations are related to the entities that they are describing, and also with hedging statements, in which uncertainty, probability and indirect expressions are to be learned by deep neural models.

Moreover, in an effort to incorporate uncertainty into the ground-truth, we differentiate radiographic findings from differential diagnoses, acknowledging the existence of two distinct image classification problems. Radiographic findings are completely observable in the images and we, therefore hypothesize that state-of-the-art deep learning models could achieve reliable results for this task. However, differential diagnoses are characterized by intrinsic uncertainty and a highly multidimensional context which is not included in the image. The expectation that automated methods could obtain similar results for differential diagnoses to those of doctors using only x-rays as input may, therefore, be unrealistic. This is because diagnoses such as “pneumonia" are not based solely on the x-ray image and depend to a great extent on external factors, such as laboratory tests, physical examinations, symptoms and signs, temporal clues and clinical judgment, which may additionally vary among practitioners.

The main objectives of the proposed dataset are:

  • To broaden the scope of radiographic diagnoses and findings that trainable models can learn from chest x-ray images, including respiratory, cardiac, infectious, neoplastic, bone and soft tissue diagnoses, the positions of nasogastric tubes, endotracheal tubes, central venous catheters, electrical devices, etc.

  • To increase the disease detection capabilities of trained models by providing the localizations of the entities as regions of interest.

  • To make available all relevant metadata, along with all the entities described by radiologists, so as to help control potential confounders in predictive models. For example projection types that dictate the adequate interpretation of radiographic findings and patient positioning have been identified as confounding factors in deep learning models. Indeed, the first task for radiologists is to identify the type of projection before reading the x-ray in order to correctly interpret the findings.

  • To help advance automatic clinical text annotations in Spanish. To the best of our knowledge, this is the first publicly available chest x-ray dataset containing excerpts from radiology reports in this language. Although the excerpts from the report are provided in Spanish, the labels are mapped onto biomedical vocabulary unique identifier (CUIs) codes, thus making the dataset usable regardless of the language.

  • Training text models with the proposed deep learning methods would make it possible to help automatically label other large-scale x-ray repositories in Spanish speaking health institutions, using the source code provided.

The remainder of the paper is as follows. Section 2 describes the methodology employed to build the dataset, including the manually labeled subset and the automatic labeling of the remaining reports using deep neural networks. Section 3 shows the evaluation results of the automatic-labeling methods described in the previous section, while Section 4 details the statistics of the dataset. Finally, Section LABEL:sec:conclusions addresses the discussion, conclusions and future work.

2 Material and methods

Figure 1: Dataset Building Pipeline: The PadChest dataset consists of 206,222 x-ray reports (large circle), 109,931 (middle circle) of which had their 160,868 corresponding images in DICOM format and were acquired from the years 2009 to 2017. A subsample of 27,593 reports (pale oval region) from the years 2014 to 2017 were manually labeled and further used to train a multi-label text classifier based on neural networks in order to annotate the remaining dataset with radiographic findings and differential diagnoses. Anatomical localizations were extracted using regular expressions. Different diagnoses, radiographic findings and anatomical localizations were mapped onto the NLM Unified Medical Language System (UMLS) controlled biomedical vocabulary unique identifiers (CUIs) and organized into semantic concept trees.

The PadChest dataset consists of all the available chest-x rays that had been interpreted and reported by 18 radiologists at the Hospital Universitario de San Juan, Alicante (Spain) from Jan 2009 to Dec 2017, amounting to 109,931 studies and 168,861 different images, as shown in Tab. 7.

This project was approved by the institutional research committee, and both the images and the associated reports were made anonymous and de-identified by the Medical Image Bank of the Valencian Community at the Department of Universal Health and Public Health Services (BIMCV-CSUSP) and the Health Informatics Department at San Juan Hospital.

The PadChest dataset can be downloaded from the repository of the medical imaging bank (BIMCV - PADCHEST111, enabled by the Medical Image Bank of the Valencian Community (BIMCV). The BIMCV has launched various projects regarding population medical images, whose objective is to develop and implement an infrastructure with a massive storage capacity following the RD Cloud CEIB architecture (Salinas12). One of the missions of this bank is to promote the publication of scientific knowledge as open data by its affiliated health institutions.

PadChest contains image files adding up to 1 TB, a csv file with 33 fields for each study and an instruction file containing field descriptions, examples and search information for efficient image retrieval. An example of a dataset study with two projections can be found in

LABEL:app:DatasetExample, along with its associated labels and additional information fields.

The methodology employed to build PadChest comprises the following main steps:

  • Pre-processing of the images and DICOM metadata extraction.

  • Pre-processing of the reports.

  • Manual medical annotations using a hierarchical taxonomy of radiographic findings, differential diagnoses and their anatomic locations.

  • Automatic labeling of the remaining studies.

2.1 Pre-processing of the images and DICOM metadata extraction

The images were processed by rescaling the dynamic range using the DICOM window width and center, when available. They were not resized to avoid the loss of resolution.

Some images were initially excluded when: 1) DICOM pixel data were not readable; 2) the photometric interpretation was missing or MONOCHROME1; 3) the modality was missing or it was Computed Tomography (CT) rather than x-rays; 4) the projections were lateral horizontal, oblique or trans-thoracic; 5) the anatomic image protocol was other than that of the chest ( e.g humerus, abdomen, ..), and 6) the study report was missing or the radiography interpretation was not identifiable.

In addition to the standard DICOM metadata, the information on image projection and radiographic positioning was extracted. These data were found in different DICOM fields in non-structured free text (Position View, Patient Orientation, Series Description and Code Meaning). The projections principally identified after excluding obliques were antero-posterior (AP), postero-anterior (PA) and lateral (L). The different body positions were erect, either standing or sitting, decubitus or lying down, supine or lying on back, lateral decubitus right lateral and left lateral. In addition, different protocols were identified based on different clinical scenarios: standard standing PA and L, mobile x-ray (for patients unable to stand) in either AP erect in bed or AP supine, pediatric protocol for patients up to 3 years old, lordotic views, ribs and sternum modality views.

(a) P-A
(b) Lateral
(c) Lordotic
(d) A-P supine
(e) A-P
(f) P-A
(g) Lateral
(h) Lordotic
(i) A-P supine
(j) A-P
Figure 2: Common chest x-ray projections.
Figure 3: Heart size projection in Postero-Anterior (PA) vs Antero-Posterior (AP). The projection of the heart (red silhouette) illustrates that anatomical ratios depend on the plane distance to the x-ray source.
(a) P-A
(b) Decubit
Figure 4: Pleural effusion in different projections: A bipedestation projection (a) shows the meniscus sign in which the fluid accumulates in the subpulmonary region, ascends through the thoracic wall and through the paramedian zone. In decubit projection (b) there is no meniscus sign. As the liquid goes to the most declining area there is a diffuse increase in hemithorax density and a loss of the net limit of the diaphragm with occupation of the pulmonary vertex by apical cap, costo-phrenic angle blunting and a thickening of the smaller fissure.

The projection information is highly relevant for diagnosis. For example, AP views, which are commonly used in pediatric patients, show an enlarged heart silhouette (Fig. 1(j)) that should not be interpreted as cardiomegaly, but merely the expected large-depth ratio of reversed organ observation (Fig. 3). Another illustrative example is the distinct pattern that pleural effusions have in the standing position (Fig. 3(a)), in which a typical meniscus sign is commonly found as opposed to decubit projections (Fig. 3(b)). Given that the number of different projections is unbalanced (for instance, PA followed by lateral projections typically comprise the majority of chest x-rays), there is the risk that none of the other projections will have sufficient instances with which to train models capable of discriminating pathological from non-pathological patterns in the context of the projection.

There are particular radiological landmarks that differentiate projections, which radiologists are trained to identify. For instance, in the case of PA projections, these landmarks are the presence of air in the gastric chamber and the scapulae projected outside the lung fields. Although these features can be learned, models trained in unbalanced datasets with a poor representation for different projections may not have sufficient instances to properly learn those patterns. An illustrative example is when the heart enlargement in AP projections is attributable only to the effect of the projection, while the trained model erroneously predicts cardiomegaly.

Given that there were numerous combinations of projections, positions and protocols and all this information was not uniformly reported in DICOM meta-data, we decided to group them into 6 main classes: standard PA, standard L, AP vertical or erect, AP horizontal or supine, pediatric, and rib views.

For those images without DICOM information on the type of projection (20,367 samples), we used a pre-trained model available at and implemented a custom method to preprocess and load PadChest images in the expected format. This method uses a pre-trained ResNet-50 (resnet) CNN model initialized with the ImageNet (deng2009imagenet) weights and trained with fine-tuning on chest x-rays. Note that these automatically labeled projection samples should not subsequently be used to train another classifier with radiographs as inputs and projections as outputs. These subsets of projections were the only ones obtained from the images using an automatic classifier, unlike all the other data provided in PadChest, which can be used to train any classifier on input images, as they were obtained from the reports.

2.2 Preprocessing of the reports

The text dataset consisted of 206,222 study reports. There was only one report for each study, and if a study had two projections, such as PA and L, then the radiography description, therefore, included both results in the same report. Not all study reports had their x-ray images available, and conversely not all available x-rays had corresponding reports.

The ground truth was obtained from raw text reports but was not usable “as is" for the following reasons:

  • The reports were in unstructured free-text and did not use standardized dictionaries or medical codes.

  • Although the radiography description was consistently present in the report, other sections, such as patient’s history, study reason, and diagnostic impression, were not systematically included in the report and, when present, those sections where not clearly delimited or did not follow a consistent pattern. Particularly, content format largely varied over the years.

  • In several cases, the radiography description contained other projection types, such as sinus and abdominal types, whose interpretation was included together with the chest x-rays in the same report.

The following pipeline was consequently implemented222Source code available at in order to obtain a final curated labeled dataset:

  1. Lowercased text and removal of accents.

  2. The identification and extraction of the text containing the radiography description of chest x-rays in each report was carried out using an iterative incremental approach based on regular expressions and a repeated random sampling review to improve and/or add more regular expressions. A total of 167 (0.6%) reports were excluded because the radiography description was not retrievable with the chosen expressions or because the section was empty.

  3. Removal of non alpha-numerical characters excluding dots and spaces.

  4. Removal of stopwords, with the exception of [’sin’, ’no’, ’ni’, ’con’], using Spanish stopword removal from NLTK (nltk).

  5. Spanish stemming to remove morphological affixes from words, leaving only word stems using NLTK.

The complete report dataset had 501,840 sentences, amounting to 3.8 million words. The vocabulary size before preprocessing was 14,234 unique words, and was downsized to 9,691 different tokens after stopword removal (310 words) and stemming.

The mean number of tokens per sentence was 7.1 and the median was 5 tokens per sentence.

2.3 Medical codes annotation

A total of 27,593 reports were manually annotated following a hierarchical taxonomy described below. The manually labeled dataset was subsequently used to train and validate a multi-label text classifier, thus enabling the automatic annotation of the remaining 82,338 reports.

2.3.1 Topic extraction

Figure 5: Pareto curve of sentence redundancy. The cumulative frequency of the 1,000 most repeated sentences in the entire set of x-ray reports amounted to 230,729 sentences (the total number of sentences in the entire dataset was 500,000).

In order to help the practitioner label the dataset with radiographic findings and diagnoses, an automatic process was performed on the pre-processed reports so as to extract a series of topics (concepts) that were subsequently used to speed-up the manual labeling process.

For this goal, first, the preprocessed radiography reports were split into sentences and unique sentences were ordered by frequency.

Of 145,134 unique sentences from a total of 500,000 sentences, the 1,000 first most repeated ones comprised 230,729 (46%). Fig. 5 shows a Pareto curve of the 1,000 most frequent sentences. Note that 20% of them covered up to 76% of 230,729 sentences. In order to exploit this redundancy, the strategy employed was to manually label at sentence level.

Topic extraction discovers keyphrases and concepts in each sentence based on the frequency and linguistic patterns in the text. In order to facilitate the manual annotation task, the topics of sentences were automatically extracted and then presented to physicians for evaluation. For example, the list of phrases from the topic associated with the Chronic obstructive pulmonary disease (COPD) is: “epoc sign radiolog atrap aere sugest marc dorsal import inter cardiomegalia aereo lev significativos prominent vascular apical torax escoliosis derecho".

Three different unsupervised clustering methods with which to extract the topics from the reports were evaluated:

  1. Non-negative Matrix Factorization (NMF), NMFSklearn

    . First, term frequency–inverse document frequency (tf-idf) statistics were extracted to reflect how important a word is in the corpus.An NMF model was then fitted with the tf-idf features using two different objective functions: the Frobenius norm, and the generalized Kullback-Leibler divergence.

  2. Latent Dirichlet Allocation (LDA), LDA. In this case, tf (raw term frequency) features were extracted to fit an LDA model.

  3. k-Means. First, Doc2Vec (le2014distributed) representations were extracted, which is a generalization of Word2Vec (word2vec)

    for learning vectors from documents. The following parameters were used to learn the representations: Learning rate 0.025, size of word vectors 300, size of the context window 10, number of epochs 55, minimum . number of word occurrences 5, number of negative sampled 5, loss function negative sampling, sampling threshold

    . Once the representations had been extracted, the k-Means algorithm was used to cluster these vectors into 20 topics as shown in Tab. 1.

The output from these methods is a list of topics, each represented as a list of terms. The results were evaluated qualitatively for semantic performance, and homogeneity, and the 20 topics obtained with Doc2Vec and k-Means were chosen by physician consensus. Grouping sentences by topic increased efficiency, thus allowing physicians to find batches of sentences, in which the same labels could be propagated seamlessly, more easily.

Topic Semantic group
1 hypoexpansion basal, laminar atelectasis, bronchiectasis, pleural effusion ,pneumonia, infiltrates
2 pleural effusion both sides, bronchovascular markings, gynecomastia, interstitial pattern bilateral, pulmonary edema, respiratory distress, heart insufficiency
3 tracheostomy tube, endotracheal tube, NSG tube, chest drain tube
4,12 exclude sentences without radiolographical findings mainly mentioning additional imaging test recommended in the follow-up
5 scoliosis, kyphosis, vertebral degenerative changes, osteopenia, osteoporosis
6 COPD signs, air trapping, chronic changes, hyperinflated lung both sides, flattened diaphragm, emphysema
7 vertebral compression
8 NSG tube, hiatal hernia
9,15,16, 18 normal study
10 calcified granuloma, nipple shadow
11 callus rib fracture, humeral prosthesis, osteosynthesis material, cervical rib
13 unchanged
14 central venous catheter, reservoir central, pacemaker
17 cardiomegaly, vascular hilar enlargement, pulmonary hypertension, hilar congestion, vascular redistribution
19 surgery, sternotomy, diaphragmatic eventration, hemidiaphragm elevation, costophrenic angle blunting
Table 1: Sentence topics: Sentences were automatically assigned to 20 topics with the sole purpose of organizing unique sentences by semantic logic. This would, therefore serve as a tool with wich to help the physician in the manual extraction of labels. Each of these topics captured different medical semantic groups that facilitated the manual task of label extraction by ordering the semantically close sentences.

2.3.2 Manual labeling

A total of 22,120 unique sentences corresponding to 27,593 study reports were labeled and manually reviewed by trained physicians. Medical entities were extracted as per physician criteria and mapped onto Unified Medical Language System (UMLS (umls)) controlled biomedical vocabulary unique identifiers. Medical entities whose exact meaning could not be found in any term of the UML metathesaurus, but that were relevant for annotation purposes were also extracted using the physician’s labels but without assigning a code to them. Four additional labels with special coding rules were explicitly defined in order to maintain consistency in the following cases:

  • Sentences describing normality, i.e. those that either reported complete resolution, did not describe radiographic findings or negated their presence were labeled as “Normal".

  • Sentences that qualified images as suboptimal or with deficient technique as reported by radiologists were labeled as “Suboptimal".

  • Sentences that were not interpretable or that did not mention any radiographic findings or diagnoses were labeled as “Exclude".

  • Sentences mentioning that there were no changes in a medical entity described previously but not included in the report were labeled as "Unchanged" unknown entity. This procedure allows to distinguish pathologic from normal studies even when the medical entity is not mentioned and only a temporal reference is given. This case is frequent on temporal series of studies done on hospitalized patients to control the evolution of a pathological finding. It is important to remark that before training a model to predict pathologies from images using PadChest, labels of type “Unchanged" are intended to be replaced by labels reported for the same patient, if available in a prior study. Otherwise, the label should be removed as it could not be learned from the image.

All the sentences were assigned to multiple labels or at least one. When a list of potential diagnoses was reported, all differential diagnoses were labeled. Similarly, when different radiographic findings were described all of them were labeled.

As a result, both the sentences and hence the reports (as a sequence of sentences) were multi-label. Finally, the following consecutive rules were applied to ensure consistency when obtaining the final unique set of multi-labels for each report:

  1. The set of unique labels was obtained as the union operation of all labels assigned to each report.

  2. The report was assigned a single label “Normal" only if in addition to that label, there were no other labels regarded as a radiographic finding or diagnosis.

  3. The report was assigned a single label “Exclude" if this was the only label.

2.3.3 Organization of labels in hierarchical taxonomies

The list of unique manually annotated labels was organized into three hierarchical trees (see LABEL:app:hierarchies) in order to facilitate the exploitation and retrieval of images by semiological groups, differential diagnoses and anatomic locations. The hierarchical organization criteria and content of the three resulting taxonomy trees were reviewed by a radiologist with 20 years of clinical experience.

The main purpose of the hierarchic organization is to allow comprehensive retrieval of the studies grouped by higher hierarchical levels, thus enabling the construction of partitions using different criteria when compiling training sets for machine learning techniques. This is particularly relevant in order to control balance and granularity when deciding the classes to be inferred by classification methods.

The hierarchies are multi-axial, that is, the same term can be found in different branches, and child-parent relationships should satisfy that a child entity “is a" parent entity. For instance, “chronic tuberculosis" is “tuberculosis". Following these criteria, each medical concept was classified and assigned a node of the radiographic finding versus the differential diagnosis tree, and spatial concepts were assigned a node in the anatomical locations tree.

The criteria applied to each of the three hierarchical trees were:

  • Radiographic Findings Tree (LABEL:app:rxFinding): Findings were defined as any medical entity or diagnosis that radiologists can assert based solely on the interpretation of a chest x-ray image without any additional knowledge of the patient’s clinical information. Medical entities that can be directly diagnosed by interpreting the image are, for example, those affecting the bones (e.g, fractures, degenerative diseases, etc) or those involving alterations in air distribution (e.g. pneumothorax, subcutaneous emphysema, pneumoperitoneum, etc).

  • Differential Diagnosis Tree (LABEL:app:DD): Diagnoses, as opposed to radiographic findings, are those entities that radiologists can propose only as a possible list of differential diagnoses because the patient’s clinical information and/or additional studies, such as other radiology tests or laboratory analyses, are required to interpret the radiographic pattern and suggest a diagnosis with certainty. The more a radiologist knows about the patient (clinical, exploration and laboratory data), the more accurate the diagnosis is. For example, an alveolar pattern is a radiographic finding that would prompt a very long list of differential diagnoses (including both infectious and non-infectious diseases such as lung edema, respiratory distress, etc.), but whenever it is present in a patient with fever, cough, leukocytosis, high CRP levels and crackles in the localization of the alteration, then it will almost certainly be pneumonia. This additional information is not always provided to the radiologist, who consequently lists the most probable differential diagnoses in the report. Up to 30% of false positive and 30% of false negative diagnoses of pneumonia based on the presentation and chest X-ray findings were reported by claessens2015early. In this work, we did not access the patients’ health record to confirm diagnoses, and those entities were, therefore, regarded as differential diagnoses.

  • Topological Tree (LABEL:app:localizations): This tree groups anatomical or spatial qualifiers in which radiographic findings and diagnoses are located. These spatial concepts include body parts, organs or organ components, chest anatomical regions, structures, and type of tissues. These concepts were first identified by means of regular expressions (see LABEL:app:regexLoc) and then assigned to labels that were hierarchically organized in the topological tree.

2.4 Automatic labeling of the remaining reports

The reports that were not labeled in the previous stage (75%) were automatically tagged using a deep neural network classifier trained with the manually labeled data. We, therefore, used the manually annotated set of x-ray reports as training set. The goal was to produce a large dataset of annotated reports to be used as output of an image classifier trained with the x-rays.

We define the task of annotating the medical entities as a multi-label text classification problem in which, for each sentence , the goal is to predict for all , where is the label space that includes the different radiographic findings (LABEL:app:rxFinding) and differential diagnoses (LABEL:app:DD).

Our method does not employ any structured data or external information beyond the manually labeled data. The descriptive statistics of the training set and ground truth are summarized in Tab.


Parameter Value
Number of training sentences 20,439
Number of validation sentences 2,271
Vocabulary size 9,691
Average (min-max) number of tokens per sentence 7.1 (1-56)
Average (min-max) number of labels per sentence 1.33 (1-9 )
Label space 193
Table 2: Descriptive statistics of the report sentences used to train the models. All sentences were manually annotated by trained physicians. Note that the reports consist of a sequence of sentences.

In this method, each input instance corresponds to a sentence (denoted as ) composed of a sequence of pretrained word-embeddings of dimension for each of its tokens, and its extracted labels encoded in a one-hot vector of dimension . Word-embeddings of size were trained on the full corpus of Spanish reports with FastText (grave2017bag) in order to obtain the embeddings with the following parameters: Learning rate 0.025, size of the context window 5, number of epochs 55, minimum number of word occurrences 5, number of negative sampled 5, negative sampling loss function, sampling threshold

, length of char n-gram (min-max)

. We favored FastText over Word2Vec method so as to be able to encode unseen words that may potentially appear at inference time.

The four evaluated models are:

  • A 1-Dimensional Convolutional Neural Network (CNN).

  • A Recurrent Neural Network (RNN).

  • A 1-Dimensional Convolutional Neural Network with attention mechanism (CNN-ATT).

  • A Recurrent Neural Network with attention mechanism (RNN-ATT).

The CNN and RNN models were regarded as baseline models in order to compare them with the same architectures including attention mechanisms. The topologies of these systems are shown in Figs 6 and 7

respectively. All neural models were implemented using PyTorch

(paszke2017automatic) and the Ignite 333 library was used to help compact the programming code for common tasks such as training loop with metrics, early-stopping and model checkpointing.

2.4.1 Cnn

In the baseline CNN, after inputing the text as a matrix of concatenated word-embeddings, the model computed a base representation of each sentence denoted as matrix containing the horizontal concatenation of the n-grams spatial localizations with 128 features each. The sigmoid activations of a final linear layer outputted the binary classification labels.

As shown in Fig. 6, each sentence in the CNN is represented as a matrix of word-embeddings of

dimensions. The first convolutional layer contains 64 filters of dimension 3x1 with stride 1 and ReLu activation function, followed by a 2x1 max pooling layer with stride 1. There is a second convolutional layer containing 128 filters with the same dimensions, stride and activation function as the first layer. In this baseline CNN model, the attention module (highlighted in gray in Fig.

6) is replaced with a fully connected layer with multi-label sigmoid activations.

2.4.2 Rnn

In the baseline RNN, after inputing the text as a variable length sequence of word-embeddings, the model computed a base representation of each sentence, denoted as matrix containing the sequence of hidden states, in which each hidden state represents one token at one time step encoded with 256 features each. As occurred with the CNN, the sigmoid activations of a final lineal layer outputted the binary classification labels.

As shown in Fig. 7,each sentence in the RNN is inputed at each time step as a sequence of variable length of word-embeddings of dimensions. There are two LSTM bi-directional layers containing 128 hidden units each. In this baseline RNN model, the attention module (highlighted in gray in Fig. 7) is removed and the hidden state of the last time step of the RNN is used as the input for a final fully connected layer with multi-label sigmoid activations.

Figure 6: Diagram of the CNN topology with attention mechanism highlighted in gray. Input is a representation of the text as a matrix of concatenated word-embeddings, and outputs are the most likely labels .
Figure 7: Diagram of the RNN topology with attention mechanism highlighted in gray. Input is a representation of the text as a sequence of variable-length word-embeddings, and outputs are the most likely labels .

2.4.3 Attention model

The attention mechanism, which is based on the work by mullenbach2018explainable, learns different text representations for each label. The underlying idea is that each snippet containing important information correlated with a label could be anywhere in the text and would differ among different labels.

In both the CNN-ATT and the RNN-ATT model, the final linear layer was replaced with an attention feed-forward module that first calculates attention vectors as a distribution of putative locations for each label in the sentence. As illustrated in Figs. 6 and 7, attention vectors are computed as:


where is a vector parameter for label and , where is the element-wise exponentiation of the vector as described in mullenbach2018explainable.

In a second step, we apply the attention vectors to the base representation of a sentence in order to compute a matrix of 128-dimensional weighted feature vectors for each label, defined as where .

Finally, following the approach of mullenbach2018explainable, after applying a linear layer consisting of vectors of weights and a scalar bias , a sigmoid transformation is used to compute the probability for each label.


The training procedure minimized the binary cross entropy loss in the case of all four topologies:


and the L2-norm of the model weights, using Adam optimizer (kingma2014adam)

for the CNN and CNN-ATT models and RMSprop

(tieleman2014rmsprop) for the RNN and RNN-ATT models.

3 Evaluation of automatic labeling

The following metrics were used to assess the performance of the automatic labeling method:

  • Accuracy: The accuracy score provides the fraction of labels that were correctly detected. In this work, as we are dealing with multi-label classification, we obtain the average accuracy per sample: If all the predicted labels for an input sentence strictly match the true set of labels, then the accuracy for that sample is 1, and is otherwise 0.0. The average accuracy of all the samples in the test set is reported as the overall accuracy.

  • MacroF1: Calculates metrics for each label, and finds their unweighted mean. This does not take label imbalance into account, as it places more emphasis on rare label predictions.

  • MicroF1: Calculates metrics globally by counting the number of true positives, false negatives and false positives.

  • WeightedF1: Calculates metrics for each label, and finds their average, weighted by support (the number of true instances for each label). This alters Macro to account for label imbalance and it may result in a score that is not between precision and recall.

The four models were trained and validated in the same random partition and early stopping was employed in order to maintain the best model leaving the max number of epochs up to 500. The hyper-parameters used to train the models are summarized in Tab 3. The accuracy metrics obtained in the multi-label annotation of radiographic findings and differential diagnoses on text are summarized in Tab. 4.

Parameter Values CNN RNN
Batch size 1024 1024
L2 Penalty 0, , 0 0
Learning rate , , ,
Filter size 3
Number of filters second conv layer 64, 128, 256 128
Dropout probability 0.4 0.4
Number of hidden LSTM units 64, 128, 256, 512 128
Number of hidden LSTM layers 2
Table 3: Hyperparameter ranges evaluated and best values chosen for the CNN and RNN based models selected by grid-search and manual fine tuning using MicroF1 as performance measure in the validation set.
Model Epochs Accuracy MacroF1 MicroF1 WeightedF1
Validation Set
CNN 94 0.706 0.384 0.837 0.814
CNN-ATT 230 0.806 0.458 0.902 0.886
RNN 78 0.853 0.483 0.913 0.903
RNN-ATT 41 0.864 0.491 0.924 0.918
Test Set
RNN-ATT 41 0.857 0.491 0.939 0.926
Table 4: MicroF1 of all the models in a validation set of 2,271 sentences with a training set of 20,439 sentences. Best results are shown in bold type. The best model from among the four different architectures was further tested in an independent random sample of 500 sentences drawn from the PadChest Dataset and manually labeled.

Cross-validation experiments using -folds with were performed in order to attain accuracy curves and compare the learning pattern up to 150 epochs for the four models (CNN, CNN-ATT, RNN and RNN-ATT). The results are shown in Fig. 8. These experiments allow us to understand the dispersion in accuracy attributable to the differences in the data distribution from the training and validation sets.

(a) CNN
(c) RNN
Figure 8:

Learning curves: Each curve shows the average and standard deviation (shaded regions) of the MicroF1 precision score in 150 epochs for each model after cross-validation. The training and validation set contained 20,439 and 2,271 samples respectively.

We draw the following conclusions:

  • The RNN outperformed the CNN.

  • Attention mechanisms increased the performance of both the CNN and the RNN models with the following particularities: The gain in the overall MicroF1 was more pronounced in the CNN-ATT model, but it had a slower but steadier increase when compared to the plain CNN. Conversely, the accuracy gain of the RNN-ATT versus the plain RNN was less appreciable but the model learned faster in early epochs.

  • Accuracy suffers from a higher variance in the CNN-ATT as can be seen from the wider standard deviation interval compared to that of the RNN-ATT.

  • For all four methods, in an attempt to reduce the variability in validation metrics, we experimented reducing the size of the training set in favor of a larger validation set but observed that the model’s predictive performance deteriorated as regards both the training and validation metrics, particularly in the CNN architecture. We consequently, hypothesize that this architecture seems to benefit more from adding a larger number of instances, but this could not be proven given the available size of the labeled dataset.

A random sample of 500 sentences drawn from the partition of the PadChest dataset and not used to train or validate the model was further labeled manually and used as an independent test set. In order to test the suitability of the trained model as regards its generalization to the whole PadChest dataset, this sample test was drawn from x-reports belonging to a different period (2009 to 2013) with respect to those of the training and validation sets (2014-2017).

The best model from among the four different architectures was further tested in this test set, obtaining a MicroF1 score of 0.93, as shown in Tab. 4. This result was similar to those achieved in the validation set, illustrating its suitability for use in the automatic annotation of the PadChest dataset samples that were not labeled manually. The RNN-ATT model was consequently used to extract the annotations for the remaining PadChest dataset, as shown in Tab. 7.

4 Dataset overview

Name Description
ImageID Image identifier
ImageDir Zip folder containing the image
StudyID Study identifier
PatientID Patient’s code
PatientBirth Year in format YYYY
Projection Classification of the 5 main x-ray projections: PA (Postero-Anterior standard), L (Lateral), AP (Antero-Posterior erect or vertical), AP-horizontal (Antero-Posterior horizontal), COSTAL (rib views)
Pediatric PED if the image acquisition followed a pediatric protocol
MethodProjection The method applied in order to assign a projection type: based on a manual review of DICOM fields or based on the classification output of a pre-trained ResNet50 model for images without DICOM information about the projection.
ReportID Integer identifier of the report
Report A text snippet extracted from the original report containing the radiographical interpretation. The text is preprocessed, while the words are stemmed and tokenized. Each sentence is separated by ‘.’
MethodLabel The method applied for manual labeling, manually by physicians (Physician) or supervised (RNN_model)
Labels A sequence of unique labels extracted from each report
Localizations A sequence of unique anatomical locations extracted from each report. Each anatomical location is always preceded by the token loc
LabelsLocalizationsBySentence Sequences of labels followed by its anatomical locations. Each single sequence corresponds to the labels and anatomical locations extracted from a sentence and repeats the pattern formed of one label followed by no or many locations for this label [ label, (0..n) loc name ]. The sequences are ordered by sentence order in the report
LabelCUIS A sequence of UMLS Metathesaurus CUIs corresponding to the extracted labels in the Labels field
LocalizationsCUIS A sequence of UMLS Metathesaurus CUIs corresponding to the extracted anatomical locations in the Localizations field
Table 5: Dataset Fields: All additional processed fields that are different from original DICOM fields. Additional information on UMLS Metathesaurus CUIs can be found at
Name DICOM tag Description
StudyDate_DICOM 0008,0020 YYYY-MM-DD Starting date of the study
PatientSex_DICOM 0010,0040 Sex of the patient. M male, F female, O other
ViewPosition_DICOM 0018,5101 Radiographic view of the image relative to the subject’s orientation
Modality_DICOM 0008,0060 Type of equipment used to acquire the image data
Manufacturer_DICOM 0008,0070 Manufacturer of the equipment that produced the images
PhotometricInterpretation_DICOM 0028,0004 Intended interpretation of the pixel data represented as a single monochrome image plane. MONOCHROME1: the minimum sample value should be displayed in white. MONOCHROME2: the minimum sample value is intended to be displayed in black
PixelRepresentation_DICOM 0028,0103 Data representation of the pixel samples, unsigned integer or 2’s complement
PixelAspectRatio_DICOM 0028,0034 Ratio of the vertical size and horizontal size of the pixels in the image specified by a pair of integer values, in which the first value is the vertical pixel size and the second value is the horizontal pixel size
SpatialResolution_DICOM 0018,1050 The inherent limiting resolution in mm of the acquisition equipment
BitsStored_DICOM 0028,0101 Number of bits stored for each pixel sample
WindowCenter_DICOM 0028,1051 Window width for display
WindowWidth_DICOM 0028,1050 Window center for display.
Rows_DICOM 0028,0010 Number of rows in the image
Columns_DICOM 0028,0011 Number of columns in the image
XRayTubeCurrent_DICOM 0018,1151 X-ray Tube Current in mA
ExposureTime_DICOM 0018,1150 Duration of x-ray exposure in msec
Exposure_DICOM 0018,1152 Exposure expressed in mAs, calculated from Exposure Time and x-ray Tube Current
ExposureInuAs_DICOM 0018,1153 Exposure in As
RelativeXRayExposure_DICOM 0018,1405 Relative x-ray exposure on the plate
Table 6: DICOM dataset fields: DICOM® (Digital Imaging and Communications in Medicine) is the international standard employed to transmit, store, retrieve, print, process, and display medical imaging information. Detailed descriptions of DICOM standard fields can be found in DICOM.

The dataset consists of 160,868 labeled chest x-ray images from 69,882 patients, acquired in a single institution between 2009 and 2017 (see Tab. 7). The patients had a mean of 1.62 chest x-ray studies performed at different time points (from 1 to 119). Each study contains one or more images corresponding to different position views, mainly P-A and lateral, and is associated with a single radiography report describing the results of all position views in a common text section. The mean of the chest-x ray images was 2.37 per patient. The dataset generated provides two types of fields for each chest-x ray image: those fields with the suffix DICOM 6 contain the values of the original field in the DICOM standard and the remaining fields 5 enrich the PadChest dataset with additional processed information.

Patients Reports Images
Initial Set 69,882 115,678 168,171
Excluded Set 4,744 5,864 7,303
Photometric interpretation 5,652
Modality 871
Study Report 705
Position View 42
Protocol 23
Pixel Data 10
Ground Truth
Manual Report Labeling 24,491 27,593 39,039
Automatic Report Labeling 43,134 82,338 121,829
Final Labeled Set 67,625 109,931 160,868
Table 7: PadChest global statistics.
Projection and Positioning
PA 96,010
L 51,124
AP-Horizontal 14,355
AP-Vertical 5,158
Costal 631
Pediatric 274
Male 80,923
Female 79,923
Table 8:

Descriptive statistics: Categorical variables.