Dynamic Graph Correlation Learning for Disease Diagnosis with Incomplete Labels

02/26/2020 ∙ by Daizong Liu, et al. ∙ Huazhong University of Science u0026 Technology 0

Disease diagnosis on chest X-ray images is a challenging multi-label classification task. Previous works generally classify the diseases independently on the input image without considering any correlation among diseases. However, such correlation actually exists, for example, Pleural Effusion is more likely to appear when Pneumothorax is present. In this work, we propose a Disease Diagnosis Graph Convolutional Network (DD-GCN) that presents a novel view of investigating the inter-dependency among different diseases by using a dynamic learnable adjacency matrix in graph structure to improve the diagnosis accuracy. To learn more natural and reliable correlation relationship, we feed each node with the image-level individual feature map corresponding to each type of disease. To our knowledge, our method is the first to build a graph over the feature maps with a dynamic adjacency matrix for correlation learning. To further deal with a practical issue of incomplete labels, DD-GCN also utilizes an adaptive loss and a curriculum learning strategy to train the model on incomplete labels. Experimental results on two popular chest X-ray (CXR) datasets show that our prediction accuracy outperforms state-of-the-arts, and the learned graph adjacency matrix establishes the correlation representations of different diseases, which is consistent with expert experience. In addition, we apply an ablation study to demonstrate the effectiveness of each component in DD-GCN.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Chest radiography is the most widely available imaging examinations for screening, diagnosis and management of multiple threatening diseases. However, automatic radiograph interpretation of chest X-ray images (CXRs) is currently a technically challenging task due to the complex pathologies which are heavily dependent on the expertise of radiologists with years of professional experience. Although much significant progress has been made to deal with chest disease classification using Convolutional Neural Networks (CNNs), it is still a multi-label challenging task due to the combinatorial nature of the output space.

Figure 1: We build a directed graph over the diseases for chest X-ray images to model the label dependencies for multi-label disease diagnosis, where each image has some uncertain labels (). In the right graph, each edge (like

) indicates the conditional probability learned by our dynamic correlation matrix on how likely “label

” will appear when “label ” appears.

To address such multi-label task like Chest X-ray14 multi-disease diagnosis [37], many researchers [31, 40, 9, 39] take a naive way by treating the objects in isolation and convert the multi-label problem into a set of binary classification problems, then predict whether each disease of concern presents independently. However, they did not utilize complex topology structure which has a strong correlation relationship between the disease pairs. For example, Pleural Effusion is more likely to appear when Pneumothorax is presented [32]. It is essential to consider the label correlation in multi-disease classification task for a better understanding on the inter-dependency. As graph is widely used to explore the label dependencies in vision tasks, previous graph-based methods [18, 20, 17, 5], mostly using Graph Convolutional Network (GCN) [17, 5], formulate the multi-label recognition task as a structural inference problem. Therefore, GCN structure can also be used to learn the correlation among different diseases for disease diagnosis.

However, there exist some limitations in these GCN based methods [17, 5]. Firstly, they use a pre-defined constant adjacency matrix, which is limited by manual definition or the database scale and thus it can not learn the natural relationship between the classes. In contrast, we develop a dynamic learnable adjacency matrix to automatically explore the inherent relationship among different diseases. The element in our matrix represents the conditional probability of various disease pairs, as illustrated in Fig. 1

. And it is dynamically updated with the backpropagated gradient during the GCN training. Secondly, they feed the graph with either word embedding

[5] or mixed feature of the whole image [17], which lacks meaningful representation of each class. To get more interpretable representation, we build our graph with image-level individual feature maps as the input, and this can lead to more differentiable characteristic of each disease for strong correlation learning.

Moreover, learning an automatic classification model needs large datasets for the training. The current largest dataset CheXpert [14] is an incomplete labeled dataset, which contains uncertain labels (), as shown in Fig. 1. The uncertain labels indicates that experts are not sure whether the corresponding disease exists. Since there exists a large proportion of such uncertain labels in the dataset, we should not ignore its potential values for improving the performance on diagnosis. Although many works [23, 35, 6] have tried to learn the potential information in the uncertain labels, they either directly set all uncertain labels as negative ones or ignore the correlation between certain and uncertain annotations. To handle the problem of incomplete labels, we propose an adaptive curriculum learning algorithm to relabel and reuse the uncertain labels in our framework.

In summary, we propose a novel GCN based end-to-end network, called Disease Diagnosis GCN (DD-GCN), which utilizes a learnable adjacency matrix to dynamically learn the conditional probability between different disease pairs. Compared to the constant label correlation matrix, our correlation matrix can be dynamically updated with gradient backpropagation during the GCN training, and therefore it can capture natural label correlation in multiple disease classification task. Besides, to generate more general inter-representations as the graph input, we first extract the fixed image-level feature on the whole CXR image, then divide it into individual feature maps for each disease. Our dynamic adjacency matrix learns such individual class features and can establish the correlation representations of different diseases, which is consistent with experts experience. In addition, to handle the incomplete labels, we design an adaptive curriculum learning strategy which relabels the uncertain labels into weak positive/negative ones to finetune the pre-trained model to learn the potential context. Comparing to state-of-the-arts, our method is much more robust and achieves the best performance on both Chest X-ray14 [37] and CheXpert [14] datasets.

Our main contributions are summarized as follows:

  • To the best of our knowledge, we are the first to utilize a learnable correlation matrix for disease diagnosis. We propose an end-to-end trainable framework, which learns a dynamic correlation matrix with independent feature map of each disease as the input, to explore more natural conditional probabilities between different disease pairs.

  • We design an adaptive curriculum learning strategy, which uses the uncertain labels for potential context learning, so as to handle the incomplete label task.

  • We conduct experiments on two benchmark CXRs datasets, and do ablation study that verifies the effectiveness of each component in our model.

2 Related Work

In this section, we briefly summarize recent researches related to our work, including chest disease classification, Graph Convolutional Network (GCN), and learning with uncertain labels.

Chest disease classification. There are several works proposed to predict the probability of the chest radiographic observation with CXRs. Wang et al. [37] implemented four basic multi-label CNN architectures to evaluate the performance on ChestX-ray14 dataset. Rajpurkar et al. [31] found that DenseNet121 [12] can extract better embedding features from the CXRs. Yao et al. [40]

built Long-short Term Memory (LSTM) based model to exploit the dependencies between different diseases. To guide the model focusing on the region of interest (ROI), Guan

et al. [9] developed an attention guided network to crop the ROI and then extract features. Tang et al. [34] utilized curriculum learning to learn the pre-defined severity level data defined based on known labels. Yan et al. [39] used SE Block [11] and WildCat structure [8] to divide multi-feature maps into different classes to improve the performance on each disease. These methods are limited to the quality of the Chest X-ray14 dataset, thus Irvin et al. [14] built up a larger CheXpert dataset that contains uncertain labels to give a strong reference standard with expert human performance metrics for comparison. In this work, our proposed method achieves higher performance on Chest X-ray14 by exploring the disease correlation, and also has a significant improvement on CheXpert with incomplete label learning.

Graph Convolutional Network (GCN). Graphs have been proven to be very effective in modeling the label correlation. Many researchers [17, 18, 5] utilized graph structure to capture the label correlation dependency with image features. Especially, Chen et al. [5] exploited GCN to build up graph nodes with word embedding inputs to propagate features between multiple labels, and then made the classification depending on the constant correlation matrix initialized by the image graph. However, word embedding is less natural for feature learning, and a constant adjacency matrix is strictly limited by the dataset scale. Method using binary matrix [15] also has similar limitation. To learn more reliable correlation, we build a dynamic graph with independent feature maps of each disease as the input of each node, so as to explore the inter-dependency between disease pairs with a dynamic correlation matrix learning.

Learning with uncertain labels. Multi-label datasets often contain uncertain ground-truth (GT) like missing or unknown labels. It will lose some contextual information if we do not use such uncertain parts. Many researchers [2, 4, 36, 33, 23] set missing labels as negative labels, which drop the performance because they may re-label lots of positive labels among the missing labels as negative ones. Another method is Binary Relevance (BR) [35], which classifies each label as an independent binary class while ignores the correlation. Although several researchers  [3, 38, 4, 6, 7] have explored the correlation among certain labels to predict information of uncertain labels, they need to solve an optimization problem which is hard to be applied in the mini-batch strategy. Inspired by curriculum learning [1], we propose an adaptive curriculum learning strategy to relabel the uncertain labels in supervise for further usage. Different from methods [42, 41, 43] that adopt the alternate updating process to utilize all unlabeled data, our curriculum strategy only selects weak positive/negative labels from uncertain labels for the model fine-tuning rather than using all the uncertain labels.

3 Method

3.1 Notations and Overview

For each input CXR image , there may exist more than one disease. Denote the corresponding disease label as where is the number of diseases and (unknown, negative, or positive for the disease). Specially, denote as the disease names. In our task, we aim to predict the existence of all diseases such that the prediction is as accurate as possible to the ground-truth .

The architecture of our proposed Disease Diagnosis Graph Convolutional Network (DD-GCN) is illustrated in Fig. 2. We first utilize DenseNet121 [12] as the CNN backbone to extract the mixed embedding features of the input CXRs. Then, we divide the embedding features into individual feature maps for each disease class. To further learn the inter-dependency between different disease pairs, we build a stacked GCN to capture the disease correlation for multi-label image classification. Here, we feed the individual feature maps of each class into different nodes in a graph to share their information, then the model will output each final corresponding binary classification result. Note that the learnable adjacency (correlation) matrix in the graph is dynamically learned at each step. In addition, we propose an adaptive strategy for incomplete label task to learn the representations from both known and unknown sets. In the following subsections, we will provide details for dynamic correlation matrix learning and incomplete label learning.

Figure 2: Overall architecture of the proposed DD-GCN. We first utilize a deep CNN network to extract features of the multi-label CXR images. Then a transfer layer is applied to partition the feature maps into several blocks, each corresponds to a particular disease class. After obtaining the representation of each disease, we send it to nodes of the stacked recurrent GCN to learn a dynamic correlation matrix, and then train the final inter-dependent disease classifiers.

3.2 Dynamic Correlation Matrix Learning

Graph Convolutional Network (GCN) [15] takes advantage of learning strong correlation between nodes in a graph, which can be used to explore relevant relationship between diseases. GCN works by gathering messages between nodes based on the correlation (adjacency) matrix, which is pre-defined in most researches. Although GCN based approaches [17, 18, 5, 15] make significant progress, there are still rooms for improvement:

  • Previous works [17, 5] generally take mixed feature of the whole image or the word embedding [29] of each label as the graph input to learn the correlation for a multi-label image classification. Such mixed features lack the differentiable representation for each disease, while the word embedding has no interpretable information in the vision space. To dig out more robust inter-dependency of diseases, more meaningful and interpretable independent features of each disease needs to be represented in the image space.

  • Almost all GCN based works exploit a pre-defined correlation matrix (binary or data-driven) to map constant conditional probability between the label-pairs. This matrix is limited by human definition [15] or the database scale [5]. Meanwhile, some inherent relationships are also hard to decide even for human or empirical statistics. Thus, more natural correlation needs to be explored by a dynamic learning and updating strategy rather than keep a constant correlation matrix.

As shown in Fig. 3, different to previous works, we extract image-level representations for individual diseases as the graph input. These individual features are more robust than mixed features or word embedding to investigate the inter-dependency using visual contexts. To further automatically learn more natural correlation maps, we design a dynamic adjacency matrix learning strategy with gradient backpropagation during the GCN training.

Individual feature generation.

To explore more natural and reliable dependency among the diseases, instead of learning with the word embedding of each disease, we need to learn the relationship between the individual representation of each class. After extracting the deep features of shape

using DenseNet121 [12], we adopt a transfer layer [8] to get the individual feature maps of each disease, as shown in Fig. 2. More in details, we first transfer the feature maps of size into feature maps of size through a convolution layer, where is the number of feature maps per class. Then a class-wise average pooling is conducted to reduce the number of channels from to , and finally we can get the representations of shape for each disease.

Figure 3: Compared to previous GCNs, our GCN layer adopt a dynamic correlation matrix to learn the inter-dependency among individual features by gradient updating.

Dynamic graph updating. After obtaining the feature map in shape of each class using the transfer layer, we construct a novel GCN module followed by a class-wise pooling, where the convolutional operation is taken as a spectral graph convolution [15]. We also design a dynamic adjacency matrix learning strategy to automatically learn the inter-dependency between disease pairs for exploring more natural relevance. At each time-step of the graph update, the dynamic matrix is updated by the backpropagated gradient.

For each node in the graph of GCN layer, we denote its representation feature as at time , where is the dimension of each node. The neighbors of node is defined as , and node collects messages from its neighbor nodes with correlation matrix and then updates its hidden state. Our dynamic matrix is updated from , and is further updated with the gradient at the current time step. In details, after the reconstruction of node , matrix updates its row to learn the conditional relevance of other diseases for disease .

The following two steps are formulated for each node to collect messages and update hidden state, where is a non-linear function like LeakyReLU [22]:


where is the entry of matrix at time , representing the conditional probability of disease when disease appears.

is the parameter vectors to be learned for feature embedding,

means the matrix addition. is the weighted representation of a neighbor node of node . And is the collected message of node . After the message is propagated in the graph, our dynamic correlation matrix updates each row with the backpropagated gradient:


We directly compute the gradient of

with loss function

, and update it with a learning rate . Eq. 5 is used to normalize matrix to be in to represent the conditional probabilities of disease pairs, where is the maximum value in row since it depends more on itself.

Details of GCN structure. More in details, as shown in Fig. 2, we exploit a stacked GCN layer that shares the same dynamic correlation matrix between two graphs. The stacked GCN structure contains two GCN layers: the first GCN is fed with the independent feature maps of different classes and then output vectors of shape ; Next, the features are sent to the second graph and it will output the results of shape standing for the prediction of each class. Before training, we initialize matrix with a data-driven matrix, which counts the occurrence of label pairs in the training set and then calculate the conditional probability in two directions. Specially, for incomplete label task, we ignore the uncertain labels, and only evaluate the correlation matrix on certain labels for initialization.

3.3 Learning with Incomplete Label

To train an end-to-end network with incomplete labels, we develop a two-step strategy. 1) Learning with adaptive loss for incomplete labels. We first train the model only with known labels and ignore uncertain labels; 2) Adopting curriculum learning to predict uncertain labels. After getting a pre-trained model, we feed the data of uncertain labels into the model to find weak positive/negative labels for relabeling, and then we finetune the model with both known and relabeled annotations.

1:Training dataset and our DD-GCN model
2:The finetuned model
3:Initial only with known labels (1 or 0)
4:Pre-train the model with known labels
6:  Inference () on uncertain labels in dataset
7:  Find and relabel the easy weak labels
8:  If no easy weak label exists
9:    Break
10:  Update with the weak labels and finetune () with the updated Return: The finetuned model
Algorithm 1 Curriculum learning

Adaptive loss function for incomplete labels. Since each CXR image may contain both certain ( or ) and uncertain () labels for different diseases, we need to guide the optimizer to only learn with certain labels of images. However, during the training of each mini-batch, the number of uncertain labels and known labels are unbalanced. The back-propagated gradient may be small and lead to slow convergence if we directly use the cross-entropy function to train our model for multi-label classification with binary label ( or ) at each mini-batch. To solve this problem, we utilize an adaptive loss based on MultiLabelSoftMarginLoss to train our model:


where is the proportion of known labels at each mini-batch, and we utilize to normalize the corresponding loss with

. It helps the optimizer to learn the unbalanced data for better convergence, and has a similar goal to batch normalization


which normalizes distributions of layer inputs for each mini-batch.

Adopting curriculum learning to predict uncertain labels. To further make effective use of uncertain labels, we aim to find the weak positive/negative labels among uncertain labels and relabel them to finetune our model. Here, we propose an adaptive curriculum learning strategy [1] to relabel the uncertain parts. We first search the easy weak positive/negative samples among uncertain labels using the pre-trained model with a threshold strategy, and then finetune the pre-trained model with these weak samples. We iterates until no more easy samples can be found with the current finetuned model.

Assume that is our trained model, and are the thresholds to define new weak positive and negative labels among the uncertain ones where . For each uncertain labels in image class , we re-mark their labels as:


Then we add the relabeled annotations into the original known labels, and finetune the pre-trained model with the ensemble annotations. Details can be found in Algorithm 1.

4 Experiments

In this section, we investigate the performance of the proposed DD-GCN model, on the standard benchmark datasets of chest X-ray images (CXRs): Chest X-ray14 [37] and CheXpert [14]. We compare our DD-GCN with state-of-the-art methods and further do ablation study to demonstrate the effectiveness of each component in DD-GCN. We show the adjacency matrix learned by the GCN layer for the learned correlation relationship between diseases, and explain that it is consistent with expert experience.

4.1 Datasets

Chest X-ray14. Chest X-ray14 [37] is a large CXR dataset that collects 112,120 frontal-view chest X-ray images of 30,805 unique patients. Each radiography is labeled to 1 (positive) or 0 (negative) with one or multiple types of 14 common thorax diseases : Atelectasis, Cardiomegaly, Effusion, Infiltration, Mass, Nodule, Pneumonia, Pneumothorax, Consolidation, Edema, Emphysema, Fibrosis, Pleural Thickening and Hernia We use the official data split on patient level, which are 76,524, 10,000, and 25,596 CXRs for train, validation and test, respectively.

CheXpert. CheXpert [14] is the largest dataset of CXRs consisting of 224,316 chest radiographs of 65,240 patients with both frontal and lateral view. It contains 200 studies from 200 patients for validation and 500 studies from 500 patients for evaluation. Different from chest X-ray14, CheXpert captures uncertainties inherent in radiograph interpretation with an effective labeling strategy (0 for negative, -1 for uncertain, and 1 for positive), and contains sufficient uncertain labels. For example, there are 33,739 uncertain labels and 33,376 positive labels in Atlectasis disease. In our experiments, we only evaluate the AUC scores on the top 5 diseases (Atelectasis, Cardiomegaly, Consolidation, Edema, and Pleural Effusion) as in [14].

4.2 Experimental Setting

DD-GCN utilizes DenseNet121 [12]

with drop rate 0.2 as the mixed feature extractor, which is initialized with a pre-trained model on ImageNet

[16]. We set to transfer feature maps for each class through the transfer layer. In the stacked GCN structure, we set the dimension of the feature vector in each node as 100, and adopt LeakyReLU [22]

with negative slope 0.2 as the non-linear activation function after each GCN layer.

In the training phase, the input images are random cropped and resized into with random horizontal flips for data augmentation. At the first phase of learning known labels, we set the Adam optimizer with

, and train our model for 15 epochs. The learning rate is reduced by the power of 0.9 for every 3 epochs. Specially, we add the weak positive/negative samples with strict threshold 0.9/0.1 into the known labels for the current term pre-training. When finetuning the pre-trained model by curriculum learning strategy, we set

to define weak labels, and adjust to

, then only train the model for 5 epochs at each iteration. All experiments are implemented on a single NVIDIA 1080ti GPU with Pytorch

[28] framework. Our full codes and pre-trained models will be made available to the public in the formal version.

4.3 Comparison with State-of-the-arts

Chest X-ray14. We first investigate the performance of our baseline model (DD-GCN without adaptive loss and curriculum learning) on the official split set of chest X-ray14 dataset with completely known labels. Table 1 shows the comparison on the AUC scores with the state-of-the-arts in detail. As compared with the seven advanced methods, our model achieves the highest average AUC score of 0.8525, demonstrating its strong ability. Our model outperforms that of Mao et al. [24] by 2.55%, which simply constructs the ImageGraph with different image inputs for clustering and loses the correlation information between labels in each CXR. Though Yan et al. [39] also applied a transfer layer to divide feature maps for different classes, their method failed to exploit the label correlation, and ours outperforms theirs by 2.23%. Our method achieves the highest score on almost every disease classification, and can learn feature representations as well as correlation matrix between the class pairs.

Wang [37] Yao [40] Li [21] Guendel [10] Raj [31] Mao [24] Yan [39] DD-GCN
Atelectasis 0.7160 0.7330 0.8000 0.7670 0.7795 0.7960 0.7924 0.8328
Cardiomegaly 0.8070 0.8580 0.8700 0.8830 0.8816 0.8960 0.8814 0.8902
Effusion 0.7840 0.8060 0.8700 0.8280 0.8268 0.8730 0.8415 0.8854
Infiltration 0.6090 0.6750 0.7000 0.7090 0.6894 0.6990 0.7095 0.7120
Mass 0.7060 0.7270 0.8300 0.8210 0.8307 0.8340 0.8470 0.8756
Nodule 0.6710 0.7780 0.7500 0.7580 0.7814 0.7620 0.8105 0.8110
Pneumonia 0.6330 0.6900 0.6700 0.7310 0.7354 0.7170 0.7397 0.7627
Pneumothorax 0.8060 0.8050 0.8700 0.8460 0.8513 0.8900 0.8759 0.8938
Consolidation 0.7080 0.7170 0.8000 0.7450 0.7542 0.7880 0.7598 0.8198
Edema 0.8350 0.8060 0.8800 0.8350 0.8496 0.8890 0.8478 0.9091
Emphysema 0.8150 0.8420 0.9100 0.8950 0.9249 0.9070 0.9422 0.9297
Fibrosis 0.7690 0.7570 0.7800 0.8180 0.8219 0.8130 0.8326 0.8390
P.T 0.7080 0.7240 0.7900 0.7610 0.7925 0.7920 0.8083 0.8213
Hernia 0.7670 0.8240 0.7700 0.8960 0.9323 0.9170 0.9341 0.9526
Average 0.7381 0.7673 0.8064 0.8066 0.8180 0.8270 0.8302 0.8525
Table 1: The AUC score comparison on the official split test set of Chest X-ray14. DD-GCN achieves the best performance on almost all diseases.
Methods Atelectasis Cardiomegaly Consolidation Edema Pleural Effusion Average
U-Ignore [14] 0.818 0.828 0.938 0.934 0.928 0.8892
U-Zeros [14] 0.811 0.840 0.932 0.929 0.931 0.8886
U-Ones [14] 0.858 0.832 0.899 0.941 0.934 0.8927
U-SelfTrained [14] 0.833 0.831 0.939 0.935 0.932 0.8940
U-MultiClass [14] 0.821 0.854 0.937 0.928 0.936 0.8952
DD-GCN 0.8723 0.8570 0.9475 0.9396 0.9436 0.9120
Table 2: The AUC score comparison on CheXpert validation set. DD-GCN achieves the best performance.

CheXpert. We investigate the efficiency of our incomplete learning part on CheXpert dataset with uncertain labels. Here we add adaptive loss function and curriculum relabeling to the baseline (DD-GCN), and Table 2 shows the comparison results on top 5 diseases. Our model achieves the best performance of 0.9120 AUC score, and outperforms five basic models (U-Ignore, U-Zeros, U-Ones, U-SelfTrained and U-MultiClass) by 2.28%, 2.34%, 1.92%, 1.80% and 1.68%, respectively. The ROC is visualized in Fig. 4(a)-4(e), which shows that the blue curve of our final model is higher than other variants. It illustrates that the AUC score of each disease increases when we apply the adaptive loss and relabeling trick to our baseline model. It also demonstrates that our method exhibits great performance on the chest disease classification task.

(a) Atelectasis
(b) Cardiomegaly
(c) Consolidation
(d) Edema
(e) Pleural Effusion
(f) total
Figure 4: ROC on CheXpert validation set. (a)-(e) show the ROC of each disease with different strategies, and (f) is the ROC of model DD-GCN+adaptive loss+relabeling with the best average AUC score of 0.9120.
Atelectasis Cardiomegaly Consolidation Edema Pleural Effusion
Atelectasis 1.0000 0.0000 0.1554 0.0496 0.0893
Cardiomegaly 0.1159 1.0000 0.0605 0.4295 0.0000
Consolidation 0.0000 0.0514 1.0000 0.0784 0.2435
Edema 0.1045 0.0047 0.0876 1.0000 0.0000
Pleural Effusion 0.3686 0.2252 0.1258 0.0000 1.0000
Table 3: Correlation matrix learned from the GCN graph.

4.4 Analysis on the Learned Correlation Matrix

To further demonstrate that the graph learns reliable conditional probability between the disease pairs, we show the learned adjacency matrix on CheXpert dataset for further analysis. Table 3 shows the correlation relationship between different disease pairs, where each value represents the corresponding conditional probability. As Melissa et al. [25] analyzed that Cardiomegaly can cause same signs and symptoms like Edema, the value verifies that Edema is more likely to appear with Cardiomegaly, which is consistent with the expert experience. and are also relatively high since bilateral P.E is generally associated with Cardiomegaly [30] and Atelectasis may happen with a Pneumothorax or Pleural Effusion [27]. Also, the conditional probability of P.E and Consolidation fits the phenomenon that the causes of pleural effusions also cause lung consolidation [26]. Besides, the conditional probability of other irrelevant disease pairs like is much smaller than other pairs. More correlation matrix results on Chest X-ray14 can be found in the supplementary. As shown in supplementary, Table 1, did not match the expert experience that Cardiomegaly may happen with Pleural Effusion. The reason is that Chest X-ray14 only contains complete labels, and there are only a few samples that contain both Cardiomegaly and Effusion. Overall, the results indicate that our dynamic correlation matrix can learn the correct inter-dependency between the disease pairs. It is provably more robust for disease diagnosis and natural correlation exploring, and can be used for disease prediction.

Transfer layer GCN Loss func Relabeling Atel Card Cons Edema P.E Mean
0.8191 0.8472 0.8693 0.9439 0.9153 0.8790
0.7906 0.8008 0.9233 0.9208 0.9222 0.8715
0.8339 0.8140 0.9245 0.9135 0.9210 0.8814
0.8222 0.8452 0.9436 0.9096 0.9219 0.8885
0.8411 0.8374 0.9371 0.9229 0.9223 0.8922
0.8723 0.8570 0.9475 0.9396 0.9436 0.9120
Table 4: Ablation study on CheXpert validation set with AUC scores of top 5 diseases.

4.5 Ablation Study

We run an extensive ablation study to demonstrate the effect of each component in DD-GCN using CheXpert dataset, including the components of transfer layer, GCN layer, adaptive loss and curriculum learning.

Ablation study on individual component. We summarize the comparison in Table 4. The DenseNet121-only model achieves 0.8790 AUC score, which drops to 0.8715 when we add the transfer layer because the correlation between the feature maps of different classes is broken. The DenseNet121+transferLayer+GCN model further learns the label correlation with graph and yields a score of 0.8885, which is improved by a large margin. The DenseNet121+GCN model only improves 0.24% from the DenseNet121 for the weak inter-dependency remained in the mixed feature maps. The adaptive loss function also increases the AUC score in additional 0.37% for exploiting the label proportion information during the training. And the curriculum strategy for incomplete label learning can further extracts more potential contexts in uncertain labels and the model achieves the highest performance of 0.9120.

Besides, it is essential to investigate the influence on different input types of GCN nodes. In general, researches like [17] feed GCN with the mixed features of an image from a single model to extract the contexts, however they are not clear whether the GCN part makes efforts to learn the independent correlation with such mixed features for multi-label task. Different to their method, our approach learns a certain relationship between inter feature maps which are divided from the mixed features. Here, model DenseNet121+GCN yields an AUC of 0.8814, and our DenseNet121+transferLayer+GCN outperforms it by 0.71%. The result demonstrates that individual feature map promotes the GCN to learn more natural label correlation between the disease pairs.

Ablation study on the correlation matrix. We compare different types of correlation matrix on the same graph in Table 6. It shows that our GCN with dynamic correlation matrix is the best, which outperforms the constant data-driven method [5] by 1.55%. The + ones matrix means that we set the same probability values on each edge between all disease pairs in the graph, which breaks the deep inter-dependency, and it only achieves an AUC score of 0.8684. It indicates that our dynamic correlation matrix is not limited by the scale of database and can further explore more natural and reliable inter-dependency between the disease pairs for diagnosis performance improving.

Figure 5: Comparison of AUC score on different and on CheXpert where . Our method achieves the best performance for and .
Matrix Average AUC
+ ones matrix 0.8684
+ data-driven matrix 0.8965
+ dynamic matrix 0.9120
Table 6: Comparison on different number of layers in the stacked GCN module.
GCN depth Average AUC
Chest X-ray14 CheXpert
2-layers 0.8525 0.9120
3-layers 0.8431 0.9010
4-layers 0.8396 0.8872
Table 5: Comparison on different correlation matrices.

Ablation study on the GCN depth. To further investigate the influence on various depths of the stacked GCN module, we show the performance results with different number of GCN layers in Table 6, and the length of each node in the hidden GCN graph are set to 100. It shows that the classification performance drops on both datasets when the number of GCN layers increases. The propagation between the nodes will be accumulated if we use more GCN layers, resulting in over-smoothing. That is, the node features may be over-smoothed such that nodes from different classes (e.g., Atelectasis vs. Mass) may become indistinguishable [19].

Ablation study on the relabeling thresholds. Also, we do ablation study on the threshold used to define the weak samples for curriculum learning. Fig. 5 shows the AUC comparison with different and on the CheXpert dataset. DD-GCN can achieve 0.8922 without curriculum learning and it is improved when the threshold increases. We obtain the best performance when and , and the AUC scores will drop to below 0.8922 when we set . It is due to the reason that the weak positive/negative samples tend to be more reliable with strict restrictions of low in one reference relabeling. And it will mix up the two cluster samples for wrong prediction with high , which may mislead the model training.

Figure 6: Iteration steps in curriculum learning.

Ablation study on iterations of curriculum learning. To investigate how our curriculum strategy works, we provide a further view on information of the iteration steps in curriculum learning. As shown in Fig. 6, the curriculum algorithm relabels most of weak samples in the first iteration where DD-GCN achieves the AUC score of 0.9051. Since the rest of uncertain labels are hard to fit, the number of weak samples at each iteration decreases gradually in the following steps while the AUC score increases slowly and yields the best performance of 0.9120.

5 Conclusions

We present a GCN based end-to-end network called DD-GCN for the multi-label chest disease diagnosis task, which novelly utilizes a learnable adjacency matrix to dynamically learn the conditional probability between various disease pairs. Besides, we feed each node with individual feature maps instead of the general word embedding. In addition, we develop a curriculum learning algorithm to extract the potential information from the natural CXR dataset with incomplete labels. Experiments demonstrate that the proposed GCN model performs favorably against state-of-the-art methods on both Chest X-ray14 and CheXpert datasets. We also verify that our GCN layer can capture reliable correlations among different diseases by using the learnable correlation matrix which represents the conditional probability of disease pairs. We believe that the learned representation of correlations can help to predict the trend of other diseases in future work.


  • [1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In

    Proceedings of the 26th Annual International Conference on Machine Learning (ICML)

    pp. 41–48. Cited by: §2, §3.3.
  • [2] S. S. Bucak, R. Jin, and A. K. Jain (2011) Multi-label learning with incomplete class assignments. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 2801–2808. Cited by: §2.
  • [3] R. S. Cabral, F. Torre, J. P. Costeira, and A. Bernardino (2011) Matrix completion for multi-label image classification. In Advances in Neural Information Processing Systems (NIPS), pp. 190–198. Cited by: §2.
  • [4] M. Chen, A. Zheng, and K. Weinberger (2013) Fast image tagging. In International Conference on Machine Learning (ICML), pp. 1274–1282. Cited by: §2.
  • [5] Z. Chen, X. Wei, P. Wang, and Y. Guo (2019) Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5177–5186. Cited by: §1, §1, §2, item a), item b), §3.2, §4.5.
  • [6] J. Deng, O. Russakovsky, J. Krause, M. S. Bernstein, A. Berg, and L. Fei-Fei (2014) Scalable multi-label annotation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 3099–3102. Cited by: §1, §2.
  • [7] T. Durand, N. Mehrasa, G. Mori, and G. Mori (2019) Learning a deep convnet for multi-label classification with partial labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 647–657. Cited by: §2.
  • [8] T. Durand, T. Mordan, N. Thome, and M. Cord (2017)

    Wildcat: weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 642–651. Cited by: §2, §3.2.
  • [9] Q. Guan, Y. Huang, Z. Zhong, Z. Zheng, L. Zheng, and Y. Yang (2018) Diagnose like a radiologist: attention guided convolutional neural network for thorax disease classification. arXiv preprint arXiv:1801.09927. Cited by: §1, §2.
  • [10] S. Guendel, S. Grbic, B. Georgescu, S. Liu, A. Maier, and D. Comaniciu (2018) Learning to recognize abnormalities in chest x-rays with location-aware dense networks. In Iberoamerican Congress on Pattern Recognition, pp. 757–765. Cited by: Table 1.
  • [11] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141. Cited by: §2.
  • [12] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700–4708. Cited by: §2, §3.1, §3.2, §4.2.
  • [13] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.3.
  • [14] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al. (2019) Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In

    Thirty-Third AAAI Conference on Artificial Intelligence

    Cited by: §1, §1, §2, §4.1, Table 2, §4.
  • [15] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §2, item b), §3.2, §3.2.
  • [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.2.
  • [17] C. Lee, W. Fang, C. Yeh, and Y. Frank Wang (2018)

    Multi-label zero-shot learning with structured knowledge graphs

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1576–1585. Cited by: §1, §1, §2, item a), §3.2, §4.5.
  • [18] Q. Li, M. Qiao, W. Bian, and D. Tao (2016) Conditional graphical lasso for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2977–2986. Cited by: §1, §2, §3.2.
  • [19] Q. Li, Z. Han, and X. Wu (2018)

    Deeper insights into graph convolutional networks for semi-supervised learning

    In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §4.5.
  • [20] X. Li, F. Zhao, and Y. Guo (2014) Multi-label image classification with a probabilistic label enhancement model. In UAI, Vol. 1, pp. 3. Cited by: §1.
  • [21] Z. Li, C. Wang, M. Han, Y. Xue, W. Wei, L. Li, and L. Fei-Fei (2018) Thoracic disease identification and localization with limited supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8290–8299. Cited by: Table 1.
  • [22] A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In ICML, Vol. 30, pp. 3. Cited by: §3.2, §4.2.
  • [23] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten (2018) Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196. Cited by: §1, §2.
  • [24] C. Mao, L. Yao, Y. Luo, and C. Mao (2019) ImageGCN: multi-relational image graph convolutional networks for disease identification with chest x-rays. arXiv preprint arXiv:1904.00325. Cited by: §4.3, Table 1.
  • [25] M. Melissa Conrad Stöppler Enlarged heart: symptoms and signs. Note: https://www.medicinenet.com/enlarged_heart/symptoms.htm Cited by: §4.4.
  • [26] M. Nancy Moyer Lung consolidation: what it is and how it’s treated. Note: https://www.healthline.com/health/lung-consolidation Cited by: §4.4.
  • [27] NIH Atelectasis. Note: https://www.nhlbi.nih.gov/health-topics/atelectasis Cited by: §4.4.
  • [28] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.2.
  • [29] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    pp. 1532–1543. Cited by: item a).
  • [30] J. M. Porcel (2009) Establishing a diagnosis of pleural effusion due to heart failure. Respirology 14 (4), pp. 471–473. Cited by: §4.4.
  • [31] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, et al. (2017)

    Chexnet: radiologist-level pneumonia detection on chest x-rays with deep learning

    arXiv preprint arXiv:1711.05225. Cited by: §1, §2, Table 1.
  • [32] J. Saini, M. Lal, and J. Rai (1980) Eosinophilic pleural effusion following pneumothorax.. The Indian journal of chest diseases & allied sciences 22 (2), pp. 133–136. Cited by: §1.
  • [33] C. Sun, A. Shrivastava, S. Singh, and A. Gupta (2017) Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 843–852. Cited by: §2.
  • [34] Y. Tang, X. Wang, A. P. Harrison, L. Lu, J. Xiao, and R. M. Summers (2018) Attention-guided curriculum learning for weakly supervised classification and localization of thoracic diseases on chest radiographs. In International Workshop on Machine Learning in Medical Imaging, pp. 249–258. Cited by: §2.
  • [35] G. Tsoumakas and I. Katakis (2007) Multi-label classification: an overview. International Journal of Data Warehousing and Mining (IJDWM) 3 (3), pp. 1–13. Cited by: §1, §2.
  • [36] Q. Wang, B. Shen, S. Wang, L. Li, and L. Si (2014) Binary codes embedding for fast image tagging with incomplete labels. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 425–439. Cited by: §2.
  • [37] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers (2017) Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2097–2106. Cited by: §1, §1, §2, §4.1, Table 1, §4.
  • [38] M. Xu, R. Jin, and Z. Zhou (2013) Speedup matrix completion with side information: application to multi-label learning. In Advances in Neural Information Processing Systems (NIPS), pp. 2301–2309. Cited by: §2.
  • [39] C. Yan, J. Yao, R. Li, Z. Xu, and J. Huang (2018) Weakly supervised deep learning for thoracic disease classification and localization on chest x-rays. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 103–110. Cited by: §1, §2, §4.3, Table 1.
  • [40] L. Yao, E. Poblenz, D. Dagunts, B. Covington, D. Bernard, and K. Lyman (2017) Learning to diagnose from scratch by exploiting dependencies among labels. arXiv preprint arXiv:1710.10501. Cited by: §1, §2, Table 1.
  • [41] Y. Zhou, Z. Li, S. Bai, C. Wang, X. Chen, M. Han, E. Fishman, and A. L. Yuille (2019) Prior-aware neural network for partially-supervised multi-organ segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 10672–10681. Cited by: §2.
  • [42] Y. Zhou, Y. Wang, P. Tang, S. Bai, W. Shen, E. Fishman, and A. Yuille (2019) Semi-supervised 3d abdominal multi-organ segmentation via deep multi-planar co-training. In IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 121–140. Cited by: §2.
  • [43] Y. Zhou, Y. Wang, P. Tang, W. Shen, E. K. Fishman, and A. L. Yuille (2018) Semi-supervised multi-organ segmentation via multi-planar co-training. arXiv preprint arXiv:1804.02586. Cited by: §2.