ImageGCN: Multi-Relational Image Graph Convolutional Networks for Disease Identification with Chest X-rays

03/31/2019 ∙ by Chengsheng Mao, et al. ∙ Northwestern University 0

Image representation is a fundamental task in computer vision. However, most of the existing approaches for image representation ignore the relations between images and consider each input image independently. Intuitively, relations between images can help to understand the images and maintain model consistency over related images. In this paper, we consider modeling the image-level relations to generate more informative image representations, and propose ImageGCN, an end-to-end graph convolutional network framework for multi-relational image modeling. We also apply ImageGCN to chest X-ray (CXR) images where rich relational information is available for disease identification. Unlike previous image representation models, ImageGCN learns the representation of an image using both its original pixel features and the features of related images. Besides learning informative representations for images, ImageGCN can also be used for object detection in a weakly supervised manner. The Experimental results on ChestX-ray14 dataset demonstrate that ImageGCN can outperform respective baselines in both disease identification and localization tasks and can achieve comparable and often better results than the state-of-the-art methods.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning low-dimensional representation of images is a fundamental task in computer vision. Deep learning techniques, especially the convolutional neural network (CNN) architectures have achieved remarkable breakthroughs in learning image representation for classification

[22, 14, 15]. However, most of the existing approaches for image representation only considered each input image independently while ignored the relations between images. In reality, multiple relations can exist between images, especially in clinical setting, medical images from the same person can show pathophysiologic progressions. Intuitively, related images can give certain insights to better understand the current image. For example, images present in the same web page can help to understand each other; knowing a patient’s other medical images can help to analyze the current image.

We model the images and the relations between them as a graph, named ImageGraph, where a node corresponds to an image and an edge between two nodes represents a relation between the two corresponding images. An ImageGraph incorporating multiple types of relations is a multigraph where multiple edges exist between two nodes. The neighborhood of an image in the ImageGraph represents the images that have close relations with it. Fig. 1(a) shows an example of ImageGraph of CXR images incorporating 3 types of relations between 5 nodes.

Figure 1: Overview of ImageGCN. (a) The ImageGraph constructed with the original images and the relations between them. Here we show 3 relations in an ImageGraph of CXR marked with different colors. The relations between CXR images are defined in Section 4.2

. (b) Multi-layers ImageGCN to model the ImageGraph. (c) The output low-dimensional distributed representations for all images in the ImageGraph. The structure of the graph is preserved. (d) The image representations are used for downstream tasks, such as classification, clustering,

Learning an image representation incorporating both neighborhood information and the original pixel information is difficult, because the neighborhood information is unstructured and varies for different nodes. Inspired by the emerging research on graph convolutional networks (GCN) [21, 13, 4, 40] that can model graph data to learn informative representations for nodes based on the original node features and the structure information, we propose ImageGCN, an end-to-end GCN framework on ImageGraph, to learn the image representations. In ImageGCN, each image updates the information based on its own features and the images related to it. Fig. 1 shows an overview of ImageGCN, where each node in an ImageGraph is transformed into an informative representation by a number of ImageGCN layers.

There are several issues when applying the original GCN [21]

to an ImageGraph. (1) The original GCN is inductive and requires all node features present during training, which does not scale out to large ImageGraphs. (2) The original GCN is for simple graphs and can not support the multi-relational ImageGraphs. (3) The original GCN is effective for low-dimensional feature vectors in nodes, and can not be effectively extended to nodes with high-dimensional or unstructured features in ImageGraphs. Thanks to GraphSAGE

[13], the inductive learning issue was addressed for GCN; the multi-relational issue was also addressed by relational GCN [40]. However, the third issue, applying GCN to high-dimensional or unstructured features still remains unaddressed. The ImageGCN is proposed to address this issue and further to incorporate the idea of GraphSAGE and relational GCN for batch propagation on multi-relational ImageGraphs.

In this paper, for graphs with high-dimensional or unstructured features in the nodes, we propose to design flexible message passing units (MPU) to do message passing between two adjacent nodes, instead of a linear transformation in the original GCN. In the proposed ImageGCN, we use a number of MPUs equipped with a multi-layer CNN architecture for message passing between images in a multi-relational ImageGraph. We introduce partial parameter sharing between different MPUs corresponding to different relations to reduce model complexity. We also incorporate the idea of GraphSAGE and relational GCN to our ImageGCN model for inductive batch propagation on multi-relational ImageGraphs.

We evaluate ImageGCN on the ChestX-ray14 dataset [46] where rich relations are available between the Chest X-ray (CXR) images. The experimental results demonstrate that ImageGCN can outperform respective baselines in both disease identification and localization.

Besides the improved performance, the main contributions are as follows. (1) To our best knowledge, this is the first study to model natural image-level relations for image representation. (2) We propose ImageGCN to extend original GCN to high-dimensional or unstructured data in an ImageGraph. (3) We incorporate the idea of relational GCN and GraphSAGE into ImageGCN for inductive batch propagation on multi-relational ImageGraphs. (4) We introduce the partial parameter sharing scheme to reduce the model complexity of ImageGCN.

2 Related work.

Deep learning for disease identification with CXR. Since the ChestX-ray14 dataset [46] was released, an increasing amount of research on CXR image analysis have used deep neural networks for disease identification [46, 51, 23, 31, 12]. The general idea of previous work is to generate a low-dimensional representation by a deep neural network architecture, independently. In our work, we consider the relation between the CXR images, and learn a representation based on the image itself and the its neighbor images.

Relational Modeling. The previous research on relational model in computer vision mainly focused on pixel-level relations [30, 33], object-level relations [49, 32, 6, 56, 52] and label-level relations [25, 45]. image-level similarity relation were also studied in literature [10, 45]. However, Few studies are found to model the natural image-level relations for image representation.

Graph Neural Networks. Recently, inspired by the huge success of CNN on regular Euclidean data like images (2D grid) and text (1D sequence), a large number of research tried to generalize the operation of convolution to non-Euclidean data such as graph [36, 7, 33, 3, 21]. In the pioneering studies, Kipf and Welling [21]

resolved the computational bottleneck by learning polynomials of the graph Laplacian and provided fast approximate convolutions on graphs, Graph Convolutional Networks (GCN), which improved scalability and classification performance in large-scale graphs. GCN had a wide range of applications across different tasks and domains, such as nature language processing

[50, 1, 54, 27], recommender systems [2, 33, 53], life science and health care [18, 57, 9, 5]

, combinatorial optimization

[19, 28], . GCN was also explored in several computer vision tasks, such as image classification [47, 10], scene graph generation [48, 17], semantic segmentation [24, 44], visual reasoning [37, 35, 52]

. In most of previous studies, the graphs were built based on the knowledge graph

[47, 37, 35] or the object relations [48, 17] or the point clouds [24, 44]. In this paper, we take into account the natural image-level relations to construct a multi-relational ImageGraph, and use GCN to model the relations to learn informative representations for the nodes images.

3 Methods

3.1 Graph Convolutional Networks

Graph convolutional network (GCN) [21] can incorporate the node feature information and the structure information to learn informative representations for nodes in the graph. GCN learns node representations with the following propagation rule derived from spectral graph convolutions for an undirected graph [21]:


where is the adjacency matrix with added self-connection, is a diagonal matrix with , can be seen as a symmetrically normalized adjacency matrix, and are the node representation matrix and the trainable linear transformation matrix in the th layer, is the original feature matrix of nodes,

is the activation function (such as the ReLU).

The propagation rule of GCN in Eq. 1 can be interpreted as the Laplacian smoothing for a graph [26], the new feature of a node is computed as the weighted average of itself and its neighbors, followed by a linear transformation before activation function, Eq. 2,


where is the representation of node in the th layer, is the set of all nodes that have a connection with (self included), is a problem-specific normalization coefficient. It can be proven that Eq. 2 is equivalent to the original GCN Eq. 1 when is the entry of the symmetrically normalized graph Laplacian . Eq. 2 can be easily interpreted as that a node accepts messages from its neighbors [11], by adding self-connection, a node is also considered a neighbor of itself.

Eq. 2 can be extended to multiple relations as Eq. 3 [40], where indicates a certain relation from a set of relations and represents all the nodes that have relation with node .


The relational GCN formulated by Eq. 3 is interpreted as that a node accepts messages from the nodes that have any relations with it. The message passing weights vary with different relations and different layers. In Eq. 3, note that there is a special relation in that deserves more attention, the self-connection (denoted by ). We have , if we consider each node equally accepts the self-contribution as is during information updating. Different from the original GCN Eqs. 1 and 2, where all connections, including the self-connection, are considered equally, the relational GCN designs different message passing methods for different relations, including the self-connection.

We can also write Eq. 3 in matrix form as Eq. 4, where is a normalized adjacency matrix for relation , for self-connection ,

is an identity matrix. By Eq.

4, the computation efficiency can be improved using sparse matrix multiplications.


Note that Eq. 3 and 4 can be generalized to the situation of multi-relations between two nodes and the directed graphs. For multi-relations between two nodes, two CXR images share the same patient and the same view position, the message passing should be conducted multiple times, one for each relation. For directed graphs, the directed edges can be regarded as two relations, the in relation and the out relation, thus there should be two different message passing methods corresponding to the message passing from the head node to tail node and from the tail to the head, respectively.

Figure 2: The propagation rule of ImageGCN. To avoid cumbersomeness, we only show the propagation of Image 4, the other images propagate in a similar rule. A dashed box is a GCN layer that consists of a number of message passing units (, corresponding to the number of relations) and a number of aggregators (, corresponding to the number of images) followed by an activation function (). is the relations between images, is the self-connection relation. Colors indicate the propagation for different relations. indicates the representation of Image in layer . In the propagation, the relations are preserved.

3.2 ImageGCN

However, Eq. 4 can not be directly extended to an ImageGraph as Fig. 1

(a), where the original feature for each image is a 3-dimensional tensor (

). If we flatten the tensor and use the linear transformation matrix for message passing, the transformation matrix will be extremely large, low efficiency and even low non-linear expressive capacity. To tackle this issue, in our ImageGCN, we propose to design flexible message passing methods between images as


where is the kernel Message Passing Unit (MPU) corresponding to relation in layer , can be a 4-dimensional tensor () that is the representations of the all images in the th layer, is the original pixel-level input tensor of images. In the last layer, should be a matrix where each row corresponds to a distributed representation of an image. The multiplication between a matrix and a tensor in Eq. 5 is expanded correspondingly.

The propagation rule of ImageGCN can be illustrated in Fig. 2, where each node of the input ImageGraph gets a representation through a GCN layer, by stacking multiple GCN layers, each node could get an informative representation eventually.

ImageGCN Layer. A ImageGCN layer contains a number of MPUs to do message passing between layers. An MPU corresponds to the message passing of a type of relation. A ImageGCN layer also has an aggregator for each node to aggregate the received messages from its neighbors. An activation function (ReLU) is applied to the aggregation to enhance the non-linear expressive capacity. Though many aggregators are available for this task [13, 34], we use the mean aggregator for simplicity as the original GCN did. In ImageGCN, MPUs can be designed as a multi-layer CNN architecture in the middle ImageGCN layers to extract high-level features, and linear MPUs can be used in the last layers to generate vector representations for images.

Propagation. For each image (Image 4 in Fig. 2), each of its neighbors are input to the corresponding MPU, the outputs are aggregated and then activated to generate the new representation of this image in the next layer. For each image, the propagation rule is


where is the entry of normalized adjacency matrix of relation . Eq. 6 is equivalent to Eq. 5 and can be seen as a generalization of Eq. 3.

(a) Partial parameter sharing
(b) All parameter sharing
Figure 3: The parameter sharing schemes. (a) Partial parameter sharing, the parameters between MPUs are shared in a large part , ; (b) All parameter sharing, all the parameters between MPUs are shared.

Partial Parameter Sharing. Because each relation has an MPU, an issue with applying Eq. 5 to a ImageGraph with many relation types is that the number of parameters would grow rapidly with the number of relations. This will lead to a very large model that is not easy to train with limited computing and storage resources, especially for MPUs with multi-layer neural networks.

To address this issue, we introduce the partial parameter sharing (PPS) scheme between MPUs. With PPS, The MPUs share most of the parameters to reduce the total number of parameters. In our design, the same CNN architecture is applied to all MPUs in the same layer, all the parameters are shared between these MPUs except for the last parameter layer where the parameters are used to make the message passing rule different for different relations, see Fig. (a)a for an ImageGCN layer with PPS. Thus, the message passing rule Eq. 5 can be further refined as:


where is shared by all relations, only that has only a few parameters determines the different message passing methods for different relations. Also, we can further share all the parameters between all MPUs, that is, assigning the same message passing rule to different relations, all parameter sharing (APS) in Fig. (b)b. However, APS will reduce the multiple relations to a single relation, thus reduce the model’s expressive capacity, our experimental results in Section 4.5 and 4.6 also demonstrate the less effectiveness of APS than PPS.

3.3 Training Strategies

Loss function.

The loss function relies on the downstream task. Specifically, for a classic node classification task, we can use a softmax activation function in the last layer and minimize the cross-entropy loss on all labeled nodes. For multi-label classifications, the loss function can be design as in our experiments in Section


Batch propagation. Equation 7 requires all nodes in the graph being present during training, it can not support propagation in batch. This is difficult to scale out to a large graph with high-dimensional node features, which is common in computer vision. One may want to simply construct a subgraph in a batch, this usually causes no edges in a batch if the graph is sparse. GraphSAGE [13] was designed to address this issue for single relational graphs. Inspired by GraphSAGE, we introduce an inductive batch propagation algorithm for multi-relational ImageGraphs in Algorithm 1. For each sample in a batch, for each relation , we randomly sample neighbors of to pass message to with relation in a layer (Line 8). The union of the sampled neighbors and the samples in the batch are considered as a new batch for the next layer (Line 3 to 11). For a layer ImageGCN, the neighbor sampling should be repeated times to reach the th order neighbors of the initial batch (Line 2 to 12). We construct the subgraph based on the final batch (Line 13 to 16, is the final batch). In each ImageGCN layer, the message passing is conducted inside the subgraph (Line 17). Note that the image features can be in persistent storage, and are loaded when a batch and the neighbors of images in the batch are sampled (Line 13), This is important to reduce memory requirement for large-scale graphs or graphs with high-dimensional features in the nodes.

0:     graph node set and the mini-batch ; relation adjacency matrix ; input image features (can be stored externally); network depth ; number of neighbors to sample for each node and each relation .
0:  The representation of all samples in
2:  for   do
4:     for  do
5:        for  do
6:            the neighbor set of based on
8:            random samples from without replacement
10:        end for
11:     end for
12:  end for
13:  load the features for from
14:  for  do
15:      the sub-matrix corresponding to in the adjacency matrix
16:  end for
17:  execute Eq. 7 with , for
18:  extract the representations of samples in from
Algorithm 1 ImageGCN batch propagation algorithm.

In test procedure, given a test batch ( can have only one or more samples), the relations between test samples and the training samples are added to the adjacency matrices . The batch propagation algorithm Eq. 1 can be directly applied for test data representation.

4 Experiments

4.1 ChestX-ray14 Dataset

We test ImageGCN for disease identification and localization on the ChestX-ray14 dataset [46] which consists of 112,120 frontal-view CXR images of 30,805 patients related with 14 thoracic disease labels. The labels are mined from the associated radiological reports using natural language processing, and are expected to have accuracy90% [46]. Out of the 112,120 CXR images, 51,708 contains one or more pathologies. The remaining 60,412 images are considered normal. ChestX-ray14 dataset also provides the patients information for a CXR image based on which we construct the ImageGraph. We randomly split the dataset into training, validation and test set by the ratio 7:2:1 (training 78484 images, validation 11212 images, 22424 images). We regard the provided labels as ground truth to train the model on training set and evaluate it on test set. We do not apply any data augmentation techniques.

Preprocessing. Each image in the dataset is resized to , and then cropped to

at the center for fast processing. We normalized the image by mean ([0:485; 0:456; 0:406]) and standard deviation ([0:229; 0:224; 0:225]) of the images from ImageNet


4.2 Graph Construction

To construct an ImageGraph based on the dataset, besides the self-connection, we consider 4 types of relations between two CXR images that are relevant for disease classification and localization. (1) Person relation, if two images come from the same person, a person relation exists. (2) Age relation, if the two images come from the persons of the same age when the CXR were taken, an age relation exists. (3) Gender relation, if the owners of two images have the same gender, a gender relation exists. (4) View relation, if two CXR images were taken with the same view position (PosteroAnterior or AnteroPosterior ), a view relation exists.

The four relations are all reflexive, symmetric and transitive, thus each relation corresponds to a cluster graph that consists of a number of disjoint complete subgraphs. Person relation usually implies gender relation but can not imply age relation, because a person can take several CXR images at different ages. The adjacency matrix of each relation is a diagonal block matrix. Our ImageGCN is built on this multi-relational graph. The adjacency matrices are normalized in advance. Note that because the self-connection relation is considered separately, The adjacency matrices do not need to add self-connection.

4.3 MPU design

Since the ImageGraph in our experiments is a cluster graph for each relation, each node can reach other reachable nodes by 1 step, one-layer ImageGCN is enough to catch the structure information of an image node. Stacking multiple GCN layers would result in over-smoothing issues [26]. For the one-layer ImageGCN, we design the MPUs in our experiments as a deep CNN architecture to catch high-level visual information. According to partial parameter sharing introduced in Section 3 and Fig. 3, each MPU consists of two parts: the sharing part and the private part .

The sharing part.

The sharing part of the MPUs consists of the feature layers of a pre-trained CNN architecture, a transition layer and a global pooling layer, sequentially. For a pre-trained model, we discard the high-level fully-connected layers and classification layers and only keep the remaining feature layers as the first component of the sharing part. The transition layer consists of a convolutional layer, a batch normalization layer

[16] and a ReLu layer sequentially. In the transition layer, we let the convolutional layer have 1024 filters with kernel size to transform the output of previous layers into a uniform number (1024 in our experiment) of feature maps which is used to generate the heatmap for disease localization. The global pooling layer pools the generated 1024 feature maps to a 1024-dimensional vector with a kernel size equal to the feature map’s size. Thus, by the sharing part of MPUs, an image is transformed to a 1024-dimensional vector. We test the feature layers of three different pre-trained CNN architectures independently in our experiments, AlexNet [22], VGGNet16 with batch normalization (VGGNet16BN) [42], and ResNet50 [14].

The private part.

The private part accepts the output of the sharing part and outputs an embedding to the aggregator. For each relation, we use a linear layer (with different parameters) as the private part to transform the 1024-dimensional vector from the sharing part to a 14-dimensional vector. For an image, the 14-dimensional vectors from its neighbors are aggregated and fed to a sigmoid activation function to generate its probabilities corresponding to the 14 diseases. With a similar method in

[55], the weights of the private linear layer of self-connection combined with the activations of the transition layer in the sharing part can generate a heatmap for the disease location task.

All the learnable parameters of the ImageGCN model are contained in these two parts, the sharing part corresponds to the feature layers of a pre-trained architecture, and the private parts corresponds to 5 linear layers corresponding to the 4 relations and self-connection. Though only a part of the pre-trained model, AlexNet, is incorporated in an MPU, we call it an AlexNet MPU for convenience, similarly, VGGNet16BN MPU and ResNet50 MPU. For each MPU type (AlexNet), we use two baselines to evaluate our model, ImageGCN with all parameter sharing (APS) and the basic pre-trained model (AlexNet) fine-tuned in the dataset. In the following statement in this paper, we use A-GCN-PPS to denote the ImageGCN with AlexNet MPUs and partial parameter sharing, similarly V-GCN-PPS for VGGNet16BN MPUs and R-GCN-PPS for ResNet50 MPUs.

Atel Card Effu Infi Mass Nodu Pne1 Pne2 Cons Edem Emph Fibr PT Hern mean
A-GCN-PPS (ours) 0.781 red0.899 red0.865 0.701 red0.813 red0.721 red0.718 red0.881 0.788 0.888 red0.882 red0.804 red0.778 0.904 red0.816
A-GCN-APS 0.739 0.876 0.815 0.671 0.799 0.704 0.679 0.857 0.762 0.846 0.863 0.792 0.765 0.910 0.791
AlexNet 0.782 0.895 0.863 0.705 0.781 0.714 0.716 0.869 0.790 0.889 0.876 0.799 0.773 0.899 0.811
R-GCN-PPS (ours) 0.785 red0.890 red0.868 red0.699 red0.824 red0.739 red0.723 red0.895 0.790 0.887 red0.911 red0.819 red0.786 red0.941 red0.826
R-GCN-APS 0.741 0.861 0.822 0.680 0.819 0.728 0.684 0.873 0.768 0.852 0.889 0.790 0.751 0.908 0.798
ResNet50 0.789 0.889 0.863 0.698 0.807 0.723 0.714 0.876 0.791 0.888 0.899 0.799 0.772 0.933 0.817
V-GCN-PPS (ours) red0.796 red0.896 red0.873 red0.699 red0.834 red0.762 red0.717 red0.890 red0.788 red0.889 red0.907 red0.813 red0.792 red0.917 red0.827
V-GCN-APS 0.754 0.871 0.826 0.676 0.820 0.737 0.688 0.872 0.769 0.839 0.894 0.789 0.770 0.926 0.802
VGGNet16BN 0.785 0.876 0.872 0.686 0.813 0.734 0.712 0.882 0.787 0.883 0.902 0.812 0.773 0.925 0.817
Wang [46] 0.716 0.807 0.784 0.609 0.706 0.671 0.633 0.806 0.708 0.835 0.815 0.769 0.708 0.767 0.738
Yao [51] 0.772 0.904 0.859 0.695 0.792 0.717 0.713 0.841 0.788 0.882 0.829 0.767 0.765 0.914 0.803
Li [29] 0.800 0.870 0.870 0.700 0.830 0.750 0.670 0.870 0.800 0.880 0.910 0.780 0.760 0.770 0.804
Kumar [23] 0.762 0.913 0.864 0.692 0.750 0.666 0.715 0.859 0.784 0.888 0.898 0.756 0.774 0.802 0.794
Tang [43] 0.756 0.887 0.819 0.689 0.814 0.755 0.729 0.85 0.728 0.848 0.906 0.818 0.765 0.875 0.803
Shen [41] 0.766 0.801 0.797 0.751 0.76 0.741 0.778 0.800 0.787 0.82 0.773 0.765 0.759 0.748 0.775
Mao [31] 0.750 0.869 0.810 0.687 0.782 0.726 0.695 0.845 0.728 0.834 0.870 0.798 0.758 0.877 0.788
Guan [12] 0.781 0.883 0.831 0.697 0.83 0.764 0.725 0.866 0.758 0.853 0.911 0.826 0.78 0.918 0.816
Table 1:

The AUC results of various models to classify for the 14 diseases on ChestX-ray14 dataset. For each disease, the best results are bolded. The red text means our ImageGCN can perform better than or equal to the corresponding baseline models. Abbrs: Atel: Atelectasis; Card: Cardiomegaly; Effu: Effusion; Infi: Infiltration; Nodu: Nodule; Pneu1: Pneumonia; Pneu2:Pneumothorax; Cons: Consolidation Edem: Edema; Emph: Emphysema; Fibr: Fibrosis; PT:Pleural Thickening Hern: Hernia.

4.4 Experimental settings

Weakly supervised learning.

The ChestX-ray14 dataset provides pathology bounding box (Bbox) annotations of a small number of CXR images, which can be used as the ground truth of the disease localization task. In our experiments, we adopt the weakly supervised learning scheme

[38], where no annotations are used for training, they are only used to evaluate the performance of disease location of a model trained with only image-level labels.

Loss function. For multi-label classification on ChestX-ray14, the true label of each CXR image is a 14-dimensional binary vector where denotes the corresponding disease is present and for absence. An all zero vector represents “No Findings” in the 14 diseases. Due to the high sparsity of the label matrix, we use the weighted cross entropy loss as Wang [46] did, where each sample with true labels and output probabilities has the loss


where and are the number of ‘0’s and ‘1’s in a mini-batch respectively. The loss of images in a mini-batch are averaged as the loss of the batch.

Hyperparameters. We set the batch size to 16. 1 neighbor is sampled for each image and each relation. All the models are trained using Adam optimizer [20] with parameters

. We terminate the training procedure when it reaches 10 epochs. In each epoch, the model with the best classification performance on the validation set is saved for evaluation.

4.5 Disease Identification

For the disease identification task, we use AUC score to evaluate the performance of the models. Table 1 shows the AUC scores of all the models on the 14 diseases. From Table 1, as expected in Section 3, for all the three types of MPUs, PPS outperform APS obviously. For each type of MPU, GCN-PPS outperform GCN-APS and the corresponding basic model overall and in most of the diseases. V-GCN-PPS with can even outperform the corresponding V-GCN-APS and VGGNet16BN for all the 14 diseases.

Table 1 also lists some results reported in the related references. Some studies like [39] that used a different training-validation-test split ratio or augmented the dataset are not listed. Our V-GCN-PPS achieved the best overall results, compared with the state-of-the-art methods. On 7 out of the 14 disease, ImageGCN achieves the best results among these state-of-the-art methods.

T(IoU) model Atel Card Effu Infi Mass Nodu Pne1 Pne2
0.1 Acc A-GCN-PPS (ours) [rgb] 1, 0, 00.4889 0.9932 [rgb] 1, 0, 00.6667 [rgb] 1, 0, 00.6667 [rgb] 1, 0, 00.4706 0.0000 [rgb] 1, 0, 00.6417 [rgb] 1, 0, 00.3469
AlexNet 0.3889 1.0000 0.6144 0.5285 0.4706 0.0253 0.5833 0.3265
A-GCN-APS 0.3000 0.9863 0.5294 0.4634 0.2824 0.0127 0.5167 0.2755
Wang [46] 0.6888 0.9383 0.6601 0.7073 0.4000 0.1392 0.6333 0.3775
AFP A-GCN-PPS (ours) [rgb] 1, 0, 00.5111 0.0137 [rgb] 1, 0, 00.3333 [rgb] 1, 0, 00.3333 [rgb] 1, 0, 00.5294 1.0127 [rgb] 1, 0, 00.3583 [rgb] 1, 0, 00.6531
AlexNet 0.6111 0.0000 0.3856 0.4715 0.5294 0.9747 0.4167 0.6735
A-GCN-APS 0.7000 0.0137 0.4706 0.5447 0.7176 1.0000 0.4833 0.7245
Wang [46] 0.8943 0.5996 0.8343 0.6250 0.6666 0.6077 1.0203 0.4949
0.5 Acc A-GCN-PPS (ours) [rgb] 1, 0, 00.0222 [rgb] 1, 0, 00.3836 0.0458 [rgb] 1, 0, 00.1138 0.0471 [rgb] 1, 0, 00.0000 [rgb] 1, 0, 00.0750 [rgb] 1, 0, 00.0408
AlexNet 0.0111 0.2260 0.0784 0.0569 0.0824 0.0000 0.0750 0.0306
A-GCN-APS 0.0000 0.3082 0.0327 0.0325 0.0235 0.0000 0.0500 0.0204
Wang [46] 0.0500 0.1780 0.1111 0.0650 0.0117 0.0126 0.0333 0.0306
AFP A-GCN-PPS (ours) [rgb] 1, 0, 00.9778 [rgb] 1, 0, 00.6233 0.9542 [rgb] 1, 0, 00.8862 0.9529 1.0127 [rgb] 1, 0, 00.9250 [rgb] 1, 0, 00.9592
AlexNet 0.9889 0.7740 0.9216 0.9431 0.9176 1.0000 0.9250 0.9694
A-GCN-APS 1.0000 0.6918 0.9673 0.9756 0.9765 1.0127 0.9500 0.9796
Wang [46] 1.0884 0.8506 1.0051 0.7632 0.7226 0.6189 1.1321 0.5478
Table 2: The comparison results of disease localization among the models. The best result in each cell is bolded. The red text means our ImageGCN can perform better than or equal to the corresponding baseline models. Acc: Accuracy; AFP: Average False Positive. Atel: Atelectasis; Card: Cardiomegaly; Effu: Effusion; Infi: Infiltration; Nodu: Nodule; Pneu1: Pneumonia; Pneu2:Pneumothorax.

In Table 1, GCN-APS is less effective than the corresponding basic model because the graph is a complete graph if all relations are considered equally by APS, an image’s own feature would be heavily dwarfed by the messages of its neighbor images. For example, in Fig. (b)b, the message from the image itself is considered equal to its neighbors’. This makes an image and its neighbors indistinguishable, thus leads to even lower performance than the baseline. On the contrary, by PPS in Fig. (a)a, messages from neighbors with different relations will be considered differently by , the less important messages will have less influence to the results. Thus, ImageGCNs with PPS perform better than those with APS and the baseline model.

4.6 Disease Localization

ChestX-ray14 dataset also contains 984 labelled Bboxes for 880 CXR images by board-certified radiologists. The provided Bboxes correspond to 8 of the 14 diseases, we consider these Bboxes as ground truth to evaluate the disease localization performance of the models.

With class activation mapping [55], for each image, we generate a heatmap normalized to with the MPU of self-connection in a weakly supervised manner. Following the setting of Wang [46], we segment the heatmap by a threshold of 180, and generate Bboxes to cover the activated regions in the binary map. We use intersection over union ratio () between the detected region and the annotated ground truth to evaluate the localization performance. We define a correct localization when , where is the self-defined threshold.

The comparison results of disease localization among the models are listed in Table 2. From Table 2, our ImageGCN with AlexNet MPU and PPS can outperform the baselines in most cases.

Fig. 4 shows example localization qualitative results of A-GCN-PPS compared to the results of the baselines. From 4, it can be seen that our ImageGCN with AlexNet MPU and PPS usually have smaller and more accurate Bboxes than the baselines.

(a) Atelectasis
(b) Cardiomegaly
(c) Effusion
(d) Infiltratio
(e) Mass
(f) Nodule
(g) Pneumonia
(h) Pneumothorax
Figure 4: The Bbox results of the models for the 8 diseases. For each cell, the first image is the original CXR with ground truth Bbox (blue) and Bbox generated by A-GCN-PPS (red). the second to fourth images are the generated heatmap and Bbox of A-GCN-PPS, A-GCN-APS and the base AlexNet, respectively. The blue Bbox is the ground truth, the red one is generated by the respective models.

5 Conclusion

We propose ImageGCN to model relations between images and apply it to CXR images for disease identification and disease localization. To our best knowledge, this is the first study to model natural image-level relations for image representation learning. ImageGCN can extend the original GCN to high-dimensional or unstructured data, and incorporate the idea of relational GCN and GraphSAGE for batch propagation on multi-relational ImageGraphs. We also introduce the PPS scheme to reduce the complexity of ImageGCN. The Experimental results on ChestX-ray14 dataset demonstrate that ImageGCN outperforms respective baselines in both disease identification and localization and can achieve comparable and often better results than the state-of-the-art methods. Future research includes tuning the MPU of ImageGraph for different vision tasks, and test ImageGCN on more general datasets.


  • [1] J. Bastings, I. Titov, W. Aziz, D. Marcheggiani, and K. Simaan.

    Graph convolutional encoders for syntax-aware neural machine translation.

    In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1957–1967, 2017.
  • [2] R. v. d. Berg, T. N. Kipf, and M. Welling. Graph convolutional matrix completion. In SIGKDD, Deep Learning Day, 2018.
  • [3] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
  • [4] J. Chen, T. Ma, and C. Xiao. Fastgcn: fast learning with graph convolutional networks via importance sampling. In ICLR, 2018.
  • [5] E. Choi, M. T. Bahadori, L. Song, W. F. Stewart, and J. Sun.

    Gram: graph-based attention model for healthcare representation learning.

    In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 787–795. ACM, 2017.
  • [6] B. Dai, Y. Zhang, and D. Lin. Detecting visual relationships with deep relational networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3076–3086, 2017.
  • [7] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pages 3844–3852, 2016.
  • [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
  • [9] A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur. Protein interface prediction using graph convolutional networks. In Advances in Neural Information Processing Systems, pages 6530–6539, 2017.
  • [10] V. Garcia and J. Bruna. Few-shot learning with graph neural networks. In ICLR, 2018.
  • [11] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    , pages 1263–1272. JMLR. org, 2017.
  • [12] Q. Guan and Y. Huang. Multi-label chest x-ray image classification via category-wise residual attention learning. Pattern Recognition Letters, 2018.
  • [13] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In NIPS, pages 1024–1034, 2017.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [15] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • [16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
  • [17] J. Johnson, A. Gupta, and L. Fei-Fei. Image generation from scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1219–1228, 2018.
  • [18] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design, 30(8):595–608, 2016.
  • [19] E. Khalil, H. Dai, Y. Zhang, B. Dilkina, and L. Song. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pages 6348–6358, 2017.
  • [20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [21] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
  • [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [23] P. Kumar, M. Grewal, and M. M. Srivastava. Boosted cascaded convnets for multilabel classification of thoracic diseases in chest radiographs. In International Conference Image Analysis and Recognition, pages 546–552. Springer, 2018.
  • [24] L. Landrieu and M. Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4558–4567, 2018.
  • [25] C.-W. Lee, W. Fang, C.-K. Yeh, and Y.-C. Frank Wang. Multi-label zero-shot learning with structured knowledge graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1576–1585, 2018.
  • [26] Q. Li, Z. Han, and X.-M. Wu.

    Deeper insights into graph convolutional networks for semi-supervised learning.

    In AAAI, 2018.
  • [27] Y. Li, R. Jin, and Y. Luo.

    Classifying relations in clinical narratives using segment graph convolutional and recurrent neural networks (seg-gcrns).

    Journal of the American Medical Informatics Association, 26(3):262–268, 2018.
  • [28] Z. Li, Q. Chen, and V. Koltun. Combinatorial optimization with graph convolutional networks and guided tree search. In Advances in Neural Information Processing Systems, pages 537–546, 2018.
  • [29] Z. Li, C. Wang, M. Han, Y. Xue, W. Wei, L.-J. Li, and L. Fei-Fei. Thoracic disease identification and localization with limited supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8290–8299, 2018.
  • [30] M. Maire, T. Narihira, and S. X. Yu. Affinity cnn: Learning pixel-centric pairwise relations for figure/ground embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 174–182, 2016.
  • [31] C. Mao, L. Yao, Y. Pan, Y. Luo, and Z. Zeng. Deep generative classifiers for thoracic disease diagnosis with chest x-ray images. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1209–1214. IEEE, 2018.
  • [32] K. Marino, R. Salakhutdinov, and A. Gupta. The more you know: Using knowledge graphs for image classification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 20–28. IEEE, 2017.
  • [33] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5115–5124, 2017.
  • [34] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks. In AAAI, 2019.
  • [35] M. Narasimhan, S. Lazebnik, and A. Schwing. Out of the box: Reasoning with graph convolution nets for factual visual question answering. In Advances in Neural Information Processing Systems, pages 2659–2670, 2018.
  • [36] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In International conference on machine learning, pages 2014–2023, 2016.
  • [37] W. Norcliffe-Brown, S. Vafeias, and S. Parisot. Learning conditioned graph structures for interpretable visual question answering. In Advances in Neural Information Processing Systems, pages 8344–8353, 2018.
  • [38] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 685–694, 2015.
  • [39] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225, 2017.
  • [40] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling. Modeling relational data with graph convolutional networks. In ESWC, pages 593–607. Springer, 2018.
  • [41] Y. Shen and M. Gao. Dynamic routing on deep neural network for thoracic disease classification and sensitive area localization. In International Workshop on Machine Learning in Medical Imaging, pages 389–397. Springer, 2018.
  • [42] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [43] Y. Tang, X. Wang, A. P. Harrison, L. Lu, J. Xiao, and R. M. Summers. Attention-guided curriculum learning for weakly supervised classification and localization of thoracic diseases on chest radiographs. In International Workshop on Machine Learning in Medical Imaging, pages 249–258. Springer, 2018.
  • [44] G. Te, W. Hu, A. Zheng, and Z. Guo. Rgcnn: Regularized graph cnn for point cloud segmentation. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 746–754. ACM, 2018.
  • [45] H. Wang, H. Huang, and C. Ding. Image annotation using bi-relational graph of images and semantic labels. In CVPR 2011, pages 793–800. IEEE, 2011.
  • [46] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 3462–3471, 2017.
  • [47] X. Wang, Y. Ye, and A. Gupta. Zero-shot recognition via semantic embeddings and knowledge graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6857–6866, 2018.
  • [48] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 670–685, 2018.
  • [49] B. Yao and L. Fei-Fei. Grouplet: A structured image representation for recognizing human and object interactions. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 9–16. IEEE, 2010.
  • [50] L. Yao, C. Mao, and Y. Luo. Graph convolutional networks for text classification. In AAAI, 2019.
  • [51] L. Yao, E. Poblenz, D. Dagunts, B. Covington, D. Bernard, and K. Lyman. Learning to diagnose from scratch by exploiting dependencies among labels. arXiv preprint arXiv:1710.10501, 2017.
  • [52] T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 684–699, 2018.
  • [53] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 974–983. ACM, 2018.
  • [54] Y. Zhang, P. Qi, and C. D. Manning. Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2205–2215, 2018.
  • [55] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.

    Learning deep features for discriminative localization.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.
  • [56] Y. Zhu and S. Jiang. Deep structured learning for visual relationship detection. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • [57] M. Zitnik, M. Agrawal, and J. Leskovec. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics, 34(13):i457–i466, 2018.