1 Introduction
Learning lowdimensional representation of images is a fundamental task in computer vision. Deep learning techniques, especially the convolutional neural network (CNN) architectures have achieved remarkable breakthroughs in learning image representation for classification
[22, 14, 15]. However, most of the existing approaches for image representation only considered each input image independently while ignored the relations between images. In reality, multiple relations can exist between images, especially in clinical setting, medical images from the same person can show pathophysiologic progressions. Intuitively, related images can give certain insights to better understand the current image. For example, images present in the same web page can help to understand each other; knowing a patient’s other medical images can help to analyze the current image.We model the images and the relations between them as a graph, named ImageGraph, where a node corresponds to an image and an edge between two nodes represents a relation between the two corresponding images. An ImageGraph incorporating multiple types of relations is a multigraph where multiple edges exist between two nodes. The neighborhood of an image in the ImageGraph represents the images that have close relations with it. Fig. 1(a) shows an example of ImageGraph of CXR images incorporating 3 types of relations between 5 nodes.
Learning an image representation incorporating both neighborhood information and the original pixel information is difficult, because the neighborhood information is unstructured and varies for different nodes. Inspired by the emerging research on graph convolutional networks (GCN) [21, 13, 4, 40] that can model graph data to learn informative representations for nodes based on the original node features and the structure information, we propose ImageGCN, an endtoend GCN framework on ImageGraph, to learn the image representations. In ImageGCN, each image updates the information based on its own features and the images related to it. Fig. 1 shows an overview of ImageGCN, where each node in an ImageGraph is transformed into an informative representation by a number of ImageGCN layers.
There are several issues when applying the original GCN [21]
to an ImageGraph. (1) The original GCN is inductive and requires all node features present during training, which does not scale out to large ImageGraphs. (2) The original GCN is for simple graphs and can not support the multirelational ImageGraphs. (3) The original GCN is effective for lowdimensional feature vectors in nodes, and can not be effectively extended to nodes with highdimensional or unstructured features in ImageGraphs. Thanks to GraphSAGE
[13], the inductive learning issue was addressed for GCN; the multirelational issue was also addressed by relational GCN [40]. However, the third issue, applying GCN to highdimensional or unstructured features still remains unaddressed. The ImageGCN is proposed to address this issue and further to incorporate the idea of GraphSAGE and relational GCN for batch propagation on multirelational ImageGraphs.In this paper, for graphs with highdimensional or unstructured features in the nodes, we propose to design flexible message passing units (MPU) to do message passing between two adjacent nodes, instead of a linear transformation in the original GCN. In the proposed ImageGCN, we use a number of MPUs equipped with a multilayer CNN architecture for message passing between images in a multirelational ImageGraph. We introduce partial parameter sharing between different MPUs corresponding to different relations to reduce model complexity. We also incorporate the idea of GraphSAGE and relational GCN to our ImageGCN model for inductive batch propagation on multirelational ImageGraphs.
We evaluate ImageGCN on the ChestXray14 dataset [46] where rich relations are available between the Chest Xray (CXR) images. The experimental results demonstrate that ImageGCN can outperform respective baselines in both disease identification and localization.
Besides the improved performance, the main contributions are as follows. (1) To our best knowledge, this is the first study to model natural imagelevel relations for image representation. (2) We propose ImageGCN to extend original GCN to highdimensional or unstructured data in an ImageGraph. (3) We incorporate the idea of relational GCN and GraphSAGE into ImageGCN for inductive batch propagation on multirelational ImageGraphs. (4) We introduce the partial parameter sharing scheme to reduce the model complexity of ImageGCN.
2 Related work.
Deep learning for disease identification with CXR. Since the ChestXray14 dataset [46] was released, an increasing amount of research on CXR image analysis have used deep neural networks for disease identification [46, 51, 23, 31, 12]. The general idea of previous work is to generate a lowdimensional representation by a deep neural network architecture, independently. In our work, we consider the relation between the CXR images, and learn a representation based on the image itself and the its neighbor images.
Relational Modeling. The previous research on relational model in computer vision mainly focused on pixellevel relations [30, 33], objectlevel relations [49, 32, 6, 56, 52] and labellevel relations [25, 45]. imagelevel similarity relation were also studied in literature [10, 45]. However, Few studies are found to model the natural imagelevel relations for image representation.
Graph Neural Networks. Recently, inspired by the huge success of CNN on regular Euclidean data like images (2D grid) and text (1D sequence), a large number of research tried to generalize the operation of convolution to nonEuclidean data such as graph [36, 7, 33, 3, 21]. In the pioneering studies, Kipf and Welling [21]
resolved the computational bottleneck by learning polynomials of the graph Laplacian and provided fast approximate convolutions on graphs, Graph Convolutional Networks (GCN), which improved scalability and classification performance in largescale graphs. GCN had a wide range of applications across different tasks and domains, such as nature language processing
[50, 1, 54, 27], recommender systems [2, 33, 53], life science and health care [18, 57, 9, 5][19, 28], . GCN was also explored in several computer vision tasks, such as image classification [47, 10], scene graph generation [48, 17], semantic segmentation [24, 44], visual reasoning [37, 35, 52]. In most of previous studies, the graphs were built based on the knowledge graph
[47, 37, 35] or the object relations [48, 17] or the point clouds [24, 44]. In this paper, we take into account the natural imagelevel relations to construct a multirelational ImageGraph, and use GCN to model the relations to learn informative representations for the nodes images.3 Methods
3.1 Graph Convolutional Networks
Graph convolutional network (GCN) [21] can incorporate the node feature information and the structure information to learn informative representations for nodes in the graph. GCN learns node representations with the following propagation rule derived from spectral graph convolutions for an undirected graph [21]:
(1) 
where is the adjacency matrix with added selfconnection, is a diagonal matrix with , can be seen as a symmetrically normalized adjacency matrix, and are the node representation matrix and the trainable linear transformation matrix in the th layer, is the original feature matrix of nodes,
is the activation function (such as the ReLU).
The propagation rule of GCN in Eq. 1 can be interpreted as the Laplacian smoothing for a graph [26], the new feature of a node is computed as the weighted average of itself and its neighbors, followed by a linear transformation before activation function, Eq. 2,
(2) 
where is the representation of node in the th layer, is the set of all nodes that have a connection with (self included), is a problemspecific normalization coefficient. It can be proven that Eq. 2 is equivalent to the original GCN Eq. 1 when is the entry of the symmetrically normalized graph Laplacian . Eq. 2 can be easily interpreted as that a node accepts messages from its neighbors [11], by adding selfconnection, a node is also considered a neighbor of itself.
Eq. 2 can be extended to multiple relations as Eq. 3 [40], where indicates a certain relation from a set of relations and represents all the nodes that have relation with node .
(3) 
The relational GCN formulated by Eq. 3 is interpreted as that a node accepts messages from the nodes that have any relations with it. The message passing weights vary with different relations and different layers. In Eq. 3, note that there is a special relation in that deserves more attention, the selfconnection (denoted by ). We have , if we consider each node equally accepts the selfcontribution as is during information updating. Different from the original GCN Eqs. 1 and 2, where all connections, including the selfconnection, are considered equally, the relational GCN designs different message passing methods for different relations, including the selfconnection.
We can also write Eq. 3 in matrix form as Eq. 4, where is a normalized adjacency matrix for relation , for selfconnection ,
is an identity matrix. By Eq.
4, the computation efficiency can be improved using sparse matrix multiplications.(4) 
Note that Eq. 3 and 4 can be generalized to the situation of multirelations between two nodes and the directed graphs. For multirelations between two nodes, two CXR images share the same patient and the same view position, the message passing should be conducted multiple times, one for each relation. For directed graphs, the directed edges can be regarded as two relations, the in relation and the out relation, thus there should be two different message passing methods corresponding to the message passing from the head node to tail node and from the tail to the head, respectively.
3.2 ImageGCN
However, Eq. 4 can not be directly extended to an ImageGraph as Fig. 1
(a), where the original feature for each image is a 3dimensional tensor (
). If we flatten the tensor and use the linear transformation matrix for message passing, the transformation matrix will be extremely large, low efficiency and even low nonlinear expressive capacity. To tackle this issue, in our ImageGCN, we propose to design flexible message passing methods between images as(5) 
where is the kernel Message Passing Unit (MPU) corresponding to relation in layer , can be a 4dimensional tensor () that is the representations of the all images in the th layer, is the original pixellevel input tensor of images. In the last layer, should be a matrix where each row corresponds to a distributed representation of an image. The multiplication between a matrix and a tensor in Eq. 5 is expanded correspondingly.
The propagation rule of ImageGCN can be illustrated in Fig. 2, where each node of the input ImageGraph gets a representation through a GCN layer, by stacking multiple GCN layers, each node could get an informative representation eventually.
ImageGCN Layer. A ImageGCN layer contains a number of MPUs to do message passing between layers. An MPU corresponds to the message passing of a type of relation. A ImageGCN layer also has an aggregator for each node to aggregate the received messages from its neighbors. An activation function (ReLU) is applied to the aggregation to enhance the nonlinear expressive capacity. Though many aggregators are available for this task [13, 34], we use the mean aggregator for simplicity as the original GCN did. In ImageGCN, MPUs can be designed as a multilayer CNN architecture in the middle ImageGCN layers to extract highlevel features, and linear MPUs can be used in the last layers to generate vector representations for images.
Propagation. For each image (Image 4 in Fig. 2), each of its neighbors are input to the corresponding MPU, the outputs are aggregated and then activated to generate the new representation of this image in the next layer. For each image, the propagation rule is
(6) 
where is the entry of normalized adjacency matrix of relation . Eq. 6 is equivalent to Eq. 5 and can be seen as a generalization of Eq. 3.
Partial Parameter Sharing. Because each relation has an MPU, an issue with applying Eq. 5 to a ImageGraph with many relation types is that the number of parameters would grow rapidly with the number of relations. This will lead to a very large model that is not easy to train with limited computing and storage resources, especially for MPUs with multilayer neural networks.
To address this issue, we introduce the partial parameter sharing (PPS) scheme between MPUs. With PPS, The MPUs share most of the parameters to reduce the total number of parameters. In our design, the same CNN architecture is applied to all MPUs in the same layer, all the parameters are shared between these MPUs except for the last parameter layer where the parameters are used to make the message passing rule different for different relations, see Fig. (a)a for an ImageGCN layer with PPS. Thus, the message passing rule Eq. 5 can be further refined as:
(7) 
where is shared by all relations, only that has only a few parameters determines the different message passing methods for different relations. Also, we can further share all the parameters between all MPUs, that is, assigning the same message passing rule to different relations, all parameter sharing (APS) in Fig. (b)b. However, APS will reduce the multiple relations to a single relation, thus reduce the model’s expressive capacity, our experimental results in Section 4.5 and 4.6 also demonstrate the less effectiveness of APS than PPS.
3.3 Training Strategies
Loss function.
The loss function relies on the downstream task. Specifically, for a classic node classification task, we can use a softmax activation function in the last layer and minimize the crossentropy loss on all labeled nodes. For multilabel classifications, the loss function can be design as in our experiments in Section
4.4.Batch propagation. Equation 7 requires all nodes in the graph being present during training, it can not support propagation in batch. This is difficult to scale out to a large graph with highdimensional node features, which is common in computer vision. One may want to simply construct a subgraph in a batch, this usually causes no edges in a batch if the graph is sparse. GraphSAGE [13] was designed to address this issue for single relational graphs. Inspired by GraphSAGE, we introduce an inductive batch propagation algorithm for multirelational ImageGraphs in Algorithm 1. For each sample in a batch, for each relation , we randomly sample neighbors of to pass message to with relation in a layer (Line 8). The union of the sampled neighbors and the samples in the batch are considered as a new batch for the next layer (Line 3 to 11). For a layer ImageGCN, the neighbor sampling should be repeated times to reach the th order neighbors of the initial batch (Line 2 to 12). We construct the subgraph based on the final batch (Line 13 to 16, is the final batch). In each ImageGCN layer, the message passing is conducted inside the subgraph (Line 17). Note that the image features can be in persistent storage, and are loaded when a batch and the neighbors of images in the batch are sampled (Line 13), This is important to reduce memory requirement for largescale graphs or graphs with highdimensional features in the nodes.
In test procedure, given a test batch ( can have only one or more samples), the relations between test samples and the training samples are added to the adjacency matrices . The batch propagation algorithm Eq. 1 can be directly applied for test data representation.
4 Experiments
4.1 ChestXray14 Dataset
We test ImageGCN for disease identification and localization on the ChestXray14 dataset [46] which consists of 112,120 frontalview CXR images of 30,805 patients related with 14 thoracic disease labels. The labels are mined from the associated radiological reports using natural language processing, and are expected to have accuracy90% [46]. Out of the 112,120 CXR images, 51,708 contains one or more pathologies. The remaining 60,412 images are considered normal. ChestXray14 dataset also provides the patients information for a CXR image based on which we construct the ImageGraph. We randomly split the dataset into training, validation and test set by the ratio 7:2:1 (training 78484 images, validation 11212 images, 22424 images). We regard the provided labels as ground truth to train the model on training set and evaluate it on test set. We do not apply any data augmentation techniques.
Preprocessing. Each image in the dataset is resized to , and then cropped to
at the center for fast processing. We normalized the image by mean ([0:485; 0:456; 0:406]) and standard deviation ([0:229; 0:224; 0:225]) of the images from ImageNet
[8].4.2 Graph Construction
To construct an ImageGraph based on the dataset, besides the selfconnection, we consider 4 types of relations between two CXR images that are relevant for disease classification and localization. (1) Person relation, if two images come from the same person, a person relation exists. (2) Age relation, if the two images come from the persons of the same age when the CXR were taken, an age relation exists. (3) Gender relation, if the owners of two images have the same gender, a gender relation exists. (4) View relation, if two CXR images were taken with the same view position (PosteroAnterior or AnteroPosterior ), a view relation exists.
The four relations are all reflexive, symmetric and transitive, thus each relation corresponds to a cluster graph that consists of a number of disjoint complete subgraphs. Person relation usually implies gender relation but can not imply age relation, because a person can take several CXR images at different ages. The adjacency matrix of each relation is a diagonal block matrix. Our ImageGCN is built on this multirelational graph. The adjacency matrices are normalized in advance. Note that because the selfconnection relation is considered separately, The adjacency matrices do not need to add selfconnection.
4.3 MPU design
Since the ImageGraph in our experiments is a cluster graph for each relation, each node can reach other reachable nodes by 1 step, onelayer ImageGCN is enough to catch the structure information of an image node. Stacking multiple GCN layers would result in oversmoothing issues [26]. For the onelayer ImageGCN, we design the MPUs in our experiments as a deep CNN architecture to catch highlevel visual information. According to partial parameter sharing introduced in Section 3 and Fig. 3, each MPU consists of two parts: the sharing part and the private part .
The sharing part.
The sharing part of the MPUs consists of the feature layers of a pretrained CNN architecture, a transition layer and a global pooling layer, sequentially. For a pretrained model, we discard the highlevel fullyconnected layers and classification layers and only keep the remaining feature layers as the first component of the sharing part. The transition layer consists of a convolutional layer, a batch normalization layer
[16] and a ReLu layer sequentially. In the transition layer, we let the convolutional layer have 1024 filters with kernel size to transform the output of previous layers into a uniform number (1024 in our experiment) of feature maps which is used to generate the heatmap for disease localization. The global pooling layer pools the generated 1024 feature maps to a 1024dimensional vector with a kernel size equal to the feature map’s size. Thus, by the sharing part of MPUs, an image is transformed to a 1024dimensional vector. We test the feature layers of three different pretrained CNN architectures independently in our experiments, AlexNet [22], VGGNet16 with batch normalization (VGGNet16BN) [42], and ResNet50 [14].The private part.
The private part accepts the output of the sharing part and outputs an embedding to the aggregator. For each relation, we use a linear layer (with different parameters) as the private part to transform the 1024dimensional vector from the sharing part to a 14dimensional vector. For an image, the 14dimensional vectors from its neighbors are aggregated and fed to a sigmoid activation function to generate its probabilities corresponding to the 14 diseases. With a similar method in
[55], the weights of the private linear layer of selfconnection combined with the activations of the transition layer in the sharing part can generate a heatmap for the disease location task.All the learnable parameters of the ImageGCN model are contained in these two parts, the sharing part corresponds to the feature layers of a pretrained architecture, and the private parts corresponds to 5 linear layers corresponding to the 4 relations and selfconnection. Though only a part of the pretrained model, AlexNet, is incorporated in an MPU, we call it an AlexNet MPU for convenience, similarly, VGGNet16BN MPU and ResNet50 MPU. For each MPU type (AlexNet), we use two baselines to evaluate our model, ImageGCN with all parameter sharing (APS) and the basic pretrained model (AlexNet) finetuned in the dataset. In the following statement in this paper, we use AGCNPPS to denote the ImageGCN with AlexNet MPUs and partial parameter sharing, similarly VGCNPPS for VGGNet16BN MPUs and RGCNPPS for ResNet50 MPUs.
Atel  Card  Effu  Infi  Mass  Nodu  Pne1  Pne2  Cons  Edem  Emph  Fibr  PT  Hern  mean  
AGCNPPS (ours)  0.781  red0.899  red0.865  0.701  red0.813  red0.721  red0.718  red0.881  0.788  0.888  red0.882  red0.804  red0.778  0.904  red0.816 
AGCNAPS  0.739  0.876  0.815  0.671  0.799  0.704  0.679  0.857  0.762  0.846  0.863  0.792  0.765  0.910  0.791 
AlexNet  0.782  0.895  0.863  0.705  0.781  0.714  0.716  0.869  0.790  0.889  0.876  0.799  0.773  0.899  0.811 
RGCNPPS (ours)  0.785  red0.890  red0.868  red0.699  red0.824  red0.739  red0.723  red0.895  0.790  0.887  red0.911  red0.819  red0.786  red0.941  red0.826 
RGCNAPS  0.741  0.861  0.822  0.680  0.819  0.728  0.684  0.873  0.768  0.852  0.889  0.790  0.751  0.908  0.798 
ResNet50  0.789  0.889  0.863  0.698  0.807  0.723  0.714  0.876  0.791  0.888  0.899  0.799  0.772  0.933  0.817 
VGCNPPS (ours)  red0.796  red0.896  red0.873  red0.699  red0.834  red0.762  red0.717  red0.890  red0.788  red0.889  red0.907  red0.813  red0.792  red0.917  red0.827 
VGCNAPS  0.754  0.871  0.826  0.676  0.820  0.737  0.688  0.872  0.769  0.839  0.894  0.789  0.770  0.926  0.802 
VGGNet16BN  0.785  0.876  0.872  0.686  0.813  0.734  0.712  0.882  0.787  0.883  0.902  0.812  0.773  0.925  0.817 
Wang [46]  0.716  0.807  0.784  0.609  0.706  0.671  0.633  0.806  0.708  0.835  0.815  0.769  0.708  0.767  0.738 
Yao [51]  0.772  0.904  0.859  0.695  0.792  0.717  0.713  0.841  0.788  0.882  0.829  0.767  0.765  0.914  0.803 
Li [29]  0.800  0.870  0.870  0.700  0.830  0.750  0.670  0.870  0.800  0.880  0.910  0.780  0.760  0.770  0.804 
Kumar [23]  0.762  0.913  0.864  0.692  0.750  0.666  0.715  0.859  0.784  0.888  0.898  0.756  0.774  0.802  0.794 
Tang [43]  0.756  0.887  0.819  0.689  0.814  0.755  0.729  0.85  0.728  0.848  0.906  0.818  0.765  0.875  0.803 
Shen [41]  0.766  0.801  0.797  0.751  0.76  0.741  0.778  0.800  0.787  0.82  0.773  0.765  0.759  0.748  0.775 
Mao [31]  0.750  0.869  0.810  0.687  0.782  0.726  0.695  0.845  0.728  0.834  0.870  0.798  0.758  0.877  0.788 
Guan [12]  0.781  0.883  0.831  0.697  0.83  0.764  0.725  0.866  0.758  0.853  0.911  0.826  0.78  0.918  0.816 
The AUC results of various models to classify for the 14 diseases on ChestXray14 dataset. For each disease, the best results are bolded. The red text means our ImageGCN can perform better than or equal to the corresponding baseline models. Abbrs: Atel: Atelectasis; Card: Cardiomegaly; Effu: Effusion; Infi: Infiltration; Nodu: Nodule; Pneu1: Pneumonia; Pneu2:Pneumothorax; Cons: Consolidation Edem: Edema; Emph: Emphysema; Fibr: Fibrosis; PT:Pleural Thickening Hern: Hernia.
4.4 Experimental settings
Weakly supervised learning.
The ChestXray14 dataset provides pathology bounding box (Bbox) annotations of a small number of CXR images, which can be used as the ground truth of the disease localization task. In our experiments, we adopt the weakly supervised learning scheme
[38], where no annotations are used for training, they are only used to evaluate the performance of disease location of a model trained with only imagelevel labels.Loss function. For multilabel classification on ChestXray14, the true label of each CXR image is a 14dimensional binary vector where denotes the corresponding disease is present and for absence. An all zero vector represents “No Findings” in the 14 diseases. Due to the high sparsity of the label matrix, we use the weighted cross entropy loss as Wang [46] did, where each sample with true labels and output probabilities has the loss
(8) 
where and are the number of ‘0’s and ‘1’s in a minibatch respectively. The loss of images in a minibatch are averaged as the loss of the batch.
Hyperparameters. We set the batch size to 16. 1 neighbor is sampled for each image and each relation. All the models are trained using Adam optimizer [20] with parameters
. We terminate the training procedure when it reaches 10 epochs. In each epoch, the model with the best classification performance on the validation set is saved for evaluation.
4.5 Disease Identification
For the disease identification task, we use AUC score to evaluate the performance of the models. Table 1 shows the AUC scores of all the models on the 14 diseases. From Table 1, as expected in Section 3, for all the three types of MPUs, PPS outperform APS obviously. For each type of MPU, GCNPPS outperform GCNAPS and the corresponding basic model overall and in most of the diseases. VGCNPPS with can even outperform the corresponding VGCNAPS and VGGNet16BN for all the 14 diseases.
Table 1 also lists some results reported in the related references. Some studies like [39] that used a different trainingvalidationtest split ratio or augmented the dataset are not listed. Our VGCNPPS achieved the best overall results, compared with the stateoftheart methods. On 7 out of the 14 disease, ImageGCN achieves the best results among these stateoftheart methods.
T(IoU)  model  Atel  Card  Effu  Infi  Mass  Nodu  Pne1  Pne2  

0.1  Acc  AGCNPPS (ours)  [rgb] 1, 0, 00.4889  0.9932  [rgb] 1, 0, 00.6667  [rgb] 1, 0, 00.6667  [rgb] 1, 0, 00.4706  0.0000  [rgb] 1, 0, 00.6417  [rgb] 1, 0, 00.3469 
AlexNet  0.3889  1.0000  0.6144  0.5285  0.4706  0.0253  0.5833  0.3265  
AGCNAPS  0.3000  0.9863  0.5294  0.4634  0.2824  0.0127  0.5167  0.2755  
Wang [46]  0.6888  0.9383  0.6601  0.7073  0.4000  0.1392  0.6333  0.3775  
AFP  AGCNPPS (ours)  [rgb] 1, 0, 00.5111  0.0137  [rgb] 1, 0, 00.3333  [rgb] 1, 0, 00.3333  [rgb] 1, 0, 00.5294  1.0127  [rgb] 1, 0, 00.3583  [rgb] 1, 0, 00.6531  
AlexNet  0.6111  0.0000  0.3856  0.4715  0.5294  0.9747  0.4167  0.6735  
AGCNAPS  0.7000  0.0137  0.4706  0.5447  0.7176  1.0000  0.4833  0.7245  
Wang [46]  0.8943  0.5996  0.8343  0.6250  0.6666  0.6077  1.0203  0.4949  
0.5  Acc  AGCNPPS (ours)  [rgb] 1, 0, 00.0222  [rgb] 1, 0, 00.3836  0.0458  [rgb] 1, 0, 00.1138  0.0471  [rgb] 1, 0, 00.0000  [rgb] 1, 0, 00.0750  [rgb] 1, 0, 00.0408 
AlexNet  0.0111  0.2260  0.0784  0.0569  0.0824  0.0000  0.0750  0.0306  
AGCNAPS  0.0000  0.3082  0.0327  0.0325  0.0235  0.0000  0.0500  0.0204  
Wang [46]  0.0500  0.1780  0.1111  0.0650  0.0117  0.0126  0.0333  0.0306  
AFP  AGCNPPS (ours)  [rgb] 1, 0, 00.9778  [rgb] 1, 0, 00.6233  0.9542  [rgb] 1, 0, 00.8862  0.9529  1.0127  [rgb] 1, 0, 00.9250  [rgb] 1, 0, 00.9592  
AlexNet  0.9889  0.7740  0.9216  0.9431  0.9176  1.0000  0.9250  0.9694  
AGCNAPS  1.0000  0.6918  0.9673  0.9756  0.9765  1.0127  0.9500  0.9796  
Wang [46]  1.0884  0.8506  1.0051  0.7632  0.7226  0.6189  1.1321  0.5478 
In Table 1, GCNAPS is less effective than the corresponding basic model because the graph is a complete graph if all relations are considered equally by APS, an image’s own feature would be heavily dwarfed by the messages of its neighbor images. For example, in Fig. (b)b, the message from the image itself is considered equal to its neighbors’. This makes an image and its neighbors indistinguishable, thus leads to even lower performance than the baseline. On the contrary, by PPS in Fig. (a)a, messages from neighbors with different relations will be considered differently by , the less important messages will have less influence to the results. Thus, ImageGCNs with PPS perform better than those with APS and the baseline model.
4.6 Disease Localization
ChestXray14 dataset also contains 984 labelled Bboxes for 880 CXR images by boardcertified radiologists. The provided Bboxes correspond to 8 of the 14 diseases, we consider these Bboxes as ground truth to evaluate the disease localization performance of the models.
With class activation mapping [55], for each image, we generate a heatmap normalized to with the MPU of selfconnection in a weakly supervised manner. Following the setting of Wang [46], we segment the heatmap by a threshold of 180, and generate Bboxes to cover the activated regions in the binary map. We use intersection over union ratio () between the detected region and the annotated ground truth to evaluate the localization performance. We define a correct localization when , where is the selfdefined threshold.
The comparison results of disease localization among the models are listed in Table 2. From Table 2, our ImageGCN with AlexNet MPU and PPS can outperform the baselines in most cases.
Fig. 4 shows example localization qualitative results of AGCNPPS compared to the results of the baselines. From 4, it can be seen that our ImageGCN with AlexNet MPU and PPS usually have smaller and more accurate Bboxes than the baselines.








5 Conclusion
We propose ImageGCN to model relations between images and apply it to CXR images for disease identification and disease localization. To our best knowledge, this is the first study to model natural imagelevel relations for image representation learning. ImageGCN can extend the original GCN to highdimensional or unstructured data, and incorporate the idea of relational GCN and GraphSAGE for batch propagation on multirelational ImageGraphs. We also introduce the PPS scheme to reduce the complexity of ImageGCN. The Experimental results on ChestXray14 dataset demonstrate that ImageGCN outperforms respective baselines in both disease identification and localization and can achieve comparable and often better results than the stateoftheart methods. Future research includes tuning the MPU of ImageGraph for different vision tasks, and test ImageGCN on more general datasets.
References

[1]
J. Bastings, I. Titov, W. Aziz, D. Marcheggiani, and K. Simaan.
Graph convolutional encoders for syntaxaware neural machine translation.
In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1957–1967, 2017.  [2] R. v. d. Berg, T. N. Kipf, and M. Welling. Graph convolutional matrix completion. In SIGKDD, Deep Learning Day, 2018.
 [3] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
 [4] J. Chen, T. Ma, and C. Xiao. Fastgcn: fast learning with graph convolutional networks via importance sampling. In ICLR, 2018.

[5]
E. Choi, M. T. Bahadori, L. Song, W. F. Stewart, and J. Sun.
Gram: graphbased attention model for healthcare representation learning.
In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 787–795. ACM, 2017. 
[6]
B. Dai, Y. Zhang, and D. Lin.
Detecting visual relationships with deep relational networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3076–3086, 2017.  [7] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pages 3844–3852, 2016.
 [8] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
 [9] A. Fout, J. Byrd, B. Shariat, and A. BenHur. Protein interface prediction using graph convolutional networks. In Advances in Neural Information Processing Systems, pages 6530–6539, 2017.
 [10] V. Garcia and J. Bruna. Fewshot learning with graph neural networks. In ICLR, 2018.

[11]
J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl.
Neural message passing for quantum chemistry.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pages 1263–1272. JMLR. org, 2017.  [12] Q. Guan and Y. Huang. Multilabel chest xray image classification via categorywise residual attention learning. Pattern Recognition Letters, 2018.
 [13] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In NIPS, pages 1024–1034, 2017.
 [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [15] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
 [16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
 [17] J. Johnson, A. Gupta, and L. FeiFei. Image generation from scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1219–1228, 2018.
 [18] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computeraided molecular design, 30(8):595–608, 2016.
 [19] E. Khalil, H. Dai, Y. Zhang, B. Dilkina, and L. Song. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pages 6348–6358, 2017.
 [20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [21] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. In ICLR, 2017.
 [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [23] P. Kumar, M. Grewal, and M. M. Srivastava. Boosted cascaded convnets for multilabel classification of thoracic diseases in chest radiographs. In International Conference Image Analysis and Recognition, pages 546–552. Springer, 2018.
 [24] L. Landrieu and M. Simonovsky. Largescale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4558–4567, 2018.
 [25] C.W. Lee, W. Fang, C.K. Yeh, and Y.C. Frank Wang. Multilabel zeroshot learning with structured knowledge graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1576–1585, 2018.

[26]
Q. Li, Z. Han, and X.M. Wu.
Deeper insights into graph convolutional networks for semisupervised learning.
In AAAI, 2018. 
[27]
Y. Li, R. Jin, and Y. Luo.
Classifying relations in clinical narratives using segment graph convolutional and recurrent neural networks (seggcrns).
Journal of the American Medical Informatics Association, 26(3):262–268, 2018.  [28] Z. Li, Q. Chen, and V. Koltun. Combinatorial optimization with graph convolutional networks and guided tree search. In Advances in Neural Information Processing Systems, pages 537–546, 2018.
 [29] Z. Li, C. Wang, M. Han, Y. Xue, W. Wei, L.J. Li, and L. FeiFei. Thoracic disease identification and localization with limited supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8290–8299, 2018.
 [30] M. Maire, T. Narihira, and S. X. Yu. Affinity cnn: Learning pixelcentric pairwise relations for figure/ground embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 174–182, 2016.
 [31] C. Mao, L. Yao, Y. Pan, Y. Luo, and Z. Zeng. Deep generative classifiers for thoracic disease diagnosis with chest xray images. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1209–1214. IEEE, 2018.
 [32] K. Marino, R. Salakhutdinov, and A. Gupta. The more you know: Using knowledge graphs for image classification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 20–28. IEEE, 2017.
 [33] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5115–5124, 2017.
 [34] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe. Weisfeiler and leman go neural: Higherorder graph neural networks. In AAAI, 2019.
 [35] M. Narasimhan, S. Lazebnik, and A. Schwing. Out of the box: Reasoning with graph convolution nets for factual visual question answering. In Advances in Neural Information Processing Systems, pages 2659–2670, 2018.
 [36] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In International conference on machine learning, pages 2014–2023, 2016.
 [37] W. NorcliffeBrown, S. Vafeias, and S. Parisot. Learning conditioned graph structures for interpretable visual question answering. In Advances in Neural Information Processing Systems, pages 8344–8353, 2018.
 [38] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free?weaklysupervised learning with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 685–694, 2015.
 [39] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, et al. Chexnet: Radiologistlevel pneumonia detection on chest xrays with deep learning. arXiv preprint arXiv:1711.05225, 2017.
 [40] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling. Modeling relational data with graph convolutional networks. In ESWC, pages 593–607. Springer, 2018.
 [41] Y. Shen and M. Gao. Dynamic routing on deep neural network for thoracic disease classification and sensitive area localization. In International Workshop on Machine Learning in Medical Imaging, pages 389–397. Springer, 2018.
 [42] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [43] Y. Tang, X. Wang, A. P. Harrison, L. Lu, J. Xiao, and R. M. Summers. Attentionguided curriculum learning for weakly supervised classification and localization of thoracic diseases on chest radiographs. In International Workshop on Machine Learning in Medical Imaging, pages 249–258. Springer, 2018.
 [44] G. Te, W. Hu, A. Zheng, and Z. Guo. Rgcnn: Regularized graph cnn for point cloud segmentation. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 746–754. ACM, 2018.
 [45] H. Wang, H. Huang, and C. Ding. Image annotation using birelational graph of images and semantic labels. In CVPR 2011, pages 793–800. IEEE, 2011.
 [46] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. Summers. Chestxray8: Hospitalscale chest xray database and benchmarks on weaklysupervised classification and localization of common thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 3462–3471, 2017.
 [47] X. Wang, Y. Ye, and A. Gupta. Zeroshot recognition via semantic embeddings and knowledge graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6857–6866, 2018.
 [48] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Graph rcnn for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 670–685, 2018.
 [49] B. Yao and L. FeiFei. Grouplet: A structured image representation for recognizing human and object interactions. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 9–16. IEEE, 2010.
 [50] L. Yao, C. Mao, and Y. Luo. Graph convolutional networks for text classification. In AAAI, 2019.
 [51] L. Yao, E. Poblenz, D. Dagunts, B. Covington, D. Bernard, and K. Lyman. Learning to diagnose from scratch by exploiting dependencies among labels. arXiv preprint arXiv:1710.10501, 2017.
 [52] T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 684–699, 2018.
 [53] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec. Graph convolutional neural networks for webscale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 974–983. ACM, 2018.
 [54] Y. Zhang, P. Qi, and C. D. Manning. Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2205–2215, 2018.

[55]
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.
Learning deep features for discriminative localization.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016. 
[56]
Y. Zhu and S. Jiang.
Deep structured learning for visual relationship detection.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  [57] M. Zitnik, M. Agrawal, and J. Leskovec. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics, 34(13):i457–i466, 2018.
Comments
There are no comments yet.