Interpreting a radiology imaging exam is a complex reasoning task, where radiologists are able to integrate patient history and image features from different anatomical locations to generate the most likely diagnoses. Convolutional Neural Networks (CNNs) have been widely applied in earlier works in automatic Chest X-ray (CXR) interpretation, one of the most commonly requested medical imaging modality. Many of these works have framed the problem either as a multi-label abnormality classification problem[rajpurkar2017chexnet, wang2017chestx], an abnormality detection and localization problem [gabruseva2020deep, sirazitdinov2019deep, wu2020automatic], or an image-to-text report generation problem [li2018hybrid, wang2018tienet]. However, these models fail to capture inter-dependencies between features or labels. Leveraging such contextual information that encodes relational information among pathologies is crucial in improving interpretability and reasoning in clinical diagnosis.
To this end, Graph Neural Networks (GNN) have surfaced as a viable solution in modeling disease co-occurrence across images. Graph Neural Networks (GNNs) learn representations of the nodes based on the graph structure and have been widely explored, from graph embedding methods[grover2016node2vec, tang2015line], generative models [wang2018graphgan, you2018graphrnn] to attention-based or recurrent models [li2015gated, velivckovic2017graph], among others. For a comprehensive review on model architectures, we refer the reader to a recent survey [wu2020comprehensive]. In particular, Graph Convolutional Networks (GCNs) [kipf2016semi] utilize graph convolution operations to learn representations by aggregating information from the neighborhood of a node, and have been successfully applied to CXR image classification. For example, the multi-relational ImageGCN model learns image representations that leverage additional information from related images [mao2019imagegcn], while CheXGCN and DD-GCN incorporate label co-occurrence GCN modules to capture the correlations between labels [chen2020label, liu2020dynamic]. To mitigate the issues with noise originating from background regions in related images, recent work utilizes attention mechanisms [cai2018iterative, zhou2021contrast] or auxiliary tasks such as lung segmentation [gordienko2018deep, chen2020two]. However, none of these works consider modeling correlations among anatomical regions and findings, e.g., output the anatomical location semantics for each finding.
We propose a novel model that captures the dependencies between the anatomical regions of a chest X-ray for classification of the pathological findings, termed Anatomy-aware X-ray Network (AnaXNet). We first extract the features of the anatomical regions using an object detection model. We develop a method to accurately capture the correlations between the various anatomical regions and learn their dependencies with a GCN model. Finally, we combine the localized region features via attention weights computed with a non-local operation [wang2018non] that resembles self-attention.
The main contributions of this paper are summarized as follows: 1) we propose a novel multi-label CXR findings classification framework that integrates both global and local anatomical visual features and outputs accurate localization of clinically relevant anatomical regional levels for CXR findings, 2) we propose a method to automatically learn the correlation between the findings and the anatomical regions and 3) we conduct in-depth experimental analysis to demonstrate that our proposed AnaXNet model outperforms previous baselines and state-of-the-art models.
We first describe our proposed framework for multi-label chest X-ray classification. Let CXR image collection comprised of a set of chest-X ray images , where each image is associated with a set of labels , with indicating whether the label for pathology appears in image or not. Then the goal is to design a model that predicts the label set for an unseen image as accurately as possible, by utilizing the correlation among anatomical region features , where is the number of anatomical region embedding, each with dimensionality . The anatomical region feature extractor is described in subsection 3.3.
Given this initial set of anatomical region representations , we define a normalized adjacency matrix that captures region correlations, and utilize a GCN to update as follows:
where is the learned weight matrix and denotes a non-linear operation, e.g.
, ReLU[zeiler2013rectified] in our experiments, and is the number of stacked GCN layers. To construct the adjacency matrix , we extract co-occurrence patterns between anatomical regions for label pairs. More specifically, the label co-occurrence matrix can be computed based on Jaccard similarity:
where and represent anatomical regions, is the set for region and label across all images and denote the intersection and union over multi-sets. However, this label co-occurrence construction may overfit the training data due to incorporating noisy rare occurrences. To mitigate such issues, we use a filtering threshold , i.e.,
where is the final adjacency matrix.
To capture both global and local dependencies between anatomical regions, we leverage a non-local operation that resembles self-attention [wang2018non]:
where . The final prediction is computed via
where is a fully connected layer to obtain the label predictions. The network is trained with a multi-label cross-entropy classification loss
We describe experimental details, i.e., evaluation dataset, metrics, etc., and present quantitative and qualitative results, comparing AnaXNet with several baselines.
|Label ID||Description||# Images (1)||# Bboxes|
|L4||Enlarged Cardiac Silhouette||55,187||58,929|
|L5||Pulmonary Edema/Hazy Opacity||33,441||145,965|
|L8||Fluid Overload/Heart Failure||6,317||18,066|
|All 9 labels||Positive/Total||153,333/217,417||720,098/3,877,010|
Existing annotations of large-scale CXR datasets [wang2017chestx, johnson2019mimic, irvin2019chexpert]
are either weak global labels for 14 common CXR findings extracted from reports with Natural Language Processing (NLP) methods[irvin2019chexpert], or are manually annotated with bounding boxes for a smaller subset of images and for a limited number of labels [shih2019augmenting, filice2020crowdsourcing]. None of these annotated datasets describe the anatomical location for different CXR pathologies. However, localizing pathologies to anatomy is a key aspect of radiologists’ reasoning and reporting process, where knowledge of correlation between image findings and anatomy can help narrow down potential diagnoses.
Intersection over Union scores (IoU) are calculated between the automatically extracted anatomical bounding box (Bbox) regions and a set of single manual ground truth bounding boxes for 1000 CXR images. Average precision and recall across 9 CXR pathologies are shown for the NLP derived labels at: right lung (RL), right apical zone (RAZ), right upper lung zone (RULZ), right mid lung zone (RMLZ), right lower lung zone (RLLZ), right costophrenic angle (RCA), left lung (LL), left apicl zone (LAZ), left upper lung zone (LULZ), left mid lung zone (LMLZ), left lower lung zone (LLLZ), left costophrenic angle (LCA), mediastinum (Med), upper mediastinum (UMed), cardiac silhouette (CS) and trachea (Trach).
The Chest ImaGenome dataset builds on the works of [wu2020automatic, wu2020ai] to fill this gap by using a combination of rule-based text-analysis and atlas-based bounding box extraction techniques to structure the anatomies and the related pathologies from 217,417 report texts and frontal images (AP or PA view) from the MIMIC-CXR dataset [johnson2019mimic]. In summary, the text pipeline [wu2020ai]
first sections the report and retains only the finding and impression sentences. Then it uses a prior curated CXR concept dictionary (lexicons) to identify and detect the context (negated or not) for name entities required for labeling the 18 anatomical regions and 9 CXR pathology labels from each retained sentence. The pathology labels are associated with the anatomical region described in the same sentence with a natural language parser, SpaCy[spacy]
, and clinical heuristics provided by a radiologist was used to correct for obvious pathology-to-anatomy assignment errors (e.g. lung opacity wrongly assigned to mediastinum). Finally the pathology label(s) for each of the 18 anatomical regions from repeated sentences are grouped to the exam level. A separate anatomy atlas-based bounding box pipeline extracts the coordinates from each frontal images for the 18 anatomical regions[wu2020automatic].
Table 1 shows high-level statistics of the generated dataset. Dual annotations for 500 random reports (disagreement resolved via consensus) were curated at sentence level by a clinician and a radiologist, who also annotated the bounding boxes for 1000 frontal CXRs (single annotation). For the 9 pathology, the overall NLP average precision and recall without considering localization are 0.9819 and 0.9875, respectively. More detailed results by anatomical regions are shown in Table 2.
The anatomical region feature extractor is a Faster R-CNN with with ResNet-50 [he2016deep] as base model. Additional implementation details, e.g., hyper-parameters, are provided later on (Subsection 3.3). We perform comprehensive analysis on the Chest ImaGenome dataset. We compare our AnaXNet model against: 1) GlobalView we implement a DenseNet169 [huang2017densely] model as a baseline method to contrast the effectiveness of location-aware AnaXNet versus a global view of the image, 2) Faster R-CNN [ren2015faster] followed by a fully-connected layer, i.e., without the GCN and attention modules, to establish a baseline accuracy for the classification task using the extracted anatomical features, and 3) CheXGCN We re-implement the state-of-the-art model CheXGCN [chen2020label]
that utilizes GCNs to learn the label dependencies between pathologies in the X-ray images. The model uses a CNN for feature extraction and a GCN to learn the relationship between the labels via word embeddings. We replace the overall CNN with Faster R-CNN for a fair comparison with our model, but retain their label co-occurrence learning module.
3.3 Implementation details
We train the detection model to detect the 18 anatomical regions. To obtain the anatomical features, we take the final output of the model and perform non-maximum suppression for each object class using an IoU threshold. We select all the regions where any class probability exceeds our confidence threshold. We use a value of 0.5 for. For each region, we extract a 1024 dimension convolutional feature vector. For multiple predictions of the same anatomical region, we select the prediction with the highest confidence score and drop the duplicates. When the model fails to detect a bounding box, we use a vector of zeros to represent the anatomical features of the region within the GCN. We use detectron2 111https://github.com/facebookresearch/detectron2 to train Faster R-CNN to extract anatomical regions and their features. Our GCN model is made up of two GCN layers with output dimensionality of 512 and 1024 respectively. We train with Adam optimizer, and
learning rate for 25 epochs in total.
3.4 Results and Evaluation
Results are summarized in Table 3
. The evaluation metric is Area Under the Curve (AUC). Note that the baseline GlobalView is in fact a global classifier and does not produce a localized label. The remaining rows in Table3 show localized label accuracy. For the localized methods, the reported numbers represent the average AUC of the model for each label over the various anatomical regions. If a finding is detected at the wrong anatomical location, it counts as false detection. For fair comparison, we use the same 70/10/20 train/validation/testing split across patients to train each model. AnaXNet model obtains improvements over the previous methods while also localizing the diseases in their correct anatomical region. The GlobalView is most likely limited because it focuses on the entire image instead of a specific region.
The CheXGCN model outperforms the other two baselines but is also limited because it focuses on one section and uses label dependencies to learn the relationship between the labels, while ignoring the relationships between the anatomical regions of the chest X-ray image. In Table 2, we visualize the output from both the CheXGCN model and our AnaXNet model. The CheXGCN model had difficulty predicting small anatomical regions like the costophrenic angles, while our model had additional information from the remaining anatomical regions, which helped in its prediction. Also the CheXGCN model struggled with enlarged cardiac silhouette label because information from the surrounding labels is needed in order to accurately tell if the heart is enlarged.
In Figure 3 we also visualize the output of Grad-CAM [selvaraju2017grad] method on the GlobalView model to highlight the importance of the localization, while the prediction of Enlarged Cardiac Silhouette was correct, the GlobalView model was focused on the lungs. Our method was able to provide accurate localization information as well as the finding.
We described a methodology for localized detection of diseases in chest X-ray images. Both the algorithmic framework of this work, and the dataset of images labeled for pathologies in the semantically labeled bounding boxes are important contributions. For our AnaXNet design, a Faster R-CNN architecture detects the bounding boxes and embeds them. The resulting embedded vectors are then used as input to a GCN and an attention block that learn representations by aggregating information from the neighboring regions.
This approach accurately detects any of the nine studied abnormalities and places it in the correct bounding box in the image. The 18 pre-specified bounding boxes are devised to map to the anatomical areas often described by radiologists in chest X-ray reports. As a result, our method provides all the necessary components for composing a structured report. Our vision is that the output of our trained model, subject to expansion of the number and variety of findings, will provide both the finding and the anatomical location information for both downstream report generation and other reasoning tasks. Despite the difficulty of localized disease detection, our method outperforms a global classifier. As our data shows (See Figure 3), global classification can be unreliable even when the label is correct as the classifier might find the correct label for the wrong reason at an irrelevant spot.