Improving Interpretability of Deep Neural Networks in Medical Diagnosis by Investigating the Individual Units

07/19/2021 ∙ by Woo-Jeoung Nam, et al. ∙ Korea University 28

As interpretability has been pointed out as the obstacle to the adoption of Deep Neural Networks (DNNs), there is an increasing interest in solving a transparency issue to guarantee the impressive performance. In this paper, we demonstrate the efficiency of recent attribution techniques to explain the diagnostic decision by visualizing the significant factors in the input image. By utilizing the characteristics of objectness that DNNs have learned, fully decomposing the network prediction visualizes clear localization of target lesion. To verify our work, we conduct our experiments on Chest X-ray diagnosis with publicly accessible datasets. As an intuitive assessment metric for explanations, we report the performance of intersection of Union between visual explanation and bounding box of lesions. Experiment results show that recently proposed attribution methods visualize the more accurate localization for the diagnostic decision compared to the traditionally used CAM. Furthermore, we analyze the inconsistency of intentions between humans and DNNs, which is easily obscured by high performance. By visualizing the relevant factors, it is possible to confirm that the criterion for decision is in line with the learning strategy. Our analysis of unmasking machine intelligence represents the necessity of explainability in the medical diagnostic decision.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Attribution Methods

In this section, we introduce notations and attribution methods: LRP and RAP, which are closely related to each other, but have different perspective and algorithm. The overview of decomposition and visualization is illustrated in Fig.2. For input , we denote the letter

the value of the network output before passing through the classification layer, such as sigmoid and softmax layer.

represents the input relevance for the attributing procedure which is same as the value of of the prediction node. , and

denote the weight, bias and activation function between layer

, respectively. is the value of neuron after applying activation function. The signs of positive and negative values are denoted by and .

1.0.1 Layerwise Relevance Propagation

The principle of LRP [2] is to find the parts with high relevance in the input by propagating the result from back (output) to front (input). The algorithm is based on the conservation principle, which maintains the relevance in all layers: from input to output.

(1)

In various LRP versions introduced in [2], we utilize LRP-, which separates the positive and negative activations during the relevance propagation process while maintaining the conservation rule (1).

(2)

In this rule, and . The propagated attributions are allocated to the pixels of the input image while indicating how relevant to output prediction. In this paper, the function parameters are set as .

1.0.2 Relative Attributing Propagation

RAP [17] decomposes the output predictions of DNNs in terms of relative influence among the neurons, resulting in assigning the relevant and irrelevant attributions with a bi-polar importance. By changing the perspective from value to influence, the generated visual explanations show the characteristics of strong objectness and clear distinction of relevant and irrelevant attributions. The algorithm has three main steps: (i) absolute influence normalization, (ii) deciding the criterion of relevance and propagating in a backward pass, and (iii) uniform shifting for changing the irrelevant neurons to negative.

Absolute Influence Normalization is the process applied in only first backward propagation for changing the perspective to the neuron from the value to the influence. From the output prediction node in layer , the relevance is allocated into the penultimate layer according to their actual contribution in a forward pass.

(3)

To approach as an influence perspective, positive or negative relevance values allocated in penultimate layer are normalized by the ratio of the absolute positive and negative values .

(4)

This process makes the neurons to be allocated as relative importance to the output prediction, from highly influenced to rarely influenced. For the next steps, i.e., the attributing procedure from the penultimate layer to the input layer, eq.(5) and (6) are repeated in each layer with changing low influential neurons to negative relevance.

(5)
(6)

Here, is the number of activated neurons in each layer, and denotes the relevance propagated through the negative weights, i.e. the latter parts of eq.(5). This procedure makes it possible to assign relatively irrelevant units as negative while emphasizing the important factors as highly positive. RAP also preserves the conservation rule (1).

2 Experimental Evaluation

2.1 Data

2.1.1 NIH ChestX-ray14

NIH ChestX-ray14 Dataset [27] comprises 112,120 X-ray images from 30,805 patients with corresponding 14 disease labels: Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, pleural thickening, Cardiomegaly, Nodule, Mass, and Hernia. We utilize CheXNet [18], which is based on DenseNet [8] and widely used in radiologist-level chest x-ray. This trained model is online available with verified performance. The average Area Under Receiver Operating Characteristic (AUROC) of this model for 14 classes is 0.843 and the AUROC of 8 classes which annotated with bounding box labels: Athelectasis, Cardiomegaly, Effusion, Infiltration, Mass Nodule, Pneumonia, Pneumothorax is in Tab.1. We utilize 984 images, annotated with bounding box, to evaluate the visual explanations for the target lesions.

2.1.2 RSNA Pneumonia Detection

The RSNA Pneumonia Detection Challenge dataset [20] is a subset of 30,000 images from the NIH ChestX-ray14 dataset and labeled with two classes: Normal and pneumonia. The original purpose of this challenge is detecting the pneumonia lesions. We exclude the test data (which does not have label) and separate the training dataset into train and validation in a ratio of 9:1. We trained VGG-16 [24], ResNet-50 [7] and DenseNet-121 networks, which are successfully settled in machine learning field with impressive performance. While this dataset is designed for localizing the lesions, we train the classification network, which is a much easier level than training the detection network. The purpose of interpreting these models is to analyze that their criterion of classification is fair compared to the human intentions. Therefore, we also compare with the assessment of CheXNet in the same experimental status to verify the analysis. The detail discussion is in section 4.

NIH ChestX-ray14 Athelect. Cardiom. Effusio. Infiltr. Mass Nodule Pneunia. Pneurax.
CheXNet
0.829
0.916
0.887
0.714
0.859
0.787
0.774
0.872
RSNA Pneumonia DenseNet-121 ResNet-50 VGG-16 CheXNet
AUROC
0.858 0.845 0.842 0.827
Table 1: The performance of AUROC for each model we used in our experiment.

2.1.3 Assessment of Explanation

It is difficult to judge the criterion of a better explanation because each method is designed for slightly different objectives and evaluating the quality of visualization does not have one commonly accepted measure. In analyzing radiologist-level chest x-ray images, the interpretation of diagnostic could be altered to the localization of lesions, which is the crucial evidence for deciding the patient status. Intersection of Union (IOU) is widely used in semantic segmentation or object detection tasks as an evaluation metric by computing the localization scores. The evaluation would be more accurate in case the dataset is annotated with segmentation mask, but there is a limit to annotating it in the medical domain in practice. Therefore, we utilize IOU to evaluate whether positive attributions are correctly distributed in the area of lesions (bounding box). The result we report is the localization performance of lesions without any supervision of the bounding box during the training procedure.

T(IOU) Method Athelect. Cardiom. Effusio. Infiltr. Mass Nodule Pneunia. Pneurax.
0.1
CAM
LRP
RAP
0.243
0.500
0.487
0.336
0.687
0.754
0.311
0.503
0.490
0.309
0.567
0.576
0.225
0.514
0.496
0.182
0.435
0.416
0.295
0.568
0.572
0.259
0.456
0.447
0.2
CAM
LRP
RAP
0.304
0.560
0.534
0.439
0.569
0.703
0.368
0.543
0.511
0.369
0.603
0.609
0.280
0.582
0.540
0.238
0.506
0.461
0.355
0.602
0.606
0.303
0.494
0.471
0.3
CAM
LRP
RAP
0.353
0.563
0.565
0.525
0.428
0.622
0.408
0.526
0.518
0.413
0.558
0.608
0.326
0.583
0.571
0.286
0.545
0.496
0.403
0.564
0.608
0.339
0.492
0.479
0.4
CAM
LRP
RAP
0.396
0.543
0.569
0.591
0.441
0.544
0.442
0.502
0.510
0.451
0.508
0.570
0.370
0.552
0.573
0.328
0.561
0.521
0.445
0.515
0.576
0.370
0.484
0.480
0.5
CAM
LRP
RAP
0.437
0.519
0.548
0.635
0.424
0.480
0.468
0.485
0.493
0.483
0.477
0.519
0.411
0.520
0.547
0.369
0.551
0.527
0.479
0.482
0.525
0.397
0.479
0.478
Table 2: The result of mean Intersection of Union of each method, between the bounding box and heatmaps. The threshold denotes the criterion for ignoring low relevance: . The performance is the localization result without any supervision of bounding box.

2.2 Results

2.2.1 Quantitative Assessment

To validate the efficiency of the attribution methods in visualizing target lesions, we compared with CAM, which is widely used in medical fields to guarantee reliability. The heatmaps from each method are normalized in and the threshold is applied. Negative attributions are cast as zero for the fair comparison. Tab.2 shows the results of mean IOU per each class on CheXNet. Pixels that have lower relevance value than the threshold are cast as zero. As shown in Tab. 2, CAM shows low IOU performance rather than LRP and RAP in the low threshold. Since the heatmaps from CAM are generated by resizing from low dimension feature maps to original input size, it is hard to visualize the delicate interpretations for the target lesions. As threshold value is increased, low attributions widely spread with irrelevant parts are deleted, resulting in improvement of the localization performance of CAM. On the contrary, the attribution methods: LRP and RAP show the decrement of IOU when the threshold is too high. After the output predictions are fully decomposed and mapped into pixel-by-pixel, attributions compose detail visual explanations with the degree of importance.

2.2.2 Qualitative Assessment

For qualitatively evaluating the heatmaps from each method, we compare the results by examining how the high activated points are distributed in the bounding box. As the methods have the same purpose for emphasizing the most important factors, we can assess whether each method is consistent in attributing positive relevance. Fig.3 presents the heatmaps from each method: CAM, LRP, and RAP for the diagnostic decisions by CheXNet. We qualitatively assessed all images in the test set of NIH chest X-ray 14 dataset, and most of them appear to show similarly satisfactory results in a human view. More qualitative comparisons are illustrated in the supplementary material.

Figure 3: The qualitative comparison of visual explanations. Each row illustrates an X-ray image, CAM, LRP, and RAP, respectively. Red square indicates the bounding box for each disease.

3 Inconsistency of Intention

It is not trivial to elucidate the decision of DNNs because of the opacity from the myriad of linear and nonlinear operations. Tempted by the impressive performance, it is easy to believe that the criterion of decisions is from the same intentions as human. [12] points out this problem and insists the necessity of explaining techniques and their evaluation metrics. Especially in the medical field, the identification of causes for diagnosis is crucial to ensure reliability.

As described in Section 3.2, we trained DNN models on RSNA Pneumonia detection Datasets for binary classification of Pneumonia images. The performance of each model based on general learning methods shows fair performance. Fig.6

illustrates the visual explanations of what DNNs mainly focused on. The input X-ray images are correctly classified as target labels. For the Pneumonia x-ray, the relevance from trained models: VGG, Resnet, Densenet is distributed on irrelevant area of lesions (bounding box) without regular patterns. However, CheXNet, pretrained in NIH dataset with certain purpose to classify various diseases, shows clear visual explanations corresponding to Pneumonia. For the normal X-ray image, relevance from trained model appears in areas that support the normal lung’s clear shape. The result of additional normal images also show this similar relevance patterns. Here, the Interesting phenomenon is that DNN models learn the status of normal lung images, not learn the characteristics of Pneumonia disease. Since we do not provide any supervisions of the lesion area, DNN focuses on the lungs in a normal state, which is rather large in volume and clearly visible to pursue higher performance. CheXNet classified this input X-ray as Cardiomegaly, which is not closely related to lung diseases, and the visual explanation clearly support the diagnostic decision by emphasizing the lesion area of heart.

Figure 4: Investigating the inconsistency between human intention and what DNN has learned. Please see Section 4 for details.

4 Conclusion

In this paper, we demonstrate an efficient method to unmask the opacity of DNNs and provide interpretation of diagnostic decision by utilizing explaining techniques. The introduced methods: LRP and RAP can visualize more accurate and clear parts of lesions than generally used CAM. Generated heatmaps indicate the important factors for deciding the target diseases with intensity from high relevant to low relevant. We utilize chest X-ray datasets: NIH ChestX-ray14 and RSNA pneumonia datasets to verify how attribution methods could localize the target lesions without any supervision of bounding box. For the quantitative evaluation, we use mean Intersection of Union for the visualization methods: CAM, LRP and RAP. The results show that fully decomposing the network with investigating the contributions of neurons makes it possible to clearly localize the part of lesions. Furthermore, we analyze the inconsistency of human intentions and DNNs by utilizing explaining methods, and emphasize the necessity of interpretability for the adoption of machine intelligence in medical domain.

Figure 5: Additional comparison of visual explanations generated from CheXNet. First, second, and third row in each tuple denote image, LRP, and RAP, respectively. The visualization is the result without applying threshold.
Figure 6: Additional comparison of visual explanations generated from CheXNet. First, second, and third row in each tuple denote image, LRP, and RAP, respectively. The visualization is the result without applying threshold.

References