Generation of Multimodal Justification Using Visual Word Constraint Model for Explainable Computer-Aided Diagnosis

06/10/2019 ∙ by Hyebin Lee, et al. ∙ 0

The ambiguity of the decision-making process has been pointed out as the main obstacle to applying the deep learning-based method in a practical way in spite of its outstanding performance. Interpretability could guarantee the confidence of deep learning system, therefore it is particularly important in the medical field. In this study, a novel deep network is proposed to explain the diagnostic decision with visual pointing map and diagnostic sentence justifying result simultaneously. For the purpose of increasing the accuracy of sentence generation, a visual word constraint model is devised in training justification generator. To verify the proposed method, comparative experiments were conducted on the problem of the diagnosis of breast masses. Experimental results demonstrated that the proposed deep network could explain diagnosis more accurately with various textual justifications.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Overall proposed deep network framework for producing textual justification and visual justification.

Thanks to the remarkable achievements of deep learning technology, there are various attempts to utilize deep learning technique in many research fields such as image recognition [1, 2] and medical image analysis [3, 4]. Computer-aided detection (CADe) and Computer-aided diagnosis (CADx) also show notable successes with deep learning based approaches [5, 6]. On the contrary, difficulty in understanding the cause of a decision still remain as a dominant limitation for application of deep learning based method in the real world. To cope with this problem, several research efforts [7, 8, 9] have been devoted to developing the method for interpreting the decision of the deep network in recent years . [7, 8] found the specific area of the input image which has the biggest impact on the final result. The multimodal approach [9] reported to generate explanation supporting the decision of deep network in form of attentive pointing map and text.

In medical applications such as CADx, interpretation method reflecting reliability is more important, because it is mainly used in a high-risk environment directly connected to human health. There are several works which utilize prescribed annotation or medical report attached to the medical image as additional information for the decision explanation. [10, 11]

introduced critic network which exploits pre-defined medical lexicon to elaborate visual evidence of diagnosis.

[12, 13]

proposed networks generate natural medical report from various Recurrent Neural Networks (RNNs) structure and point informative area of input medical image.

However, it is challenging to generate an accurate sentence with large variation because of the high complexity in the natural language. As addressed in [14], the conventional captioning methods suffer a problem in which the model duplicates a completely identical sentence of the training set even the model is trained on the large dataset. In other words, the deep network tends to memorize every sentence in the training set, which causes the situation that the generated sentence could not describe the target image in detail sufficiently. It becomes a more serious problem in the medical research area due to the limited number of medical report data.

In this study, we propose a novel deep network to provide visual and textual justification interpreting the diagnostic decision. The main contribution of this study is summarized as followings:

1) We propose a new justification generator to interpret the diagnostic decision of the deep network. The proposed justification generator could be constructed on top of the diagnosis network and provide the textual and the visual justification for the diagnostic decision. Due to the reason that the proposed justification generator is constructed on the diagnosis network, the proposed method could apply on any conventional CADx network (classifier of malignant mass and benign mass) to interpret the decision of the deep network without diagnostic performance degradation.

2) To overcome the duplication problem in which the model generates a completely identical sentence of the training set, we devise a new learning method with a visual word constraint loss. For evaluating the proposed method, a sentence dataset describing the characteristics (the shape and the margin) of breast masses with the words corresponding Breast Imaging Reporting and Data System (BI-RADS) mass lexicon has been collected in this study. Experimental results have shown that the proposed method could generate various textual justifications which are not just duplication of the sentence in the training dataset by guiding the textual justification generator with the visual word constraint model.

The rest of the paper is organized as follows. In the Section 2, we introduce the proposed diagnostic interpreting network for generating visual and textual justification. At the same time, we describe detail process to construct sentence dataset. Next, experimental results are presented and analyzed in terms of visual and textual justifying ability in Section 3. Finally, Section 4 concludes the paper.

2 Proposed method

Figure 2: Detail architecture of the justification generator.

Figure 3: Example of composed sentence describing ROI mass image.

Figure 4: Results of the textual and the visual justification of the proposed method. Diagnosis, margin, and shape denote ground truth. The sentences for textual justification are compared with the proposed method and the method learned without .

2.1 Overall framework

An overall proposed network framework is shown in Fig. 1. As shown in the figure, the overall architecture of the deep network is divided into two parts, a diagnosis network and a justification generator. As the diagnosis network, any conventional CADx network (classifier of malignant mass and benign mass) could be utilized. The justification generator employs a visual feature and a diagnostic decision of the diagnosis network. To effectively train the justification generator by avoiding the sentence duplication of the training set, a visual word constraint loss is devised in the training stage. The detailed structure of the justification generator and the learning strategy are described in following subsections.

2.2 Justification generator

As shown in Fig. 2, in order to explain the diagnostic decision, the justification generator make a textual justification and a visual justification from the predicted diagnosis and the visual feature. The visual feature is defined as an intermediate feature map in the diagnosis network. From given image , the visual feature is extracted by the visual feature encoder as


The diagnosis predictor make a diagnostic decision from the visual feature by


where and

denote the probability of the benign and the probability of the malignant, respectively. The diagnostic decision

is embedded to the channel-wise attention weight as followings:


where denotes a function with learnable parameter

for embedding the diagnostic decision. This embedded vector refines the visual feature with the channel attention as


where denotes a diagnosis embedded feature and is the k-th element of . From the diagnosis embedded feature, the visual justification is generated as followings:


where is the 2D map obtained by a function with learnable parameter . The softmax operation in Eq. (6) was conducted to represent more focused areas and suppress the activation on the background. For obtaining the textual justification, a text generating feature is encoded from the diagnosis embedded feature

. The text generating feature is used as the input of the textual justification generator which is designed with the Long Short Term Memory(LSTM) networks. The text generating feature

is obtained by


where denotes the refined diagnosis embedded feature by the spatial attention as and the channel-wise attention as . is a function with learnable parameter for encoding the text generating feature. and are implemented by multiple convolutional layers. Finally, the textual justification is generated by using the two-hidden-layer-stacked LSTM network as


where denotes a t-th word obtained by converting the t-th hidden state using a function with learnable parameters and .

2.3 Network training using visual word constraint

In the training stage, the textual difference loss is defined as


where textual justification ground truth is and denotes the number of words in ground truth sentence. In order to overcome the aforementioned duplication problem in the textual justification generation, we devise a visual word constraint model . The visual word constraint model is designed as a sentence classifier [15]

which predicts the margin and the shape from the given sentences. The margin and the shape are estimated from the given sentences



where are a predicted margin and a predicted shape, respectively. denotes a function for predicting the margin and the shape. The visual word constraint model is pre-trained on the training set and utilized to guide the textual justification generator with a visual word constraint loss . The visual word constraint loss is defined as



are ground truth of margin and shape. As a result, overall network is trained by minimizing following loss function:


where is a balancing hyper-parameter. By introducing visual word constraint loss, the textual justification could contain more various word. The proposed model could grasp similarity in meaning with word describing same margin or shape even without additional large word set embedding to vector space.

3 Experiments

3.1 Experimental condition

In the experiments, we used two mammogram datasets. First dataset was the public mammogram dataset, named Digital Database for Screening Mammography (DDSM) dataset [16]. The BI-RADS descriptions and the location of masses were annotated by the radiologist [16]. The dataset (605 masses) was split into a training set (484 masses) and a test set (121 masses). Second dataset was Full-Field Digital Mammogram (FFDM) dataset from a hospital. A total of 147 masses of 67 patients were collected and a two-fold cross-validation was conducted in this study. The deep network learned from the DDSM dataset was used as the initial network for training of FFDM dataset. The sentence datasets were collected on both the DDSM dataset and the FFDM dataset. Before composing sentences, we investigated words and phrases for describing BI-RADS mass lexicons (margin and shape) in the medical papers [17, 18, 19, 20, 21, 22, 23, 24] and its synonyms called visual words. Visual words of each lexicon included 5-12 words or phrases. Three sentences were annotated for each ROI mass image. As shown in Fig. 3, each sentence was annotated by containing at least one visual word for mass margin and shape respectively. According to [9], every sentence included at least 10 words and did not contain BI-RADS mass lexicon as it is. In addition, the sentences contained individual details.

In order to increase the number of training data, data augmentation was conducted. The two sizes of patches were cropped from the original ROI image at five locations (top left, top right, center, bottom left, bottom right). Each cropped image was also flipped and rotated (0°, 90°, 180°, and 270°). The size of mini-batch was set to 64 and an Adam optimizer [25] was used with learning rate 0.0005. The balancing parameter was empirically set to 2.

For the diagnosis network at the front part of the proposed network, we used VGG16 [26]

based binary classifier. The initial weights were pre-trained via ImageNet

[27] and the fine-tuning was conducted. As the visual feature, feature map after conv 5_3 in the VGG16 network was used in this study. The area under the ROC curve (AUC) was calculated for evaluating the diagnostic performance and the AUC of 0.918 was obtained on the DDSM dataset. During training of the justification generator, the parameters of the diagnosis network were fixed.

3.2 Results

To validate the effect of our model, we compared the proposed method with the method learned without visual word constraint loss . Fig. 4 shows the examples of the visual justification and the textual justification in the proposed method. As shown in the figure, the proposed method could provide the textual justification and the visual justification on the diagnostic decision. The sentences generated by the method learned without visual word constraint loss were also compared. The visual constraint loss enabled the textual justification generator to match the margin and shape labels of the generated texture justification and input ROI mass image in the training phase. Therefore, the generated textual justification was more accurate in the proposed method compared to the method without .

For quantitatively evaluating the quality of the textual justification, we adopted BLEU [28], ROUGE-L [29], and CIDEr [30] metrics which calculated the similarity between the generated sentence and the reference (ground truth) sentence. Table 4 shows the results of the evaluation for the textual justification on the DDSM dataset in terms of BLEU, ROUGE-L, and CIDEr. As shown in the table, with the proposed method learning the model utilizing the visual word constraint loss, the generated textual justifications were closer to reference sentences composed by human. Furthermore, following the evaluations in [31], the ratio of the unique sentences and the ratio of the novel sentences were calculated in Table 4 on the DDSM dataset. The unique sentence was defined as the sentence which was not repeated in all generated sentences and the novel sentence was defined as the sentence which was unseen in the training set. These two metrics were calculated to evaluate the textual justification regarding the duplication problem. If duplication occurred, the textual justification could not accurately narrate the given test image. By calculating the ratio of the novel and unique sentences, it was possible to measure how reliably the textual justification was generated according to the given image. As shown in the table, the number of novel sentences was dramatically improved with the proposed method. The number of the unique sentences in proposed method was also increased compared with the method learned without visual word constraint loss. We conducted same evaluation process on the FFDM dataset. As shown in Table 4 and Table 4, the proposed method achieved higher score to prove that more accurate and diverse textual justifications were generated.

4 Conclusion

Proposed method Without
BLEU-1 0.3870 0.3687
BLEU-2 0.1968 0.1742
BLEU-3 0.1026 0.0887
BLEU-4 0.0586 0.0490
ROUGE_L 0.2526 0.2439
CIDEr 0.1514 0.1469
Table 2: Ratios of the unique sentences and the novel sentences on the DDSM dataset.
Proposed method Without
Ratio of unique sentences 93.39% 64.46%
Ratio of novel sentences 43.80% 4.13%
Table 3: Evaluation of textual justification on the FFDM dataset.
Proposed method Without
BLEU-1 0.4070 0.3835
BLEU-2 0.2296 0.2133
BLEU-3 0.1354 0.1187
BLEU-4 0.0871 0.0650
ROUGE_L 0.2650 0.2596
CIDEr 0.1366 0.1185
Table 4: Ratios of the unique sentences and the novel sentences on the FFDM dataset.
Proposed method Without
Ratio of unique sentences 54.42% 11.56%
Ratio of novel sentences 65.99% 8.16%
Table 1: Evaluation of textual justification on the DDSM dataset.

In this paper, we proposed the novel deep network to provide multimodal justification for the diagnostic decision. The proposed method could explain the reason of the diagnostic decision with the sentence and indicate the important areas on the image. In the case of textual justification generation for medical purpose, the network tended to generate templated result due to the limited number of medical reports. To overcome this problem, the learning method utilizing visual word constraint loss was devised. By the comparative experiments, the effectiveness of the proposed method was verified. The proposed method generated more diverse and accurate textual justifications. These results imply that the proposed method could explain the diagnostic decision of the deep network more persuasively.


  • [1] A. Krizhevsky, I. Sutskever, and G.E. Hinton,

    “Imagenet classification with deep convolutional neural networks,”

    in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR. IEEE, 2016, pp. 770–778.
  • [3] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
  • [4] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, Y. Bengio, C. Pal, P. Jodoin, and H. Larochelle, “Brain tumor segmentation with deep neural networks,” Medical image analysis, vol. 35, pp. 18–31, 2017.
  • [5] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, et al., “Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning,” arXiv preprint arXiv:1711.05225, 2017.
  • [6] S.T. Kim, H. Lee, H.G. Kim, and Y.M. Ro, “Icadx: interpretable computer aided diagnosis of breast masses,” in Medical Imaging. International Society for Optics and Photonics, 2018, vol. 10575, p. 1057522.
  • [7] R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV. IEEE, 2017, pp. 618–626.
  • [8] R.C. Fong and A. Vedaldi, “Interpretable explanations of black boxes by meaningful perturbation,” in ICCV. IEEE, 2017.
  • [9] D. Huk Park, L. Anne Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach, “Multimodal explanations: Justifying decisions and pointing to the evidence,” in CVPR. IEEE, 2018, pp. 8779–8788.
  • [10] S.T. Kim, J.H. Lee, H. Lee, and Y.M. Ro, “Visually interpretable deep network for diagnosis of breast masses on mammograms,” Physics in Medicine & Biology, vol. 63, no. 23, pp. 235025, 2018.
  • [11] S.T. Kim, J.H. Lee, and Y.M. Ro, “Visual evidence for interpreting diagnostic decision of deep neural network in computer-aided diagnosis,” in Medical Imaging 2019: Computer-Aided Diagnosis. International Society for Optics and Photonics, 2019, vol. 10950, p. 109500K.
  • [12] X. Wang, Y. Peng, L. Lu, Z. Lu, and R.M. Summers, “Tienet: Text-image embedding network for common thorax disease classification and reporting in chest x-rays,” in CVPR. IEEE, 2018, pp. 9049–9058.
  • [13] Z. Zhang, Y. Xie, F. Xing, M.n McGough, and L. Yang, “Mdnet: A semantically and visually interpretable medical image diagnosis network,” in CVPR. IEEE, 2017, pp. 6428–6436.
  • [14] X. Liu, H. Li, J. Shao, D. Chen, and X. Wang, “Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data,” in ECCV, 2018.
  • [15] Y. Kim, “Convolutional neural networks for sentence classification,” in

    Conference on Empirical Methods in Natural Language Processing

    , 2014, pp. 1746–1751.
  • [16] M. Heath, K. Bowyer, D. Kopans, R. Moore, and W.P. Kegelmeyer, “The digital database for screening mammography,” in International Workshop on Digital Mammography. Medical Physics Publishing, 2000, pp. 212–218.
  • [17] American College of Radiology, Breast Imaging Reporting and Data System® (BI-RADS®), American College of Radiology, Reston, Va, 4 edition, 2003.
  • [18] J. Lee, “Practical and illustrated summary of updated bi-rads for ultrasonography,” Ultrasonography, vol. 36, no. 1, pp. 71, 2017.
  • [19] W.K. Moon, C.M. Lo, J.M. Chang, C.S. Huang, J.H. Chen, and R.F. Chang, “Quantitative ultrasound analysis for classification of bi-rads category 3 breast masses,” Journal of digital imaging, vol. 26, no. 6, pp. 1091–1098, 2013.
  • [20] I. Thomassin-Naggara, A. Tardivon, and J. Chopier, “Standardized diagnosis and reporting of breast cancer,” Diagnostic and interventional imaging, vol. 95, no. 7-8, pp. 759–766, 2014.
  • [21] R. Selvi, Breast Diseases Imaging and Clinical Management, Springer India, 2015.
  • [22] K.A. Lee, N. Talati, R. Oudsema, S. Steinberger, and L.R. Margolies, “Bi-rads 3: Current and future use of probably benign,” Current radiology reports, vol. 6, no. 2, pp. 5, 2018.
  • [23] H. Berment, V. Becette, M. Mohallem, F. Ferreira, and P. Chérel, “Masses in mammography: What are the underlying anatomopathological lesions?,” Diagnostic and interventional imaging, vol. 95, no. 2, pp. 124–133, 2014.
  • [24] B. Surendiran and A. Vadivel, “Mammogram mass classification using various geometric shape and margin features for early detection of breast cancer,” International Journal of Medical Engineering and Informatics, vol. 4, no. 1, pp. 36–54, 2012.
  • [25] D.P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations, 2015.
  • [26] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
  • [27] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR. IEEE, 2009, pp. 248–255.
  • [28] K. Papineni, S. Roukos, T. Ward, and W.J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
  • [29] C.Y. Lin, “Rouge: A package for automatic evaluation of summaries,” Text Summarization Branches Out, 2004.
  • [30] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in CVPR. IEEE, 2015, pp. 4566–4575.
  • [31] Y. Wang, Z. Lin, X. Shen, S. Cohen, and G.W. Cottrell, “Skeleton key: Image captioning by skeleton-attribute decomposition,” in CVPR. IEEE, 2017.