X-ray is a widely used medical imaging technique in clinics for diagnosis and treatment of thoracic diseases. Medical image interpretation, including both disease annotation and report writing, is a laborious routine for radiologists. Moreoever, the quality of interpretation is often quite diverse due to the differential levels of experience, expertise and workload of the radiologists. To release radiologists from their excessive workload and to better control quality of the written reports, it is desirable to implement a medical image interpretation system that automates the visual perception and cognition process and generates draft reports for radiologists to review, revise and finalize.
Despite the rapid and significant development, the existing natural image captioning models, e.g. [7, 18], fail to perform satisfactorily on medical report generation. The major challenge lies in the limited number of image-report pairs and relative scarcity of abnormal pairs for model training, which are essential for quality radiology report generation. Additional challenge is the lack of appropriate performance evaluation metrics; the
-gram based BLEU scores widely used in natural language processing (NLP) are not suitable for assessing the quality of generated reports.
Nevertheless several approaches have been developed to generate reports automatically for chest X-rays using the CNN-RNN architecture developed in natural image captioning research [6, 8, 17, 19] (Fig. 1a). Since the medical report typically consists of a sequences of sentences, Jing et al.  use a hierarchical LSTM  to generate paragraphs and achieve impressive results on Indiana University (IU) X-ray dataset 
. Instead of only using visual features extracted from image, they first predict the Medical Text Indexer (MTI) annotated tags, and then combine semantic features from the tags with visual features from the images for report generation. Similarly use both visual and semantic features but generate ‘impression’ and ‘findings’ of the report separately. The former one-sentence summary is generated from a CNN encoder whereas the latter paragraph is generated using visual and semantic features. Different from , the semantic feature is extracted by embedding the last generated sentence as opposed to the annotated tags. Li et al. 
use a hierarchical decision-making procedure to determine whether to retrieve a template sentence from an existing template corpus or to invoke the lower-level decision to generate a new sentence from scratch. The decision priority is updated via reinforcement learning based on sentence-level and word-level rewards or punishments. However, none of these methods demonstrate a satisfactory performance in disease localization and classification, which is a central issue in medical image interpretation.
Wang et al.  address both disease classification and medical image report generation problems in the same model. They introduce a novel Text-Image Embedding network (TieNet), which integrates self-attention LSTM using textual report data and visual attention CNN using image data. TieNet is capable of extracting an informative embedding to represent the paired medical image and report, which significantly improves the disease classification performance compared to . However, TieNet’s performance on medical report generation improves only marginally over the baseline approach , trading the medical report generation performance for the disease classification performance. Moreover, TieNet does not provide a visual support for radiologists to review and revise the automatically generated report.
We present an automatic medical image interpretation system with in situ visual support striving for a better performance in both image annotation and report generation (Fig. 1b). To our knowledge this is among the first attempts to exploit disease localization for X-ray image report generation with visual supports. Our contributions are in four-fold: (1) we describe an integrated image interpretation framework for disease annotation and medical report generation, (2) we transfer knowledge from large image data sets (ChestX-ray8 
and ImageNet) to enhance medical image interpretation using a small number of reports for training (IU X-ray), (3) we evaluate suitability of the NLP evaluation metrics for medical report generation, and (4) we demonstrate the functionality of localizing the key finding in an X-ray with a heatmap.
Our workflow (Fig. 2) first annotates an X-ray image by classifying and localizing thoracic diseases (Fig. 2a) and then generates the corresponding sentences to build up the entire report (Fig. 2b). Fig. 2c displays the structure of attentive LSTM used to generate reports.
2.1 Disease Classification and Localization
Fig. 2a shows our classification module built on a -layer Dense Convolutional Network (DenseNet) . Similar to , we replace the last fully-connected layer with a new layer of dimension , where is the number of diseases. This is a multiple binary classification problem that input is a frontal view X-ray image X
and output is a binary vector, i.e., , indicating absence or presence of a disease
. The binary cross-entropy loss function is defined as:, where
is the probability for a target disease. If , an X-ray is annotated with disease for the next level modeling. Otherwise, it is considered as “Normal”. It is worth mentioning that a vast majority of X-rays are considered as “Normal”, therefore, other choices of thresholds also work well with our system.
We apply Grad-GAMs 
to localize disease with a heatmap. Gard-CAMs uses the gradient information and flows it back to the final convolutional layer to decipher the importance of each neuron in classifying an image to disease. Formally, let be the th feature maps and weight represents importance of the feature map for the disease . We first calculate the gradient of the score for class m, (before the sigmoid), with respect to a feature map , i.e., . Thus are calculated by: . represents the coordinates of a pixel, and is the total number of pixels. We then generate a heatmap for disease by applying weighted average of
, followed by a ReLU activation:. The localized semantic features to predict disease are identified and visualized with the heatmap . Similar to , we apply a thresholding based bounding box (B-Box) generation method. The B-Box bounds pixels whose heatmap intensity is above 90% of the maximum intensity. The resulting region of interest is then cropped for next level modeling.
2.2 Attention-based Report Generation
Fig. 2b illustrates the process of report generation. If there is no active thoracic disease found in an X-ray, a report will be directly generated by an attentive LSTM based on the original X-ray as shown in the green dashed box. Otherwise (as shown in the red dashed box), the cropped subimage with localized disease from the classification module (Fig. 2a) is used to generate description of abnormalities whereas the original X-ray is used to generate description of normalities in the report.
As shown in the Fig. 2c, the attentive LSTM is based on an encoder-decoder structure , which takes either the original X-ray image or the cropped subimage corresponding to abnormal region as the input and generates a sequence of sentences for the entire report. Our encoder is built on a pre-trained ResNet-101 , which extracts the visual features matrix (reshaped from ) from the last convolutional layer followed by an adaptive average pooling layer. Each vector of F represents one regional feature vector, where .
The LSTM decoder takes F as input and generates sentences by producing a word at each time . To utilize the spatial visual attention information, we define the weights , which can be interpreted as the relative importance of region feature at time . The weights
is computed by a multilayer perceptron: and , and hence the attentive visual feature vector is computed by . In addition to the weighted visual feature and last hidden layer , the RNN also accepts the last output word at each time step as an input. We concatenate the embedding of last output word and visual feature as context vector . Thus the transition to the current hidden layer can be calculated as: After model training, a report is generated by sampling words and updating the hidden layer until hitting the stop token.
3 Experiments and Results
Datasets. We use the IU Chest X-ray Collection , an open image dataset with radiology reports paired with chest X-rays for our experimental evaluation. Each report contains three sections: impression, findings and Medical Subject Headings (MeSH) terms. Similar to [6, 19], we generate sentences in ‘impression’ and ‘findings’ together. The MeSH terms are used as labels for disease classification  as well as the follow-up report generation with abnormality and normality descriptions. We convert all the words to lower-case, remove all non-alphanumeric tokens, replace single-occurrence tokens with a special token and use another special token to separate sentences. We filter out images and reports that are non-relevant to the eight common thoracic diseases included in both ChestX-ray8  and IU X-ray datasets , resulting in a dataset with pairs of X-ray image and report. Finally, we split all the image-report pairs into training, validation and testing dataset by ratio .
We implement our model on a GeForce GTX 1080ti GPU platform using PyTorch. The dimension of all hidden layers and word embeddings are set to. The network is trained with Adam optimizer with a mini-batch size of . The training stops when the performance on validation dataset does not increase for epochs. We do not fine-tune the DenseNet pretrained with ChestX-ray8  and ResNet pretrained with ImageNet due to the small sample size of IU X-ray dataset . For each disease class, a specific pair of LSTMs are trained to ensure consistency between the predicted disease annotation(s) and the generated report. For the disease classes with less than samples, we train a shared attentive LSTM across the classes to generate normality description of the report.
Evaluation of Automatic Medical Image Reports. We use the metrics for NLP tasks such as BLEU , ROUGE , and CIDEr  for automatic performance evaluation. Our model outperforms all baseline models [3, 10, 13, 15] and demonstrates the best CIDEr and ROUGE scores among all the advanced methods specifically designed for medical report generation [6, 8, 19], despite the fact that we only use a single frontal view X-ray. While BLEU scores measure the percentage of consistency between the automatic report and the manual report in light of the automatic report (precision), it is not illuminative in assessing the amount of information captured in the automatic report in light of the manual report (recall). In real-world clinical applications, both recall and precision are critical in evaluating the quality of an automatic report.
, which may decreases recall but does not affect precision. Thus, the automatic report missing the key disease information can still achieve high BLEU scores nevertheless it provides limited insight for medical image interpretation. Therefore, ROUGE is more suitable than BLEU for evaluating the quality of automatic reports since it measures both precision and recall. Further, CIDEr is more suitable for our purpose than ROUGE and BLEU since it captures the notions of grammaticality, saliency, importance and accuracy. Additionally, CIDEr uses TF-IDF to filter out unimportant common words and weight more on disease keywords. As a result, higher ROUGE and CIDEr scores demonstrate a superior performance of our medical image interpretation system.
Evaluation of Disease Classification. Although ROUGE and CIDEr scores are effective in evaluating the consistency of an automatic report to a manual report, none of them, however, are designed for assessing the correctness of medical report annotation in terms of common thoracic diseases. The latter is another key output of a useful image interpretation system. For example, the automatically generated sentence: “no focal airspace consolidation, pleural effusion or pneumothorax” is considered as similar to the manually written sentence: “persistent pneumothorax with small amount of pleural effusion” using both ROUGE and CIDEr scores despite the completely opposite annotations. Therefore, we assess the accuracy in medical report annotation by comparing with TieNet  in disease classification using Area Under the ROC (AUROC) as the metric. Our result outperforms TieNet’s classification module in out of diseases (Table 4, Fig. 4). As we discussed before, the inferior performance of TieNet may due to the fact that it trades the image classification performance for report generation performance. On the contrary, our model exploits the former to enhance the latter via a bi-level attention.
Example System Outputs. Fig. 5 shows two example outputs each with a generated report and image annotation. The first row presents an annotated “Normal” case whereas the second row presents an annotated “Cardiomegaly” case with the disease localized in a red bounding box on the heatmap generated from our classification and localization module. The results show that our medical interpretation system is capable of diagnosing thoracic diseases, highlighting the key findings in X-rays with heatmaps and generating well-structured reports.
In summary, we propose a bi-level attention mechanism for automatic X-ray image interpretation. Using only a single frontal view chest X-ray, our system is capable of accurately annotating X-ray images and generating quality reports. Our system also provides visual supports to assist radiologists in rendering diagnostic decisions. With more quality training data becomes available in the near future, our medical image interpretation system can be improved by: (1) incorporating both frontal and lateral view of X-rays, (2) predicting more disease classes, and (3) using hand labeled bounding boxes as the target of localization. We will also generalize our system by extracting informative features from Electronic Health Record (EHR) data and repeated longitudinal radiology reports to further enhance the performance of our system.
Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C.L., Parikh, D., Batra, D.: Vqa: Visual question answering. Int. J. Comput. Vision123(1), 4–31 (May 2017)
-  Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., Rodriguez, L., Antani, S., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiology examinations for distribution and retrieval. JAMIA 23(2), 304–310 (2015)
-  Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: ICCV. pp. 2625–2634 (2015)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778 (2016)
-  Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR. pp. 4700–4708 (2017)
-  Jing, B., Xie, P., Xing, E.: On the automatic generation of medical imaging reports. arXiv preprint arXiv:1711.08195 (2017)
-  Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: CVPR. pp. 3337–3345. IEEE (2017)
-  Li, C.Y., Liang, X., Hu, Z., Xing, E.P.: Hybrid retrieval-generation reinforced agent for medical image report generation. arXiv preprint arXiv:1805.08298 (2018)
Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004)
-  Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: CVPR. vol. 6, p. 2 (2017)
-  Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: ACL. pp. 311–318. Association for Computational Linguistics (2002)
-  Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., Ding, D., Bagul, A., Langlotz, C., Shpanskaya, K., et al.: Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225 (2017)
-  Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR. pp. 7008–7024 (2017)
-  Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: ICCV. pp. 618–626. IEEE (2017)
-  Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: CVPR. pp. 3156–3164 (2015)
-  Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: CVPR. pp. 3462–3471. IEEE (2017)
-  Wang, X., Peng, Y., Lu, L., Lu, Z., Summers, R.M.: Tienet: Text-image embedding network for common thorax disease classification and reporting in chest x-rays. In: CVPR. pp. 9049–9058 (2018)
-  Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML. pp. 2048–2057 (2015)
-  Xue, Y., Xu, T., Long, L.R., Xue, Z., Antani, S., Thoma, G.R., Huang, X.: Multimodal recurrent model with attention for automated radiology report generation. In: MICCAI. pp. 457–466. Springer (2018)