Auxiliary Signal-Guided Knowledge Encoder-Decoder for Medical Report Generation

06/06/2020 ∙ by Mingjie Li, et al. ∙ Monash University SUN YAT-SEN UNIVERSITY 8

Beyond the common difficulties faced in the natural image captioning, medical report generation specifically requires the model to describe a medical image with a fine-grained and semantic-coherence paragraph that should satisfy both medical commonsense and logic. Previous works generally extract the global image features and attempt to generate a paragraph that is similar to referenced reports; however, this approach has two limitations. Firstly, the regions of primary interest to radiologists are usually located in a small area of the global image, meaning that the remainder parts of the image could be considered as irrelevant noise in the training procedure. Secondly, there are many similar sentences used in each medical report to describe the normal regions of the image, which causes serious data bias. This deviation is likely to teach models to generate these inessential sentences on a regular basis. To address these problems, we propose an Auxiliary Signal-Guided Knowledge Encoder-Decoder (ASGK) to mimic radiologists' working patterns. In more detail, ASGK integrates internal visual feature fusion and external medical linguistic information to guide medical knowledge transfer and learning. The core structure of ASGK consists of a medical graph encoder and a natural language decoder, inspired by advanced Generative Pre-Training (GPT). Experiments on the CX-CHR dataset and our COVID-19 CT Report dataset demonstrate that our proposed ASGK is able to generate a robust and accurate report, and moreover outperforms state-of-the-art methods on both medical terminology classification and paragraph generation metrics.



There are no comments yet.


page 2

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Natural image captioning, which aims to summarise visual information (images or videos) in a sentence or generate a topic-related paragraph Anderson et al. (2018); Rennie et al. (2017); Xu et al. (2015), is a complex task that requires the model to bridge visual and linguistic information. When compared to describing natural images Cao et al. (2018); Wu and Cohen (2016), medical report generation requires an increased capability to understand medical domain knowledge and describe images at a fine-grained and semantic-coherent level, covering accurate abnormal terminologies Li et al. (2018). In particular, outstanding challenges associated with modeling medical reports lie in successfully detecting visual groundings and incorporating medical domain knowledge.

Figure 1: Two samples from CX-CHR and COV-CTR datasets. Red bounding boxes annotated by a radiologist indicate the regions that he pays more attention to describing this image. The red text describes the abnormalities. Underlined text indicates alignment between ground truth reports and generated reports.

Generally speaking, when a radiologist describes a medical image, he/she will carefully inspect the abnormal regions after quickly browsing the global image, then write a report that draws on the knowledge he/she learned from external medical domain information and his/her working experience. However, unlike radiologists’ working patterns, most existing methodsWang et al. (2017); Kumar et al. (2018); Wang et al. (2018) employ the global image as input and train their language model with the datasets’ corpora only, both of which are limitations. As shown in Figure 1, the attention regions take up only a small portion of the global image, but have been treated equally to other regions in previous works. Therefore, other regions could be considered as irrelevant noise that distract the model. Furthermore, as the attention regions differ between each image, it is difficult to manually crop them in pre-processing. Moreover, unlike common and trivial normality in medical reports, abnormalities are rare and diverse. Training with datasets’ corpora thus makes it difficult to alleviate data deviation.

Accordingly, to mimic the behavior of medical experts and address the above-mentioned learning difficulties, we introduce two kinds of auxiliary signals: namely the internal fusion features, and external medical linguistic information. More specifically, we attend the auxiliary region features to global visual features in order to produce the internal auxiliary signal, while the external auxiliary signals are contained in a large-scale easily-accessed medical textbook. We are inspired by the recent great progress made in large-scale unsupervised/self-supervised vision and language understanding Sun et al. (2019); Zhu et al. (2020); Xiong et al. (2019); Devlin et al. (2018); Radford et al. (2018) which has demonstrated that auxiliary signals can improve data efficiency during training and reduce the gap between the visual and linguistic domains. Due to the difficulty associated with acquiring and annotating medical images, these two auxiliary signals are comparatively far easier to access and can avoid data inefficiency.

Capitalizing on these new auxiliary signals, we propose an Auxiliary Signal-Guided Knowledge (ASGK) approach to guide knowledge encoding and natural language decoding in order to facilitate medical report generation. The medical graph decoder and natural language decoder are pretrained using external auxiliary signals, enabling them to memorize and phrase medical knowledge, while the internal signals facilitate the graph encoding that permits the incorporation of prior medical knowledge and bridging of the visual and linguistic information. To tackle the imbalance between normal and abnormal tag distributions, moreover, we adopt focal loss Lin et al. (2017) as our training strategy for tag classification.

We further introduce a new COVID-19 CT Report (COV-CTR) dataset for use in validating the robustness and generalization ability of ASGK. Since December 2019, the novel COVID-19 virus has caused a global pandemic and infected millions of people across 200 countries. A key step in controlling the infection is that of identifying infected people. In addition to the Reverse Transcription Polymerase Chain Reaction (RT-PCR) tests, lung CT scan analysis has emerged as another essential testing method. Therefore, an accurately written report could assist patients and doctors to understand their health condition. We invited three radiologists with more than five years of working experience to apply their diagnostic skills to the public COVID-CT datasetZhao et al. (2020) and use this information to construct the COV-CTR dataset.

We test our approach on the large-scale CX-CHR dataset and our COV-CTR dataset. We adopt CIDER-D Vedantam et al. (2015), ROUGE-L Lin (2004) and BELU Papineni et al. (2002) as the metrics for evaluating our approach. Comprehensive experiments demonstrate that ASGK improves performance in terms of both tag classification and report generation. Our ablation studies also provide insight that enables us to determine how ASGK works well.

The main contributions of this paper are three-fold as follows:

  • We identify and produce two kinds of auxiliary signals, namely the internal fusion visual features and the external medical linguistic information to facilitate graph encoding and medical knowledge learning respectively.

  • We design a medical tag graph encoder to transfer input features into higher-level information and adopt Generative Pre-Training (GPT) Radford et al. (2018) as our natural language decoder to generate accurate and robust medical reports.

  • We invite three radiologists with more than five years of experience to apply their diagnostic skills to the COVID-19 CT images Zhao et al. (2020) and use this information to construct a new medical report dataset, COVID-19 CT Report which will be available.

2 Related Work

Visual Captioning and Medical Report Generation.

Due to the rapid development of deep learning 

Huang et al. (2017); He et al. (2016); Hochreiter and Schmidhuber (1997); Chung et al. (2014), visual captioning models Anderson et al. (2018); Gan et al. (2017); Vinyals et al. (2015) have achieved significant progress in summarizing visual information (images or videos) in a single sentence or topic-related paragraph. Neural encoder-decoder frameworks Sutskever et al. (2014) and attention mechanisms Ranzato et al. (2015); You et al. (2016) have achieved great performance in both natural images and the medical domain. To further boost accuracy, scene Johnson et al. (2015)

and knowledge graphs 

Li et al. (2019)

are explored to replace the encoded vectors that are able to take advantage of detected nodes and their relationships. In order to alleviate textual data bias, reinforcement learning directly uses metric as a reward and optimizes these non-differentiable metrics 

Liu et al. (2017); Rennie et al. (2017); however, it is poor at implicitly balancing the visual data bias.

Medical Image Analysis with Auxiliary Signals. Recent works Islam et al. (2017); Shin et al. (2016)

discussed the application of deep learning technologies to the field of medical image analysis. However, due to the difficulty associated with accessing and annotating medical images, many researchers have attempted to use self-supervised learning to loosen the requirements of training data. The core of self-supervised learning involves the design of various proxy tasks that provide auxiliary signals for training deep neural networks

Jing and Tian (2020). Furthermore, auxiliary signals are widely applied as the basic structure for image analysis. Adopting auxiliary signals to guide training has advantages in terms of boosting model performance and improving model robustness. Zhuang et al. Zhuang et al. (2019) found that auxiliary signals are likely to benefit 3D neural networks for brain hemorrhage classification and brain tumor segmentation.

Language Model Pre-training. Natural language decoders are another critical part of the image captioning process. Recent breakthroughs in the field of pretrained language models, such as ELMOPeters et al. (2018), BERTDevlin et al. (2018), and XLNetYang et al. (2019)

, have demonstrated the effectiveness of auxiliary signals for a widespread range of natural language processing tasks. For example, the new state-of-the-art GPT-2

Radford et al. (2019)

reveals that pretraining allows models to learn a language’s syntactic and semantic information via unsupervised learning, which is then transferred to other tasks. However, directly applying these models to medical domain datasets often yields unsatisfactory results due to the existence of a domain gap between general corpora and medical corpora. To tackle this problem, Habibi

et al. Habibi et al. (2017) proposes a completely generic method based on deep learning and statistical word embedding, while Lee et al. Lee et al. (2020) pretrains BERT on medical corpora.

3 Approach

3.1 Problem Setup

The task of medical report generation involves asking a model to generate a topic related paragraph consisting of a series of sentences to describe a medical image of a patient case. We represent the image as and the report as , where presents the index of word in V the vocabulary of all words contained in the datasets. To generate fine-grained and semantically coherent sentences, we propose a graph encoder-decoder framework that first encodes inputs feature vectors to a medical tag graph and then decodes them to a medical report. We represent the medical tag as , where and is a set of edges. In our task, we represent each node feature

by its detected tag classification probability, then encode the correlation between each of the two tags as edge weights.

represents the total number of medical tags composes abnormal terminologies, such as “pneumothorax” and “colon shadow”, and normal terminologies such as “normal spine”, “normal intercostal space” and so on.

Generally, when a radiologist describes a image, he will inspect the abnormal region carefully after quickly browsing the global image, then write a report that reflects both his inspection and the knowledge obtained from external medical domain information and his working experience. To mimic this pattern, we firstly pretrain the framework with the external medical signals collected from an appropriate website in order to correctly phrase and learn medical knowledge. Subsequently, the internal visual fusion signals facilitate graph encoding and bridge the gap between linguistic and visual domain. More details regarding these internal visual fusion signals are described in Section 3.3.

3.2 The structure of ASGK

Figure 2: An overview of our ASGK approach. The ASGK model consists of a medical graph encoder and a natural language decoder. The medical graph encoder encodes input features into the corresponding medical tag graph, while the natural language decoder transfers high-level information to sentences or reports. The external signals guide the pretraining procedure, while the internal signals guide the model to bridge linguistic and visual information. T and MCS represent threshold and max connection select operation respectively.

An overview of our approach is shown in Figure 2. The main structure of ASGK comprises a medical graph encoder and natural language decoder.

Medical Graph Encoder. This component is built to encode the input features into higher level information, i.e. a medical tag graph. In the medical graph, each node denotes one detected medical tag, the features of which are the classification probabilities and can be written as Equation 1.


where is a projection matrix of size ; here, represent the dimension of the input features, and N is the number of total tags. Given that the truth edge information is not available in our case, we conduct an attention operation to learn edge weights automatically, which can be written as follows:


where Norm is the normalization operation, while Attention is executed as a scaled dot-product operation. Then the medical tag graph is incorporated with the prior medical knowledge which is represented as a set of nodes of size N with initialized features and edges via attention mechanism following by Li et al. (2019), which can be written as follows:


To enhance the correlation between each of the nodes, we employ a multi-head self attention operation on to get the final graph. We further treat medical tag detection as a multi-label classification task and adopt BCE loss to maximize the prediction scores


where is a projection matrix of size ; here, represent the dimension of the input features, is the ground truth label, and is the final graph tag features.

Natural Language Decoder. Inspired by GPT Radford et al. (2018), we design a natural language decoder consisting of blocks, similar to the Transformer decoder, to interpret the medical tag graph and enable semantic alignment in the visual and linguistic domain. The structure of the block is presented in Figure 2. This block applies a masked, multi-head self-attention operation to the medical report or sentences tokens embedded from Glove vectors pretrained on our datasets. We use Radford et al. (2018) to maximize the likelihood in the following formulation:


where is the conditional probability of the next token prediction, modeled using a neural network with parameters and history sentences. Then, followed by position-wise feed forward layers, the natural language decoder aims to produce an output distribution over all token vocabulary.


where is the index of input tokens in the vocabulary, is the index of the token’s position, is the pretrained wording embedding matrix, and is the position embedding matrix.

3.3 Auxiliary Signal-Guide Learning

Pretraining with External Auxiliary Signals. The direct application of general pretrained language models to medical domain tasks leads to unsatisfactory results, since the word distributions differ from those of those of general and medical corpora. To resolve this problem, we collect medical textual information from an appropriate website to construct a large-scale medical textbook. This textbook provides sufficient information about medical knowledge, including the symptoms, manifestations and other information about COVID-19 and thoracic diseases. Before feeding it into the medical graph encoder, we divide the medical textbook into sentences and embed the word tokens with embedding vectors, which are trained in our datasets using Glove. After embedding, sentences are encoded using a single-layer GRU with 1024 hidden units to produce the external medical auxiliary signals.

Training with Internal Auxiliary Signals. Evidently, the quality of the encoded medical graph will significantly affect the accuracy of the generated reports. Therefore, we produce internal fusion visual signals to facilitate medical graph encoding and bridge the gap between linguistic and visual information. As shown in Figure 2

, we first classify the global image using DenseNet-121 and obtain the feature maps

before the final pooling layers and output from last pooling layers . To produce the mask, we perform a threshold operation on a heat map acquired by Equation 9 and select the max connected area:


We adopt another DenseNet to extract the attended region features from the final pooling layers, then perform the element-wise operation on and to produce the fusion signals . To balance the deviation in medical tags, we optimize the parameters of three branch via focal loss, as follows:


where represents the label, represents the prediction probability, is a hyper-parameter set according to diverse datasets, and is treated as a modulating factor with a tunable focusing parameter . We set to and to in our task.

4 Experiments

Figure 3: Sample output of our approach on both CX-CHR and COV-CTR datasets. We use the outputs before the last pooling layer in DenseNet-121 to generate heat maps, then threshold them by to produce the auxiliary regions. In the medical tag graphs, we show the nodes whose value (which is equal to the classification probability) exceeds and edges whose weights are more than 0.3. To read the image clearly, we show the values of some edges in the appropriate places. The underlined text indicates alignment between ground truth reports and generated reports.

Datasets. We conduct experiments on the large-scale CX-CHR dataset and our COVID-19 CT Report dataset in order to validate the robustness and generalization ability of ASGK. CX-CHR is a large-scale chest X-ray dataset, constructed by a professional medical institution, that consists of 35,609 patients and 45,598 images paired with their corresponding Chinese diagnostic reports. We collect 173 medical tags comprising 155 abnormal terminologies and 28 normal terminologies from the ’findings’ section and annotate paired images with these tags. Moreover, the COV-CTR datasets consist 728 images (349 for COVID-19 and 379 for Non-COVID) collected from published papers and their corresponding paired Chinese reports. We perform the same operation described above and collect 68 tags (50 abnormalities and 18 normalities).

We tokenize all reports and the medical textbook and filter tokens with a minimum frequency of three, which results in 27683 unique tokens covering over of words in the corpus. On both datasets, we randomly split the data into training, validation, and testing sets using a ratio of ; there is no overlap between these branches.

Evaluation Metrics. Following Li et al. (2019), we adopt three kinds of metrics to evaluate our approach. Firstly, we use area under the curve (AUC) to evaluate the performance of all medical tag classifications. Moreover, to evaluate medical report generation, we select CIDER-L, ROUGE-L, and BELU as automatic metrics and conduct the human evaluation. We randomly select 100 samples from the testing set and generate corresponding medical reports using CoAtt Jing et al. (2017) and our approach.

Training Details.

The whole network is implemented using a PyTorch framework based on Python 3.6 and trained on two GeForce RTX 2080T GPUs. We adopt DenseNet-121 with no pretraining as the backbone to extract visual features. There are three steps in our training process: external auxiliary signal-guide pretraining, DenseNet pretraining, and internal auxiliary guide training. In the first step, the maximum length of the sentence is 300 (padded with 0s), and the word embedding dimension is 300. We train ASGK for 30 epochs until convergence. The natural language decoder consists of three blocks. We adopt ADAM for optimizing and the training rate is 5e-4. For the second step, we resize the image to

for both global and region images. The batch size is 32. We jointly train two DenseNets for 50 epochs until convergence. The learning rate starts from 1e-2 and delays by every 10 epochs until 1e-5. We threshold the heat map by 0.7 to acquire region images. We adopt the model that achieves the best performance on test datasets as a visual extractor in the third step. In the final step, we resize the images to

and train the entire network for 30 epochs until convergence. The learning rates for the visual extractor and ASGK are 1e-5 and 5e-4, respectively. We also adopt the ADAM optimizer to minimize the loss function. Among the multi-tasks, we set all loss weights to 1.

4.1 Results and Analysis

Dataset Model C R B-1 B-2 B-3 B-4 Hit(%)
CX-CHR CoAttJing et al. (2017) 273.5 64.5 64.7 57.5 52.5 48.7 8.0
HRGR-AgentLi et al. (2018) 289.5 61.2 67.3 58.7 53.0 48.6 -
KERPLi et al. (2019) 285.0 61.8 67.3 58.8 53.2 47.3 -
Vision-BERTDevlin et al. (2018) 302.4 63.7 68.6 60.1 54.1 50.3 19.0
Vison-GPTRadford et al. (2018) 301.8 63.0 67.9 59.6 54.0 48.7 -
Ours 324.5 64.1 68.6 60.8 55.8 52.3 20.0
COV-CTR CoAttJing et al. (2017) 67.2 74.8 70.9 64.5 60.3 55.2 25.0
SATVinyals et al. (2015) 65.9 72.3 69.7 62.1 56.8 51.5 -
AdaAttLu et al. (2017) 68.2 72.6 67.6 63.3 59.6 51.4 -
Vision-BERTDevlin et al. (2018) 68.4 74.7 71.0 65.3 60.6 55.8 26.0
Vision-GPTRadford et al. (2018) 68.0 74.6 70.8 64.5 60.0 54.9 -
Ours 68.4 74.6 71.2 65.9 61.1 57.0 27.0
Table 1:

Evaluation metrics on CH-CHR and COV-CTR datasets comparing ASGK with other methods. C and R are shot for CIDER-D and ROUGE-L. B-n denotes that the BLEU score uses up to n-grams. Hit represents the human evaluation results.

Automatic Evaluation. Table 1 summarizes the performances on the automatic evaluation metrics of different models. The results on both datasets indicate that ASGK outperforms all existing state-of-the-art models through its exploitation of auxiliary signals to guide the framework in knowledge pretraining and knowledge transfer procedures. The results demonstrate the robustness and superior generalization ability of ASGK. We also combine our medical graph encoder with Vision Bert Devlin et al. (2018) and Vision GPTRadford et al. (2018) in order to validate the capability of the language-to-vision transfer. We adopt CIDER-D as the main metric to validate our model. On the large-scale CX-CHR dataset, ASGK significantly boosts performance compared with other baselines, it increases the CIDER score by 51.0, 35.0, 39.5, 22.1 and 22.7 respectively. However, ASGK only acheives a slightly low ROUGE-L score than the CoAttJing et al. (2017) method. ASGK also outperforms other baselines in COV-CTR dataset.

Medical Tags Classification. The AUCs of medical tag classification, which contains both normal and abnormal terminologies on both datasets, are presented in Table 2. Our framework, which is guided by two auxiliary signals, outperforms the baseline on both datasets. Baseline outputs are predicted by a DenseNet-121 without pretraining. We attempt to boost the performance through the use of internal auxiliary signals and the adaptation of focal loss to balance the deviation. This demonstrates that internal auxiliary signals effectively promote the medical graph encoder and facilitate the medical tag classification.

Human Evaluation. Given 200 random images from these two datasets equally, we invited three radiologists to evaluate the corresponding outputs of our methods, CoAttJing et al. (2017) and Vison-BertDevlin et al. (2018). They are encouraged to select a more accurate result from each pair. The human evaluation results are presented in Table 1. It shows that in the CX-CHR and COV-CTR datasets, radiologists thought , and portions of our reports are more accurate than others’ respectively, and while they thought , and portions of results are same. The human evaluation demonstrates that our method is capable of generating accurate and semantic-coherent reports.

Visualization. An illustration of heat maps, auxiliary regions, medical tag graphs, and paragraphs of medical reports is presented in Figure 3. It is clear from the results that auxiliary regions suggest the region on which the model should focus. For example, in the first row, the auxiliary region focuses on the inferior lobe of the left lung which presents a shadow. In the fourth row, moreover, the auxiliary region focuses the inferior pleural of the left lung, which covers ground-glass opacity, one of the symptoms of COVID-19. The medical tag graph demonstrates that ASGK is capable of encoding input features into a high-level knowledge graph; as we lack the ground truth of the corresponding graph, we train in an end-to-end way to encode the graph. The generated reports demonstrate the high quality and provide significant alignment with the ground truth.

4.2 Ablation Studies

CX-CHR baseline 289.7 61.3 48.3 78.7
baseline+IA+CE 304.6 62.5 48.9 82.1
baseline+IA 305.3 62.7 49.1 83.2
baseline+EA 317.2 63.8 52.0 79.3
baseline+IA+EA 324.5 64.1 52.3 85.9
COV-CRT baseline 59.1 68.3 52.5 72.7
baseline+IA+CE 61.3 70.2 54.1 79.0
baseline+IA 62.8 70.5 54.2 79.7
baseline+EA 66.9 72.0 55.6 74.5
baseline+IA+EA 68.4 74.6 57.0 80.4
Table 2: Ablation studies for different auxiliary signals. IA, EA and CE are short for “internal auxiliary signals”, “external auxiliary signals’ and “cross entropy”. Four metrics are adopted to evaluate our model on two datasets.

We conduct ablation experiments to compare the performance of the two auxiliary signals. Table 2 presents the results of automatic evaluation metrics and tag classification. The baseline represents the direct training of the ASGK model without any auxiliary signals. In addition to extra notes, we adopt focal loss as our training strategy.

Do internal auxiliary signals help? From Table 2, we can determine that auxiliary signals significantly boost the tag classification performance and improve the quality of generated reports. The internal auxiliary signal-guided learning outperforms the automatic metrics , and respectively, and also performs better than the baseline in terms of classification accuracy on the CX-CHR dataset. The quality of the medical tag graphs significantly impacts the natural language decoder. We produce internal auxiliary signals to mimic radiologists’ working patterns, since abnormal regions provide richer visual features. These experiments demonstrate that focusing on abnormal regions benefits the detection of medical tags and the generation of medical reports.

What is the use of focal loss? Radiologists are asked to describe all of their observations on one medical image, which leads to serious data deviation on medical tag labels and reports. Typically, each image contains three to five normal tags and a few abnormal terminologies. To alleviate the deviation in multi-label classification tasks, we adopt focal loss in order to optimize the parameters in DenseNet and the medical tag decoder. When the second and third rows are compared, the performance shows its capability to balance deviation and improve AUC metrics. Without focal loss, the performances on AUC metrics decrease by and respectively on the two datasets.

Are external auxiliary signals useful? The external auxiliary signals guide the pretraining procedure to assist the model in memorizing and phrasing medical knowledge. As expected, ASGK benefits a lot from the pretraining procedure. The performance on automatic metrics are boosted substantially from to and to on the two datasets respectively, which indicates that external auxiliary signal-guided training is capable of generating accurate and semantically coherent sentence. However, it improves the classification accuracy slightly, by , and respectively on the two datasets, which demonstrates that exploiting medical domain knowledge primarily promotes the natural language decoder. Furthermore, our findings show that without external auxiliary signals, the model fails to alleviate the data bias and is therefore prone to repeating several specific words and sentences in one report.

Overall, the internal signals mainly facilitate the medical tag encoder’s effectiveness in generating fine-grained sentences and describing more medical tags. The external signals enable the natural language decoder to generate more semantically coherent sentences.

5 Conclusions and Future Work

In this paper, we proposed an Auxiliary Signal-Guided Knowledge Encoder-Decoder approach that mimics radiologists’ working patterns to generate fine-grained and semantically coherent medical reports. We investigated how to best crop the auxiliary region from the global medical image, how to exploit medical domain knowledge from medical textbook, and how these auxiliary signals work. Experiments demonstrate that ASGK outperforms existing methods and boosts the performance of medical report generation tasks on report generation and tag classification on two medical datasets. In the future, we plan to focus on building a general captioning framework guided by auxiliary signals to encode and decode general corpora knowledge.

Broader Impacts

This work practically analyzes a meaningful task combined with the computer vision and natural language processing task, medical report generation. Especially when pandemic happens like COVID-19, robust and accurate medical report generation technology is of great clinical value, which can reduce the burden on doctors and enable people to more accurately grasp their health status. We propose an anthropomorphic model, mimicking radiologists’ working patterns, to promote the medical report generation task via acquiring easily-accessed auxiliary signals. This approach may inspire those researchers who have limited access to medical image resources to dig deeper into adopting unsupervised learning methods to acquire more auxiliary signals to supervised this task and achieve state-of-the-art performances. However, it still needs more effort to provide theoretical interpretation for these auxiliary signals.


Dr Chang is partially supported by Australian Research Council (ARC) Discovery Early Career Researcher Award (DECRA) under Grant no. DE190100626. The authors also would like to thank Prof Anyuan Li from The First Affiliated Hospital of Harbin Medical University for providing his domain knowledge.


  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, Cited by: §1, §2.
  • Z. Cao, W. Li, S. Li, and F. Wei (2018) Retrieve, rerank and rewrite: soft template based neural summarization. In ACL, Cited by: §1.
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014)

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    arXiv preprint arXiv:1412.3555. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2, §4.1, §4.1, Table 1.
  • Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng (2017) Semantic compositional networks for visual captioning. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 5630–5639. Cited by: §2.
  • M. Habibi, L. Weber, M. Neves, D. L. Wiegandt, and U. Leser (2017)

    Deep learning with word embeddings improves biomedical named entity recognition

    Bioinformatics 33 (14), pp. i37–i48. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §2.
  • M. T. Islam, M. A. Aowal, A. T. Minhaz, and K. Ashraf (2017)

    Abnormality detection and localization in chest x-rays using deep convolutional neural networks

    arXiv preprint arXiv:1705.09850. Cited by: §2.
  • B. Jing, P. Xie, and E. Xing (2017) On the automatic generation of medical imaging reports. arXiv preprint arXiv:1711.08195. Cited by: §4.1, §4.1, Table 1, §4.
  • L. Jing and Y. Tian (2020) Self-supervised visual feature learning with deep neural networks: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.
  • J. Johnson, R. Krishna, M. Stark, L. Li, D. Shamma, M. Bernstein, and L. Fei-Fei (2015) Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3668–3678. Cited by: §2.
  • P. Kumar, M. Grewal, and M. M. Srivastava (2018) Boosted cascaded convnets for multilabel classification of thoracic diseases in chest radiographs. In ICIAR, Cited by: §1.
  • J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 (4), pp. 1234–1240. Cited by: §2.
  • C. Y. Li, X. Liang, Z. Hu, and E. P. Xing (2019) Knowledge-driven encode, retrieve, paraphrase for medical image report generation. In AAAI, Cited by: §2, §3.2, Table 1, §4.
  • Y. Li, X. Liang, Z. Hu, and E. P. Xing (2018) Hybrid retrieval-generation reinforced agent for medical image report generation. In NeurIPS, Cited by: §1, Table 1.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Link Cited by: §1.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, Cited by: §1.
  • S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy (2017) Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE international conference on computer vision, pp. 873–881. Cited by: §2.
  • J. Lu, C. Xiong, D. Parikh, and R. Socher (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 375–383. Cited by: Table 1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Link, Document Cited by: §1.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. Cited by: 2nd item, §1, §3.2, §4.1, Table 1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §2.
  • M. Ranzato, S. Chopra, M. Auli, and W. Zaremba (2015) Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732. Cited by: §2.
  • S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017) Self-critical sequence training for image captioning. In CVPR, Cited by: §1, §2.
  • H. Shin, K. Roberts, L. Lu, D. Demner-Fushman, J. Yao, and R. M. Summers (2016) Learning to read chest x-rays: recurrent neural cascade model for automated image annotation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2497–2506. Cited by: §2.
  • Y. Sun, E. Tzeng, T. Darrell, and A. A. Efros (2019) Unsupervised domain adaptation through self-supervision. arXiv preprint arXiv:1909.11825. Cited by: §1.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §2.
  • R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In CVPR, Cited by: §1.
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §2, Table 1.
  • X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers (2017) Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In CVPR, Cited by: §1.
  • X. Wang, Y. Peng, L. Lu, Z. Lu, and R. M. Summers (2018) Tienet: text-image embedding network for common thorax disease classification and reporting in chest x-rays. In CVPR, Cited by: §1.
  • Z. Wu and R. Cohen (2016) Encode, review, and decode: reviewer module for caption generation. arXiv preprint arXiv:1605.07912. Cited by: §1.
  • W. Xiong, J. Du, W. Y. Wang, and V. Stoyanov (2019) Pretrained encyclopedia: weakly supervised knowledge-pretrained language model. arXiv preprint arXiv:1912.09637. Cited by: §1.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In ICML, Cited by: §1.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5754–5764. Cited by: §2.
  • Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo (2016) Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4651–4659. Cited by: §2.
  • J. Zhao, Y. Zhang, X. He, and P. Xie (2020) COVID-ct-dataset: a ct scan dataset about covid-19. arXiv preprint arXiv:2003.13865. Cited by: 3rd item, §1.
  • F. Zhu, Y. Zhu, X. Chang, and X. Liang (2020) Vision-language navigation with self-supervised auxiliary reasoning tasks. In CVPR, Cited by: §1.
  • X. Zhuang, Y. Li, Y. Hu, K. Ma, Y. Yang, and Y. Zheng (2019) Self-supervised feature learning for 3d medical images by playing a rubik’s cube. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 420–428. Cited by: §2.