Radiologists are faced daily with the very time-consuming and repetitive task of looking at hundreds of radiography images and writing up radiological reports. The fast turn-arounds they are expected to produce leads to fatigue that can negatively affect diagnostic accuracy . Supervised learning for automated pathology detection from images has the potential for clinical-decision support, however, such image segmentation and classification learning tasks require detailed annotations covering a large distribution of input data for the algorithms to be able to make robust predictions. Such annotations must be made by qualified radiologists, which, for the detail and breadth of annotation required, will be an equally if not more time-consuming task than manually creating the reports. In addition, classification and semantic segmentation tasks only solve for the prediction of presence of pathologies, and not the generation of reports which contain additional information such as severity, location, and absence of pathologies.
Recently, we have seen supervised learning approaches that aim to take advantage of past radiological exams containing reports in order to either auto-generate the reports [17, 7, 23], or to assist in classification tasks [16, 21, 20, 24, 22]
. The noise present in medical reports in addition to the presence of non-visually significant information, such as the negation of pathologies, make it difficult to learn from them directly as done in natural image captioning frameworks. Additionally, high recall/precision of pathologies is more crucial in the medical domain where the risk of mis-labelling is much higher.
We therefore propose to use a limited number of manual medical concept annotations in order to first learn to extract them from raw reports, and then take advantage of the model predictions as image annotations, thus providing a method for augmenting an image-annotation dataset. We then demonstrate how these image-concept annotations can be learned through sequence models conditioned on image features, and generate a more readable context for the diagnosis that can be used as part of a clinical decision support system, thus greatly alleviating the burden on radiologists. Our approach can be summarised in the following steps:
We propose a network that learns to extract visually-significant medical concepts from raw reports. To our knowledge, this is a first attempt to that goes beyond simple pathology detection to include concepts such as anatomical position and severity.
We explore several sequence-learning networks that aim to condition the sequence generation process on image features in order to learn to auto-generate structured reports from radiological images.
We use the predictions made by the structured report generation process in step 1 to demonstrate how they can be used to create an image-report training set for step 2.
2 Related Work
2.1 Data-mining Image Labels
There are two common approaches to extracting image labels from raw reports: statistical and tool based. Radiological text mapping tools such as DNorm  and MetaMap  have been used to extract labels for multi-label classification  and in weakly supervised localisation learning frameworks . However, other biological concepts in the reports, such as location, severity, and other visually descriptive features of the pathology are not taken advantage of. Unsupervised, statistical methods such as latent Dirichlet allocation  and clustering  have been used to implicitly define topics and cluster groups containing key words and propose classification into these topics and groups. These approaches are heavily dependent on the number of topics/groups providing the lowest perplexity score, which can be a range of values. In addition, these are not generative models, therefore reports can only be selected based on nearest-neighbour methods. To this end, we propose instead to learn to generate reports comprised of medical concept from images, in a similar style to natural image captioning.
2.2 Radiology Report Generation
Closest to our work, Shin et al. proposed a cascaded learning framework to auto-generate MeSH annotations from chest X-rays  whereby image embeddings are first extracted from a pre-trained classification network, and then used to initialise a sequence prediction model to auto-generate MeSH sequences. Zhang et al. [24, 23]
leverage manually created structured reports in a dual-attention framework to improve features used for classifying histopathology images and to provide interpretability to the classification. The reports used in both cases are far more structured than their raw counterparts and so this approach cannot be directly translated to hospital data. Training on raw hospital reports, Jing et al. demonstrated how they can be generated by first training a multi-label CNN on the images and the Medical Text Indexer (MTI) tags identified in the original raw reports of the Openi chest x-ray dataset. However, reports can be very long and heterogeneous, and the authors do not evaluate the model’s ability to determine whether visually and clinically-relevant medical concepts have been identified. To address the challenges of learning from raw reports directly, we first learn to generate structured reports made up of only visually-significant medical concepts that correspond directly to features seen in the images. Being shorter and vocabulary-controlled, the generation process is easier to evaluate for correct identification of pathologies.
3.1 Enriched Concept Extraction from Raw Reports
We approach learning structured reports from raw textual reports as a multi-label classification task since the vocabulary of MeSH terms is consistent across annotators, and limited. We modify the shallow CNN first introduced by Kim  for multi-class text classification and later adapted for multi-label text classification by Liu at al.  by introducing a learn-able embedding layer as we do not have the advantage of pre-trained word embeddings for medical text, and by introducing dropout followed by a fully-connected layer to each convolutional output prior to the concatenation to aid regularisation.
be the d-dimensional word vector for the-th word of report . The textual report is thus represented as a concatenation of word embeddings: where is the maximum length of the reports. A filter is convolved with a window of words to produce a new feature :
is a non-linear activation function andis a bias term. The filter is applied consecutively to every -word window in the sentence, resulting in a feature map . Max-over-time pooling  is applied over each feature map to capture the most important feature . In this way we apply many filter operations with varying window widths in order to obtain multiple features that are able to capture semantic information of reports with varying word lengths. We use the sigmoid activation function as we require an independent prediction for each class and train by minimising the multi-class sigmoid cross-entropy (SCE) loss. In addition, we add terms to balance maximising the true positive class prediction with true negative class predictions as the positive label space is very sparse:
where is the number of classes, is presence/absence of class label for instance , is the prediction for instance on label made through a sigmoid activation:
The weights of each loss term, are non-negative, sum to 1 and chosen through cross-validation. Finally, the modified SCE loss is averaged over batches.
3.2 Report Generation from Images
Given that it is possible to learn structured report outputs from raw reports, we propose a method of learning to auto-generate structured reports directly from images. We explore multiple ways of conditioning the MeSH sequence learning on the image embeddings that aims to maintain the dependency between the word generation process and the image embedding at every time-step. The MeSH sequence is modelled using an RNN, specifically the Long Short-Term Memory (LSTM) implementation proposed in. Each LSTM unit has three sigmoid gates to control the internal state: ‘input’, ‘output’ and ‘forget’. At each time step, the gates control how much of the previous time steps is propagated through to determine the output. For an input word sequence where , the internal hidden state and memory state are updated as follows:
where is the word embedding, and are the trainable weight parameters, and , and are the input, output and forget gates respectively. Bias terms are left out for readability.
The image embedding, where is extracted from the final spatial-average pooling layer of the pre-trained CNN. We explore three ways of conditioning the sequence learning process:
RNN0: The image embedding is projected into the same embedding space as the word embeddings via a dense transition layer: . The image embedding is concatenated with the word sequence and thus treated as the first ‘word’ in the MeSH sequence.
RNN1: The image embedding is projected via a dense transition layer into a fixed embedding width and combined with the output of the recurrent layer through either concatenation or summation operation, and passed to the decoder :
where represents concatenation or summation and are the weights of the decoder.
RNN2: The image embedding is projected via a dense transition layer into a fixed embedding width and combined with the input of the recurrent layer through either concatenation or summation operation, and passed to the encoder :
where are the weights of the encoder.
The model architectures are illustrated in Figure 1. For all models, the decoder outputs are passed to the prediction layer where is the softmax function. The models are all trained by minimising the cross-entropy loss between the output and true sequence:
is the probability that the predicted wordequals the true word at time step given image features and previous words , and is the LSTM sequence length.
We evaluated our models on the OpenI  Indiana U. Chest X-ray Collection. This dataset consists of 7,470 frontal and lateral chest x-ray images and 3,955 associated radiological reports from the hospital’s picture archiving systems. They have all been fully anonymised to remove patient names. In addition to the raw text reports, each exam has MeSH annotations made by qualified radiologists. MeSH annotations are (with some exceptions) formatted as [pathology/description,… pathology/description] where description is a combination of anatomy/position/severity. The number of captions per image is on average 2.33, with an average of 2.68 MeSH terms per caption.
This involved lower-casing, punctuation and non-alpha-numeric character removal from reports and MeSH. We limit the MeSH annotations to just one pathology/description
pair by selecting the caption with the most common pathology. Additionally, as the negation of pathologies was generally standard across reports, we performed negation removal using regex. Finally, the text reports were cropped/padded to 32 words based on the average of 20.23 +1 std of 11.9. MeSH captions were cropped/padded to length 5 based on average+1 std. Empty reports were removed. This resulted in 3,023 unique report-MeSH term pairs, of which 300 were randomly selected for validation and 300 for test.
4.2 Experimental Settings
We first investigate whether the structured reports can be learned from raw reports by creating a sub-set of size=1000 of the MeSH annotated reports, and training the text CNN on the report-MeSH pairs in the sub-set. The trained text CNN is then used to make MeSH prediction on the remaining set of raw reports, and these (together with the gold-standard annotated sub-set from the previous step) are used to train the image-MeSH sequence model. We compare this to training the image-MeSH sequence model on the entire gold-standard annotated set of 3,023.
4.2.1 Text CNN
For the text CNN model, we use rectified linear units as activation function on the convolutional layers, one-dimensional convolutional filters of width 3,4,5 with 512 feature maps for each filter, dropout rate of p=0.5, with 254 hidden units for the dense layer, and with
for the loss terms. The model was trained through batch backpropagation, batch size=128 and using Adam optimisation
with learning rate=0.001 for 100 epochs with early-stopping. To compensate for the class imbalance of ‘normal’ vs. diseased cases, we select batches with uniform distribution over the classes, augmenting the instances by sentence-shuffling.
4.2.2 Sequence Models
models, pre-trained on ImageNet to extract and respectively. For RNN0, the joint image-word embedding dim is set to 2048 for the Vgg input, and 1024 for Resnet. For RNN1 and RNN2, the dense transition layer dimension is set to 1024. For all the sequence models, the LSTM hidden state is set to dim 512, and the LSTM units are unrolled up to 6 time steps (1 for the start token, and 5 for MeSH sequence). All models are trained with batch size 128, using Adam optimisation , learning rate=0.001 and early-stopping.
|Training sample size||Acc.||R||R-OC||R-OS||P||P-OC||P-OS|
|Learning to read ||97.2 / 67.1 / 14.9 / 2.8||68.1 / 30.1 / 5.2 / 1.1||79.3 / 9.1 / 0.0 / 0.0|
|RNN0+vgg16+all||8.8 / 1.8 / 0.7 / 0.2||7.8 / 2.6 / 1.1 / 1.4||6.9 / 2.3 / 0.7 / 0.1|
|RNN0+resnet50+all||16.5 / 8.6 / 4.6 / 2.4||16.7 / 8.7 / 3.9 / 1.0||18.8 / 10.4 / 3.8 / 1.9|
|RNN1+resnet50+all||77.9 / 45.6 / 29.4 / 18.9||65.7 / 51.6 / 30.2 / 17.1||66.7 / 47.1 / 26.8 / 15.9|
|RNN2+resnet50+all||74.1 / 42.0 / 26.8 / 17.3||63.2 / 47.5 / 27.3 / 15.8||63.2 / 43.9 / 25.5 / 13.6|
|RNN0+resnet50+pred||22.9 / 15.5 / 7.8 / 4.0||13.6 / 8.3 / 4.0 / 0.9||14.7 / 9.3 / 2.7 / 1.5|
|RNN1+resnet50+pred||73.6 / 50.0 / 30.9 / 17.8||41.5 / 29.7 / 15.9 / 7.2||41.6 / 28.2 / 13.2 / 8.1|
|RNN2+resnet50+pred||69.4 / 47.6 / 29.6 / 16.7||39.4 / 28.0 / 14.6 / 6.7||39.8 / 26.4 / 12.7 / 8.0|
5.0.1 Enriched Concept Extraction from Reports
We evaluate the MeSH term prediction from the text reports by calculating the total binary accuracy (Acc), precision (P) and recall (R), and the mean-over-class (P-OC, R-OC) and mean-over-samples (P-OS, R-OS) precision and recall of the 102 classes. In addition, we report metrics of the ‘pathology’ classes separately by manual allocation based on the definitions on the MeSH term online library. Complete metrics are compared in Table 1.
5.0.2 Report Generation from Images
During inference, the first word is sampled from the LSTM, concatenated to the input, and used to predict consequent words. The quality of the generated reports was evaluated by measuring BLUE  scores averaged over all the reports, which are a form of -gram precision commonly used for evaluating image captioning as they maintain high correlation with human judgement. BLEU scores of RNN0, RNN1 and RNN2 trained on all gold-standard annotations and on the predictions made by the text CNN are presented in Table 2. RNN0 is the same framework used in , however, they additionally train their model in a cascaded fashion which significantly improves the model’s ability to predict the first word, but struggles to maintain visual correspondence in generating subsequent words, hence the steep reduction in higher -gram precision. Additionally, cascaded models suffer from error propagation during test time, hence the poor performance on test data. RNN1 and RNN2 solve both problems by conditioning the word generation process on the images at every time-step and by being trained end-to-end, hence achieving higher -gram scores on the test data. In addition, we have shown that we can achieve comparably high BLEU metrics when training on the predicted MeSH terms made by the text CNN.
We demonstrate how, given a small amount of manual annotations, clinically and visually-important concepts can be learned from raw textual radiology reports. We then demonstrate how these concepts can be used as radiological image annotations and used in an image-sequence learning model to auto-generate reports as part of a clinical decision support system.
-  (2001) Effective mapping of biomedical text to the umls metathesaurus: the metamap program.. In Proceedings of the AMIA Symposium, pp. 17. Cited by: §2.1.
Natural language processing (almost) from scratch.
Journal of machine learning research12 (Aug), pp. 2493–2537. Cited by: §3.1.
Improving deep neural networks for lvcsr using rectified linear units and dropout. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 8609–8613. Cited by: §4.2.1.
-  (2009) Imagenet: a large-scale hierarchical image database. In , pp. 248–255. Cited by: §4.2.2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.2.2.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.2.
-  (2017) On the automatic generation of medical imaging reports. arXiv preprint arXiv:1711.08195. Cited by: §1, §2.2.
-  (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §3.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.1.
-  (2015) Challenges in clinical natural language processing for automated disorder normalization. Journal of biomedical informatics 57, pp. 28–37. Cited by: §2.1.
-  (2017) Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115–124. Cited by: §3.1.
-  (Website) External Links: Cited by: §5.0.1.
-  (Website) External Links: Cited by: Automated Enriched Medical Concept Generation for Chest X-ray Images, §4.1.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §5.0.2.
-  (2016) Interleaved text/image deep mining on a large-scale radiology database for automated image interpretation.. Journal of Machine Learning Research 17 (1-31), pp. 2. Cited by: §1, §2.1.
-  (2016) Learning to read chest x-rays: recurrent neural cascade model for automated image annotation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2497–2506. Cited by: §1, §2.2, Table 2, §5.0.2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.2.2.
-  (2018) A systematic review of fatigue in radiology: is it a problem?. American Journal of Roentgenology 210 (4), pp. 799–806. Cited by: §1.
-  (2017) . In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 998–1007. Cited by: §1, §2.1.
-  (2017) Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2097–2106. Cited by: §1, §2.1.
-  (2019) Holistic and comprehensive annotation of clinically significant findings on diverse ct images: learning from radiology reports and label ontology. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8523–8532. Cited by: §1, §2.1.
-  (2017) Tandemnet: distilling knowledge from medical images using diagnostic reports as optional semantic references. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 320–328. Cited by: §1, §2.2.
-  (2017) Mdnet: a semantically and visually interpretable medical image diagnosis network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6428–6436. Cited by: §1, §2.2.