1 Introduction
Training medical imaging models requires large amounts of expertly annotated data which is time-consuming and expensive to obtain. Fortunately, medical images are often accompanied by free-text reports written by radiologists summarising their main findings (what the radiologist sees in the image e.g. “hyperdensity”) and impressions (what the radiologist diagnoses based on the findings e.g. “haemorrhage”). This information can be converted to structured labels which are used to train image analysis algorithms to detect the findings and to predict the impressions. Image-level labels have previously been provided to train image analysis algorithms e.g. as part of the RSNA haemorrhage detection challenge [17] and the CheXpert challenge for automated chest X-Ray interpretation [9]. The task of reading the radiology report and assigning labels is not trivial and requires a certain degree of medical knowledge on the part of the human annotator. An alternative is to automatically extract labels, and in this paper we study the task of automatically labelling head computed tomography (CT) radiology reports.
Automatic extraction has traditionally been accomplished using expert medical knowledge to engineer a feature extraction and classification pipeline
[24]; this was the approach taken by Irvin et al. to label the CheXpert dataset of Chest X-Rays [9] and by Gorinski et al. in the EdIE-R method for labelling head CT reports [8]. Such pipelines separate the individual tasks such as named entity recognition and negation detection.
An alternative approach is to design an end-to-end machine learning model that will learn to extract the final labels directly from the text. Simple approaches have been demonstrated using word embeddings or bag of words feature representations followed by logistic regression
[25][22]. More complex recurrent neural networks (RNNs) have been shown to be effective for document classification by many authors
[23, 3] and Drozdov et al. [7]show that a bidirectional long short term memory (Bi-LSTM) network with a single attention mechanism also works well for a binary task. However, with recent developments of transformer natural language processing (NLP) models such as Bidirectional Encoder Representations from Transformers (BERT)
[6], it is easier than ever before to use existing pre-trained models that have learnt underlying language patterns and fine-tune them on small domain-specific datasets. This was the approach taken by Wood et al. in the Automated Labelling using an Attention model for Radiology reports of MRI scans (ALARM) model for labelling head magnetic resonance imaging (MRI) reports
[21]. Specifically, they use BioBERT [1] as the base model, which has been pretrained on PubMed abstracts rather than Wikipedia, to obtain contextualised embeddings for each input token and then apply a further attention mechanism to this embedding. Wood et al. perform a binary classification of normal versus abnormal radiology report, which is determined by a number of criteria during data annotation. BERT has also been used for multi-label classification of radiology reports by Smit et al. [19]. They show that BERT can outperform the previous state of the art for labelling 13 different labels on the CheXpert open source dataset
[9].Mullenbach et al. proposed per-label attention in a similar document classification task (for clinical coding) in their Convolutional Attention for Multi-Label classification (CAML) model [15]. In this paper, inspired by [15], we extend existing state-of-the-art models with a label-dependent attention mechanism. Our contributions are to:
-
Propose a set of radiographic findings and clinical impressions for labelling of head CT scans for suspected stroke patients.
-
Show that a multi-headed model with per-label attention improves the accuracy compared to a simple multi-label softmax output.
-
Show that simple synthetic data significantly improves task performance, especially for classification of rarer labels.
2 Data
Below we describe the three datasets used in this work.
2.0.1 NHS GGC dataset:
Our target dataset contains 230 radiology reports supplied by the NHS Greater Glasgow and Clyde (GGC) Safe Haven. We have the required ethical approval111iCAIRD project number: 104690; University of St Andrews: CS14871 to use this data. A synthetic example report with similar format to the NHS GGC reports can be seen in Figure 3.
![]() |
![]() |

A list of 31 radiographic findings and clinical impressions found in stroke radiology reports was collated by a clinical researcher; this is the set of labels that we aim to classify. Figure 2 shows a complete list of these labels. Each sentence is labelled for each finding or impression as “positive”, “uncertain”, “negative” or “not mentioned” - the same certainty classes as used by Smit et al. [19]. The most common labels such as “haemorrhage”, “infarct” and “hyperdensity” have between 200-400 mentions (100-200 negative, 0-50 uncertain, 100-200 positive) while the rarest labels such as “abscess” or “cyst” only occur once in the dataset.
During the annotation process, the reports were manually split into sentences by the clinical researcher, resulting in 1,353 sentences which we split into training and validation datasets (due to the limited number of annotated reports, we do not have a separate test set). Each sentence was annotated independently, however we allocate sentences from the same original radiology report to the same dataset to avoid data leakage.
2.0.2 Synthetic dataset:
We augment our training dataset by synthesising 5 sentences for each label as follows:
-
“There is [label].” positive
-
“There is [label] in the brain.” positive
-
“[Label]” is evident in the brain.” positive
-
“There may be [label].” uncertain
-
“There is no [label].” negative
For the labels “haemorrhage/haematoma/contusion”, “evidence of surgery/ intervention”, “vessel occlusion (embolus/thrombus)”, and “involution/atrophy”, we synthesise sentences for each variant. There are 180 synthetic sentences total.
2.0.3 MIMIC-III dataset:
To pre-train the word embedding, we use clinical notes from the MIMIC dataset [11]; in total 2,083,180 documents from 46,146 patients.
The datasets are summarised in Table 1.
Dataset | #patients | #reports | #sentences |
---|---|---|---|
NHS GGC – Training | 138 | 138 | 838 |
NHS GGC – Validation | 92 | 92 | 515 |
Synthetic data | - | - | 180 |
MIMIC-III | 46,146 | 2,083,180 | 99,718,301 |
3 Methods
Below we describe the methods which are compared in this paper (implemented in Python). We denote our set of labels as and our set of certainty classes as , such that the number of labels and the number of certainty classes . For the NHS GGC dataset, and . For all methods, data is pre-processed by extracting sentences and words using the NLTK library [13]
, removing punctuation, and converting to lower case. Hyperparameter search was performed through manual tuning on the validation set, based on the micro-averaged F1 metric.
3.1 Simple machine learning approaches
3.1.1 BoW + RF:
The Bag of Words + Random Forest (BoW + RF) model uses a bag of words representation as its input. We train one model per label since this gives the most accurate results, resulting in 31 random forest classifiers. Random forest classifiers are quick to train and apply so multiple models are still practical in a real use case. We use the sci-kit learn library
[16]implementation with 100 estimators, a maximum depth of 10, and 200 maximum features.
3.1.2 Word2Vec:
The Word2Vec [14] baseline uses a pre-trained word embedding of size . The embedding is pre-trained on the MIMIC dataset described in section 2
for 30 epochs using the gensim
[18]library; the vocabulary size is 107,497 words. The word vectors for the input sentence are averaged and passed through a fully connected single layer neural network mapping to an output layer of size
. This network is trained with a constant learning rate of 0.001, batch size of 16 and an embedding size of 200. This and all following models are trained for a maximum of 200 epochs with early stopping patience of 25 epochs on F1 micro.3.2 Deep learning: Per-label attention mechanism
When training neural networks, we find that accuracy can be reduced where there are many classes. Here we describe the per-label attention mechanism [2] as seen in Figure 3, an adaptation of the multi-label attention mechanism in the CAML model [15]. We can apply this to the output of any given neural network subarchitecture. We define the output of the subnetwork as where is the number of tokens and
is the hidden representation size. The parameters we learn are the weights
and bias . Furthermore, for each label we learn an independent to calculate an attention vector .The attended output is then passed through parallel classification layers reducing dimensionality from to .

3.3 Deep learning: Neural network models
We pre-process the data before input to the neural network architectures. Each input sentence is limited to
tokens and padded with zeros to reach this length if the input is shorter. We choose
as this is larger than the maximum number of words in any of the sentences in the NHS GGC dataset. The neural network models all finish with softmax classifier outputs, each with classes.Models are trained using a weighted categorical cross entropy loss and Adam optimiser [12]. We weight across the labels but not across classes, as this did not give any improvements. Given a parameter , the number of sentences and the number of “not mentioned” occurrences of a label , we calculate the weights for each label using the training data as follows:
3.3.1 Caml:
The CAML model follows the implementation by Mullenbach et al. [15]
and uses an embedding that is initialised to the same pre-trained weights as for the Word2Vec baseline. The embedded input passes through a convolutional layer of graduated filter sizes applied in parallel (see below), followed by max-pooling operations across each graduated set of filters, to produce our intermediate representation
. This is then passed through the per-label attention mechanism introduced by the CAML model. For the convolutional layer, we chose 512 CNN filter maps with kernel sizes of 2 and 4. The model was trained with a learning rate of 0.0005 and a batch size of 16.3.3.2 Bi-GRU:
The embedding is initialised to the same pre-trained weights as used for the Word2Vec baseline. The embedded sentence passes through a bidirectional GRU (Bi-GRU) network [5] with hidden size of . The outputs from both directions are concatenated to produce a representation for each input sentence. For Bi-GRU + single attention, this representation is passed through a single attention mechanism. For Bi-GRU + per-label attention, this representation is passed through the per-label attention mechanism. The model was trained with a learning rate of 0.0005, batch size of 16 and hidden size .
3.3.3 BERT and BioBERT:
The BERT model is a standard pre-trained BERT model, “bert-base-uncased” - weights are available for download online444 https://github.com/google-research/bert - we use the huggingface [20] implementation. We take the output representation for the CLS token of size at position 0 and follow with the softmax outputs. The model was trained with a learning rate of 0.0001 and batch size of 32. For BioBERT, we use a Bio-/ClinicalBERT model pretrained on both PubMed abstracts and the MIMIC-III dataset555 https://github.com/th0mi/clinicalBERT with the huggingface BERT implementation. We use the same training parameters as for BERT (above).
3.3.4 Alarm:
Our implementation of the ALARM [21] model uses the BioBERT model (and training parameters) described above. Following the implementation details of Wood et al., instead of using a single output vector of size , we extract the entire learnt representation of size . For the ALARM + softmax model, we pass this through a single attention vector and then through three fully connected layers to map from to to to the outputs. For ALARM + per-label-attention, we employ per-label attention mechanisms instead of a single shared attention mechanism before passing through three fully connected layers per label.
4 Results
Tables 2 and 3 show the results. We report the micro-averaged F1 score as our main metric, calculated across all labels. We also report the macro-averaged F1 score; this is F1 score averaged across all labels with equal weighting for each label. We note that although we used micro F1 as our early stopping criterion, we do not observe an obvious difference in the scores if F1 macro is used for early stopping. We exclude the “not mentioned” certainty from our metrics, similar to the approach used by Smit et al. [19] - we denote , so . When we report our F1 metrics for a single certainty class we report the usual F1 metric, whereas when we report metrics for all classes and labels we report an average per certainty class.
For all experiments, we use a machine with NVIDIA GeForce GTX 1080 Ti GPU (11GB of VRAM), Intel Xeon CPU E5 v3 (6 physical cores, maximum clock frequency of 3.401 GHz) and 32GB of RAM. Training run times range from 14 seconds for the Random Forest model to 376 seconds for the Bi-GRU + per-label attention model and 1448 seconds for the ALARM + per-label-attention model. For details of all run times, see Table 4.
Model | All | Negative | Uncertain | Positive |
---|---|---|---|---|
BoW + RF | ||||
Word2Vec | ||||
CAML [15] | ||||
Bi-GRU | ||||
Bi-GRU + single attention | ||||
Bi-GRU + per-label attention | ||||
BERT | ||||
BioBERT | ||||
ALARM + softmax | ||||
ALARM + per-label attention |
Model | All | Negative | Uncertain | Positive |
---|---|---|---|---|
BoW + RF | ||||
Word2Vec | ||||
CAML [15] | ||||
Bi-GRU | ||||
Bi-GRU + single attention | ||||
Bi-GRU + per-label attention | ||||
BERT | ||||
BioBERT | ||||
ALARM + softmax | ||||
ALARM + per-label attention |
4.0.1 Per-label attention:
The micro- and macro-averaged F1 scores (Tables 2 and 3) show that for both BioBERT and the Bi-GRU models, adding per-label attention to the models improves performance consistently over the models with a single attention mechanism (p-values of ). We also show the breakdown in accuracies across certainty classes (negative, uncertain and positive) in our results tables. It can be seen that the per-label attention provides large gains in accuracy across all classes. The macro F1 metric amplifies this because all labels are weighted equally, giving an idea of how the model performs for the rarer labels, several of which have fewer than 10 training samples each.

Figure 4 compares the attention learnt by a single attention model to per-label attention models. We see that the single attention vector (Figure 4b) attends to the correct words - “congenital” and “haemorrhage” - however the model incorrectly predicts both labels as “not mentioned”. In comparison, the model with per-label attention (Figure 4a) recognises the same keywords separately within the respective label attention mechanisms, and correctly predicts both labels as “positive”. This makes sense because the single attention mechanism does not have separate follow-on representations and therefore features for all labels are entangled in one representation. Finally, the model trained without synthetic data (Figure 4c) does not recognise the “congenital” keyword and does not make the correct prediction for this label.
4.0.2 Synthetic data and importance of pre-training:
To investigate the effect of the synthetic training data, we train models on only the synthetic data, only NHS GGC data, and both combined. The results for macro F1 in Figure 5 clearly show an improvement when the synthetic data is used alongside the original data - this is consistent across both of our best models (p-values of ). For numerical results see Tables 5 and 6.
We also investigated the effect of the embedding pre-training. A model with randomly initialised embeddings (maintaining the same vocabulary and embedding size) performs 0.028 worse for the micro-averaged F1 compared to a model using a pre-trained embedding (p-value of ).
4.0.3 Error Analysis:
When investigating the prediction errors of our best model, we identify that approximately 30% of errors are due to missed labels, 10% are due to falsely predicted labels, and the remaining 60% are due to confusion between certainty classes (negative, uncertain, positive). Many of the missed labels are caused by previously unseen synonyms or subtypes, for instance “arteriovenous malformation” is an instance of “congenital abnormality” which is a diverse class. There are also many ways of expressing certainty which are subtly different; for instance positive might be expressed as “probable”, “likely”, ”indicates”, ”suggestive of”, ”is consistent with” whereas uncertainty might be expressed as “possible”, ”may represent”, ”could indicate”, ”is suspicious of” and other subtly different expressions. Errors might be mitigated with the use of a larger training dataset and richer data synthesis, potentially by exploiting medical knowledge bases such as UMLS
[4] to augment the synthetic dataset with a rich synonym set.
5 Conclusions and Future Work
We have introduced a set of radiographic findings and clinical impressions that are relevant for stroke and can be extracted from head CT radiology reports. For deep learning approaches, we have shown that per-label attention and a simple synthetic dataset each improve accuracy for our multi-label classification task, yielding a recipe for scalable learning of many labels. In future work, we intend to annotate a larger dataset as well as leveraging knowledge bases to create a richer synthetic dataset. Furthermore, the labels generated by our models should be used to train an image analysis algorithm on the associated head CT scans.
6 Acknowledgements
This work is part of the Industrial Centre for AI Research in digital Diagnostics (iCAIRD) which is funded by Innovate UK on behalf of UK Research and Innovation (UKRI) [project number: 104690]. We would like to thank the Glasgow Safe Haven for assistance in creating and providing this dataset. Thanks also to The Data Lab for support and funding.
Appendix
Training | Inference time | ||
---|---|---|---|
Model | #Parameters | time [s] | [s/sample] |
BoW + RF | n/a | ||
Word2Vec | |||
CAML [15] | |||
Bi-GRU | |||
Bi-GRU + single attention | |||
Bi-GRU + per-label attention | |||
BERT | |||
BioBERT | |||
ALARM + softmax | |||
ALARM + per-label attention |
Model | Embedding | Data | All | Negative | Uncertain | Positive |
---|---|---|---|---|---|---|
Bi-GRU | MIMIC | S | ||||
Bi-GRU | MIMIC | N-S | ||||
Bi-GRU | Random | N+S | ||||
Bi-GRU | MIMIC | N+S | ||||
ALARM | MIMIC | S | ||||
ALARM | MIMIC | N-S | ||||
ALARM | MIMIC | N+S |
Model | Embedding | Data | All | Negative | Uncertain | Positive |
---|---|---|---|---|---|---|
Bi-GRU | MIMIC | S | ||||
Bi-GRU | MIMIC | N-S | ||||
Bi-GRU | Random | N+S | ||||
Bi-GRU | MIMIC | N+S | ||||
ALARM | MIMIC | S | ||||
ALARM | MIMIC | N-S | ||||
ALARM | MIMIC | N+S |
References
- [1] (2019-06) Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, Minnesota, USA, pp. 72–78. External Links: Document Cited by: §1.
- [2] (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §3.2.
-
[3]
(2019)
Hierarchical transfer learning for multi-label text classification
. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6295–6300. Cited by: §1. - [4] (2004-01) The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Research 32 (90001), pp. 267D–270. External Links: Document Cited by: §4.0.3.
- [5] (2014-10) On the properties of neural machine translation: encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, pp. 103–111. External Links: Document Cited by: §3.3.2.
- [6] (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Document Cited by: Table 4, §1.
- [7] (2020) Supervised and unsupervised language modelling in chest x-ray radiological reports. Plos one 15 (3), pp. e0229963. Cited by: §1.
- [8] (2019) Named entity recognition for electronic health records: a comparison of rule-based and machine learning approaches. arXiv preprint arXiv:1903.03985. Cited by: §1.
-
[9]
(2019)
Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 590–597. Cited by: §1, §1, §1. - [10] (2015-05) Association between brain imaging signs, early and late outcomes, and response to intravenous alteplase after acute ischaemic stroke in the third international stroke trial (ist-3): secondary analysis of a randomised controlled trial.. The Lancet. Neurology 14, pp. 485–496. External Links: Document Cited by: Figure 2.
- [11] (2016) MIMIC-iii, a freely accessible critical care database. Scientific data 3, pp. 160035. Cited by: §2.0.3.
- [12] (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §3.3.
- [13] (2002) NLTK: the natural language toolkit. In In Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Philadelphia: Association for Computational Linguistics, Cited by: §3.
- [14] (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §3.1.2.
- [15] (2018-06) Explainable prediction of medical codes from clinical text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1101–1111. External Links: Document Cited by: Table 4, §1, §3.2, §3.3.1, Table 2, Table 3.
- [16] (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §3.1.1.
- [17] RSNA Intracranial Hemorrhage Detection (Kaggle challenge). Note: https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection/overview Cited by: §1.
- [18] (2010-05-22) Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50 (English). Cited by: §3.1.2.
- [19] (2020) CheXbert: combining automatic labelers and expert annotations for accurate radiology report labeling using bert. arXiv preprint arXiv:2004.09167. Cited by: §1, §2.0.1, §4.
- [20] (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §3.3.3.
- [21] (2020) Automated Labelling using an Attention model for Radiology reports of MRI scans (ALARM). In Medical Imaging with Deep Learning, External Links: Link Cited by: Table 4, §1, §3.3.4.
- [22] (2016) Automated outcome classification of computed tomography imaging reports for pediatric traumatic brain injury. Academic Emergency Medicine 23 (2), pp. 171–178. External Links: Document, https://onlinelibrary.wiley.com/doi/pdf/10.1111/acem.12859 Cited by: §1.
- [23] (2016) Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp. 1480–1489. Cited by: §1.
- [24] (2013) A text processing pipeline to extract recommendations from radiology reports. Journal of biomedical informatics 46 (2), pp. 354–362. Cited by: §1.
- [25] (2018) Natural language–based machine learning models for the annotation of clinical radiology reports. Radiology 287 (2), pp. 570–580. Cited by: §1.