Extracting Radiological Findings With Normalized Anatomical Information Using a Span-Based BERT Relation Extraction Model

by   Kevin Lybarger, et al.

Medical imaging is critical to the diagnosis and treatment of numerous medical problems, including many forms of cancer. Medical imaging reports distill the findings and observations of radiologists, creating an unstructured textual representation of unstructured medical images. Large-scale use of this text-encoded information requires converting the unstructured text to a structured, semantic representation. We explore the extraction and normalization of anatomical information in radiology reports that is associated with radiological findings. We investigate this extraction and normalization task using a span-based relation extraction model that jointly extracts entities and relations using BERT. This work examines the factors that influence extraction and normalization performance, including the body part/organ system, frequency of occurrence, span length, and span diversity. It discusses approaches for improving performance and creating high-quality semantic representations of radiological phenomena.



page 1

page 2

page 3

page 4


Event-based clinical findings extraction from radiology reports with pre-trained language model

Radiology reports contain a diverse and rich set of clinical abnormaliti...

On the Automatic Generation of Medical Imaging Reports

Medical imaging is widely used in clinical practice for diagnosis and tr...

RadLex Normalization in Radiology Reports

Radiology reports have been widely used for extraction of various clinic...

Simple Large-scale Relation Extraction from Unstructured Text

Knowledge-based question answering relies on the availability of facts, ...

Relation Extraction as Two-way Span-Prediction

The current supervised relation classification (RC) task uses a single e...

Text to brain: predicting the spatial distribution of neuroimaging observations from text reports

Despite the digital nature of magnetic resonance imaging, the resulting ...

HySPA: Hybrid Span Generation for Scalable Text-to-Graph Extraction

Text-to-Graph extraction aims to automatically extract information graph...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Radiology reports contain detailed descriptions of diverse clinical abnormalities based on radiologists’ interpretation of medical imaging. Although structured reports with semantic representations of medical concepts have been developed,[15] nearly all radiology reports convey findings through unstructured text.[20]

Semantic representations of radiological findings could be automatically generated using natural language processing (NLP) information extraction techniques. These automatically derived semantic representations would enable a wide range of applications, including ground-truth labeling for artificial intelligence applications of medical images,

[22] translation of reports into lay-language for patients, integration with clinical decision support,[5] cross-specialty diagnosis correlation,[8] automated impression generation,[19] semantic searching of reports,[9] and timely follow-up of recommendations.[14]

We are currently conducting a large-scale clinical and economic analysis of incidental findings (incidentalomas) in radiology reports, focusing on six organ systems with the highest probability of incidental malignancy (thyroid, lung, adrenal glands, kidneys, liver, and pancreas). Incidentaloma identification requires the extraction of radiological findings and conversion of these findings to a structured semantic representation. To develop data-driven extraction models, we designed an event-based annotation schema and annotated computed tomography (CT) reports. Each finding event is characterized by a trigger and set of attributes (assertion, anatomy, characteristics, size, size-trend, size count). In this paper, we use this gold standard corpus to explore the extraction of radiological findings with normalized anatomy information. We extract radiological findings and associated anatomies as a relation extraction task, where the extracted anatomies are normalized to a set of 56 pre-defined anatomy labels. We investigate this relation extraction task using Eberts and Ulges’s Span-based Entity and Relation Transformer (SpERT).

[7] SpERT is a state-of-the-art BERT model that jointly extracts entities and relations using span and relation output layers. As part of an ablation study, we use the gold anatomical spans to explore anatomy normalization, without extraction, to better understand the normalization task and the role of context. In this normalization experimentation, anatomy phrases are normalized at 0.89 F1 micro. In the extraction experimentation, finding spans are extracted at 0.83-0.92 F1, anatomy spans are extracted at 0.72-0.79 F1, and finding-anatomy relations are extracted at 0.63-0.72 F1. We explore the relationship between extraction performance, span length and diversity, and anatomy frequency. This work leverages state-of-the-art transformer-based extraction approaches and provides insight into the extraction of key finding and anatomy information from radiology reports.

Related Work

There is a large body of biomedical entity normalization work exploring the mapping of text spans to fixed vocabularies. A frequently explored ontology is the Unified Medical Language System (UMLS) [2]

, which includes the Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) and RxNorm. The 2019 National NLP Clinical Challenges (n2c2)/Open Health NLP (OHNLP) shared task explored the normalization of pre-defined text spans in clinical notes to SNOMED CT and RxNorm concepts. Top performing teams used dictionary and string matching, cosine distance, retrieval and ranking, and deep learning, with the highest performing system utilizing deep learning.

[10] With large concept vocabularies, a frequently explored approach utilizes a two-step process, where a retrieval model identifies top candidate concepts and then a reranking model identifies the single best concept.[3, 4, 11, 21]

Chen et al. normalized biomedical entities to SNOMED CT concepts, using knowledge sources to identify candidate concepts and an ensemble of machine learning approaches is used to identify target concepts.

[3] Datta et al. and Ji et al. explore biomedical entity normalization tasks using BM25 to identify top concept candidates and BERT to select the top concept.[4, 11] In the n2c2 challenge, Xu et al. uses a Lucene-based search that utilizes the UMLS and a BERT-based reranker.[21]

In this work, the anatomical concept vocabulary does not necessitate a retrieval model for identifying top candidates; however, we do utilize BERT-based models for identifying anatomy concepts. Tutubalina et al. investigate the normalization of medical concepts in social media posts to SNOMED CT concepts using a bidirectional recurrent neural network (RNN) and attention network to classify spans, incorporating semantic information from the UMLS.

[17] Wang et al. explore a hierarchical anatomy normalization task with nine body parts (e.g. head and chest) and 41 sub-body parts (e.g. skull and brain).[18] Wang et al. use Wikipedia as an anatomical knowledge source and explore different scoring functions for comparing anatomical entities to anatomical wiki pages. Recent work also explores both the extraction and normalization of biomedical entities, including anatomical spans. Tahmasebi et al. implement an unsupervised approach where anatomical phrases are identified using SNOMED CT and grammar-based patterns.[16]

Anatomical phrases are normalized by representing each phrase as the weighted sum of word embeddings and comparing the cosine similarity between anatomical phrases and target concept labels. This unsupervised approach outperforms a stacked bidirectional RNN and conditional random fields (CRF) model. Tahmasebi et al. identify 56 anatomical class labels corresponding to SNOMED CT IDs, which we use in this work. In a sequence tagging task, Zhu et al. predict eight anatomy classes (brain, breast, kidney, liver, lung, prostate, thyroid, and other) using a stacked bidirectional long short-term memory network (bi-LSTM) and CRF that incorporates sentence-level context vectors that are learned to predict the presence of each anatomical class in the sentence.

[23] Zhu et al. experiments with incorporating sentence-level and report-level context and finds that incorporating report-level context improves classification performance. Similar to Zhu et al., we also explore the role of context in normalizing anatomical spans. Our work is differentiated from this prior work in that we extract anatomical information related to medical findings, and the anatomical phrases are normalized to a larger anatomy vocabulary.



Figure 1: Annotation example

This work utilizes an annotated data set created by Lau, et al, [12] which includes 500 randomly selected CT reports from an existing clinical data set from the University of Washington Medical Center and Harborview Medical Center. The data set includes 706,908 CT reports authored from 2008-2018. The annotated reports use an event-based annotation scheme to characterize two types of findings: (1) lesion findings (e.g. mass or tumor) and (2) other medical problem findings (e.g. fracture or lymphadenopathy). These findings are characterized across multiple dimensions, including assertion (e.g. present vs. absent), anatomy, count, size, and other attributes. The inter-rater agreement on event annotations in 30 notes is 0.83 F1. Although the corpus is annotated with several attributes related to findings, including lesions, this work focuses on the extraction of findings and the associated anatomical information. We collectively refer to the lesion findings and other medical problem findings as Finding. Lau, et al.’s annotated corpus includes Anatomy annotations without anatomy normalization labels.[12] We augment the Anatomy annotations to include the Anatomy Subtype labels defined in Table 1. These Anatomy Subtype

labels are based on Tahmasebi et al.’s work identifying anatomical terms using unsupervised learning.

[16] The terms have associated SNOMED CT concept identifiers and represent all human organ systems, anatomic labels, and body regions. The Anatomy Subtype labels normalize the Anatomy spans, allowing the extracted finding and anatomy information to be more readily used in secondary use analyses.

Anatomy Subtype labels Abdomen Gallbladder Nasal sinus Seminal vesicle Adrenal gland Head Neck Spleen Back Heart Nervous Stomach Bile Duct Integumentary Nose Testis Bladder Intestine Ovary Thorax Brain Kidney Pancreas Thyroid Breast Laryngeal Pelvis Trach. Cardio Liver Penis Upper limb Diaphragm Lower limb Pericardial sac Urethra Digestive Lung Peritoneal sac Uterus Ear Lymphatic Pharynx Vagina Esophagus Mediastinum Pleural sac Vas deferens Eye Mouth Prostate Vulva Fallopian tube MSK Retroperitoneal Whole body Table 1: Anatomy Subtype labels. Abbreviated terms include Cardiovascular (Cardio), Musculoskeletal (MSK), and Tracheobronchial (Trach). indicates systems, like the Nervous System. Figure 2: Anatomy Subtype label distribution for training set

We approach this radiological information extraction task as a relation extraction task, where spans are identified, mapped to a fixed set of classes, and linked through relations. Figure 1 presents example annotations. The entity types include Finding and Anatomy, although the phrases are not strictly noun phrases. Unlike a typical entity annotation, the Anatomy entities include Anatomy Subtype labels corresponding to the 56 anatomies defined in Table 1. We represent the Finding-Anatomy pairs as asymmetric relations, where the relation head is a Finding entity and the tail is an Anatomy entity. There is only a single relation type, has, so the Finding-Anatomy pairing can be interpreted as a binary classification task (connected vs. not connected). The annotated corpus includes 500 CT reports, with 10,409 Finding entities, 5,081 Anatomy entities, and 6,295 Finding-Anatomy relations.[12] There are more Finding-Anatomy relations than Anatomy entities, because a given Anatomy entity can be associated with multiple findings. The corpus includes approximately 19K sentences and 203K tokens and is randomly split into training (70%), validation (10%), and test (20%) sets. Figure 2 presents the 20 most frequently annotated Anatomy Subtypes in the training set. Musculoskeletal system (MSK), Cardiovascular system (Cardio), and Lung are the most frequent Anatomy Subtypes

, and there are many subtypes that occur infrequently or are absent from the data set. This skewed distribution is the result of randomly sampling the annotated corpus.

Information Extraction

We extract the radiological findings and related anatomy using Eberts and Ulges’s SpERT model.[7] SpERT jointly extracts entities and relations using a pre-trained BERT[6] model with output layers that classify spans and predict the relations between spans. SpERT achieves state-of-the-art performance in three entity and relation extraction tasks, including open domain information extraction (CoNLL04), science information extraction (SciERC), and adverse drug event extraction (ADE).[7] The SpERT framework is presented in Figure 3. Input encoding: Each sentence is tokenized and converted to BERT word pieces. BERT generates a contextualized representation for the sentence, yielding a sequence of word-piece embeddings , where is the sentence-level representation associated with the token, is the word piece embedding, and is the sequence length.

Figure 3: SpERT framework

Span Classification: The span classifier predicts labels for each span, , where the width of the span is word pieces. A learned matrix of span width embeddings, , is used to incorporate a span width prior in the classification of spans and relations. A fixed length representation of the span,

, is created by max pooling the associated BERT embeddings and looking up the relevant span width embedding, as


where denotes concatenation. The span classifier input for the span, , is the concatenation of the span embedding, , and sentence-level context embedding, , as


The span classifier consists of a single linear layer, as


For our task, the span classifier label set, , includes a label, Finding, and the 56 Anatomy Subtypes in Table 1: (). The label indicates no span prediction. We experimented with several multi-layer, hierarchical span classifiers, where the first classification layer predicts the entity labels, , and the second layer resolves the 56 Anatomy Subtype labels for the Anatomy spans. However, none of the hierarchical span classifier configurations outperformed the base SpERT model, so these hierarchical configurations are not presented. By directly predicting the Anatomy Subtypes, the span classifier identifies and normalizes the Anatomy spans. Only spans with a width less than a predefined maximum are included in modeling to limit time and space complexity. Relation Classification: The relation classifier predicts the relationship between a candidate head span, , and a candidate tail span, , with input


where and are the head and tail span embeddings and is the max pooling of the BERT embedding sequence between the head and tail spans. The relation classifier consists of a single linear layer, as


For our task, the relation classifier label set, , includes a label and the relation types: (). Only spans predicted to have a non- label are considered in the relation classification, to limit the time and space complexity of the pairwise span combinations. Training: The span and relation classifier parameters are learned while fine-tuning BERT. For each training batch, the cross entropy loss for each classifier is averaged, and the averaged loss values are summed using uniform weighting. The training spans include all the gold spans, , as positive examples and a fixed number of spans with label as negative examples. The training relations include all the gold relations as positive in samples, and negative relation examples are created from all the entity pairs in that are not connected through a relation. Baseline: As a baseline for evaluating the performance of SpERT, we implement a multi-step BERT approach (BERT-multi) where entities are first extracted and then relations between entities are resolved. BERT-multi is implemented by adding entity extraction and relation prediction layers to a single pretrained BERT model. For entity extraction, we implement a common BERT sequence tagging approach, where Begin-Inside-Outside (BIO) labels are predicted by a linear layer applied to the last BERT hidden state.[13] For evaluation, the word piece predictions are aggregated to token-level predictions by taking label of the first word piece of the token. For relation prediction, we implement a common BERT sentence classification approach, where relation predictions are generated by a linear layer applied to the encoding.[13] For each pair of predicted entities, a modified input sentence is created where the identified entities are replaced with special tags. For example, the first sentence in Figure 1 would become, Lungs: @Finding$ of the @Lung$. When enumerating candidate head-tail pairs, only Finding entities are included as potential heads and only Anatomy Subtype spans are included as potential tails. No such constraint is imposed in SpERT. Each training batch involves (i) generating sequence tag predictions and (ii) predicting relations for the identified spans, and the loss is backpropogated after both (i) and (ii). Experimentation: The primary focus of this work is the extraction and normalization of anatomy information associated with findings. We use SpERT to extract Finding and Anatomy spans, normalize the Anatomy spans to the Anatomy Subtypes, and resolve Finding-Anatomy relations. The data set only includes anatomy annotations for anatomical information connected to findings, so not all anatomy phrases are annotated in the reports. We include normalization-only experimentation, where the Anatomy Subtype labels are predicted for gold anatomy phrases. The normalization-only experimentation is incorporated to explore the difficulty of the anatomy normalization task separate from span extraction and investigate the role of context in anatomy normalization. This normalization-only experimentation uses the same input encoding and span classier as SpERT (see Equations 2-3). To investigate the role of context in anatomy normalization, we implement phrase-only models where the input is the anatomy phrase (e.g. right lower lobe) without any context and sentence context models where each anatomy phrase is contextualized in the sentence in which it is located (e.g.Lungs: Compressive atelectasis of the the right lower lobe

.) Both normalization models use the gold labels to identify the anatomy phrases. Model architectures and hyperparameters were selected using the training and validation sets, and the final performance was evaluated on the withheld test set. Common parameters across all models include: pretrained transformer=

Bio+Clinical BERT,[1]

optimizer=Adam, maximum gradient norm=1.0, and learning rate=5e-5. Normalization parameters include: dropout=0.05, batch size=50, and epochs=15. SpERT parameters include: dropout=0.2, batch size=20, epochs=20, learning rate warmup=0.1, weight decay=0.01, negative entity count=100, negative relation count=100, max span width=10, and maximum span pairs=1000. BERT-multi parameters include: batch size=50, epochs=20, dropout=0.2, negative relation count=100, and maximum span pairs=1000. To account for the variance associated with model random initialization, each model was trained on the training set 10 times and evaluated on the test set to generate a distribution of performance values. The mean and standard deviation (SD) of the performance values is presented (mean

SD). Significance is assessed using a two-sided t-test with unequal variance.


Performance is assessed using precision (P), recall (R), and F-score (F1). Each entity,

, can be represented as a double, , where is the span and is the span label in . Entity extraction performance is assessed using two sets of equivalence criteria: exact match and any overlap. Under the exact match criteria, a gold entity, , is equivalent to a predicted entity, , if the span and span label match exactly, as . Under the more relaxed any overlap criteria, is equivalent to , if the there is at least one overlapping token in the gold and predicted spans and the span labels match, as . We include this any overlap assessment, because the Anatomy Subtype labels capture clinically relevant information, even if there are discrepancies in spans. In the example of Figure 1, the span right lower lobe is labeled as Anatomy with Anatomy Subtype Lung. If the span classifier predicts the span lower lobe to have the Anatomy Subtype label Lung, the gold and predicted spans would not match, and the sidedness information associated with right would not be captured. However, a majority of the clinically relevant information would be captured, namely that the Finding is associated with the Lung. Each relation, , can be represented as a triple, , where is the head, is the relation label in , and is the tail. A gold relation, , and predicted relation, , are equivalent if , where entity equivalence can be assessed using the exact match or any overlap criteria.



Model F1 micro
phrase only 0.86
sentence context 0.89
Table 2: Anatomy normalization performance on test set (meanSD) for 10 models (1,153 phrases). indicates best performance with significance ().

This section presents the normalization results where Anatomy Subtype labels are predicted for gold anatomy phrases. Table 2 presents the anatomy normalization performance on the withheld test set averaged across the 10 randomly instantiated models for each input configuration: phrase only and sentence context. The F1 scores in Table 2 are micro averaged across the 56 Anatomy Subtype labels. The phrase only model achieves relatively high performance, indicating a high proportion of the anatomical phrases include strong cues for normalization. The inclusion of the sentence context improves normalization performance from 0.86 F1 to 0.89 F1 with significance (), indicating there are some ambiguous anatomy phrases that require intra-sentence context to resolve. For example, the term cervical can be related to the neck or the uterus, and sentence context is needed to resolve ambiguity. Early experimentation with context beyond the sentence of the anatomy phrase did not improve performance.

Gold Predicted Avg. freq.
Cardio MSK 18.6
Abdomen Intestine 5.0
Cardio Pelvis 5.0
MSK Thorax 4.5
Eye MSK 4.3
Intestine Abdomen 3.5
MSK Cardio 3.4
Head Neck 2.8
Lung Thorax 2.7
Thorax Abdomen 2.5
Table 3: Most confused Anatomy Subtypes for sentence context models on the test set, averaged across 10 models.

Table 3 presents the most frequently confused Anatomy Subtypes, averaged across the sentence context

model predictions. We omit the full confusion matrix because of the high number of labels and sparsity of the matrix. In general, organs and body regions are the most confusable anatomy subtypes as either could be applied. Cardio and MSK are among the most frequently confused labels, with 53% of all errors involving Cardio or MSK labels as the gold or predicted labels. Cardio and MSK labels are organ systems that extend throughout the body and therefore overlap with body region labels. Moreover, these labels are the most frequent in the data set. Other frequently confused labels include co-located body parts and organ systems, like Abdomen-Intestine and Head-Neck.

Entity and Relation Extraction

Span label # gold SpERT BERT-multi
exact overlap exact overlap
P R F1 P R F1 F1 F1
Finding 2,122 0.82 0.84 0.83 0.91 0.92 0.92 0.79 0.91
Anatomy 1,153 0.75 0.69 0.72 0.83 0.76 0.79 0.63 0.77
Anatomy Subtype 1,153 0.70 0.64 0.67 0.77 0.70 0.73 0.58 0.71
(a) Span labeling performance
Relations # gold SpERT BERT-multi
exact overlap exact overlap
P R F1 P R F1 F1 F1
Finding-Anatomy 1,380 0.65 0.60 0.63 0.75 0.70 0.72 0.50 0.66
Finding-Anatomy Subtype 1,380 0.61 0.56 0.58 0.70 0.65 0.67 0.47 0.60
(b) Relation extraction performance
Table 4: Average extraction performance on the withheld test set, as mean and standard deviation for 10 trained models. indicates best performance with significance ().

This section presents the entity and relation extraction performance. Tables 3(a) and 3(b) presents the extraction performance on the withheld test set for SpERT and BERT-multi, averaged across 10 randomly instantiated models. Table 3(a) includes the span labeling performance for Finding and Anatomy entities and the micro-averaged Anatomy Subtype labels. An Anatomy label is assigned is any span with an Anatomy Subtype label. SpERT outperforms BERT-multi for all span labels, under the exact match or any overlap criteria, with significance (). There is a larger performance gap between the exact match and any overlap assessment for BERT-multi than SpERT, which is likely the result of the differing training objectives. SpERT is trained to identify exact span matches without any reward for partial matches, while BERT-multi generates word piece predictions that are aggregated to token predictions. For both architectures, the relatively small difference in performance between Anatomy and Anatomy Subtype (0.04-0.05 F1 for exact match and 0.06-0.07 F1 for any overlap), suggests that there is relatively low confusability between the Anatomy Subtype labels for spans that are correctly identified as Anatomy.

Figure 4: Span extraction recall as a function of span length in tokens (not word pieces)

Table 3(b) presents the relation extraction performance. SpERT outperforms BERT-multi for Finding-Anatomy and Finding-Anatomy Subtype relations with significance. As expected, the relation extraction performance is lower than the span labeling performance because of cascading errors. For both architectures, the magnitude of the performance drop from span labeling to relation extraction is roughly consistent with the accumulation of Finding and Anatomy span labeling errors, suggesting that the performance of the relation classifiers is relatively high. Figure 4 presents the recall of SpERT as a function of the gold span length, in number of tokens (not word pieces). The recall is aggregated for the 10 model runs and reported for Finding, Anatomy, and and Anatomy Subtype labels. The maximum span width for SpERT is set to 10 tokens, so the exact match recall is zero for all spans longer than 10 tokens. Under the exact match criteria, the Finding recall drops from approximately 0.9 for shorter spans to approximately 0.2-0.3 for long spans (9-10 tokens). Under the any overlap criteria, the Finding recall remains relatively high for all span lengths, as the extractor only needs to identify a portion of the gold span for a match. Under the exact match criteria, the Anatomy and Anatomy Subtype remains relatively steady across span lengths from 1-10. Under the any overlap criteria, the Anatomy and Anatomy Subtype recall tends to increase with span length.

Figure 5: Performance, anatomy type frequency, and number of unique spans by Anatomy Subtype.

Figure 5 presents summary statistics and performance for the 15 most frequent Anatomy Subtypes in the test set. Figure 5 includes the label counts (# gold), number of unique lower cased spans (# unique). It also includes the normalization, span labeling, and relation extraction performance. The normalization performance is associated with sentence context models summarized in Table 2, and the span labeling and relation extraction performance is associated with the SpERT model summarized in Table 4. There is a large imbalance in the distribution of Anatomy Subtype labels with Cardio, MSK, and Lung accounting for approximately 50% of the labels. The diversity of the anatomy spans varies significantly by Anatomy Subtype. For example, MSK has 168 unique spans in 204 occurrences (ratio of 0.8), while Mediastinum has 7 unique spans in 37 occurrences (ratio of 0.2). The span labeling and relation extraction performance does not drop off for infrequent labels and appears to be more related to the span diversity.

Error Analysis

Span label Short examples Long examples
Finding hernia poor opacification of these vessels distally
lesion expanded thoracic aortic aneurysm
Anatomy scapula soft tissues of the posterolateral left chest wall
   MSK left third rib subcutaneous fat in the right groin
Anatomy aorta proximal descending thoracic aorta
   Cardio coronary arteries arteries of the right lower extremity and abdomen
Anatomy left lung lateral aspect of the right major fissure
   Lung right lower lobe dependent portions of the left upper lobe adjacent
Table 5: Example false-negative spans

Generating a correct relation prediction requires identifying the Finding (head), identifying the Anatomy and Anatomy Subtype (tail), and pairing the head and tail (role). The results in Table 4 suggests the biggest source of error is identifying Anatomy entities, followed by identifying Finding entities. Error is also introduced in the Anatomy Subtype normalization and Finding-Anatomy pairing; however, entity extraction is the most challenging aspect of this task. Table 5 presents example SpERT false negative spans for Finding and the most frequent Anatomy Subtypes. These false negatives are assessed using the any overlap criteria, to identify text regions related to findings and anatomy that the model completely missed. The short Finding examples are relatively straightforward targets and the cause of these missed spans is unclear. The long Finding examples include medical problems coupled with anatomical information, resulting in longer spans that are generally more difficult to extract. The inclusion of anatomical information in the Finding spans creates annotation inconsistencies, where anatomical information may be labeled as Finding or Anatomy. We are currently building on this radiological work as part of an exploration of incidentalomas and updated the annotation guidelines to separate finding information from anatomy information and create shorter, more consistently annotated spans. For example the Finding span, expanded thoracic aortic aneurysm, would be annotated as a the relation triple (Finding=aneurysm, role=has, Anatomy=thoracic aortic). All of the short Anatomy examples are concise descriptions of anatomy that use common anatomical terminology. There are multiple contributing factors to these errors. In the corpus, only anatomy information associated with findings is annotated, so there are many descriptions of anatomy that are not annotated. As previously discussed, anatomy information is frequently incorporated into Finding annotations, which introduces annotation inconsistencies. The long Anatomy examples are more nuanced descriptions of anatomy that often describe multiple systems or body parts in relation to each other. For example, the Cardio span, arteries of the right lower extremity and abdomen, contains references of three Anatomy Subtypes: Cardio, Lower limb, and Abdomen. Annotating such examples with the Anatomy Subtype labels can be challenging, and more nuanced anatomy descriptions are likely to have noisier annotations.


This work explores a novel radiological information extraction task with the goal of automatically generating semantic representations of radiological findings that capture anatomical information. We extract and normalize anatomical information connected to findings in CT reports, using state-of-the-art extraction architectures. This extraction task is both novel and important because it couples extracted anatomical information with radiological findings and normalizes the anatomical information to a commonly used ontology. Linking the anatomy to findings and normalizing the anatomy yields a more complete semantic representation, which can more easily be incorporated into secondary use applications. We demonstrate that the span-based SpERT model, which jointly extracts entities and relations, outperforms a strong BERT baseline that separately extracts entities and relations in a pipelined approach. The explored extraction task involves three subtasks: identifying Finding and Anatomy entities, normalizing Anatomy entities to Anatomy Subtypes, and pairing Finding and Anatomy entities through relations. Entity extraction is the most difficult of these subtasks. We find that extraction performance for Finding entities decreases as span length increases; however, Anatomy extraction performance is relatively constant across span lengths. In an exploration of performance by Anatomy Subtype, we find span extraction performance is influenced more by the diversity of the associated spans than the frequency of the Anatomy Subtype labels. This work is limited by the annotated data set, which only utilizes data from a single hospital system and incorporates a single type of imaging report (CT). The extraction models trained on this annotated data set may not generalize well to other institutions or radiology modalities. We are currently expanding the annotated data set to other radiology modalities, including magnetic resonance imaging (MRI) and positron emission tomography (PET) reports. The 56 Anatomy Subtypes used in this work provide moderate granularity in resolving the anatomical location of radiological findings. In our current incidentaloma research, we anticipate representing anatomical locations with finer resolution. We will build on the work presented here and explore learned approaches for characterizing anatomical spans through multiple attributes. For example the phrase right lower lobe could be characterized through a semantic representation describing the body part/organ (Lung), sidedness (right), and vertical location (lower). This type of detailed semantic representation could facilitate a wide range of impactful use cases.


This work was supported by NIH/NCI (1R01CA248422-01A1) and NIH/NLM (Biomedical and Health Informatics Training Program - T15LM007442). Research and results reported in this publication were partially facilitated by the generous contribution of computational resources from the University of Washington Department of Radiology.


  • [1] E. Alsentzer, J. Murphy, W. Boag, et al. (2019) Publicly available clinical BERT embeddings. In Clinical Natural Language Processing Workshop, pp. 72–78. Note: doi: 10.18653/v1/W19-1909 Cited by: Information Extraction.
  • [2] O. Bodenreider (2004) The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Research 32 (suppl_1), pp. D267–D270. Note: doi: 10.1093/nar/gkh061 External Links: Document Cited by: Related Work.
  • [3] L. Chen, W. Fu, Y. Gu, et al. (2020-10) Clinical concept normalization with a hybrid natural language processing system combining multilevel matching and machine learning ranking. J Am Med Inform Assoc 27 (10), pp. 1576–1584. Note: doi: 10.1093/jamia/ocaa155 External Links: Document Cited by: Related Work.
  • [4] S. Datta, J. Godfrey-Stovall, and K. Roberts (2020) RadLex normalization in radiology reports. In AMIA Annu Symp Proc, Vol. 2020, pp. 338. Note: PMID: 33936406 Cited by: Related Work.
  • [5] D. Demner-Fushman, W. W. Chapman, and C. J. McDonald (2009) What can natural language processing do for clinical decision support?. J Biomed Inform 42 (5), pp. 760–772. Note: doi: 10.1016/j.jbi.2009.08.007 Cited by: Introduction.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In N Am Chapter Assoc Comput Linguist, pp. 4171–4186. Note: doi: 10.18653/v1/N19-1423 External Links: Document Cited by: Information Extraction.
  • [7] M. Eberts and A. Ulges (2020) Span-based joint entity and relation extraction with transformer pre-training. In Eur Conf on Artif Intell, pp. 2006–2013. External Links: Link Cited by: Introduction, Information Extraction.
  • [8] R. W. Filice (2019) Deep-learning language-modeling approach for automated, personalized, and iterative radiology-pathology correlation. J Am Coll Radiol 16 (9), pp. 1286–1291. Note: doi: 10.1016/j.jacr.2019.05.007 Cited by: Introduction.
  • [9] A. Gerstmair, P. Daumke, K. Simon, M. Langer, and E. Kotter (2012)

    Intelligent image retrieval based on radiology reports

    Eur Radiol 22 (12), pp. 2750–2758. Note: doi: 10.1007/s00330-012-2608-x Cited by: Introduction.
  • [10] S. Henry, Y. Wang, F. Shen, and Ö. Uzuner (2020) The 2019 national natural language processing (NLP) clinical challenges (n2c2)/open health NLP (OHNLP) shared task on clinical concept normalization for clinical records. J Am Med Inform Assoc 27 (10), pp. 1529–1537. Note: doi: 10.1093/jamia/ocaa106 Cited by: Related Work.
  • [11] Z. Ji, Q. Wei, and H. Xu (2020) BERT-based ranking for biomedical entity normalization. AMIA Jt Summits Transl Sci Proc 2020, pp. 269. Note: PMID: 32477646 Cited by: Related Work.
  • [12] W. Lau, D. Wayne, S. Lewis, Ö. Uzuner, M. Gunn, and M. Yetisgen (2021) A new corpus for clinical findings in radiology reports. In AMIA Annu Symp Proc, Cited by: Data, Data.
  • [13] J. Lee, W. Yoon, S. Kim, et al. (2019-09) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinform 36 (4), pp. 1234–1240. Note: doi: 10.1093/bioinformatics/btz682 External Links: ISSN 1367-4803, Document Cited by: Information Extraction.
  • [14] T. Mabotuwana, C. S. Hall, V. Hombal, et al. (2019) Automated tracking of follow-up imaging recommendations. Am J Roentgenol 212 (6), pp. 1287–1294. Note: doi: 10.2214/AJR.18.20586 Cited by: Introduction.
  • [15] D. L. Rubin and C. E. Kahn Jr (2017) Common data elements in radiology. Radiol 283 (3), pp. 837–844. Note: doi: 10.1148/radiol.2016161553 Cited by: Introduction.
  • [16] A. M. Tahmasebi, H. Zhu, G. Mankovich, et al. (2019) Automatic normalization of anatomical phrases in radiology reports using unsupervised learning. J Digit Imaging 32 (1), pp. 6–18. Note: doi: 10.1007/s10278-018-0116-5 External Links: Document Cited by: Related Work, Data.
  • [17] E. Tutubalina, Z. Miftahutdinov, S. Nikolenko, and V. Malykh (2018)

    Medical concept normalization in social media posts with recurrent neural networks

    J Biomed Inform 84, pp. 93–102. Note: doi: 10.1016/j.jbi.2018.06.006 External Links: ISSN 1532-0464, Document Cited by: Related Work.
  • [18] Y. Wang, X. Fan, L. Chen, et al. (2019) Mapping anatomical related entities to human body parts based on wikipedia in discharge summaries. BMC Bioinformatics 20 (1), pp. 1–11. Note: doi: 10.1186/s12859-019-3005-0 External Links: Document Cited by: Related Work.
  • [19] W. F. Wiggins, F. Kitamura, I. Santos, and L. M. Prevedello (2021) Natural language processing of radiology text reports: interactive text classification. Radiol Artif Intell, pp. e210035. Note: doi: 10.1148/ryai.2021210035 Cited by: Introduction.
  • [20] M. J. Willemink, W. A. Koszek, C. Hardell, et al. (2020) Preparing medical imaging data for machine learning. Radiol 295 (1), pp. 4–15. Note: doi: 10.1148/radiol.2020192224 Cited by: Introduction.
  • [21] D. Xu, M. Gopale, J. Zhang, K. Brown, E. Begoli, and S. Bethard (2020-07) Unified Medical Language System resources improve sieve-based generation and Bidirectional Encoder Representations from Transformers (BERT)–based ranking for concept normalization. J Am Med Inform Assoc 27 (10), pp. 1510–1519. Note: doi: 10.1093/jamia/ocaa080 External Links: ISSN 1527-974X, Document Cited by: Related Work.
  • [22] J. Zech, M. Pain, J. Titano, et al. (2018) Natural language–based machine learning models for the annotation of clinical radiology reports. Radiol 287 (2), pp. 570–580. Note: doi: 10.1148/radiol.2018171093 Cited by: Introduction.
  • [23] H. Zhu, I. C. Paschalidis, C. Hall, and A. Tahmasebi (2019) Context-driven concept annotation in radiology reports: anatomical phrase labeling. AMIA Jt Summits Transl Sci Proc 2019, pp. 232. Note: PMID: 31258975 Cited by: Related Work.