CLARA: Clinical Report Auto-completion

by   Siddharth Biswal, et al.

Generating clinical reports from raw recordings such as X-rays and electroencephalogram (EEG) is an essential and routine task for doctors. However, it is often time-consuming to write accurate and detailed reports. Most existing methods try to generate the whole reports from the raw input with limited success because 1) generated reports often contain errors that need manual review and correction, 2) it does not save time when doctors want to write additional information into the report, and 3) the generated reports are not customized based on individual doctors' preference. We propose CLinic Al Report Auto-completion (CLARA), an interactive method that generates reports in a sentence by sentence fashion based on doctors' anchor words and partially completed sentences. CLARA searches for most relevant sentences from existing reports as the template for the current report. The retrieved sentences are sequentially modified by combining with the input feature representations to create the final report. In our experimental evaluation, CLARA achieved 0.393 CIDEr and 0.248 BLEU-4 on X-ray reports and 0.482 CIDEr and 0.491 BLEU-4 for EEG reports for sentence-level generation, which is up to 35 evaluation, CLARA is shown to produce reports which have a significantly higher level of approval by doctors in a user study (3.74 out of 5 for CLARA vs 2.52 out of 5 for the baseline).



page 8


Writing by Memorizing: Hierarchical Retrieval-based Medical Report Generation

Medical report generation is one of the most challenging tasks in medica...

Clinically Accurate Chest X-Ray Report Generation

The automatic generation of radiology reports given medical radiographs ...

Contextual Sentence Classification: Detecting Sustainability Initiatives in Company Reports

We introduce the novel task of detecting sustainability initiatives in c...

CREATe: Clinical Report Extraction and Annotation Technology

Clinical case reports are written descriptions of the unique aspects of ...

Evaluation of the performance challenges in automatic traffic report generation with huge data volumes

In this paper we analyze the performance issues involved in the generati...

Automated Analysis, Reporting, and Archiving for Robotic Nondestructive Assay of Holdup Deposits

To decommission deactivated gaseous diffusion enrichment facilities, mil...

Automatically Generating Macro Research Reports from a Piece of News

Automatically generating macro research reports from economic news is an...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Medical imaging or neural recordings (e.g., X-ray images or EEG) are widely used in clinical practice for diagnosis and treatment. Typically clinical experts will visually inspect the images and signals, and then identify key disease phenotypes and compose text reports to narrate the abnormal patterns and detailed explanation of those findings. Currently, clinical report writing is cumbersome and labor-intensive. Moreover, it requires thorough knowledge and extensive experience in understanding the image or signal patterns and their correlations with target diseases [world2004neurology]. In the age of telemedicine, more diagnostic practices can be done on the web which requires a more efficient diagnostic process. Improving the quality and efficiency of medical report writing can have a direct impact on telemedicine and healthcare on the web.

To alleviate the limitation of manual report writing, several medical image reporting generation methods [jing2017automatic] have been proposed. However, none of the existing works simultaneously provide the following desired properties for medical report generation.

  1. Align with disease phenotypes. Medical reports describe clinical findings and diagnosis from medical images or neural recordings, which need to align with disease phenotypes and ensure the correctness of medical terminology usage.

  2. Adaptive report generation. The generated reports need to be adapted to the preference of end-users (e.g., clinicians) for improved adoption.

Figure 1: CLARA uses input data such as EEG or X-ray with anchor words(disease phenotypes) to produce a report sentence by sentence. In the inference mode, doctors will be able to provide the anchor words or predicted anchor words can be used to generate the report providing more control over the final output report. CLARA uses different anchor words for the final report generation. In the above report Abnormal EEG, Generalized slowing, Seizures, Epileptiform discharges were used as the anchor words to generate the final report.

To fill the gap, we propose an interactive method named CLARA to fill in the medical reports in a sentence by sentence fashion based on anchor words (disease phenotypes) and partially completed sentences (prefix text) provided by doctors. CLARA adopts an adaptive retrieve-and-edit framework to progressively complete report writing with doctors’ guidance. CLARA constructs a prototype sentence database from all previous reports. In particular, CLARA extracts the most relevant sentence templates based on user queries and then edit those sentences with the feature representation extracted from the data. In particular, the retrieval step uses an information retrieval system such as Lucene to enable fast, flexible and accurate search [bworld]. Then the edit step uses a modified version of the seq2seq method [Sutskever_Vinyals_Le_2014] to generate sentences for the current report. The latent representation of the previous sentences is adaptively used as context to generate the next sentence. In summary, CLARA has the following contributions compared with other medical report generation approaches.

  1. Phenotype oriented. Since CLARA generated report is created using the anchor words of relevant disease phenotypes, it ensures that the report is clinically accurate. We also evaluate our method on clinical accuracy via disease phenotype classification.

  2. Interactive report generation. Users (e.g., doctors) have more control over the generated reports via interactive guidance on a sentence by sentence level.

We evaluate CLARA on two types of clinical report writing tasks: (1) X-ray report generation that takes fixed length imaging data as input, and (2) EEG report generation that considers varying-length EEG time series as input. For EEG data, we evaluated our model using two datasets to test the generalizability of CLARA. We show that with our CLARA framework, we can achieve 0.393 CIDEr and 0.248 BLEU-4 on X-ray reports and 0.482 CIDEr and 0.491 BLEU-4 for EEG reports for sentence-level generation, which is up to 35% improvement over the best baseline. Compared to other methods, our CLARA approach can generate more clinically meaningful reports. We show via a user study, CLARA can produce more clinically acceptable reports measured through quality score metric 3.74 out of 5 for CLARA vs. 2.52 out of 5 for the best baseline.

Related Work

Image captioning

generates short descriptions of image input. There have been few attempts at solving image captioning task before the deep learning era

[yao2010i2t, ordonez2011im2text]. Several deep learning models were proposed for this task [vinyals2015show, karpathy2015deep]. Many of these different image captioning frameworks proposed can be categorized into template-based, retrieval-based and novel caption generation[farhadi2010every, you2016image, li2011composing, mao2014deep, lu2018neural, dai2018compositional, VenugopalanHRMD16, RennieMMRG16, chenShowFool, xu2015show]. However, they do not perform very well in generating longer paragraphs. There is limited research for generating longer captions, notably hierarchical RNN [krause2017hierarchical].

Medical report generation adapts similar ideas from image captioning to generate full medical text report based on X-ray images [jing2017automatic, li2018hybrid, ZhangXXMY17, LiuCliniaclly2019, Zhang2018, gale2018radiology, li2019knowledge]. To improve report accuracy, researchers have utilized curated report templates to simplify the generation task [li2019knowledge, HanWLC018]. However, the generated full reports often contain errors that require significant time to correct. CLARA focuses on an interactive report generation that follows the natural workflow of clinicians and led to more accurate results. CLARA does not require any predefined templates but instead retrieves and adapts existing reports to generate the new one interactively. More recently, [EEG2text] develops a template-based approach to generate EEG reports using a hybrid model of CNN and LSTM.

Query auto-completion is about expanding prefix text with related text to generate more informative search queries. This is a well-established topic [Cai2016-mq]. Tradition query auto-completion suggests the more popular and relevant queries to the prefix text [Bar-Yossef2011-uk]

. Recently neural networks models have been used for query auto-completion task that can potentially generate new and unseen queries using LSTM 

[jaech2018personalized] and hierarchical encoder-decoder [sordoni2015hierarchical]. CLARA differs in terms of input for the model as our models accept multimodal input, not just short prefix text.


Task Formulation

Figure 2: An overview of CLARA. CLARA has an input encoder module to learn embeddings from medical images or neural recordings, meanwhile, a prototype repository is constructed by indexing the unique sentences from all medical reports. Anchor words and prefix text provided by users will be served as queries to retrieve most relevant sentence templates. These sentence templates will be modified by the edit module via seq2seq model to produce a new sentence for the current report. The process will repeat iteratively to generate all sentences in the report description and the associated disease phenotypes.

Data: We denote data samples as . In the case of EEG data, we denote as the EEG record for subject , where is the number of electrodes and is the number of discretized time steps per recording. In the case of X-ray, the input is a image. and are the guidance provided by users, namely, the anchor words and prefix text for subject . These anchor words include general descriptions such as “normal” as well as diagnostic phenotype such as “seizure”. The prefix text is the first few words from each sentence in the report.

Task: In this work, we focus on generating findings (impression) section of medical reports due to its clinical importance. Given an input sample (X-ray or EEG), CLARA generates a text report consisting of a sequence of sentences to narrate the patterns and findings in . are optional prefix texts provided by users for each sentence. Note that can be empty. CLARA generates a sentence using the data embedding of input and the context generated by the previous sentence , anchor words and optional prefix text . The notations are summarized in Table. 1. We have illustrated the overall CLARA framework in Fig 2.

Notation Definition
-th data sample,
-th input sample (X-ray or EEG) and its embedding,
-th report,
-th sentence in the -th report,
anchor words provided by users for the -th report
optional prefix text provided by users for each sentence
prototype sentences extracted from all reports
Table 1: Notations used in CLARA.

The Clara Framework

The CLARA framework comprises of the following modules.

  • M1. Input encoder module transforms medical data such as image or EEG time series into compressed feature representations.

  • M2. Prototype construction constructs a sentence-level repository which includes distinction sentences, their representations, writer information and frequency statistics derived from a large medical report database. This repository will be searched dynamically to provide a starting point for generating sentences in a new report.

  • M3. Query module provides more control for the clinicians to interactively produce a customized medical report. It accepts queries from the clinicians in the form of anchor words (global context) and prefix text (local context). Anchor words are phenotype keywords associated with the entire report. And optional prefix text are partial sentences entered by the users through interactive edit.

  • M4. Retrieve and edit module interactively produces report guided by users using the data representation, anchor words, and prefix text. This module sequentially performs report generation. First, the retrieve module extracts most relevant sentences from prototypes repository. Then the edit module uses a sequence-to-sequence [Sutskever_Vinyals_Le_2014] model to modify the retrieved sentences based on the data rsentation, anchor words, and prefix text.

M1. Input Encoder Module This module is used to extract data embedding from the input to guide the report completion. The input can be raw measurements of X-ray or EEG. For both images and EEG time series

in the form of a sequence of EEG epochs

= ,,…,

, we can encode them using a convolutional neural network(CNN) to obtain image embedding

, or the EEG embedding for epoch .


For X-ray imaging, the DenseNet [huang2017densely] architecture is used for CNN. For EEG, the final embedding for all epochs is the average embedding

. We use a CNN with convolutional-max pooling blocks for processing the EEG data into feature space. We use Rectified Linear Units(ReLUs) activation function for these convolutional networks, and with batch normalization


More detailed model configuration is provided in the experiment section. Finally, we average over these feature vectors to produce

for an EEG recording associated with the sample. More sophisticated aggregations such as LSTM or attention model is considered as well but with very limited improvement. Therefore, we decide to use this simple but effective method of average embedding. The output data embedding will be fed into the retrieving step to be associated with anchor words and used to generate reports jointly. The anchor words are provided as labels.

M2. Prototype Construction The idea here is to organize all the existing sentences from medical reports into a retrieval system as prototype sentences. We take a hybrid approach between information retrieval and deep learning to structure prototype sentences.

Motivation: Prototype learning [DBLP:journals/corr/SnellSZ17, DBLP:journals/corr/abs-1710-04806, li2019knowledge] and memory networks [DBLP:journals/corr/WestonCB14, sukhbaatar2015end] are different ways to incorporate data instances directly into the neural networks. The common idea is to construct a set of prototypes and their representation . Then given a new data instance , prototype learning will try to learn a representation of as where is a distance or similarity function. Similarly, memory network will put all those prototype representation in a memory bank and learn a similarity function between and every instance in the memory bank. However, there are several significant limitations to these approaches: 1) storage and computation cost can be large when we have a large number of prototypes. For example, we want to treat all unique sentences from a medical report database as prototypes. Every pass of the network involves a large number of distance/similarity computations. 2) static prototypes - Often prototypes and their representations have to be fixed first before the prototype learning model can be trained. Also once the model is trained, no new prototypes can be added easily. In medical report applications, new reports are continuously being created and should be incorporated into the model without retraining from scratch. 3) computational waste - it seems quite wasteful to conduct all the similarity computations knowing only a small fraction of prototypes are relevant for a given query.

Approach: We take a scalable approach to structure prototypes in CLARA. We extract all sentences from a large collection of medical reports, then index these sentences to be used by a retrieval system, e.g., inverted index over the unique sentences. We also weigh those sentences based on their popularity so that frequent sentences will have higher weights to be retrieved. There are several immediate benefits of this approach: 1) we can support a large number of sentences as a typical retrieval system such as Lucene can support a web-scale corpus; 2) We are able to update the index with new documents easily so new reports can be integrated; 3) The query response is much faster than a typical prototype learning model thanks to the fast retrieval system. Formally, given a report corpus , we map them into a set of sentence pairs where is the number of reports and the number of unique sentences. Then we index the set with retrieval engine such as Lucene to support similarity query.

M3. Query Module provides interactive report auto-completion for users to efficiently produce report sentence by sentence. It has two ways of interactions.

  1. Anchor words are a set of keywords that provide a high-level context for the report. For EEG reports, anchor words include Normal, Sleep, Seizure, Focal Slowing, and Epileptiform. Similarly, for X-ray reports anchor words include Pneumonia, Cardiomegaly, Lung Lesion, Airspace Opacity, Edema, Pleural Effusion, Fracture as used in [irvin2019chexpert].

  2. Prefix text specifies the partial sentence of sentence in report . This prefix text enables customization and controls from users. Note that prefix text are completely optional to CLARA.

Anchor words and prefix text are used in the Retrieve module to find relevant sentences from the prototype repository.

M4. Interactive Retrieve and Edit module aims to find the most relevant sentences from the prototype repository (Retrieve phase), and then edit them to fit the current report (Edit phase). Usually, clinicians use a predefined template to draft the report in the clinical workflow. For example, the standard clinical documentation often follows a SOAP note (an acronym for subjective, objective, assessment, and plan). In this case, we seek sentence-level templates that users prefer using. Below we describe the two-phase approach that CLARA uses to generate sentences for medical reports.

In the retrieve phase, we use an information retrieval system to find the most relevant sentences in the prototype repository. This step simulates a doctor looking up his previously written reports to identify the relevant sentences to modify. Given an anchor word and optional prefix text, this module extracts a template sentence from the prototype repository. Here we use the widely-adopted information retrieval system Lucene to index and search for the relevant sentences [bworld, zobel2006inverted, perez2009integrating]. More details of indexing and scoring operations performed by Lucene engine are in Appendix A. If anchor words are not available, CLARA

will first predict what anchor words should be there by learning a classifier from data embedding

to anchor words . Compared to other retrieve approach such as [li2019knowledge], our approach is more flexible and scalable thanks to the power of retrieval systems.

In the edit phase, the retrieved sentence is modified to produce the final sentence for the current report. We adopted a sequence-to-sequence model [Sutskever_Vinyals_Le_2014] which consists of an encoder and a decoder, where the encoder projects the input to compressed representations and the decoder reconstructs the output. Here we use both the sentence template and the data embedding

as input for the encoder and revised sentence is the output sequence. The encoder is implemented as two layer bi-directional Long short term memory network (LSTM) 

[hochreiter1997long]. The decoder is a three-layered LSTM. The decoder takes the resulting context vector as input for the generation process. Then it is concatenated with the decoder’s hidden states and used to compute a softmax distribution over output words to produce the final .


Our CLARA framework uses a sequential generation process to produce the final report. We iteratively use the previous hidden states with the encoder to enforce the context generated at each sentence to guide the next sentence generation. The anchor words and prefix texts are often included in the final report generated as these words are part of the reports.


We evaluate CLARA framework to answer the following questions:

  1. Can CLARA generate higher quality clinical reports?

  2. Can the generated reports capture disease phenotypes?

  3. Does CLARA generate better reports from clinicians’ view?

Experimental Setup

Data We conduct experiments using the following datasets.

(1) Indiana University X-ray Data(IU X-ray) dataset contains 7,470 images and paired reports collected. Each patient has 2 images (a frontal view and a lateral view) [demner2015preparing]. The paired report contains impression, finding and indication sections. We apply some data preprocessing techniques to remove duplicates from this dataset. For X-ray reports, we only focus on findings section of the report. After extracting the findings section, we apply tokenization and keep tokens with at least 3 occurrences in the corpus resulting in 1235 tokens in total.We use the labels used by CheXpert labeler as the anchor words [irvin2019chexpert]. These labels are representative of the different phenotypes present in X-ray reports.

(2) TUH EEG Data is an EEG dataset which provides variable length EEG recording and corresponding EEG report [obeid2016temple] collected at Temple University Hospital. This dataset contains 16,950 sessions from 10,865 unique subjects. We preprocess the reports to extract the impression section of the report. We apply similar tokenization to these reports to extract tokens. We only keep the tokens with 3 or more occurrences.

(3) Massachusetts General Hospital (MGH) EEG Data This is another EEG reports dataset which was used to evaluate our methods which was collected at large hospital in United States and contains EEG recordings paired with EEG reports written by clinicians. This dataset contains 12,980 deidentified EEG recordings paired with text reports. We apply similar preprocessing steps to clean the reports from this dataset.

The data statistic are summarized in Table 2.

Number of Patients 3,996 10,890 10,865
Number of Reports 7,470 12,980 16,950
Total EEG length - 4,523 hrs 3,452 hrs
Total number of Final Tokens 1235 2987 2675
Table 2: Dataset Statistics

Baselines: For IU X-ray image data, we compared CLARA with these following baselines. We use DenseNet [huang2017densely] as the CNN model for extracting features for all variants of CLARA models for fair comparison.

  1. CNN-RNN [vinyals2015show] passes the image through a CNN to obtain visual features and then passes to an LSTM to generate text reports.

  2. Adaptive Attention [lu2017knowing] uses adaptive attention to produces context vectors and then generate text reports via LSTM .

  3. HRGR [li2018hybrid]

    uses reinforcement learning to either generate a text report or retrieve a report from a template database.

  4. KERP [li2019knowledge] uses a graph transformer-based neural network to generate reports with a template database based approach.

  5. AG [jing2017automatic] first generates the tags associated with X-ray reports then generates reports based on those tags and visual features.

Likewise, for EEG datasets, we consider the following baselines.

  1. Mean-pooling(MP) [venugopalan2014translating] uses CNN to extract features for different EEG segments and then combine them using mean pooling. The output feature vectors are then passed to a 2-layer LSTM to generate text reports.

  2. S2VT [venugopalan2015sequence] applies a seq-to-seq model which reads CNN outputs using an LSTM and then produce text with another LSTM.

  3. Temporal Attention Network(TAM) [yao2015describing] uses CNN to learn EEG features and then passes them to a decoder equipped with temporal attention which allows focusing on different EEG segments to produce the text report.

  4. Soft Attention(SA) [bahdanau2014neural] uses a soft attention mechanism to allow the decoder for focusing on EEG feature representations.

  5. EEG2text[EEG2text] develops a template based approach to generate EEG reports using a hybrid model of CNN and LSTM.

Metrics: To evaluate report generation quality, we use BLEU[papineni2002bleu] and CIDEr [vedantam2015cider] which are commonly used to evaluate language generation tasks. In addition, we also qualitatively evaluate the generated texts via a user study with doctors.

Training Details For all models, we split the data into train, validation, test set with 70%, 10%, 20% ratio. There is no overlap between patients between train, validaation and test sets. The word embeddings which are used in the editing module were pre-trained specifically for each dataset.

Implementation Details  We implemented CLARA

in PyTorch 1.2

[Paszke2017-sg].We use ADAM [kingma2014adam] with batch size of 128 samples. We use a machine equipped with Intel Xeon e5-2640, 256GB RAM, eight Nvidia Titan-X GPU and CUDA 10.0. For ADAM to optimize all models and the learning rate is selected from [2e-3, 1e-3, 7.5e-4] and is selected from [0.5, 0.9]. We train all models for 1000 epochs. We start to half the learning rate every 2 epochs after epoch 50. We used 10% of the dataset as a validation set for tuning hyper-parameters of each model. We searched for different model parameters using random search method. While preprocessing the text reports, if words were excluded, then a special “UNKNOWN” token is used to represent that word. Word embeddings were used with the seq2seq model in the editing module of CLARA. Word embedding are typically used with such models to provide a fixed length vector to the LSTM model. We used pretrained word embeddings in our training procedure.

Pretraining CNN for X-ray data

. It has been shown that pretraining of neural networks leads to better classification performance in various tasks. In other image captioning tasks, often ResNets pretrained on imagenet dataset is used instead of retraining the entire network from scratch. So we also pretrained a DenseNet 

[huang2017densely] model with publicly available ChestX-ray8 [wang2017chestx]

dataset on multi-label classification. ChestX-ray8 dataset consists of 108,948 frontal-view X-ray images of 32,717 unique patients with each image labeled with occurrence of 14 common thorax diseases where labels were text-mined from the associated radiological reports using natural language processing.

Encoder CNN Details for EEG data. Usually the input EEG is 25-30minutes long, we divide EEG into 1 minute segments. This chunking operation leads [19x6000x30] dimension input for 30 minute length EEG where there are 19 channels and 6000 data points for time(100Hz, 60second). Each of the 19x600 is passed through a CNN architecture which can accept multi-channel input. This CNN is composed multiple convolution, batch normalization, max-pooling blocks. The output of this CNN is 1x512 dimension feature vector which is obtained at last layer of the network which is a fully connected layer to obtain the final representation.

These are the steps of the operations for the CNN with EEG input. In the following notations, Conv2D refers to a 2D convolution operation. DepthwiseConv2D refers to depthwise spatial convolution. Separable Conv2D refers to separable convolutions consisting of a depth wise spatial convolution followed by a pointwise convolution. The following operations describe the CNN for processing the EEG input. (1)Input EEG -¿ (C,T) (2) Reshape -¿ (1,C,T) (3) Conv2D -¿ (F1, C, T), kernel size = 64, filter = 8 [here we denote C = number of channels, T = number of time points, F1= filter size](4) Batch Normalization (5) DepthwiseConv2D , number of spatial filters = 2 (6) Batch Normalization (7) Activation , ReLU (8) AveragePool2D, pool size=(1, 4) (9) Dropout, Dropout Rate = 0.5 (10) Separable Conv2D, filters =16 (11) Batch Normalization (12) Activation -¿ ReLU (13) AveragePool2D: pool size = (1,8) (14) Dropout: Dropout Rate = 0.5 (15) Dense.

Anchor words used as classification labels

Anchor words are the words which are used by our method CLARA to trigger auto-completion by retrieving and editing the sentences to produce the final report. These words are critical because these are used to extract different candidate sentences. We have listed different anchor words which are used in our experiments. For image, we used labels in CheXpert [irvin2019chexpert] as the anchor words for the X-ray report completion. For EEG, we use a list of terms obtained from American Clinical Neurophysiology Society (ACNS) [hirsch2013american] guidelines. The two sets of keywords are listed below.

  1. X-ray anchor words include the following ones: No Finding; Enlarged Cardiomediastinum; Cardiomegaly; Lung Lesion; Airspace Opacity; Edema; Consolidation ; Pneumonia ; Atelectasis; Pneumothorax ; Pleural Effusion; Pleural Other; Fracture.

  2. EEG anchor words include Normality, Sleep, Generalized Slowing, Focal Slowing, Epileptiform Discharges, Drowsiness, Spindles, Vertex Waves, Seizure.


(A). CLARA can generate higher quality clinical reports

We compare CLARA with state-of-the-art baselines using the following experiments:

  1. Report level auto-completion with predefined anchor words.

  2. Report level auto-completion without predefined anchor words (i.e., anchor words are predicted). This experiment evaluates the scenario of fully automated report generation.

  3. Sentence level auto-completion. This experiment simulates the real-world report auto-completion behavior where the recommendation is provided sentence by sentence.

Table 5 summarizes the report level performance on both X-ray image and EEG datasets. CLARA (predicted anchor words) outperforms the best baselines with a 17-30% improvement in CIDEr, which confirms the effectiveness of the retrieval from the prototype repository. We can also see with interactive guidance of anchor words from clinicians, CLARA (defined anchor words) provides an even better performance, which shows the importance of human input for report generation. To further understand the behavior of individual modules, we evaluate CLARA without edit module (only sentence retrieval from existing reports), which still achieves better performance in CIDEr than baselines but is much lower than CLARA (predicted anchor words) utilizing both retrieval and edit modules.

With sentence-by-sentence interactive report auto-completion with anchor words and prefix text, the performance of CLARA can be further improved. We evaluated CLARA with varying numbers of anchor words and prefix sentences to understand the effect of the increasing number of anchor words. We used 1-5 anchor words with CLARA. We also used prefix sentences with variable length. We present these sentence-level auto-completion results in table 3. As the results show the with increasing the number of anchor words, we can obtain higher scores. We observe that with increasing the number of anchor words the performance of CLARA increases 1-2%. In real-world deployed version of CLARA, clinicians can provide more input(anchor words) to the system to obtain more accurate results which is an advantage over current baselines where clinicians do not have control over the report generation.

Figure 3: This plot shows that the increasing CIDEr and BLEU scores for EEG reports and X-ray report generation with an increasing number of anchor words. This increasing trend of the scores with an increasing number of anchor words indicates that anchor words help guide CLARA to produce higher-quality reports. As clinicians provide more anchor words, CLARA can extract better candidate sentences and edit the sentences
Dataset Method CIDEr BLEU-1 BLEU-2 BLEU-3 BLEU-4
IU X-ray (image) CNN-RNN 0.294 0.216 0.124 0.087 0.066
Adaptive Attention 0.295 0.220 0.127 0.089 0.068
AG [jing2017automatic] 0.277 0.455 0.288 0.205 0.154
HRGR [li2018hybrid] 0.343 0.438 0.298 0.208 0.151
CLARA (1 anchor word) 0.356 0.471 0.318 0.209 0.199
CLARA (2 anchor words) 0.367 0.484 0.334 0.212 0.218
CLARA (3 anchor words) 0.374 0.488 0.355 0.235 0.227
CLARA (4 anchor words) 0.379 0.495 0.358 0.243 0.234
CLARA (5 anchor words) 0.393 0.498 0.375 0.259 0.248
CLARA(with prefix) 0.425 0.512 0.402 0.281 0.254
MGH (EEG) MP [venugopalan2014translating] 0.371 0.715 0.634 0.561 0.448
S2VT [venugopalan2015sequence] 0.321 0.748 0.623 0.531 0.469
TAM [yao2015describing] 0.345 0.748 0.672 0.593 0.381
SA [bahdanau2014neural] 0.353 0.689 0.634 0.573 0.484
EEG2text [EEG2text] 0.386 0.731 0.719 0.562 0.453
CLARA(1 anchor word) 0.443 0.763 0.684 0.603 0.463
CLARA(2 anchor words) 0.458 0.765 0.687 0.609 0.468
CLARA(3 anchor words) 0.462 0.773 0.688 0.614 0.485
CLARA(4 anchor words) 0.477 0.781 0.693 0.627 0.489
CLARA(5 anchor words) 0.482 0.785 0.716 0.645 0.491
CLARA(with prefix) 0.495 0.793 0.725 0.661 0.516
TUH (EEG) MP[venugopalan2014translating] 0.368 0.643 0.579 0.462 0.364
S2VT [venugopalan2015sequence] 0.371 0.725 0.634 0.545 0.441
TAM [yao2015describing] 0.385 0.719 0.646 0.503 0.469
SA [bahdanau2014neural] 0.353 0.738 0.621 0.524 0.432
EEG2text [EEG2text] 0.368 0.723 0.678 0.609 0.457
CLARA(1 anchor words) 0.449 0.763 0.688 0.614 0.464
CLARA(2 anchor words) 0.452 0.771 0.691 0.621 0.469
CLARA(3 anchor words) 0.454 0.773 0.694 0.624 0.470
CLARA(4 anchor words) 0.467 0.782 0.701 0.637 0.475
CLARA(5 anchor words) 0.479 0.789 0.705 0.645 0.483
CLARA(with prefix) 0.505 0.792 0.726 0.668 0.496
Table 3: Sentence level completion for EEGs and X-rays
Dataset Method Averaged Accuracy PR-AUC
IU X-ray (image) CNN-RNN 0.804 0.709
Adaptive Attention 0.823 0.723
CLARA (predicted anchor words) 0.871 0.796
CLARA (defined anchor words) 0.894 0.804
MGH MP [venugopalan2014translating] 0.745 0.724
S2VT [venugopalan2015sequence] 0.773 0.738
TAM [yao2015describing] 0.761 0.713
SA [bahdanau2014neural] 0.743 0.716
EEG2text [EEG2text] 0.784 0.748
CLARA(predicted anchor words) 0.835 0.803
CLARA(defined anchor words) 0.861 0.814
TUH MP [venugopalan2015sequence] 0.758 0.697
S2VT [venugopalan2015sequence] 0.764 0.683
TAM [yao2015describing] 0.782 0.698
SA [bahdanau2014neural] 0.781 0.701
EEG2text [EEG2text] 0.793 0.735
CLARA(predicted anchor words) 0.827 0.786
CLARA(defined anchor words) 0.834 0.804
Table 4: Accuracy of disease phenotype prediction based on generated reports
Figure 4: Examples of X-ray reports (doctor written report vs. generated reports). The second column shows the doctor written reports. The third and fourth columns show the reports generated by CLARA without edit module (just retrieval of most relevant sentences) and CLARA

with edit module using defined anchor words (our best model), respectively. Underlined text indicates matched terms of the generated text and ground truth reports. The highlighted words show the changes added by the edit module of

CLARA. These generated reports show that our method CLARA is able to write reports that are similar to the doctor’s written report. CLARA without the edit module is also capable of extracting good candidate sentences.
Figure 5: An example of EEG report generation. The first column shows the ground truth text report. The second column shows the report generated by TAM [yao2015describing] method. The third column shows the report generated CLARA without edit module and the fourth column shows the report generated by CLARA with edit module using defined anchor words. Underlined text indicates the correspondence of the generated text and ground truth reports. The highlighted words show the edit module of CLARA adding more information to the retrieved report. Noticeably CLARA with edit module can capture specific information such as “Delta slowing”, “anterior quadrant”. These changes introduced by the edit module showcase the importance of retrieve and edit module for producing a module
Dataset Method CIDEr BLEU-1 BLEU-2 BLEU-3 BLEU-4
IU X-ray (image) CNN-RNN 0.294 0.216 0.124 0.087 0.066
Adaptive Attention 0.295 0.220 0.127 0.089 0.068
AG [jing2017automatic] 0.277 0.455 0.288 0.205 0.154
HRGR [li2018hybrid] 0.343 0.438 0.298 0.208 0.151
KERP  [li2019knowledge] 0.280 0.482 0.325 0.226 0.162
CLARA(without edit module) 0.317 0.421 0.288 0.201 0.142
CLARA(predicted anchor words) 0.359 0.471 0.324 0.214 0.199
CLARA(defined anchor words) 0.374 0.489 0.356 0.225 0.234
MGH (EEG) MP [venugopalan2014translating] 0.367 0.714 0.644 0.563 0.443
S2VT [venugopalan2015sequence] 0.319 0.741 0.628 0.529 0.462
TAM [yao2015describing] 0.334 0.749 0.668 0.581 0.378
SA [bahdanau2014neural] 0.348 0.684 0.629 0.568 0.472
EEG2text [EEG2text] 0.372 0.742 0.728 0.587 0.381
CLARA(without edit module) 0.382 0.691 0.651 0.564 0.405
CLARA(predicted anchor words) 0.419 0.742 0.674 0.594 0.452
CLARA(defined anchor words) 0.443 0.762 0.684 0.614 0.464
TUH (EEG ) MP [venugopalan2014translating] 0.363 0.645 0.578 0.459 0.361
S2VT [venugopalan2015sequence] 0.364 0.724 0.613 0.543 0.438
TAM [yao2015describing] 0.384 0.714 0.647 0.492 0.461
SA [bahdanau2014neural] 0.341 0.736 0.619 0.519 0.420
EEG2text [EEG2text] 0.381 0.752 0.618 0.593 0.428
CLARA(without edit module) 0.368 0.725 0.634 0.573 0.423
CLARA(predicted anchor words) 0.399 0.769 0.635 0.601 0.455
CLARA(defined anchor words) 0.425 0.784 0.659 0.624 0.483
Table 5: Report level completion tasks on image data (IU X-ray) and EEG time series data (MGH and TUH) for Impression Section generation task

(B). CLARA provides accurate disease phenotyping

We also evaluate the effectiveness of CLARA in disease phenotype prediction. In particular, we train a character CNN classifier[zhang2015character] on original reports written by doctors to predict disease phenotypes. This classifier is used to score the generated reports produced by different baselines and CLARA. We measure the accuracy for predicting different disease phenotypes. CLARA consistently outperforms baselines. This results for X-ray is in Table 4.

(C). Clinical Expert Evaluation of CLARA The results of our models were evaluated by an expert neurologist in terms of its usefulness for clinical practice. In this setup, we measured the quality score metric for the generated reports. We only evaluated the EEG report generation task in this experimental setting. We provided the experts with samples with doctor written reports, reports generated by best-performing baselines and CLARA presented side by side. Clinicians were asked to provide a quality score in the range of 0-5. As shown in Figure 6, CLARA obtained an average quality score of 3.74 compared to TAM (best performing baseline) obtaining an average quality score of 2.52. These results indicate that the reports produced by CLARA are of higher clinical quality.

Figure 6: Clinical expert assessment of generated reports.

(D). Qualitative Analysis

We show sample results of clinical report generation using CLARA in Figure 4. Reports generated by CLARA show significant clinical accuracy and granular understanding of the input image. As clinicians use CLARA with different anchor words to generate the report, it ensures inclusion of important clinical findings such as “Pleural effusion”, “Pneumothorax”. Since anchor words are based on clinically significant findings, it enforces the report generation module to be clinically accurate. The third and fourth columns of the figure show the difference and changes introduced by the edit module. The edit module is able to modify the retrieved sentences with more details. For example, “focal consolidation” in row 1, “Chloecystectomy” in row 2, “Granuloma in right side” are important edits performed by the edit module of CLARA.


Medical report writing is important but labor-intensive for human doctors. Most existing works on medication generation focus on generating full reports without close human guidance, which is error-prone and does not follow clinical workflow. In this work, we propose CLARA, a computational method for supporting clinical report auto-completion task, which interactively facilitates doctors to write clinical reports in a sentence by sentence fashion. At the core, CLARA combines the information retrieval engine and neural networks to enable a powerful mechanism to retrieve most relevant sentences via retrieval systems then modify that using neural networks. Our experiments show that CLARA can produce higher quality and clinically accurate reports. CLARA outperforms a variety of compelling baseline methods across tasks and datasets with up to 35% improvement in CIDEr and BLEU-4 over the best baseline.


This work is part supported by National Science Foundation award IIS-1418511, CCF-1533768 and IIS-1838042, the National Institute of Health award NIH R01 1R01NS107291-01 and R56HL138415.


In this work, we proposed CLARA, a doctor representation learning based on both patient representations from longitudinal patient EHR data and trial embedding from the multimodal trial description. CLARA leverages a dynamic memory network where the representations of patients seen by the doctor are stored as memory while trial embedding serves as queries for retrieving the memory. Evaluated on real world patient and trial data, we demonstrated via trial enrollment prediction tasks that CLARA can learn accurate doctor embeddings and greatly outperform state-of-the-art baselines. We also show by additional experiments that the CLARA embedding can also be transferred to benefit the data insufficient setting (e.g., model transfer to less populated/newly explored country or from common disease to rare disease) that is highly valuable yet extremely challenging for clinical trials.


This work was in part supported by the National Science Foundation award IIS-1418511, CCF-1533768 and IIS-1838042, the National Institute of Health award NIH R01 1R01NS107291-01 and R56HL138415.


Appendix A Supplemetary

Lucene Details

Lucene implements a variant of the Tf-Idf scoring model

  • [leftmargin=*]

  • tf = term frequency in document = measure of how often a term appears in the document

  • idf = inverse document frequency = measure of how often the term appears across the index

  • coord = number of terms in the query that were found in the document

  • lengthNorm = measure of the importance of a term according to the total number of terms in the field

  • queryNorm = normalization factor so that queries can be compared

  • boost (index) = boost of the field at index-time

  • boost (query) = boost of the field at query-time


Factor Description:

  1. tf(t ind): Term frequency factor for the term (t) in the document (d).

  2. idf(t): Inverse document frequency of the term.

  3. coord(q,d): Score factor based on how many of the query terms are found in the specified document.

  4. queryNorm(q): Normalizing factor used to make scores between queries comparable.

  5. t.getBoost(): Field boost.

  6. norm(t,d): Encapsulates a few (indexing time) boost and length factors.

Lucene steps from query to output In this section, we describe some details of the search engine behind Lucene. Usually, query is passed to the Searcher of the Lucene engine, beginning the scoring process. Then the Searcher uses Collector for the scoring and sorting of the search results. These important objects are involved in a search: (1) The Weight object of the Query: this is an internal representation of the Query that allows the Query to be reused by the Searcher. (2) The Searcher that initiated the call.(3) Filter for limiting the result set. (4) Sort object for specifying the sorting criteria for the results when the standard score based sort method is not desired.

Simulated auto-completion: The auto-completion system requires a trigger from the user to initiate the process. These triggers are initialized with prefix or anchor words provided by the user. In real-world a deployed version of our model, doctors will provide anchor words ore prefix to CLARA to trigger completion of the sentences. But while developing the method, we can not expect to train the model with input from clinicians. So we created a simulated environment where the anchor words are predefined for each report. These anchor word creation steps are detailed in the above section.