DeepEnroll: Patient-Trial Matching with Deep Embedding and Entailment Prediction

Clinical trials are essential for drug development but often suffer from expensive, inaccurate and insufficient patient recruitment. The core problem of patient-trial matching is to find qualified patients for a trial, where patient information is stored in electronic health records (EHR) while trial eligibility criteria (EC) are described in text documents available on the web. How to represent longitudinal patient EHR? How to extract complex logical rules from EC? Most existing works rely on manual rule-based extraction, which is time consuming and inflexible for complex inference. To address these challenges, we proposed DeepEnroll, a cross-modal inference learning model to jointly encode enrollment criteria (text) and patients records (tabular data) into a shared latent space for matching inference. DeepEnroll applies a pre-trained Bidirectional Encoder Representations from Transformers(BERT) model to encode clinical trial information into sentence embedding. And uses a hierarchical embedding model to represent patient longitudinal EHR. In addition, DeepEnroll is augmented by a numerical information embedding and entailment module to reason over numerical information in both EC and EHR. These encoders are trained jointly to optimize patient-trial matching score. We evaluated DeepEnroll on the trial-patient matching task with demonstrated on real world datasets. DeepEnroll outperformed the best baseline by up to 12.4 in average F1.



There are no comments yet.


page 8


DeepEnroll: Patient-Trial Matching with Deep Embeddingand Entailment Prediction

Clinical trials are essential for drug development but often suffer from...

COMPOSE: Cross-Modal Pseudo-Siamese Network for Patient Trial Matching

Clinical trials play important roles in drug development but often suffe...

HINT: Hierarchical Interaction Network for Trial Outcome Prediction Leveraging Web Data

Clinical trials are crucial for drug development but are time consuming,...

Doctor2Vec: Dynamic Doctor Representation Learning for Clinical Trial Recruitment

Massive electronic health records (EHRs) enable the success of learning ...

A generic rule-based system for clinical trial patient selection

The n2c2 2018 Challenge task 1 aimed to identify patients who meet lists...

Developing and Using Special-Purpose Lexicons for Cohort Selection from Clinical Notes

Background and Significance: Selecting cohorts for a clinical trial typi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Clinical trial enrollment is a long-standing problem for drug development. Often insufficient patients enroll in trials despite that many of them having the target condition. The first barrier to trial participation is simple: many patients are unaware clinical trials are open and relevant for them. Luckily, the availability of trial eligibility criteria (EC) online and the rich collection of patient electronic health records (EHRs) data in hospitals bring a new promise to data driven automated trial enrollment.

Over the years, both rule-based systems and machine learning approaches were proposed for patient-trial matching. The rule-based systems 

(weng2011elixr; kang2017eliie) highly rely on laborious rule-setting and human annotations. Also they yield poor recall due to the existence of morphological variants and inadequate rule coverage. The machine learning based models perform rule extraction automatically. For example, Alicante et al., (alicante2016unsupervised) applied unsupervised clustering methods for eligible rule extraction. Bustos et al., (bustos2018learning)

evaluated several classifiers including convolutional neural networks for EC classification in cancer domain. Criteria2Query 

(yuan2019criteria2query) utilized rules extracted from  (kang2017eliie) to generate matching patient definition. Despite their improvement over rule-based systems, existing patient-trial matching methods still face the following challenges.

  1. Heterogeneous data from EC and patient EHR. Existing methods take two steps to extract rules from EC first, which could yield criteria that are too strict to find enough patients from EHR data. However, simultaneous matching of EC and EHR is difficult. ECs use unstructured natural language to describe eligibility criteria. EHR, on the other hand, use structured clinical codes to represent hospital visits of patients. There is a lack of cross-modal representation that can link matched concepts in ECs and EHR.

  2. Lack of explicit modeling for numerical information

    . EC statements often include many numerical information such as age, values of lab results and medication dosage. It creates challenges for matching since numbers are a common source of contradictions in natural language processing tasks 


    , and are also insensitive in distributed representations 

    (roy2015reasoning). Existing patient-trial matching works pay little attention to this issue.

To fill the gap, we proposed DeepEnroll to perform patient-trial matching based on clinical trial EC and patient health data. DeepEnroll is enabled by the following technical contributions.

  1. Joint embedding and entailment prediction to match heterogeneous data. DeepEnroll applies a Bidirectional Encoder Representations from Transformers (BERT (devlin2018bert)) model to encode clinical trial information. It also performs hierarchical embedding to encode patient health data. And the patient-trial matching are casted as an entailment prediction problem, where we model patient embedding as hypothesis and trial embedding as premise, and the objective is to predict whether a particular patient can be inferred from a given trial embedding. The encoders for EC and EHR and the entailment task are jointly optimized to find good matches between trials and patients.

  2. Numerical information entailment module to explicitly match numerical information. DeepEnroll is augmented by a numerical information embedding and entailment module to explicitly reason over numbers. It proposes a numerical representation to specifically encode numbers in textual input (EC). Then a pattern-based comparison algorithm is developed for (1) extracting numbers from free-text ECs and (2) inference between extracted numerical information in EC and EHR. Then the comparison results will be used to update the matching results.

We evaluated DeepEnroll on both real world clinical trial dataset and a synthetic data. We evaluated the patient-trial matching via predicting patient enrollment for trials. DeepEnroll outperformed the best state-of-the-art baselines by up to 12.4% in average F1 and 6.8% in PR-AUC.

2. Related Work

Patient-Trial Matching includes both rule-based and machine learning based models. Among rule-based models, EliXR (weng2011elixr) matches Unified Medical Language System (UMLS) concepts and relations via pre-defined dictionary and regular expressions. Another system called EliIE (kang2017eliie) parses and formalizes free-text EC with OMOP standard 111 The rule-based models learn to structuralize named entities and relations for clinical trials only, rather than for patient-trial matching. In addition, they highly rely on rule-setting and human annotations. Also they yield poor recall due to the existence of morphological variants of ECs and inadequate rule coverage. As for machine learning based algorithms,  (alicante2016unsupervised) that applies unsupervised clustering methods for eligible rule extraction. Also  (bustos2018learning)

compared CNN, SVM, and kNN as classifiers for EC classification in cancer domain. More recently, Criteria2Query 

(yuan2019criteria2query) is a hybrid information extraction pipeline that combines machine learning and rule-based methods to form patient criteria. Compared to the existing works, DeepEnroll

is an end-to-end deep learning model that achieves significantly better performance as shown in the experiment.

Pre-training Techniques. The goal of pre-training techniques is to provide model training with good initializations. Recently, language model pre-training techniques such as (peters2018deep; radford2018improving; devlin2018bert) have shown to largely improve the performance on multiple natural language processing tasks. As the most widely used one, BERT (devlin2018bert) builds on the Transformer (vaswani2017attention) architecture and improves the pre-training using a masked language model for bidirectional representation. Previously, BERT has been applied on EHR data to leverage unlabeled data for medication recommendation (ijcai2019-825). In this paper, we adapt the framework of BERT and pre-train our clinical trial EC embedding using large medical corpora.

Cross-modal Matching is enabled by joint embedding learning and pairwise similarity learning. Joint embedding learning aims to find a joint latent space where embedding of similar text and tabular data are close (li2017identity; wang2016learning). Pairwise similarity learning focuses on similarity computation for the matching (huang2017instance; wang2018learning). For patient-trial matching task, one EC statement could entail a group of EHR and vice versa. Therefore, similarity learning is not appropriate due to the need for capturing the many-to-many relationship.

3. Method

3.1. Problem Formulation

Patient EHR data for patient can be represented as a sequence of hospital visits . Each hospital visit consists of a major diagnosis code and a set of associated treatments(medications or procedures) .Similarly, each consists of varying number of treatments . The corresponding clinical trial can be represented as where each indicates the -th EC and can be represented as a sequence of words: where is the -th word in EC . In this paper, we embed EC and EHR into a shared latent space, where and are the sets of distributed representation of and after transformation. The matching is casted as a multiclass classification problem, where the matching results between and are classified into the following categories: ”entailment”, ”contradiction” and ”neutral”.

Notation Description
; EC set of the -th trial; EHR set of the -th patient
the -th EC of the -th trial
unnormalized attention weight matrix in alignment module
; EC embedding set; EHR embedding set
;; the -th visit; major diagnosis; treatment codes of patient
the -th treatment code from
weight matrix for predicting and
weight matrix for MLP layer
set of demographics of given patient
the -th demographics of given patient
visit-level embedding of
; EC embedding; EHR embedding
; soft-algined phrases of to ; of to
partly inference output from (, ) and (, )
aggregation output from and
;;; function for EHR hierarchical embeddings

non-linear activation function

; # of EC statements; # of EHR visits
real, predicted outcome
Table 1. Notations used in this paper.

3.2. The DeepEnroll Model

As illustrated in Fig. 1, DeepEnroll includes the following components: a trial EC embedding module, a hierarchical patient representation module, the alignment and entailment prediction module that perform patient-trial matching via attentive comparison for soft-aligned fragments. Next we will first introduce these modules and then provide details of training and inference.

Figure 1. The DeepEnroll framework. DeepEnroll applies a pre-trained BERT model to encode clinical trial information into sentence embedding . And the hierarchical embedding model encodes patient health data into embedding . Then the distributed representation and are soft-aligned to capture the interaction from each side. In addition, DeepEnroll is augmented by a numerical information embedding and entailment module to explicitly match numerical information representations (NIR) and reason over numerical information in EC and EHR. At last, the match module predicts the matching scores from the interaction representations and . The results will be further updated by the result of numerical information entailment.

Trial EC Sentence Embedding Using BERT

Clinical trial ECs are in the form of unstructured text. To embed these data into meaningful vector representations, we use a pre-trained BERT model as the base model for word and sentence representations. Based on a multi-layer Transformer encoder 

(vaswani2017attention), BERT is pre-trained using two unsupervised tasks.

  1. Masked Language Model. Instead of predicting words based on previous words, BERT predicts randomly masked words in a sequence to learn better bidirectional representations.

  2. Next Sentence Prediction. In order to gain better sentence embedding results which is used in our further matching prediction task, BERT has a binary sentence prediction task to predict whether one sentence is the next sentence of the other.

In particular, we applied the Clinical BERT (alsentzer2019publicly), which was pre-trained on three medical corpora including PubMed abstracts, PubMed full-text articles, and MIMIC-III doctor notes. Then, we collect text data for clinical trials, including trial condition, summary, description and ECs to further fine-tune the Clinical BERT. Last, we use the fine-tuned BERT model to generate the EC sentence embedding for downstream patient-trial matching tasks.

The trial EC embedding process can be formally described below. We denote the fine-tuned BERT model as and the ECs of the clinical trial as , where each indicates the -th EC and is represented as a sequence of words: . The embedding for each sentence is described as follow:


where represents the sentence embedding of the -th EC statement of clinical trial and is the sequence of embedding for the entire clinical trial .

Hierarchical Patient Data Embedding Inspired by the multilevel patient embedding strategy in  (choi2018mime), we also leverage the inherent multilevel structure of EHR data to learn patient embedding. The hierarchical structure of EHR data is structured as patient-level, visit-level, then diagnosis codes and corresponding treatment codes for that visit. Demographics information is also provided in patient-level, including birth year, gender, country, geo location, ethnicity, and blood type. We encode these medical codes and demographics as using one-hot vectors. We first embedded into a dense representation, and then apply a hierarchical network to learn patient embedding.

Formally, for each patient, we denote patient demographics as . Each visit encompasses treatments and diagnosis . The hierarchical embedding process is given below.


We denote as a single MLP that encodes demographics data , diagnosis codes , and treatment codes into a -dimensional embedding. And a weighted matrix converts into a different latent space to effectively capture its interaction with . The function takes the distributed representation of diagnosis code and the sum of drug-treatment interactions to become the visit-level embedding. Then, the function will integrate demographics embedding and visit embedding into the patient-level embedding.

To fine-tune the patient embedding, we further develop an auxiliary prediction task to predict diagnosis code and corresponding treatment codes based on visit-level embedding .


Where and are weight matrix to compute the prediction of diagnosis code and treatment codes respectively.

Alignment Module Attentive neural networks have demonstrated success in entailment prediction tasks  (parikh2016decomposable; rocktaschel2015reasoning). In our case, we regard the trial EC embedding as premise and the encoded EHR as hypothesis. Our task is to predict whether the hypothesis is entailed in the premise, i.e. given a trial EC, whether a particular patient is a match for the trial.

First, we create a soft-alignment matrix using neural attention between a set of premises and hypotheses. Next, we use the alignment vectors to decompose our task into two sub-problems.

  1. Given a certain premise, predict the entailment between the premise and all hypothesises.

  2. Given a certain hypothesis, predict the entailment between the hypothesis and all premises.

Such a decomposition allows every EC to be soft-aligned to all related EHR, as well as every EHR to all related EC. Accordingly, the match module can capture the entailment relations between all EC and EHR pairs.

Formally, we denote as a set of premises and as a set of hypothesises, where . We feed elements from premise and hypothesis into a shared transformation layer parameterized by () and obtain unnormalized attention weights as soft-alignment matrix .



is an activation function (e.g., ReLU in our case) that operates element-wise over a vector. And

is the attention weight between and . The soft-alignment matrix is then used to obtain the weighted summation for premise as and hypothesis as .


Here is the related hypothesis in that is softly aligned to , while is the related premise in EC that is softly aligned to . And is formal representation for the input of sub-problems (1). is formal representation for the input of sub-problems (2).

The proposed decomposable soft-alignment has two major benefits. First, the aligned vectors focus on only related local sub-structure, which is proved to be effective for entailment inference task (parikh2016decomposable). Second, the decomposition avoids the quadratic complexity ( times) and only requires linear complexity ( times).

Match via Entailment Prediction Given the output from the alignment module, we jointly consider interactions between single EHR and EC as well as for the original EC embeddings and EHR embeddings and by feeding their concatenation into a shared transformation layer to compute the entailment relationship between given patient and trial.


Where (,) are the parameters for the transformation layer. The output and represent the entailment prediction results for sub-problems (1) and sub-problems (2), respectively. They are used to aggregate into the final entailment inference results.


Where and are the summation of the prediction results of the two sub-problems. Following the aggregation method from (mou2015natural), we adapt three aggregation methods: (a) vector concatenation;(b) element-wise product; (c) element-wise difference, to measure their similarity and closeness in latent space. The aggregated representations are then feed into a final output layer.


where (,) are the parameters for output layer. represents the predicted label, including ”entailment”, ”contradiction”, and ”neutral”. The entire embedding and entailment prediction process is summarized in Algorithm 1.

Numerical Information Entailment Numerical information in ECs such as age or dosage is important for patient-trial matching. However, it is difficult to match them since numbers are a common source of contradictions in natural language processing tasks (dagan2013recognizing), and are often insensitive in distributed representations (roy2015reasoning). To explicitly model these numerical information, we proposed the following numerical information representation(NIR).

The design of NIR is inspired by (roy2015reasoning; ho2019qsearch). We first encode numerical information as a triplet (), where the three variables correspond to number, unit, and concept, respectively.

  • [leftmargin=*]

  • Number : The lower or upper bound for a value range. e.g. more than 500 mg, at least one month, during last three months. We do not store the surface forms but convert them into a set of ranges. For example, ”more than 20 mg” is stored as (20,+).

  • Unit : is about the scale a value is measured. e.g., mg, weeks. The word ”weeks” in the phrase ”within 12 weeks” is a unit.

  • Concept : an aligned concept that the quantity is associated with. e.g. certain drug or procedure. This is stored for augmentation of quantity and input representations.

Next, we develop the following method. Given a sequence of tokens , we extract its word-level features and character-level features to recognize numerical information. In word-level, we can recognize if appears in a set of known scientific units(e.g., mcg, g/L), written numbers. In character-level, can contain a digit, have all digits, or include a suffix (st,nd,rd,th). Then the extraction results are formulated as the numerical information representation with standardization including numbers converting into floating points and fixed date type, and units converting into standard base unit.

Finally, we design a numerical information comparison method to infer the entailment between NIR learned from ECs and numerical information in EHR. Numerical information in EHR are often represented as a fixed value with unit. Therefore, we firstly check whether two units are comparable. If comparable, we would then compare the fixed value with the value range in NIR. The entailment results would includes: ”entailment”, ”contradiction” and ”not comparable”. Only the ”entailment” ones should be regarded as supported by the numerical information entailment module.

Input: A set of labelled triplet . Let equals the size of training set.

1:  for iter  do
2:      EC Sentence Embedding()
3:      Hierarchical Patient Embedding()
4:     , Concept Alignment()
5:     if Entailment Match(,) = Entailment then
6:        if Quantity Match() = Entailment then
7:           return  Entailment
8:        else
9:           return  Contradiction
10:        end if
11:     else if Entailment Match (,) = Contradiction then
12:        return  Contradiction
13:     else if Entailment Match (,) = Neutral then
14:        return  Neutral
15:     end if
16:  end for

Output: Trained deep embedding and entailment prediction model.

Algorithm 1 DeepEnroll for Patient-Trial Matching

4. Experiment

We evaluated DeepEnroll model on three datasets, including two sets of proprietary matched trial and EHR data, and a publicly available synthetic data. The code can be found in 222 We designed experiments to answer the following questions.

Q1: Does DeepEnroll have better performance in matching patients and trials as measured by predicting clinical trial enrollment?

Q2: Does the numerical information entailment module help improve the performance of DeepEnroll?

Q3: Can DeepEnroll provide interpretable matching results?

4.1. Experimental Setting

Data The data used in experiments are listed below.

  • [leftmargin=*]

  • Clinical Trial EC We randomly selected 794 clinical trials with varying disease domains from Firstly we downloaded the XML files for the clinical trials. Each file follows a fixed structure defined by the clinical trial XML schema. We select the Inclusive Criteria and Exclusive Criteria to form raw ECs. Then we extract EC statements from the free-text raw ECs through patter-based paragraph segmentation and sentence segmentation. In total, we obtain 12,445 sentence-level EC statements for 794 clinical trials.

  • Large Scale Patient-Trial Matching Data (IQVIA dataset) We use this trial-patient matched data for model training and validation, noted as IQVIA dataset. It contains 561 registered clinical trials and 57,696 matched patients. The patient information is encoded in a large-scale longitudinal prescription and medical claims data. For each trial, a number of free-text EC statements describe trial requirements from different aspects. Patients are profiled as a set of EHR, including records of diagnosis, and treatments. We define the inclusive EC and exclusive EC and their corresponding matched patients’ EHR as one ”Entailment” or ”Contradiction”. As to unmatched EHR and EC statements, we labelled them as ”Neutral”. In all we have 852,1031 labelled examples.

  • Rare Disease Data. The rare disease data is a subset of IQVIA dataset, which contains conditions from the rare diseases list provided by the national organization for rare disorders(NORD). This results in 77 trials and 562 matched patients from IQVIA dataset.

  • Synthetic Data For reproducibility, we also develop a synthetic data based on (yuan2019criteria2query) using Synthea 333, a synthetic patient population simulator to automatically generate patient EHR. We select 243 registered trials from a wide range of disease domains, and leverage Criteria2Query to generate cohort definitions for filtering matched patient EHRs. In total, we have 68,845 simulation EHRs, which is labelled in similar way as IQVIA dataset. Details for synthetic data generation can be found in Section 7 of the appendix.

Statistic IQVIA Synthetic Rare Disease
# of trials 561 243 77
# of EC statements 9,510 2,935 1,394
Avg. EC sentence length 24.2 21.3 23.4
# of patients 57,696 13,634 562
# of EHR 852,103 68,845 9,580
# of unique medications 352 156 294
# of unique diagnosis 939 288 309
Table 2. Data Statistics

Baseline We compared DeepEnroll with the following baselines.

  • [leftmargin=*]

  • Logistic Regression (LR) (hosmer2013applied) We concatenate sentence embedding of EC and one-hot EHR vectors and apply LR to make multiclass classification.

  • Multi-Layer Perception (MLP) is leveraged as entailment inference module (bowman2015large; conneau2017supervised). The model is simply a stack of three 200d tanh layers, with the bottom layer taking the concatenated premise and hypothesis word representations as input and the top layer feeding a softmax classifier.

  • Long Short Term Memory (LSTM) (hochreiter1997long). We apply two LSTM to model both premise sequence and hypothesis sequence and then use the corresponding final output for entailment prediction.

  • match-LSTM (mLSTM) (wang2015learning) perform word-by-word matching of the hypothesis with the premise.

  • Stack Augmented Parser Interpreter Network (SPINN) (bowman2016fast) combines parsing and interpretation within a tree-sequence hybrid model by integrating tree structured sentence interpretation.

  • Word-by-word Attention(WA) (rocktaschel2015reasoning) is an entailment inference model that utilize the output latent state of premise LSTM as the input of hypothesis LSTM. Then a word-by-word attention is developed for both LSTM for entailment prediction.

  • Criteria2Query (yuan2019criteria2query)

    is the current state-of-the-art work for patient-trial matching task. It implements a systematic information extraction pipeline to parse free-text eligibility criteria into a structured and computable representation using the OMOP CDM, and also leverages a series of natural language inference techniques such as named entity recognition to autonomously generate matching patient definition to identify patient cohorts Note that we would not compare its performance on Synthea dataset, which is generated through Criteria2Query.


To measure the accuracy of matching, we use the average F1 and Precision Recall AUC (PR-AUC) as our metrics. Precision, Recall, and F1 score are computed on the number of true positives (TP), false positives (FP), and false negatives (FN). PR-AUC measures the area under the precision-recall curve. F1 score is the harmonic mean of precision and recall. Since the entailment prediction task is casted as a multi-label classification task, we consider micro-F1 as the metric and calculate average F1 based on that.

Evaluation Strategy

All methods are implemented in PyTorch 


and trained on an Ubuntu 16.04 with 64GB memory and three GTX 1080 Ti GPU. For all datasets, we randomly select 60% of the patients as training set, 20% as validation set and the remaining 20% as test set. We train the model using training data, and fix model parameters based on the best model performance on validation set. We then test the model on test set. We perform three random runs and report both mean and standard deviation for testing performance.

Implementation Details

. We use stochastic gradient descent (SGD) with a learning rate of 0.1 and a weight decay of 0.99. At each epoch, we divide the learning rate by 5 if the dev accuracy decreases. We use mini-batches of size 64 and training is stopped when the learning rate goes under the threshold of 10e-5. For the classifier, we use a multi-layer perception with 1 hidden-layer of 512 hidden units. We use dropout 

(srivastava2014dropout) with a rate of 0.5, which is applied to all feedforward connections. For the pre-trained Clinical BERT, We use a batch size of 32 and fine-tune for 3 epochs over the data for two unsupervised task. For each task, we selected the fine-tuning learning rate of 2e-5.

For all baselines except for Criteria2Query, we use 300 dimensional GloVe embedding (Pennington14glove:global)

as word embedding and max-pooling as sentence embedding method. Out-of-vocabulary (OOV) words are hashed to one of 100 random embedding each initialized to mean 0 and standard deviation 1. All other hidden layer weights were initialized from random Gaussian distribution with mean 0 and standard deviation 0.01. Each hyperparameter setting was run on a same machine as the

DeepEnroll, using Adagrad (Duchi:2011:ASM:1953048.2021068) for optimization with initial accumulator value of 0.1. For EHR data, we convert them into one-hot vectors as the hypothesis representation. As for Criteria2Query, we directly use the existing model to produce the matching results for given trials and patients.

Q1: DeepEnroll achieved the best predictive performance in clinical trial enrollment prediction

We compared DeepEnroll against the state-of-the-art baselines on IQVIA dataset, the synthetic data, and the rare disease data. We reported the results in Table. 3, Table 4 and Table 5, respectively. The best results are presented in bold figures.

Model Average F1 PR-AUC
LR 0.5675 0.0003 0.6580 0.0005
MLP 0.5864 0.0001 0.6738 0.0000
LSTM 0.62250.0001 0.68710.0001
SPINN 0.63310.0042 0.70140.0063
Word-by-word Attention 0.64570.0012 0.71630.0021
match-LSTM 0.66410.0007 0.74230.0011
Criteria2Query 0.68190.0001 0.71030.0002
DeepEnroll 0.75430.0018 0.78640.0083
Table 3. Performance Comparison on IQVIA dataset.
Model Average F1 PR-AUC
LR 0.61230.0002 0.68490.0003
MLP 0.62310.0001 0.69140.0000
LSTM 0.64240.0002 0.70920.0003
SPINN 0.62330.0074 0.67120.0078
Word-by-word Attention 0.62450.0019 0.68210.0034
match-LSTM 0.65420.0013 0.69560.0025
DeepEnroll 0.72980.0013 0.74310.0029
Table 4. Performance Comparison on synthetic data. C2Q is not compared here since it was used in data synthesis.
Model Average F1 PR-AUC
LR 0.5518 0.0004 0.6375 0.0012
MLP 0.5454 0.0001 0.6844 0.0000
LSTM 0.61630.0002 0.67820.0002
SPINN 0.61940.0061 0.68920.0078
Word-by-word Attention 0.63570.0014 0.70650.0023
match-LSTM 0.64420.0009 0.71330.0021
Criteria2Query 0.61550.0001 0.64810.0001
DeepEnroll 0.72410.0011 0.74530.0029
Table 5. Performance comparison on rare disease data.

From the results, we observe that DeepEnroll consistently achieves the best performance against best baselines. On IQVIA dataset, it outperformed the best baseline Criteria2Query by 10.6% in Average F1 and match-LSTM by 5.9% in PR-AUC. On the synthetic data, it outperformed the best baseline match-LSTM by 11.5% in Average F1 and 6.8% in PR-AUC. On rare disease dataset, it outperformed the best baseline match-LSTM by 12.4% in Average F1 and 4.5% in PR-AUC.

Among the baselines, LR and MLP have the worst performance due to they fail to capture the temporal patterns in EC text and patient data, thus cause the inferior matching performance. Compared with them, the LSTM and the word-by-word attention model manage to capture long-range temporal dependencies among both EC and patient data. The SPINN model uses a tree-sequence hybrid structure for better sequence modeling. Thus they demomnstrated improved performance over LR and MLP.

The two best baselines are match-LSTM and Criteria2Query. The match-LSTM model is a LSTM-based natural language inference model. It performs word-by-word matching of the hypothesis with the premise, such that it not only can place more emphasis on important word-level matching results, but also can remember important mismatches that are critical for predicting the contradiction or the neutral relationship label. With this design, match-LSTM has greatly improved performance compared with the other LSTM-based models. On the other hand, Criteria2Query (yuan2019criteria2query) implements a systematic information extraction pipeline to parse free-text eligibility criteria into a structured and computable representation using the OMOP CDM, and also leverages a series of natural language inference techniques such as named entity recognition to autonomously generate matching patient definition to identify patient cohorts. It achieved very good performances across all clinical trials due to rule extraction and natural language inference techniques. However, on rare disease trial data, the performance of Criteria2Query is not encouraging, which could be due to the lack of full knowledge of rare diseases in the knowledge system it builds on. One example was that Criteria2Query showed confusing results in recognize numerical restrictions: Heart failure with ejection fraction 40 (NCT04078425). Criteria2Query would recognize only Heart failure while DeepEnroll could recognize the specific numeric criteria for matching.

To compare with retrieval-style baseline, we used solr(smiley2011apache) to query 20 clinical trials with textual descriptions of EHR formatted in bag of words, including demographics and name of diagnosis, products, and procedures. The F-1 score for top 5 results is 0.4731 and for top 10 is 0.5378 (vs. DeepEnroll 0.7543). As to the baselines mentioned in Related Work, the baseline Criteria2Query already uses the same retrieval component as EliXR and EliIE.

In this experiment, we also evaluated model performance on patient-trial matching for rare diseases. Recruiting suitable patients to trials for rare diseases is challenging. This is because often there are few patients with rare diseases. It may not be feasible to significantly narrow entry criteria based on disease stage or other characteristics (Augustine2013). From the results in Table 5, we can observe that although the performance of all models decreased, DeepEnroll still output performed all baselines with a 7.99% improvement and demonstrated a minimal performance reduction. Among baselines, Criteria2Query has the biggest performance drop since it heavily relies on disease guidelines, which are not sufficient for rare diseases.

Q2: The numerical information entailment module augments the performance of DeepEnroll

We conduct an ablation study to understand the contribution of numerical information representation and entailment module (NIR) in DeepEnroll. We remove the NIR module in patient-trial matching and perform trial enrollment prediction using the reduced model. We compared the prediction results against the full model across all datasets. The parameters in the reduced model are determined with cross-validation, and the best performances are reported in Table 6.

Dataset Metric w/o NIR DeepEnroll
IQVIA dataset PR AUC 0.76110.0094 0.78640.0083
F1 Score 0.73920.0021 0.75430.0018
Rare Disease PR AUC 0.72580.0032 0.74530.0029
F1 Score 0.70120.0011 0.72410.0011
Synthetic Data PR AUC 0.73990.0028 0.74310.0029
F1 Score 0.72110.0013 0.72980.0013
Table 6. Abalation study of DeepEnroll demonstrated the advantage of numerical information entailment augmentation.

From Table. 6, it is easy to see when we solely leverage the EC sentence embedding and hierarchical patient embedding for entailment match, the performances are largely reduced. For example, on IQVIA dataset and rare disease data, the performance could be decreased by up to 3.32% in PR-AUC and 3.26% in F1 score. In conclusion, it suggests the necessity of augmenting textual representation with explicit embedding and entailment for numerical information in trial-patient matching task.

Q3: DeepEnroll can provide interpretable matching results

To better facilitate clinical decision making, we also design the following strategy to understand the key factors for matching patients and trials. For each trial considered in case study, we randomly select patients who are enrolled in the trial. We collect the soft-alignment attention weights, and then visualize them using heatmaps.

Here, we select the following trials: trials on heart failure (HF, NCT03882710), Alzheimer’s Disease (AD, NCT03690193) and idiopathic pulmonary fibrosis (IPF, NCT02085018). Each row represent one EC statement and one column shows one patient embedding. We chose 5-6 representative EC statements with relatively high attention weights. We leveraged the weighted average of attention weights between EHR and EC into patient level (), and visualize the impact of each EC statements on the selected patients. We utilize ECharts as our visualization tool 444 Light color indicates lower relevance while dark color means strong relevance.

Figure 2. Heatmap showing the attention weights for EC statements and patient pairs. Rows represent EC statements (e.g., inclusion and exclusion) and columns are patient embedding. Dark color indicates strong relevance. Matched patients of each trial are highlighted.

Heart failure is a serious condition such that the heart of HF patient cannot keep up with its workload. The body of the patient may not get the oxygen it needs. HF has no cure. Drugs for HF are designed to help patients better manage their conditions. From Figure 2 (a), the most relevant EC statement requires the patients to have no ”other systemic disease limiting life expectancy to less than 3 years” mainly due to the the clinical trial will take more than 3 years. Another important statement is about ”LVEF in a nondilated LV”, which is a key sign of HF.

Alzheimer’s disease (AD) is an irreversible, progressive brain disorder that slowly destroys memory and thinking skills, and, eventually, the ability to carry out the simplest tasks. Currently, there is no cure for AD or a way to stop its progression. Drugs and treatments are developed to either cure AD or help treat some symptoms of AD. From Figure 2 (b), the most relevant EC statement states that in order for the patients to be considered for the trial, their major organ should have normal functions, and also patients with diabetes and cancer should be excluded. For AD trial, BMI measure and behavior description are less important.

The last example is for a trial on treating idiopathic pulmonary fibrosis (IPF). IPF is a pulmonary disease that is characterized by the formation of scar tissue within the lungs in the absence of any known provocation (Meltzer2008). IPF is a rare disease which affects approximately 5 million people worldwide, with prevalence rate at . Currently there is no cure for IPF. Treatments are being developed and tested to slow the rate of scarring (pirfenidone and nintedanib) and treat particular IPF symptoms such as breathlessness and coughing. From Figure 2 (c), for this trial the most relevant EC statement states the patients should have ”history of cough”, and have no ”known allergy to Opeprazole or other proton pump inhibitor”.

Through visualization, we conclude that the attention weights in DeepEnroll help better understand the key inclusion and exclusion criteria for successful recruitment of patients.

4.2. Case Study: trials that are difficult to match patients with machine learning models

Figure 3. Attention weights for BLITZ Heart Failure trial. Matched patients are highlighted.

Among our experiments, for some clinical trials, it is not easy to use machine learning models to match patients. To analyze the possible reason, we select a trial that has shown lower than 30% correct patient-trial matching using either DeepEnroll or Criteria2Query. The trial is named BLITZ Heart Failure (NCT03661060), which is an observational study designed to evaluate whether a structured educational program can improved the patient adherence from both acute and chronic HF patients to guidelines recommendations.

We take a similar strategy to visualize its attention matrix in Figure 3. There are two possible reasons for the low matching performance. First, there are many over-simplified abbreviation in trial EC statements, such as the ones shown in the forth and fifth rows. The over-simplified description could be ambiguous and hard to match with concepts in patient EHR. In addition, the EC statement in the sixth row has relatively high attention weights, but is actually unrelated to information from patient EHR.

5. Conclusion

In this paper, we proposed DeepEnroll, a cross-modal inference learning model to jointly encode enrollment criteria and patients records into a shared latent space for patient-trial matching. DeepEnroll applies a pre-trained BERT model to encode clinical trial information, and a hierarchical model to embed patient EHR. DeepEnroll is also augmented by a numerical information embedding and entailment module to explicitly match numerical information in EC and EHR. We eva both real world clinical trial dataset and a synthetic data. Experiments on real-world datasets demonstrated the strong performance of DeepEnroll. The DeepEnroll method can also be extended to other application domains for matching problems based on heterogeneous data, such as personalized targeted advertising that is based on matching Ads description (in text) and customers’ browsing history (in sequential data).

6. Acknowledgments

This work is part supported by National Science Foundation award IIS-1418511, CCF-1533768 and IIS-1838042, the National Institute of Health award NIH R01 1R01NS107291-01 and R56HL138415.

7. Appendix: Details on generating the synthetic dataset

For model reproducibility, we generated a synthetic data following the procedure below.

  1. Data Gathering. For clinical trial data, we downloaded 1,000 clinical trial descriptions from a Each trial is stored in an XML file that follows a structure of fields defined by an XML schema. We select the EC items in free-text format from XML files as our original dataset. Then, we split EC into statements by considering different types of bullet and lists symbols as the natural sentence splitting signs.

  2. EHR Generation. According to (weng2010formal), EC items could be divided into the following categories:demographic information; condition occurrence, procedure occurrence; measurement; drug exposure; observation; and willing information. Except for willing and demographic information, other categories can be mapped with corresponding tables in EHR system. For each EC record, we use the Criteria2Query (yuan2019criteria2query) system to generate a corresponding EHR query in CDM v5.0 format, which includes the category label for EC and the EHR query information. Then, we run these queries in ATLAS (, an online EHR query engine to generate eligible patient-level EHR information. These information are stored in relational database. We developed a algorithm to translate the database information into natural text.

  3. Labelling. Last, the matching EHR and EC are labelled as Entailment for inclusion criteria and Contradiction for exclusion criteria. Additionally, a number of unrelated EHR and EC pairs are also generated and labelled as Neutral. The labeling process forms tuples in the form of as model input.