COMPOSE: Cross-Modal Pseudo-Siamese Network for Patient Trial Matching

06/15/2020 ∙ by Junyi Gao, et al. ∙ IQVIA University of Illinois at Urbana-Champaign 0

Clinical trials play important roles in drug development but often suffer from expensive, inaccurate and insufficient patient recruitment. The availability of massive electronic health records (EHR) data and trial eligibility criteria (EC) bring a new opportunity to data driven patient recruitment. One key task named patient-trial matching is to find qualified patients for clinical trials given structured EHR and unstructured EC text (both inclusion and exclusion criteria). How to match complex EC text with longitudinal patient EHRs? How to embed many-to-many relationships between patients and trials? How to explicitly handle the difference between inclusion and exclusion criteria? In this paper, we proposed CrOss-Modal PseudO-SiamEse network (COMPOSE) to address these challenges for patient-trial matching. One path of the network encodes EC using convolutional highway network. The other path processes EHR with multi-granularity memory network that encodes structured patient records into multiple levels based on medical ontology. Using the EC embedding as query, COMPOSE performs attentional record alignment and thus enables dynamic patient-trial matching. COMPOSE also introduces a composite loss term to maximize the similarity between patient records and inclusion criteria while minimize the similarity to the exclusion criteria. Experiment results show COMPOSE can reach 98.0 matching and 83.7 improvement over the best baseline on real-world patient-trial matching tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Clinical trials with annual market of over 46 billion USD are the only established process for developing new treatments for diseases. But trials often suffer from expensive, inaccurate and insufficient patient recruitment. Many trials struggle to acquire the required number of patients. Moreover 50% of trials delayed due to patient recruitment issues while some trials are unable to find sufficient patients to begin the trial at all (Pharmafile, 2016). For example, Campbell et al. (Campbell et al., 2007) reported that one-third of publicly funded trials required time extensions due to insufficient enrollment. It was also reported that of cancer trials failed to enroll sufficient patients (S., 2015)

. Even with sufficient patient, the recruitment cost is high, estimated around 6000 to 7500 USD per patient 

(Myshko, 2005).

Figure 1. An illustration of patient-trial matching.

Automated patient-trial matching brings a new opportunity to optimize the trial recruitment process. The key is to find qualified patients for clinical trials given patient’s longitudinal electronic health records (EHR) and trial eligibility criteria (EC) described as both inclusion and exclusion criteria, as shown in Fig.1

. Over the years, rule based retrieval systems and lately a deep embedding model were developed to match patients for trials. Rule based systems either rely on large amount of human annotations 

(Weng et al., 2011; Kang et al., 2017)

or train supervised learning classifiers to extract rules 

(Bustos and Pertusa, 2018)

, or combined machine learning and rule-based methods to generate SQL queries for ECs 

(Yuan et al., 2019). However, they often yield poor recall due to morphological variants and inadequate rule coverage. More recently, DeepEnroll (Zhang et al., 2020) was proposed as a deep embedding model to predict whether a patient is eligible for a trial. It jointly embeds and aligns medical concepts derived from patient data and EC onto the same embedding space, and then perform simple similarity match of patients to trials based on the aligned embeddings. However, DeepEnroll ignores the difference between inclusion and exclusion criteria, which can lead to criteria mismatch. In general, here are the key open challenges for patient trial matching:

  1. Multi-granularity medical concept. Unstructured ECs often encode more general disease concepts due to heterogeneity of disease manifestation (Hill et al., 2008), while structured patient EHR often represent patient conditions using more specific medical codes. For example, a patient with pleuropericardial adhesion on EHR data can be recruited to a trial that require more general cardiovascular diseases. It is nontrivial to match medical concepts with heterogeneous granularity in different data modalities.

  2. Many-to-many relationship between patients and trials. In practice, each patient may enroll more than one trial and vice versa. Since each trial generally focus on certain diseases, the semantic distance between trials even ECs may be large. However, all existing works align the patient embedding to different trial embeddings, which may confuse the embed function and lead to inferior matching results.

  3. Explicit Inclusion/Exclusion Criteria Handling. Trial ECs comprise inclusion and exclusion criteria. They describe what are desired and unwanted from the targeted patients. Existing approaches did not explicitly differentiate these two criteria, which may significantly affect the matching accuracy.

To address these challenges, we propose CrOss-Modal Pseudo-Siamese (COMPOSE) for patient trial matching, which has the following contributions:

  1. Taxonomies guided multi-granularity medical concept embedding. To match medical concepts of heterogeneous granularity, we augment the medical codes in patient’s records with their textual descriptions and hierarchical taxonomies, such that concepts can be embedded in both finer and more coarse levels to better align detailed medical codes in EHR and medical concepts with various granularity.

  2. Attentional record align and dynamic patient-trial match. Instead of aligning a patient’s entire record to each trial, we developed an attentive READ mechanism inside of a dynamic memory network to extract the best matching part of patient EHR to match with ECs at criteria level.

  3. Differentiating inclusion/exclusion criteria. COMPOSE also has a composite similarity loss term to explicitly handle the inclusion and exclusion criteria separately. It improved the patient-trial matching based on maximizing the similarity between patient and inclusion criteria while minimizing the patient and exclusion criteria in latent space.

We evaluated COMPOSE on real world clinical trial dataset. COMPOSE significantly outperformed the best state-of-the-art baselines. It achieved 24.3% relatively higher accuracy over the best baselines on patient-trial matching tasks.

2. Related Works

Patient trial matching can be categorized as rule based systems and deep embedding based models. Rule based systems try to extract named entities and relations for trial eligibility criteria (ECs) and construct rules for identifying patients. They either rely on large amount of human annotations (Weng et al., 2011; Kang et al., 2017) or train supervised learning classifiers to extract rules (Bustos and Pertusa, 2018), or combined machine learning and rule-based methods  (Yuan et al., 2019) to rules for ECs. For example, EliXR (Weng et al., 2011) matches Unified Medical Language System (UMLS) concepts and relations via pre-defined dictionary and regular expressions. Alicante et al. (Alicante et al., 2016) utilized unsupervised clustering methods for eligible rule extraction. Bustos et al.  (Bustos and Pertusa, 2018)

used naive machine learning models such as CNN, SVM, and kNN as classifiers for specific disease EC classification. Yuan

et al. proposed a complete pipeline Criteria2Query (Yuan et al., 2019), which combines machine learning and rule-based methods to form patient criteria. These methods often yield poor recall due to morphological variants and inadequate rule coverage. Recent years, deep embedding based models such as DeepEnroll (Zhang et al., 2020) jointly embeds patient records and trial ECs in the same latent space, and then aligns them using attentive inference. However, DeepEnroll did not consider the match of different concept granularity nor differentiate inclusion and exclusion criteria. In the experiment, we compare with Criteria2Query and DeepEnroll as they are the state of the art methods.

Cross-modal retrieval enables flexible retrieval across different modalities, such as semantic image-text retrieval (You et al., 2018; Ren et al., 2015). The core of cross-modal retrieval is how to measure the content similarity. Among others, Siamese network is a typical structure for uni-modal retrieval tasks (Guo et al., 2017; Qi et al., 2016). It consists of two branches of the same structure and utilizes similar pairs and dissimilar pairs for similarity learning. Pseudo-Siamese network (Hughes et al., 2018; Treible et al., 2019) is more flexible than Siamese network in the sense that it allows different structures to receive inputs from different modalities. In this work, we developed a pseudo-siamese network for patient trial matching task.

Figure 2. COMPOSE model. (1) Clinical trial EC sentence embedding: First, we use pretrained BERT to generate contextualized word embeddings for each word in EC sentences. Then we use CNNs to capture semantic features and generate embeddings for EC sentence. (2) Taxonomy guided multi-granularity medical concept embedding: We use hierarchical memory networks to store medical concept taxonomies. Different granularity information of concept is stored in different levels of the memory network. (3) Attentional record align and dynamic patient-trial match: We use learned EC embeddings to attentively visit the memory network and retrieve the best matching memories . Finally, we use the EC embedding and matched memory embedding to predict matching result. Meanwhile, we also optimize the distance between two embeddings to differentiate inclusion and exclusion ECs.

3. Problem Formulation

Below we define the input data and the modeling problem in this paper. We summarize the notations in Table. 1.

Definition 0 (Patient Records).

In longitudinal EHR data, each patient can be represented as a sequence of multivariate observations , where is the number of visits. Each visit is represented by with diagnosis , medication and procedure . Here , and are the sets of diseases, medications and procedures, respectively. For simplicity, we use to represent any medical code (, , or ) in . Each patient also has demographic features including gender and age.

Definition 0 (Clinical Trials).

For each clinical trial, its protocol comprises two types of eligibility criteria: inclusion criteria and exclusion criteria. where () denotes the number of inclusion (exclusion) criteria in the trial, () denotes the -th inclusion (exclusion) criterion. Each criterion can be represented as a sequence of words: , where is the number of words in .

Definition 0 (Medical Taxonomy).

The medical taxonomy expresses the hierarchy of various medical concepts in the form of a parent-child relationship including diagnosis taxonomy , procedure taxonomy and medication taxonomy . The leaf nodes in a taxonomy are detailed medical concepts (e.g. Type 2 diabetes mellitus), while parent nodes represent more general concepts (e.g. Endocrine, nutritional and metabolic diseases). can be built by using well-organized taxonomy of medical concepts (e.g. ICD111, CCS222 or USC333 We use the notation to represent a general taxonomy for medical code .

Problem 1 (Patient Criteria Matching).

Given a patient’s visit records and a set of inclusion or exclusion criteria, we formulate the patient-criteria matching task as a multi-class classification problem, which is to classify the matching results between patients and ECs into three categories: ”match”, ”mismatch”, and ”unknown” based on the similarity between patient records and trial criteria: .

Problem 2 (Patient Trial Matching).

Given a patient’s visit records and a clinical trial that comprises a set of inclusion or exclusion criteria, we consider a patient and a trial are a match only if the patient matches (i.e., confirms) all inclusion criteria and mismatches (i.e., refutes) all exclusion criteria in the trial.

[1pt] Notations short explanation
patient record sequence
multivariate visit record at the -th visit
patient demographic features
a medical code in
taxonomy for disease, medication and procedure
sets of diseases, medications, and procedures
a clinical trial
the -th inclusion or exclusion criterion
the -th word in a text sequence
learned EC sentence embedding
the memory slot at level
retrieved patient embedding
predicted/ ground-truth for criteria with patient
Table 1. Notations used in COMPOSE.

4. The Compose model

This section presents our COMPOSE model (Fig. 2). COMPOSE

is a pseudo-siamese network consists of two branches: 1) a convolutional neural network-based branch to learn trial eligibility criteria (ECs) embeddings and 2) a taxonomy guided memory network branch to learn embeddings for patient records (EHR). Then we use an dynamic alignment and matching module to generate final outputs.

COMPOSE will embed two modalities of data from ECs and EHR into a shared latent space. In particular, we use EC embeddings as query to read memories in the memory network. Finally, we use the retrieved EHR embeddings and EC embeddings to jointly predict matching scores .

4.1. Trial Eligibility Criteria Embedding

Trial eligibility criteria (ECs) describe the inclusion and exclusion criteria of clinical trials using unstructured text. To learn embeddings for trial ECs, previous works use GloVe (Pennington et al., 2014) or BERT (Devlin et al., 2018)

with max-pooling to obtain a static embedding vector for each criterion. However, this is not a good way to capture detailed information in sentences such as numerical values or units. We hope the learned EC embeddings can retain sentence-level semantics while also capture important detailed information in sentences. To this end, we use convolutional neural networks (CNN) to learn EC embedding, as previous works

(Hu et al., 2014) shows that CNN can generate rich matching patterns at different levels for semantic matching tasks.

First, we use pretrained BERT to generate contextualized word embedding for each word of EC. Specifically, we applied the Clinical BERT (Alsentzer et al., 2019) pretrained on PubMed444 text resources and MIMIC-III (Johnson et al., 2016) doctor notes. Concretely, given an EC sentence , the contextualized embeddings are calculated as:


where denotes the number of words in the EC sentence and is a contextualized word embedding for . The final embedding for the EC sentence is the concatenation of these word embeddings.

Then we use multiple one-dimensional convolutional layers with different kernel sizes to capture semantics at different granularity level (four levels in our experiment) (You et al., 2018). We feed into the one-dimensional convolutional layers and concatenate the outputs to generate a vector as:


where denotes the one dimensional convolutional operator with kernel size to . Then we feed

into convolutional highway layers, which has been widely used in many natural language processing tasks to extract text semantics

(You et al., 2018; Srivastava et al., 2015). Particularly, the outputs of highway networks are calculated as:



denotes the sigmoid activation. We set the stride to 1 and use same padding to make sure the output dimension of

is the same as the input dimension of .

Finally, we use max-over-time pooling (Kim et al., 2016) to reduce the dimension and obtain the final embedding for the EC sentence in current trial as . Each dimension in denotes a semantic feature captured by convolutional filters.

4.2. Taxonomy Guided Patient Embedding

Memory networks are powerful frameworks to process, store and retrieve information from sequential data (Weston et al., 2014). To enable the patient-trial matching task, we also learn effective patient representation from patient data (e.g., EHR) by using memory network. In this work, we developed a hierarchical memory network to leverage medical taxonomy. This is achieved by allowing each memory slot to memorize concept on a specific level of the taxonomy.

Formally, given a patient’s EHR data, the -th visit is represented by three types of medical concepts: diagnosis, medications and procedures. We encode each type of code using a separate memory component. Since each type of concept can be divided into different levels, from a more general concept to a more specific one. In this work, we use the Uniform System of Classification (USC) taxonomy, a therapeutic classification system, to divide each concept into four levels. To this end, our patient embedding memory network consists of three separate memory networks to store diagnosis, medications and procedures respectively. Each memory network has four memory slots to store information from fine-grained to coarse levels. This can be represented as:


is at the root level (most general), while is the leaf level (most specific). Each memory slot is a vector initialized by zero. At each visit, the values in the memory slots will be updated. Note that we use four levels because of the specific medical taxonomies used in the experiment. The method can be easily generalized to taxonomy of different levels like (Gao et al., 2019; Choi et al., 2017).

Augment medical codes with textual description

. It is worth noting that we do not use the one-hot encoded medical codes as the input like previous trial matching and other clinical prediction works do 

(Zhang et al., 2020; Choi et al., 2018; Gao et al., 2020; Ma et al., 2019). Because, in clinical trial ECs, medical concepts are described in natural language rather than medical codes. Therefore, using textual description of medical codes to embed patients’ EHR can help the model find matched concepts. All the textual description of each medical concept can be found in the Uniform System of Classification (USC). For example, we use the textual description Contact dermatitis and other eczema for the diagnosis code 692.9. More specifically each medical concept can be represented as lists of words, i.e., , where is the length of the text description for the code . We use BERT with max-pooling to generate the concept embedding:


Specifically, when the patient have multiple diagnosis / medications / procedures, we use max-pooling layer to aggregate the final embedding for each type of codes (i.e., we will obtain three code embeddings , and ).

Update memories at each visit. Since patients’ EHR are sequential data, we update the memory slots at each visit. We adopt the erase-followed-by-add update mechanism in (Zhang et al., 2017; Gao et al., 2019). It allows us to erase unnecessary information at each visit and then add new information dynamically. Given the medical code embedding at -th visit, we can obtain its parent code embeddings to in the corresponding taxonomy and is the leaf node . For the code at level () and corresponding memory slot (), we calculate the erase and add gate as:


Then we update the memory in the slot as:


where denotes the element-wise product.

4.3. Attentional Record Alignment and Dynamic Criteria Level Match

Previous patient trial matching works learn an alignment matrix which map patients’ EHR embeddings and EC embeddings to the same latent space. However, each patient may enroll more than one trial and vice versa. Since each trial generally focus on certain diseases, the semantic distance between trials may be large. Trying to push the patient’s EHR embedding to be close to different EC embeddings may confuse the embed function, or the patient embedding could be mapped to near average position of those EC embeddings. Such situation is similar to the multi-label image classification case in computer vision

(Ren et al., 2015).

To overcome this problem, we let each EC correspond to the sub-memories that match current trial best. Concretely, we use the inclusion criteria embedding or exclusion criteria embedding as queries to attentively visit the memory network. Given the memory slot , the attention weight for current slot is calculated as:



is a multi-layer perceptron to align the dimension between

and . A large may indicate that the queried trial is highly related to the information stored in memory slot .

Given an criteria , we can obtain the best matching memories as:


Specifically, we also use MLP to learn a embedding for the patient’s demographics as: , where has the same dimension with . Finally, we concatenate the EHR embedding and the retrieved memory to predict whether the criteria matches the patient as:


where denotes the concatenation operation and we use MLP to map the criteria and demographics embedding to the same embedding space with .

4.4. Joint learning with Explicit Inclusion/Exclusion Criteria Handling

During model optimization, in order to maximize the patient-trial matching as well as to explicit handle the difference between inclusion and exclusion criteria, we designed a composite loss function with the following loss terms.

Classification Loss. We use a cross-entropy loss term in Eq. 11 to optimize the classification performance between prediction and ground truth :


Inclusion/Exclusion Loss. In addition, we also construct a loss term to explicitly handle the match between patient embedding and EC embedding for inclusion criteria and exclusion criteria. The loss term enables the model to extract different features (e.g. negation words) in the inclusion/exclusion criteria and thus help decide whether to include or exclude the patient. Mathematically, this boils down to minimize the distance between the retrieved patient memory and the embedding for inclusion criteria (i.e., ) while maximize the distance between the memory and the embedding for exclusion criteria (i.e., ). We construct the loss term with the following pairwise distance loss:



is the similarity function between two vectors. In this work we use the cosine similarity function to measure the distance between two modalities of data. Here

is a hyper-parameter which denotes the minimum distance between the embedding of exclusion criteria and the patient memory. If a patient matches an inclusion criterion, the model will minimize the cosine distance between the two embeddings to make close to . If the patient is excluded by an exclusion criterion, the distance between two embeddings (i.e. ) should be no smaller than . This allows and have different distance to in the latent embedding space.

Finally, we jointly minimize the loss functions by back propagation in an end-to-end way as:


Our COMPOSE algorithm is summarized in Algorithm. 1.

0:    Patient records , a set of inclusion criteria or exclusion criteria sentences .
0:    Initialize all memory slots to zero.
  for  to  do
     Generate all medical concept in embedding using Eq. 5
     Update corresponding memory slots using Eq. 7 and 6;
  end for
  for  to  do
     Generate EC sentence embedding ;
     Obtain the convolution results using Eq. 3;
  end forRetrieve the best-matching memories using Eq. 9 and 8; Calculate matching results using Eq. 10; Update parameters by optimizing the loss in Eq. 11, 12 and 13.
Algorithm 1 The COMPOSE model

5. Experiment

We evaluate COMPOSE by comparing against state-of-the-art baselines on a real-world patient trial matching dataset. The code of COMPOSE is publicly available at 555

5.1. Experimental Setup

Dataset description We evaluated COMPOSE using the data below.

  1. Clinical Trial Data We randomly selected 590 clinical trials with varying disease domains from publicly available data source ( We extract the inclusion criteria and exclusion criteria from these trials. In total, we obtain criteria-level (i.e., sentence-level) EC statements.

  2. Patient EHR Data We extract patient claims data from IQVIA’s real-world patient database, which can be accessed by request. In total we have EHR records from patients from 2002 to 2018, where each patient is a match to at least one trial in the previous extracted trial dataset. The patient information is encoded in a longitudinal prescription and medical claims data including diagnosis, procedures and medications.

We label each inclusion/exclusion criterion and their corresponding matched patients’ EHR as ”match”/”mismatch”. For each criterion, we randomly sample one inclusion criterion and exclusion criterion from another trial and label them as ”unknown”. In all we have 397,321 labelled pairs. The data statistics are in Appendix.

Baselines We evaluated COMPOSE against the following baselines.

  1. LSTM+GloVe (Hochreiter and Schmidhuber, 1997). We use LSTM to learn the representation for longitudinal EHR data and use GloVe (Pennington et al., 2014) followed by a max-pooling layer to learn the sentence embedding for ECs. Then we concatenate the outputs to predict the matching results.

  2. LSTM+BERT (Devlin et al., 2018) We use LSTM to learn EHR embeddings and use BERT to learn EC sentence embeddings. Then we concatenate the outputs to predict the matching results.

  3. Criteria2Query (Yuan et al., 2019)

    consists of a systematic information extraction pipeline that uses basic text mining solution (such as named entity recognition) to parse unstructured text data and translate them to a set of structured attributes. And then use these attributes to identify patient cohorts.

  4. DeepEnroll (Zhang et al., 2020)

    is the previous state-of-the-art deep learning model for patient trial matching task. The structure of DeepEnroll is similar to

    LSTM+BERT, but it uses MiME (Choi et al., 2018) instead of LSTM to learn EHR embedding. DeepEnroll uses an alignment matrix to predict the matching results.

GloVe and BERT are common models for text-related retriveal tasks. We use them as our baseline pseudo-siamese structures. In addition to these baselines, we also perform ablation studies by comparing COMPOSE against its reduced models:

  1. COMPOSE-MN We reduce the memory network from COMPOSE. We use an LSTM model to learn the EHR embedding, and we use one-hot encoded codes instead of textual decriptions.

  2. COMPOSE-Highway We reduce the highway network from COMPOSE. We use a regular three-layer CNN to learn the trial embedding.

  3. COMPOSE- We reduce the loss term from COMPOSE. We use just to optimize the model. Such setting will make the model not explicitly differentiate inclusion criteria and exclusion criteria.

Evaluation strategy and metrics (1) Criteria Level: We use all labelled patient-criterion pairs to train and evaluate our model. We follow previous works (Zhang et al., 2020) to use the accuracy score (Accuracy

), area under the receiver operating characteristic curve (

AUROC), and area under the precision-recall curve (AUPRC) to evaluate model performance. Since the matching task is casted as a multi-label classification task, we calculate all scores using micro-average. (2) Trial Level: For trial level matching tasks, if patient A has enrolled in trial B, we first obtain all matching results between A and all criteria in B, then we aggregated the results to get the final matching results between patient A and trial B. The result is considered correct when the prediction results between A and all inclusion criteria in B are ”match” and the results between A and all exclusion criteria are ”mismatch”. We use the accuracy score to evaluate the performance. In practice, some inclusion or exclusion criteria can be too strict to prevent finding patients, yet non-essential thus can be modified. We choose four threshold (0.7, 0.8, 0.9 and 1) to simulate that trial recruiters may loose the restriction to enroll enough patients. For example, threshold 0.7 means we consider a patient match the trial when the patient only matches 70% criteria in the trial.

We fix a test set of 30% patients, and divide the rest of the dataset into the training set and validation set with a proportion of 90%:10%. We fix the best model on the validation set and report the performance on the test set. We perform five random runs and report both mean and standard deviation for testing performance except for Criteria2Query, since it requires no training process.

5.2. Results

We designed experiments to answer the following question.

Q1. How does COMPOSE perform in patient-trial matching?

Q2. How does COMPOSE perform for various disease categories?

Q3. How does COMPOSE perform for trials at each phases?

Q4. Sensitivity of performance for different matching threshold.

Q5. Cases for showing the attentional record align mechanism.

Q1. Performance on Patient Trial Matching

The results for trial-level matching are shown in Table 2. The accuracy score is computed based on 100% matching criteria (i.e. a patient matches a trial only when the patient matches all inclusion criteria and mismatches all exclusion criteria in the trial). COMPOSE significantly outperfoms other state-of-the-art methods. Compared to the best baseline model DeepEnroll, COMPOSE achieves 24.3% relatively higher accuracy score compared with DeepEnroll and 36.3% relatively higher accuracy score compared with Criteria2Query .

Model Accuracy
Baselines LSTM+GloVe 0.42940.010
LSTM+BERT 0.54600.008
Criteria2Query 0.6147-
DeepEnroll 0.67370.021
Reduced COMPOSE-MN 0.78330.011
COMPOSE-Highway 0.81020.009
COMPOSE- 0.82120.010
Proposed COMPOSE 0.83730.012
Table 2. Patient-Trial matching. Performance are measured by accuracy based on the match of criteria.

Among all baselines, Criteria2Query and DeepEnroll achieves better performance. All reduced models of COMPOSE also outperforms all baseline models. And the results also show that textual description enhanced code embeddings and the hierarchical memory network that performs dynamic matching contributes most to the performance. Below we also provide the patient criteria matching results in Table 3. Compared to the best baseline model DeepEnroll, COMPOSE achieves 8.8% relatively higher accuracy, 4.7% higher AUROC and 3.3% higher AUPRC. The strong criteria level performance of COMPOSE provides a good foundation for patient-trial matching.

Model Accuracy AUROC AUPRC
Baselines LSTM+GloVe 0.7220.010 0.7890.009 0.7840.009
LSTM+BERT 0.8340.008 0.8450.007 0.8400.007
DeepEnroll 0.8690.012 0.9360.013 0.9470.011
Reduced COMPOSE-MN 0.8990.012 0.9550.013 0.9600.010
COMPOSE-Highway 0.9120.007 0.9650.007 0.9670.009
COMPOSE- 0.9390.010 0.9760.009 0.9730.007
Proposed COMPOSE 0.9450.008 0.9800.007 0.9790.008
Table 3. Performance on Criteria Level matching.

Varying Length of Patient Record It is challenging for matching patients with longer records due to gradient vanishing issues of deep learning models or evolving health conditions of patients. Here we experimentally explore how COMPOSE performs in matching trials with patients who have short or long records. We categorize patients into three groups based on the length of their EHR records: Short (1 visit), Medium (2-3 visits), Long ( visits). We report the patient-trial matching performance for each group in Table 4.

Model Short Medium Long
LSTM+GloVe 0.4906 0.4328 0.0000
LSTM+BERT 0.5484 0.5512 0.5338
Criteria2Query 0.6833 0.5989 0.5172
DeepEnroll 0.6779 0.6797 0.6443
COMPOSE 0.8420 0.8389 0.8350
Table 4. Performance (measured by accuracy) on trial level matching for different length of records.

From Table 4

, we can observe that for most models it is easier to match patients with short and medium length of records to trials. This is probably due to patients with shorter sequences tend to have simpler health conditions, while patients with longer records tend to have more complex condition or condition changes, which cause their EHR to have irrelevant information that confused the patient-trial matching model. Compared with baselines,

COMPOSE have robust performance for patients with different length of records, this is because COMPOSE uses dynamic memory network to store patients’ EHR information, which has better capability to reserve fine-grained information in different slots.

Q2. Varying Disease Types

We also conduct experiments to explore how our model performs on different types of diseases. We select trials related to chronic diseases, oncology and rare diseases. Particularly, we consider 19 trials on 9 chronic diseases including chronic pain, chronic obstructive pulmonary disease, etc. For oncology trials, we consider 33 trials on 18 oncologies including gastric cancer, lung cancer, etc. we also select 5 rare diseases related trials including Glioma, Polymyositis, etc. More details are provided in Appendix.

From the results in Table 5, COMPOSE generally outperformed other baseline methods and outperformed best baseline DeepEnroll by 77.3% relative higher accuracy for chronic diseases. For cancer and rare diseases cohorts, baseline models fail to match correct patients. However, Criteria2Query outperforms most baselines because it requires no training process, therefore insufficient data does not effect its performance.

For COMPOSE, the performance for matching patients with trials designed for chronic diseases is lower than the other two disease types. This is due to patients with chronic diseases often have complex condition and heterogeneous manifestation. So the criteria for these diseases often have general descriptions. For example, most trials in the data are related to chronic pain (47.4%), which is a common symptom often caused by other diseases. The ECs for trials on chronic pains often contain vague description such as Adolescents experiencing chronic pain of any type or Have a history of non-cancer pain in past 6 months. It is difficult for automatic matching methods to match patient with such terms likeany type or non-cancer, thus results in low accuracy.

In contrast, trials on oncology and rare diseases have more strict criteria in recruiting patients. Most of these ECs requires different medications and diagnosis from different aspect. Therefore, the task is much more difficult than common matching tasks for baseline modes, since these models have to align the patient representation with each criterion. However, there is little training and testing data available to finish such a complex task for most baseline models. We will further discuss this in the case study section.

Model Chronic Diseases Oncology Rare Diseases
LSTM+GloVe 0.1793 0.0000 0.0000
LSTM+BERT 0.2062 0.0000 0.0000
Criteria2Query 0.5103 0.2722 0.2292
DeepEnroll 0.3345 0.0000 0.0000
COMPOSE 0.5931 0.6370 0.6875
Table 5. Performance (measured by accuracy) on trial level matching for different disease types.

Q3. Varying Trial Phases

From the results in Table 6 , it is easy to see COMPOSE significantly outperforms other models on different phases. Compared to the best baseline models DeepEnroll and Criteria2Query, COMPOSE achieves 155% relative higher accuracy for phase I trials, 19% higher accuracy for phase II trials and 27% higher accuracy for phase III trials.

Compared all phases, the matching performance on Phase I trials is generally much lower. This is due to phase I trials are designed to test a new regimen’s tolerability and toxicity. These trials usually enroll a limited number of patients who have exhausted other treatment options. Consequently in our training data we also have much less Phase I trials and fewer patients available for training and testing: only 5% patients enrolled in phase I trials, while much more patients are enrolled in phase II (42%) and III trials (53%).

Model Phase I Phase II Phase III
LSTM+GloVe 0.0008 0.5865 0.3743
LSTM+BERT 0.0025 0.6045 0.4862
Criteria2Query 0.3025 0.6433 0.5870
DeepEnroll 0.2034 0.7493 0.6329
COMPOSE 0.5189 0.8939 0.8005
Table 6. Performance (measured by accuracy) on trial level matching for different trial phases.

Q4. Varying Threshold of Matching

In practice, some inclusion or exclusion criteria can be too strict to prevent finding patients, yet non-essential thus can be modified. In this section, we will examine how trial matching accuracy will change under different matching thresholds. We also show COMPOSE can provide insights for guiding the adjustment of criteria.

Table 7 shows the performance of trial matching results under threshold 70%, 80% and 90% (i.e., a pair of patient and trial is considered matching when the patient matches 70%/80%/90% of criteria of the trial). Criteria2Query is not applicable for this analysis since it requires all criteria in a trial to match patients. For COMPOSE, a more strict matching threshold results in 2% lower accuracy score. While for baseline models, their performance drops 5%~6% when the threshold rises from 80% to 90%. When threshold rises from 90% to 100%, performance drops by more than 11%.

In order to explore which criteria cause the performance drop, we examine three criteria in Tanezumab for Diabetic Peripheral Neuropathy trial (NCT01087203) : (1) Other types of diabetic neuropathies; (2) Clinically significant neurological diseases; (3) Clinically significant psychiatric diseases. The three criteria are successfully matched by COMPOSE, but wrongly matched by other baselines when the threshold is increased from 0.8 to 1. These criteria describe general disease cohorts rather than a specific disease. For baseline models that use RNN to encode medical codes, it is difficult to align such abstract disease cohort description with detailed codes in patients’ EHR records such as Diabetes mellitus due to underlying condition with diabetic polyneuropathy. Thanks to the hierarchical memory networks to store taxonomies of medical concepts, COMPOSE can easily align the criteria to either a detailed code or a more general parent concept.

Model 70% Matching 80% Matching 90% Matching
LSTM+GloVe 0.6218 0.5862 0.5057
LSTM+BERT 0.7231 0.6861 0.6238
DeepEnroll 0.8225 0.7963 0.7422
COMPOSE 0.9334 0.9193 0.8915
Table 7. Performance (accuracy) for different thresholds.

Q5. Case Studies

To show how the attentional record align mechanism in COMPOSE works, we choose a trial on Cabozantinib which treats grade IV astrocytic tumors. COMPOSE successfully matches this trial (94% matching) while all baselines fail (¡50% matching). Fig. 3 shows the attention weights on different memory slots for 6 selected criteria.

Figure 3. Attention weights on the memory slots for the Cabozantinib trial for treating grade IV astrocytic tumors.
(a) DeepEnroll
Figure 4. Visualize EHR and EC embeddings using PCA

For this trial, each criterion focuses on a single diagnosis or medication, so it is difficult for baseline models to match each criterion to a longer patient record. However, the attentional record align mechanism in COMPOSE helps each criterion match the most related memory slots and therefore achieves dynamic matching. For example, the 4th criteria (pregnant or breast-feeding) is aligned to the demographic slot. And the 2nd EC (receiving warfarin or other coumarin derivatives) is aligned to the Lv.2 and Lv.4 of medication slots since warfarin corresponds to a detailed medication while coumarin derivatives corresponds to a higher level category.

We also visually compared the learned embeddings from COMPOSE

and DeepEnroll using principal component analysis (PCA) in Fig

4. For DeepEnroll, a patient (green circle) is aligned to multiple inclusion and exclusion criteria (squares and triangles), so the EC embeddings are mixed and lead to wrong prediction. However, for COMPOSE, we use each EC as queries to match EHR records, so there are many EHR embeddings (blue and red circles) and each embedding is corresponding to a specific EC. So the model can have more accurate matching for different ECs. Besides, the inclusion and exclusion EC embeddings form different clusters, which means the model can differentiate them by optimizing the distance.

We also found some trials difficult to find matching patients, e.g., the trial for Nivolumab Plus Ipilimumab or Nivolumab Plus Chemotherapy Versus Chemotherapy Alone in Early Stage Non-Small Cell Lung Cancer (NSCLC, NCT02998528). All models achieve lower than 50% accuracy score for this trial. The ECs in this trial are listed in Appendix B, and we denotes inclusion criteria as and exclusion criteria as . The prediction results of COMPOSE for this trial and a case patient are shown in Fig 5.

Figure 5. Prediction results for NCT02998528 trial. Inclusion criteria are denoted as and exclusion criteria as .

The results show that COMPOSE successfully matches and to the patient but classifies other ECs to unknown. and describes a large group of diseases that models fail to match them to either a detailed code or a category. For , and , EHR data is not enough to determine whether a patient matches these criteria.

6. Conclusion

In this work, we propose a cross-modal pseudo-siamese network model, COMPOSE, to conduct patient-trial matching. COMPOSE can perform dynamic patient-trial matching based on learning taxonomy guided multi-granularity medical concept embedding. COMPOSE is also augmented by a composite loss term to maximize the similarity between patient records and inclusion criteria while minimize the similarity between patient records and exclusion criteria. Experiments on real-world datasets demonstrated that COMPOSE significantly outperforms state-of-the-art baselines.

7. Acknowledgments

This work is part supported by National Science Foundation award IIS-1418511, CCF-1533768 and IIS-1838042, the National Institute of Health award NIH R01 1R01NS107291-01 and R56HL138415.


  • A. Alicante, A. Corazza, F. Isgrò, and S. Silvestri (2016) Unsupervised entity and relation extraction from clinical records in italian. Computers in biology and medicine 72, pp. 263–275. Cited by: §2.
  • E. Alsentzer, J. R. Murphy, W. Boag, W. Weng, D. Jin, T. Naumann, and M. McDermott (2019) Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323. Cited by: §4.1.
  • A. Bustos and A. Pertusa (2018) Learning eligibility in cancer clinical trials using deep neural networks. Applied Sciences 8 (7), pp. 1206. Cited by: §1, §2.
  • MK. Campbell, C. Snowdon, D. Francis, D. Elbourne, AM. McDonald, R. Knight, and A. Grant (2007) Recruitment to randomised trials: strategies for trial enrollment and participation study: the steps study.. Health Technol. Assess.. Cited by: §1.
  • E. Choi, M. T. Bahadori, L. Song, W. F. Stewart, and J. Sun (2017)

    GRAM: graph-based attention model for healthcare representation learning

    In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 787–795. Cited by: §4.2.
  • E. Choi, C. Xiao, W. Stewart, and J. Sun (2018) Mime: multilevel medical embedding of electronic health records for predictive healthcare. In Advances in Neural Information Processing Systems, pp. 4547–4557. Cited by: §4.2, item iv.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.1, item ii.
  • J. Gao, X. Wang, Y. Wang, Z. Yang, J. Gao, J. Wang, W. Tang, and X. Xie (2019) Camp: co-attention memory networks for diagnosis prediction in healthcare. Cited by: §4.2, §4.2.
  • J. Gao, C. Xiao, Y. Wang, W. Tang, L. M. Glass, and J. Sun (2020) StageNet: stage-aware neural networks for health risk prediction. arXiv preprint arXiv:2001.10054. Cited by: §4.2.
  • Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang (2017) Learning dynamic siamese network for visual object tracking. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1763–1771. Cited by: §2.
  • NS. Hill, IR. Preston, and KE. Roberts (2008) Patients with pulmonary arterial hypertension in clinical trials. who are they?. Proc. Am. Thorac. Soc. 5, pp. 503–609. Cited by: item i.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: item i.
  • B. Hu, Z. Lu, H. Li, and Q. Chen (2014) Convolutional neural network architectures for matching natural language sentences. In Advances in neural information processing systems, pp. 2042–2050. Cited by: §4.1.
  • L. H. Hughes, M. Schmitt, L. Mou, Y. Wang, and X. X. Zhu (2018) Identifying corresponding patches in sar and optical images with a pseudo-siamese cnn. IEEE Geoscience and Remote Sensing Letters 15 (5), pp. 784–788. Cited by: §2.
  • A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016) MIMIC-iii, a freely accessible critical care database. Scientific data 3, pp. 160035. Cited by: §4.1.
  • T. Kang, S. Zhang, Y. Tang, G. W. Hruby, A. Rusanov, N. Elhadad, and C. Weng (2017)

    EliIE: an open-source information extraction system for clinical trial eligibility criteria

    Journal of the American Medical Informatics Association 24 (6), pp. 1062–1071. Cited by: §1, §2.
  • Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush (2016) Character-aware neural language models. In

    Thirtieth AAAI Conference on Artificial Intelligence

    Cited by: §4.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix C.
  • L. Ma, J. Gao, Y. Wang, C. Zhang, J. Wang, W. Ruan, W. Tang, X. Gao, and X. Ma (2019)

    AdaCare: explainable clinical health status representation learning via scale-adaptive feature extraction and recalibration

    arXiv preprint arXiv:1911.12205. Cited by: §4.2.
  • D. Myshko (2005) Accurately costing a clinical trial - PharmaVOICE. Note: 2020-2-13 Cited by: §1.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in pytorch

    Cited by: Appendix C.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In In EMNLP, Cited by: 1st item, §4.1, item i.
  • Pharmafile (2016) Clinical trials and their patients: the rising costs and how to stem the loss. Note: Jan 1, 2020 Cited by: §1.
  • Y. Qi, Y. Song, H. Zhang, and J. Liu (2016)

    Sketch-based image retrieval via siamese convolutional neural network

    In 2016 IEEE International Conference on Image Processing (ICIP), pp. 2460–2464. Cited by: §2.
  • Z. Ren, H. Jin, Z. Lin, C. Fang, and A. Yuille (2015) Multi-instance visual-semantic embedding. arXiv preprint arXiv:1512.06963. Cited by: §2, §4.3.
  • F. S. (2015) One in four cancer trials fails to enroll enough participants.. External Links: Link Cited by: §1.
  • R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Highway networks. arXiv preprint arXiv:1505.00387. Cited by: §4.1.
  • W. Treible, P. Saponaro, and C. Kambhamettu (2019) Wildcat: in-the-wild color-and-thermal patch comparison with deep residual pseudo-siamese networks. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 1307–1311. Cited by: §2.
  • C. Weng, X. Wu, Z. Luo, M. R. Boland, D. Theodoratos, and S. B. Johnson (2011) EliXR: an approach to eligibility criteria extraction and representation. Journal of the American Medical Informatics Association 18 (Supplement_1), pp. i116–i124. Cited by: §1, §2.
  • J. Weston, S. Chopra, and A. Bordes (2014) Memory networks. arXiv preprint arXiv:1410.3916. Cited by: §4.2.
  • Q. You, Z. Zhang, and J. Luo (2018) End-to-end convolutional semantic embeddings. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 5735–5744. Cited by: §2, §4.1.
  • C. Yuan, P. B. Ryan, C. Ta, Y. Guo, Z. Li, J. Hardin, R. Makadia, P. Jin, N. Shang, T. Kang, et al. (2019) Criteria2Query: a natural language interface to clinical databases for cohort definition. Journal of the American Medical Informatics Association 26 (4), pp. 294–305. Cited by: §1, §2, item iii.
  • J. Zhang, X. Shi, I. King, and D. Yeung (2017) Dynamic key-value memory networks for knowledge tracing. In Proceedings of the 26th international conference on World Wide Web, pp. 765–774. Cited by: §4.2.
  • X. Zhang, C. Xiao, L. Glass, and J. Sun (2020) Patient-trial matching with deep embedding and entailment prediction. In Proceedings of The Web Conference 2020, WWW ’20, New York, NY, USA, pp. 1029–1037. External Links: ISBN 9781450370233, Link, Document Cited by: §1, §2, §4.2, item iv, §5.1.

Appendix A Details of Trial Data

The basic statistics of the trial dataset and EHR dataset are shown in Table 8. The detailed diseases in different cohorts and the number of trials and patients for each disease are shown in Table 9.

# of trials 590
Avg. # of Inclusion criteria per trial 6.67
Avg. # of Exclusion criteria per trial 10.56
Avg. criteria sentence length 31.80
# of patients 83,371
# of unique medications 286
# of unique diagnosis 421
# of labeled pairs 397,321
Avg. # of trials per patient 1.20
Avg. # of patients per trial 169.49
Table 8. Data statistics
[0.7pt] Cohort Disease # of trials # of patients
Rare Polymyositis 1 11
Glioblastoma 1 16
Diffuse Large B-Cell Lymphoma 1 2
Glioma 1 2
Gastrointestinal Stromal Tumors 1 1
Oncology Brain and Central Nervous System Tumors 1 60
Unspecified Adult Solid Tumor 1 1
Astrocytic Tumors 1 19
Advanced Solid Tumor 7 9
Ovarian Cancer 1 5
cMET-dysregulated Advanced Solid Tumors 1 1
Triple Negative Breast Cancer 1 2
Cancer 1 1
Non Small Cell Lung Cancer 6 6
Metastatic Colorectal Cancer 1 1
Colorectal Cancer 2 4
Tumors 2 2
Solid Tumor 3 3
PIK3CA Mutated Advanced Solid Tumors 1 1
mCRPC 1 2
Advanced or Metastatic Breast Cancer 1 2
Gastrointestinal Stromal Tumors 1 1
Advanced Gastric Cancer 1 1
Chronic Chronic Demyelinating Polyradiculoneuropathy 1 56
Chronic Low Back Pain 2 43
Chronic Cluster Headache 3 49
Chronic Pain 4 20
Chronic Sinusitis With or Without Nasal Polyps 1 13
Chronic Myeloid Leukemia 3 5
Hepatitis C, Chronic 3 5
Chronic Severe Plaque-type Psoriasis 1 45
Treatment for Prevention of Chronic Migraine 1 21
Table 9. Diseases in different cohorts

Appendix B Inclusion and Exclusion Criteria for NCT02998528 Trial

All criteria in NCT02998528 trial is shown below. denotes the inclusion criteria and denotes the exclusion criteria.

  • : Early stage IB-IIIA, operable non-small cell lung cancer, confirmed in tissue

  • : Lung function capacity capable of tolerating the proposed lung surgery

  • : Eastern Cooperative Oncology Group (ECOG) Performance Status of 0-1

  • : Available tissue of primary lung tumor

  • : Presence of locally advanced, inoperable or metastatic disease

  • : Participants with active, known or suspected autoimmune disease

  • : Prior treatment with any drug that targets T cell co-stimulations pathways (such as checkpoint inhibitors)

Appendix C Implementation details

All methods are implemented in PyTorch (Paszke et al., 2017) and trained on an Ubuntu 16.04 with 64GB memory and a Tesla V100 GPU. We use Adam optimizer (Kingma and Ba, 2014)

with a learning rate of 0.001. We use a batch size of 512 and train all model for 20 epochs.

For hyper-parameter settings of each baseline model, our principle is as follows: For some hyper-parameter, we will use the recommended setting if it is available in the original paper. Otherwise, we determine its value by grid search on the validation set.

  • LSTM+GloVe. We use 300 dimensional GloVe embedding (Pennington et al., 2014) as word embedding and max-pooling as sentence embedding method. Out-of-vocabulary (OOV) words are hashed to one of 100 random embedding each initialized to mean 0 and standard deviation 1. The hidden units of LSTM cell are set to 256.

  • LSTM+BERT. The hidden units of LSTM cell are set to 512. We use the same pretrained clinicalBERT model in COMPOSE.

  • Criteria2Query. We directly use the existing model to produce the matching results for given trials and patients. This baseline does not require any training process or hyper-parameters.

  • DeepEnroll. We use the recommended settings in the paper. However, to be fair, we use the same BERT model for DeepEnroll and our model.

  • COMPOSE. We set the dimension of all convolutional layers to 128. The kernel size to is set to 1, 3, 5, 7, and we use kernel size 3 for the highway layers. The hidden units of memory slots are set to 320. The margin in loss term is set to 0.3.