Predicting Opioid Use Disorder from Longitudinal Healthcare Data using Multi-stream Transformer

by   Sajjad Fouladvand, et al.

Opioid Use Disorder (OUD) is a public health crisis costing the US billions of dollars annually in healthcare, lost workplace productivity, and crime. Analyzing longitudinal healthcare data is critical in addressing many real-world problems in healthcare. Leveraging the real-world longitudinal healthcare data, we propose a novel multi-stream transformer model called MUPOD for OUD prediction. MUPOD is designed to simultaneously analyze multiple types of healthcare data streams, such as medications and diagnoses, by finding the attentions within and across these data streams. Our model tested on the data from 392,492 patients with long-term back pain problems showed significantly better performance than the traditional models and recently developed deep learning models.


page 1

page 2

page 3

page 4


Modeling disease progression in longitudinal EHR data using continuous-time hidden Markov models

Modeling disease progression in healthcare administrative databases is c...

Double-Coupling Learning for Multi-Task Data Stream Classification

Data stream classification methods demonstrate promising performance on ...

Distilling Knowledge from Deep Networks with Applications to Healthcare Domain

Exponential growth in Electronic Healthcare Records (EHR) has resulted i...

Functional Principal Component Analysis for Extrapolating Multi-stream Longitudinal Data

The advance of modern sensor technologies enables collection of multi-st...

A Canonical Architecture For Predictive Analytics on Longitudinal Patient Records

Many institutions within the healthcare ecosystem are making significant...

Predicting Ulnar Collateral Ligament Injury in Rookie Major League Baseball Pitchers

In the growing world of machine learning and data analytics, scholars ar...

Trading-Off Cost of Deployment Versus Accuracy in Learning Predictive Models

Predictive models are finding an increasing number of applications in ma...

1 Introduction

Early identification and engagement of individuals at risk of developing an opioid use disorder (OUD) is a critical unmet need in healthcare [gostin2017reframing, pitt2018modeling]. Individuals with OUD often do not seek treatment or have internalized stigma about OUD that limits identification through traditional means, such as screening and clinical interview [olsen2014confronting]. Significant disparities limit access to treatment for OUD resulting in less than 20% of all individuals with OUD receiving any form of treatment in the past year [wu2016treatment]

. Typical clinician workflow does not allow for comprehensive OUD screening, but available administrative and clinical data have the potential to help clinicians identify and screen higher risk patients providing an opportunity for primary care professionals to play a greater role in increasing OUD detection, treatment, and prevention. Healthcare data are a growing source of information that can be harnessed together with machine learning to advance our understanding of factors that increase the propensity for developing OUDs as well as those that aid in the treatment of the disorders

[RN4]. In healthcare data, patients’ outcomes and treatments are collected at multiple follow-up times. Tools developed to analyze longitudinal healthcare data and to extract meaningful patterns from these ever growing data are critical in addressing real-world public health emergency including but not limited to OUD.

Analyzing real-world data is a complicated task with multiple computational challenges including high dimensionality, heterogeneity, temporal dependency, sparsity, and irregularity [RN13]. In particular, healthcare (and claim) data are typically collected from multiple sources, and the subsequent data analysis requires to simultaneously analyze the temporal correlation among multiple streams such as medications, diagnoses, and procedures. Deep learning models have demonstrated great potential in addressing some of these challenges and creating promising longitudinal healthcare data analysis tools. Among them, Doctor AI [choi2016doctor], RETAIN [choi2016retain], and DeepCare [pham2016deepcare]

modeled multiple data streams including medications, diagnoses, and procedures using Recurrent Neural Network (RNN) models such as Long-Short Term Memory models (LSTMs) 


. Doctor AI concatenated multi-hot input vectors to predict subsequent visit events 

[choi2016doctor]. RETAIN used two separated RNNs to generate attentions at the visit level and the variable level as well [choi2016retain]. These applications demonstrate that RNNs are promising in longitudinal and sequential healthcare data analysis, since RNNs are capable of extracting contextual information from past time steps and pass this information forward, which helps to efficiently model long-term dependencies in input streams[RN25]. Nevertheless, the network architecture and design preclude RNNs from processing long streams in reasonable amount of time [ma2017dipole]. Attention mechanism was introduced to RNNs to increase their capacity in capturing longer dependencies more efficiently[ma2017dipole, luong2015, bahdanau2014]. Attention-based models bridge the gap between different states in RNNs using a context vector. Successful applications of attentions led to the transformer model [NIPS2017_7181], which removed recurrence in RNNs and instead relied entirely on the attention mechanism.

Transformer is a variety of attention-based deep learning models originally proposed for natural language processing (NLP) tasks such as machine translation 

[NIPS2017_7181]. Later, transformer has been applied on longitudinal EHR data [choi2020learning] to predict patients’ outcomes in the future. There are already several models that have been successfully applied on EHR data without significantly changing the network architecture or loss [song2018attend, wang2019inpatient2vec, shang2019pre]. Of course, transformer’s structure can be altered to better fit the special needs of solving healthcare problems [choi2020learning, li2020behrt]

. Choi et al. proposed a transformer model for healthcare data analysis by utilizing the conditional probabilities calculated from the encounter records to guiding the self-attention mechanism in the transformer 

[choi2020learning]. BEHRT [li2020behrt] was developed based on BERT [bert], a popular transformer model for NLP tasks, for analyzing EHR data. BEHRT considers the patients’ existing diagnoses and demographic data to predict their future diagnoses. Similar to RNNs, transformers have been modified to model multiple data streams. Li et al developed a two-stream transformer to analyze both time-over-channel and channel-over-time features in human activity recognition tasks [bing2021that]. Two parallel, yet separate transformers were used to handle two input streams. Another multi-stream transformer has been developed to generate effective self-attentions for speech recognition [han2019state]. They parallelized multiple self-attention encoders to process different input speech frames. Gomez et al. developed a multi-channel transformer for sign language translation using one self-attention encoder [camgoz2020multi]

. Their model finds the attentions across three different channels, i.e. hand shapes, mouthing, and upper body pose. A more recent work 

[hu2021transformer] showed that “transformer is all you need” by using multiple transformer encoders. The encoded outputs can be concatenated using a joint decoder that enables simultaneous model training. There are also works that analyze multi-stream data using transformer by simply stacking or parallelizing multiple transformer models [libovicky2018input, li2019visualbert].

Figure 1: Data preprocessing and patient representation. EHR data are first converted to an enrollee-time matrix . Then, the data are fed to LSTM models to encode the medication and diagnosis streams separately.

Although the recently developed transformer models showed promising performance, especially on handling multiple data streams, the potential of applying transformer on healthcare data analysis has not yet been fully explored. One of the major limitations is the lack of power to model multiple data streams within the self-attention layer. Transformer originally was designed to process one data stream, which is mostly an order of words in a NLP task, at a time. The modified transforms either can only handle multiple streams at intra-stream level or they are not suitable to solve OUD prediction problem as a real-time task where only previous clinical events can be used to make a prediction at a specific time point. Here, OUD prediction is a complex data analysis task that includes not only finding long term effects of prescription opioids such as morphine and fentanyl, history of diagnoses such as mood disorders, but also the hidden associations between patient’s prescriptions and diagnoses, since these input streams are highly correlated with each other. Identifying the relationships within and between input streams may reveal hidden patterns leading to an increased power in OUD prediction and interpretation. Moreover, the medication application patterns and the interactions between medications across different visits as well as the patient’s diagnoses patterns throughout his/her medical history may carry important information that should be extracted in order to develop precise OUD prediction and treatment tools.

This study proposes a novel transformer model called Multi-stream Transformer for Predicting Opioid Use Disorder (MUPOD) to analyze longitudinal healthcare data collected from multiple sources and predict the onset of OUD. First of all, MUPOD is capable of analyzing multiple data streams, such as medication and diagnosis, simultaneously and extracting associations within and between the streams. Second, MUPOD utilizes attention weights within and across data streams to interpret the prediction results. In our experiment, MUPOD successfully captured the complex associations within and between multiple streams including medication, diagnoses, and demographic information, and predicts the onset of OUD precisely.

2 Materials and Methods

Data Set

The large-scale administrative records in the IBM (formerly Truven Health Analytics) MarketScan Commercial Claims [truven] database were used to train and test both baseline models and MUPOD. Data include person-specific clinical utilization, expenditures and enrollment across inpatient, outpatient, prescription drug and carve-out services. The database contains about 30 million enrollees annually across the US, and these enrollees are nationally representative of the US population with respect to sex (50% female), regional distribution, and age.

We extracted medications, diagnoses and demographic information of 682,402 patients who have at least one diagnosis of OUD (ICD-9: 304.0x, 305.5x and ICD-10:; where x can be any code) from 2009 to 2018. Hyper-geometric [rice2006mathematical] test was used to identify sub-cohorts of OUD with high statistical significance of whether a population consists the richest information of OUD. We identified an OUD sub-cohort (p-value 0.0) with 229,214 patients who had at least one Clinical Classification Software code (CCS) of 205 (patients with Spondylosis; intervertebral disc disorders; other back problems). This sub-cohort was defined as the case cohort. Note that CCS 205 has already been shown to be a prevalent diagnosis in OUD patients in the literature [mark2013psychiatric, barnett2009comparison].

The case cohort (OUD positive and CCS 205) was matched with a subpopulation of OUD-negative patients called the control cohort. All the individuals in the control cohort has the same back pain diagnosis (CCS 205) but do not develop OUD. We first matched case and control based on age and sex. Second, we matched case and control based on the opioid medication use duration. Specifically, we grouped every opioid medication with a Generic Product Identifier (GPI) of 65x as opioid medications. Buprenorphine and Methadone were excluded as they are often used as a treatment for opioid overdose. Second, we randomly sampled OUD-negative patients who have the matched age and gender with the case ensuring that the averaged opioid use ratio between case and control is almost equal.

Table 1

shows the characteristics of case and control regarding age, sex, top-10 most frequent medications and top-10 most frequent diagnoses. The diagnoses and the medications were classified using CCS codes and Generic Product Identifier codes (TCGPI) respectively. For Analgestics-Opioid, Neuromuscular Agents Anticonvulsants, Musculoskeletal Therapy Agents (TCGPI=65x, TCGPI=72x and TCGPI=75x where x can be any code) and Antianxiety Agents (TCGPI=57x), we grouped these medications based on the first two digits of their TCGPI codes (from left to right). The rest of medications were classified using the first 6 digits of their TCGPI codes (from left to right). The variables presented in Table 

1 have already been reported as OUD risk factors in the literature [mark2013psychiatric, clarke2014rates]. Especially, diseases including “Other connective tissue disease”, “Other nervous system disorders”, “Essential hypertension”, “Mood disorders”, “Other non-traumatic joint disorders and Anxiety disorders” have been found to be more prevalent diagnoses among OUD patients than normal people [mark2013psychiatric]. Note, since we matched the case and control cohorts based on age, sex and analgesics-opioid use these three variables have similar statistical characteristics across both case and control cohorts. However, the distributions of other variables varies across the case and control cohorts and can be utilized by our deep learning models to discriminate OUD-positive patients from OUD-negative individuals.

Variables Case Control Variables Case Control
Age (SD) 45.62 (13.81) 52.35 (14.39) Female (percentage) 109,121 (55.60%) 117,699 (59.98%)
Diagnoses (CCS Code) Medications (TCGPI Code)
Other connective tissue disease (211) 152,703 (77.81%) 165,112 (84.14%) Analgesics - Opioid (65) 190,141 (96.89%) 196,246 (100%)
Other nervous system disorders (95) 138,866 (70.76%) 141,350 (72.03%) Neuromuscular Agents Anticonvulsants (72) 105,508 (53.76%) 97,444 (49.65%)
Essential hypertension (98) 106,299 (54.17%) 132,049 (67.29) Musculoskeletal Therapy Agents (75) 106,186 (54.11%) 102,888 (52.43%)
Mood disorders (657) 97,035 (49.45%) 81,306 (41.43%) Antianxiety Agents (57) 76,830 (39.15%) 75,463 (38.45%)
Other aftercare (257) 127,131 (64.78%) 133,920 (68.24%) Proton Pump Inhibitors (492700) 71,243 (36.30%) 86,561 (44.11%)
Residual codes; unclassified (259) 136,177 (69.39%) 152,748 (77.83%) Serotonin-norepinephrine Reuptake Inhibitors (581800) 58,039 (29.57%) 48,323 (24.62%)
Other non-traumatic joint disorders (204) 134,042 (68.30%) 150,660 (76.77%) Selective Serotonin Reuptake Inhibitors (581600) 69,665 (35.50%) 65,005 (33.12%)
Anxiety disorders (651) 91,736 (46.75%) 78,296 (39.90%) Hmg Coa Reductase Inhibitors (394000) 53,806 (27.42%) 79,201 (40.36)
Disorders of lipid metabolism (53) 94,507 (48.16%) 122,322 (62.33%) Non-barbiturate Hypnotics (602040) 46,965 (23.93%) 44,404 (22.63%)
Medical examination/evaluation (256) 129,224 (65.85%) 147,268 (75.04%) Nonsteroidal Anti-inflammatory Agents (661000) 87,301 (44.49%) 98,639 (50.26%)
Table 1: Distributions of age, sex, medication, and diagnoses in both case and control. Top 10 diagnoses and medications are provided. The numbers indicate the number of patients who had at least one such diagnosis or medication.

Data Pre-processing

For each of the enrollees in the case and control cohort, his/her medications and diagnoses between Jan 2009 and Dec 2018 and demographic records were extracted. In total, we extracted 78,136,935 medication records and 143,275,864 diagnoses records. The original format of the prescription and professional service encounter claims in IBM MarketScan data is a table where each row is a visit and columns are enrollee ID, date of visit, and prescription/diagnoses. If an enrollee has multiple visits, each visit will occupy a row in the table. To facilitate further study of the temporal patterns in the data, we converted the data into an enrollee-time matrix , where is the complete set of enrollees in , is the set of time steps between Jan 2009 and Dec 2018 (by month) and is the feature set, each vector records the medications/diagnoses an enrollee took at time . We excluded patients from if the number of valid entries is less than 3.

The goal of data representation is to learn a function: , where , and and are medication and diagnosis, respectively. To train the function , LSTM [sud2019Sajjad, RN19] was adopted. The outputs of the trained LSTMs were used to represent both the OUD case and control cohorts. The general schema of the data pre-processing and representation is shown in Figure 1.

MUPOD Architecture

MUPOD is a transformer-based deep learning model designed to analyze highly correlated healthcare data streams simultaneously. To minimize ambiguity, the algorithms is described for a single patient and for . Each patient can be represented by in which is a set of input streams and is the target label. Herein, three input streams are considered: 1) medication tuples in which is the time step and is a list of medications that the patient is prescribed with at time , 2), diagnoses tuples where is the time step and is a list of diagnoses assigned to the patient at time , 3) demographic tuple in which is the time step and is the demographic information of patient at .

This study uses the encoder part of transformer to identify the associations between medication and diagnosis across time and make the prediction of OUD. Medications , diagnoses , and demographics are fed to the model in parallel. The first step is to incorporate the temporal patterns of the data stream into the encoder’s inputs using positional encoding. The embedding layer in the transformer is replaced by the proposed LSTM based representation layer. This change has two computational advantages. Firstly, it deals with challenges in the input data such as variable dimension and data sparseness, which is common in longitudinal healthcare data. Secondly, it extracts hidden parameters and transforms the original input into a new feature space where cases and controls are better separated than in the original feature space.

The encoded input streams are plugged into the attention layer to generate Query, Key, and Value matrices for each input stream. For example, medications are fed to a set of fully connected layers to generate , , and , representing the query, key, and value matrices for the encoded medication stream for patient . Let , the Query, Key, and Value matrices are used to find the attentions across these three input streams:


Figure 2 describes how the attention weights are calculated across different input streams and are transformed into new streams for the next encoder. In the figure, , , and represent query, key, and value matrices for stream (). All possible permutations of the stream are used to determine the attention weights between different visits and across streams. Attentions are then passed through a set of dense layers to generate outputs. For example, given two data streams and , we can generate three permutations i.e. , , and .

(a) General architecture of MUPOD.
(b) Multi-stream encoder.
Figure 2: MUPOD architecture. represent query, key, and value matrices for the input stream , where . represents the attention weights between different records across input streams and , where . represents the outputs, which capture the associations between the input streams and . The demographic information is plugged into the system before the last layer and in the prediction layer.

The reconstruction layer receives the relevant output and map them to appropriate format for the next layer as described in Equation 2. For example, only the outputs relevant to the medications () including and are used to reconstruct the medication stream appropriate to be fed into the next encoder layer.


where , , , and are trainable reconstruction weight and bias matrices. The two reconstructed matrices generated by the last encoder layer are fed to a prediction layer to make the final prediction for the current patient . The prediction layer is implemented in Equation 3, where is trainable weight matrix and is the bias matrix.


Experimental Results

All the deep learning models in this work were deployed on the TensorFlow platform 


and were trained using eight GeForce GTX 1080 GPUs. Original transformer model, LSTM models, Random Forest (RF) 


and Support Vector Machine (SVM)

[RN14] were compared with MUPOD as baselines. We used 314,504 samples for training, 38,776 samples for validation and 39,212 for testing the models. All results reported in this paper are on the test set. We optimized all models using a random search policy across hyper-parameters of each model. A grid of hyper-parameters values was set up and 10 random combinations of the hyper-parameters were selected to train the models. For the LSTMs, their learning rates were randomly set to where . The batch size was randomly selected from and the number of iterations was randomly selected from where . The regularization parameter for LSTM models were randomly selected from where

. The number of hidden neurons for the LSTMs in representation layer was fixed to 10; because the outputs of these LSTM models was the inputs to MUPOD and the inputs to our model has to have a fixed dimension (the dimension of our model in this paper is 20: 10 for medications and 10 for diagnoses stream). However, the number of hidden neurons for other LSTM model used as a baseline (refer to Table 

2) was randomly selected from where .

Table 2 compares the prediction performance of MUPOD with RF, SVM, LSTM and original Transformer model. We used the same train, validation and test data to train, validate and test all models in Table 2 except for the SVM model. Due to the hardware and time limitation we had to train and test this model using 10,000 randomly selected samples. Note, the LSTM model in Table 2 is trained using medication, diagnosis and demographic data. We concatenated medications, diagnoses and demographics in each time step and formed a single vector which was fed to this LSTM model. The hyper-parameter search space for this LSTM was the same as explained earlier in this section. We used a randomized 5-fold cross validation to tune RF and SVM models. The RF and SVM were trained on the static data and the LSTM, transformer and MUPOD were trained on the longitudinal data. To create a static data for RF and SVM, the longitudinal data was converted to a new format , where is the complete list of patients, and is a vector including aggregated values for all medication, diagnosis and demographic features across time steps (from Jan. 2009 to Dec. 2018). Transformer is the original encoder block of the transformer model [NIPS2017_7181]. In Table 2, MUPOD has the highest accuracy (0.775), precision (0.741), F1-score (0.790) and AUC (0.871). These results indicate that our proposed model captures important factors in the medication, diagnosis and demographic data and provides an increased power to predict the development of OUD, while RF, SVM, LSTM and original Transformer miss such factors.

Model Acc. Prec. Rec. F1-score AUC
RF 0.698 0.694 0.708 0.701 0.774
SVM 0.619 0.644 0.569 0.604 0.665
LSTM 0.656 0.700 0.546 0.614 0.729
Transformer 0.708 0.654 0.880 0.751 0.801
MUPOD 0.775 0.741 0.847 0.790 0.871
Table 2: Performance of OUD prediction using MUPOD compared to RF, SVM, LSTM and original transformer.

In addition, we tested the models performances on three imbalanced test data sets with the ratio of OUD-positive samples to OUD-negative samples of 0.1, 0.2 and 0.5. Table 3 shows the model performances on imbalanced test sets. Table 3 shows that MUPOD maintains higher performance on all imbalanced test sets compared to all baselines in terms of precision, F1-score and AUC. Note, we didn’t show accuracy in Table 3, because this measure is not informative when assessing algorithms on imbalanced data.

Model Precision Recall F1-score AUC
RF .531 .313 .182 .710 .715 .701 .608 .436 .290 .773 .777 .770
LSTM .539 .312 .189 .548 .532 .546 .544 .393 .281 .730 .723 .732
Transformer .486 .276 .160 .879 .885 .883 .626 .420 .270 .799 .804 .796
MUPOD .588 .364 .221 .845 .848 .843 .693 .509 .351 .871 .870 .871
Table 3: OUD prediction results for imbalanced test sets. The means the number of samples in the OUD-positive cohort are times smaller than the number of samples in the OUD-negative cohort.

We revealed the relationships between the medication and diagnosis streams by aggregating the attention weights in the first layer of the model for all the records of each individual and visualized the results. In the visualization, a rectangular node represents a medication type and an oval node represents a diagnosis code. We divided the accumulated attention weights to “weak”, “moderate”, and “strong” based on a pre-defined threshold (i.e. weak: , moderate: , and strong: ), represented using solid, dashed, and dotted lines respectively. The lines of an OUD-negative patient are colored black, while the lines of an OUD-positive patient are colored red.

Figure 2(b)

shows the attention weights computed with MUPOD on one OUD-positive and one OUD-negative patient. The cosine similarities of the medication and diagnosis streams of the two patients are 0.85 and 0.27, respectively, indicating that they have different diagnoses but similar medication records. The connections belonging to the positive and negative are well separated. Besides, almost all the strong connections are from the OUD-positive patient, while all the moderate connections are from the OUD-negative patient. Similarly, Figure 

2(c) shows the attention weights on one OUD-positive and one OUD-negative patient. The cosine similarities of the medication and diagnosis streams of the two patients are 0.71 and 0.93, respectively, indicating that they have very similar diagnoses and medication records. Although they have similar records and similar connections between medication and diagnoses nodes, the strengths of attention is different for the OUD-positive patient versus the OUD-negative patient and MUPOD was able to correctly classify these two samples.

(a) Medication and diagnoses abbreviations and full names.
(b) Sample 1.
(c) Sample 2.
Figure 3: Attention weights. Rectangular nodes represent medications and oval nodes represent diagnoses. Solid, dashed and dotted edges respectively mean strong, moderate and weak connections. We used abbreviations for medications and diagnoses, and provided the full names in (a).


OUD is a public health crisis costing the US billions of dollars annually in healthcare, lostworkplace productivity, and crime. In this study, we developed a multi-stream transformer model to analyze the long-term impact of medication application pattern, diagnosis history and demographic information, and to explore the associations within and between these streams of patients data. Our proposed model was able to predict the onset of OUD more efficiently compared to baseline models including RF, SVM, LSTM and original transformer model. We discovered that the associations between medication and diagnosis streams are key factors that improve power to predict the development of subsequent OUD.

There are some limitations in our approach. First, the current model relies on patient demographic information and limited subset of medications and diagnoses as features. Incorporating more detailed diagnostic and medication information such as daily dose of opioid could refine the relationship between medications and diagnoses, and create more accurate OUD prediction tools. Second, the current approach cannot predict/estimate risks because the medication application patterns and the diagnosis history of patients that may lead to the increment of OUD risk has not been studied. In the future, we will extend our model to address the aforementioned problems such as incorporating more medication and diagnosis features as well as the Morphine Milligram Equivalent (MME) information in MUPOD. The rationale is, given a patient who is constantly on the same type of medication for a while, the variation of the dosage may indicate whether the medication is still effective for the patient.

Despite the limitations of the model, the current approach adds detail to our understanding of the factors that may be important to the development of OUD. Our hope is a more thorough understanding of the relationships between medications and diagnosis will eventually enable clinicians to identify individuals at risk for OUD at an earlier stage, and ideally, perhaps even prevent OUD.


This research is supported by Kentucky Lung Cancer Research (grant no.KLCR-3048113817).