Outcome identification in electronic health records using predictions from an enriched Dirichlet process mixture

06/06/2018 ∙ by Bret Zeldow, et al. ∙ 0

We propose a novel semiparametric model for the joint distribution of a continuous longitudinal outcome and the baseline covariates using an enriched Dirichlet process (EDP) prior. This joint model decomposes into a linear mixed model for the outcome given the covariates and marginals for the covariates. The nonparametric EDP prior is placed on the regression and spline coefficients, the error variance, and the parameters governing the predictor space. We predict the outcome at unobserved time points for subjects with data at other time points as well as for new subjects with only baseline covariates. We find improved prediction over mixed models with Dirichlet process (DP) priors when there are a large number of covariates. Our method is demonstrated with electronic health records consisting of initiators of second generation antipsychotic medications, which are known to increase the risk of diabetes. We use our model to predict laboratory values indicative of diabetes for each individual and assess incidence of suspected diabetes from the predicted dataset. Our model also serves as a functional clustering algorithm in which subjects are clustered into groups with similar longitudinal trajectories of the outcome over time.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Electronic health records (EHR), now a critical component of health care, make a large quantity of data available for researchers. Challenges in using EHR for statistical analyses, however, are well-documented [1]. The focus of this paper is on the challenge of outcome identification. Many diseases can be identified in the data from diagnostic codes. However, this is unlikely to fully capture outcomes. EHR data often contain longitudinal measures from laboratory tests (labs) which can be used for the diagnosis of diseases and for disease monitoring. In practice, labs are sometime used to identify additional outcomes (beyond those identified from diagnostic codes). For instance, subjects at risk for diabetes can have fasting glucose labs monitored over time, which can be instrumental in diagnosing the disease [2]. From a statistical perspective, one challenge is that labs may be abundant for some subjects and sparse or missing for others. Unlike in planned observational studies with primary data collection, labs are not necessarily observed at ideal times. Correspondingly, it may be helpful to model these labs and to use this model to make predictions at time points of interest for EHR containing missing or sparse data. To this end, we propose a flexible joint model for the distribution of a continuous longitudinal outcome (lab values) and baseline covariates. The parameters from the joint model are all given a Dirichlet process (DP) prior with the enrichment proposed in Wade et al. [3]. Our model provides a flexible framework for prediction as well as serving as a functional clustering algorithm in which one does not specify the number of clusters a priori.

The Dirichlet process (DP) mixture is a popular Bayesian nonparametric (BNP) model [4, 5, 6] found in many applications, including topic modeling [7], survival analysis [8], regression [9], classification [10], and causal inference [11]. Consider the regression setting of Shahbaba and Neal [12] and Hannah et al. [9], where there is an outcome which we would like to regress on covariates . In a Bayesian generalized linear model (GLM) setup [13], the predictors

are restricted to be a linear combination of the unknown regression parameters. Because of this, GLMs are not appropriate to model nonlinear response curves when the regression coefficients are given normally distributed priors

[14]. In contrast, placing a DP prior on the regression coefficients (DP-GLM) instead of a parametric prior allows for nonlinearities despite the underlying GLM framework, and this flexibility can often be achieved with only modest additional computational burden. The power of the DP prior stems in part from its partitioning properties [15], where it clusters observations and fits local regressions among subjects with similar relationships between covariates and the outcome [9].

Wade et al. [3] showed that with a high number of covariates , the likelihood contribution of can dominate the posterior of the partition so that clusters form based more on similarity of covariates than on regression parameters. This leads to a high number of clusters with few observations per cluster and can result in poor predictive performance that can be improved by using an enriched DP (EDP) mixture instead of a DP mixture [16]. The EDP mixture allows for nested clustering, where one can have clusters based solely on the regression coefficients governing on and within those, nested clusters based on similarity in the covariate space. The benefits of the EDP mixture were demonstrated in simulation and in a real data analysis [16].

In this paper, we extend the EDP mixture model to longitudinal settings with a continuous outcome. Some alternatives to our EDP approach to longitudinal data have been proposed in the literature. Müller and Rosner [17] modeled blood concentrations in a pharmacokinetic study using DP mixtures with a DP prior on the covariate parameters and the regression coefficients. Li et al. [18]

developed a flexible semiparametric mixed model with smoothing splines and a DP prior on the random effects with a uniform shrinkage prior for its hyperparameters.

Das et al. [19] fit a bivariate longitudinal model for sparse data with penalized splines for the effect of time and DP priors on the random effects. Quintana et al. [20] developed a longitudinal model with random effects and a Gaussian process with DP mixtures on covariance parameters of the Gaussian process. This allows for flexible modeling of the correlation structure. Bigelow and Dunson [21] fit a joint model for a binary outcome and functional predictor where the functional predictor was modeled with cubic B-splines whose basis coefficients were given a DP prior. Scarpa and Dunson [22] developed an enriched (unrelated to the enriched DP) stick-breaking process which incorporated curve features to better fit functional data.

Our model is unique in that the regression parameters and the parameters for the covariates are given an EDP prior rather than the usual DP prior. As a result, the partitions are not dominated by the covariates as may otherwise happen. Along with improved prediction over DP priors, our model serves as a functional clustering algorithm in which subjects with similar trajectories over time are likely to be part of the same cluster. This aspect also benefits from the EDP prior as functions cluster separately on the regression parameters and covariates. Functional clustering can illuminate distinct patterns among different groups of subjects. A review of functional clustering can be found in Jacques and Preda [23]

. Notably, frequentist and parametric Bayesian methods often require prior specification of the number of clusters, often chosen through model fit statistics. Our EDP model requires no such specification; new clusters may form and existing clusters may vanish throughout the Markov Chain Monte Carlo (MCMC) algorithm.

Our motivating example is a study of individuals who newly initiate a second-generation antipsychotic (SGA). SGAs are known to increase incidence of diabetes [24, 25]. A previous analysis explored the value of incorporating elevated laboratory test results as part of the definition of the outcome of incident diabetes, defined by diagnosis codes and dispensing claims of antidiabetics [26]

. However, many subjects had no recorded lab values or had them measured outside the narrow study window. In this paper, we demonstrate our model by regressing each of three lab values indicative of diabetes (hemoglobin A1c, fasting glucose, and random glucose) on baseline covariates and time. Throughout the MCMC algorithm, values are predicted for each subject at the end of the individual’s follow-up, either at one year post SGA initiation or earlier if censored prior to that time. We then combine each set of predictions with the observed data so that each draw can be thought of as an imputed lab value. We then calculate the incidence of diabetes using a multiple imputation procedure

[27]. Lastly, we demonstrate how our model can be used for functional clustering by examining posterior clustering patterns resulting from the model.

The rest of the paper is organized as follows. In section 2, we write out the details of our model and describe key components. In section 3, we discuss computations and making predictions from our model. In section 4, we test our method on simulated datasets. In section 5, we apply our method to the SGA dataset. We discuss the paper in section 6 including limitations and future directions.

2 Model

To motivate our model, first consider a hypothetical planned observational study, where the outcome of interest is diabetes status one year following initiation of an SGA. In that hypothetical study, we would collect laboratory data, such as hemoglobin A1c (HbA1c) at the end of the study. We might then classify people as having the outcome if their HbA1c value was


Now consider a study with the same goals, but using EHR data. Figure 1 shows four hypothetical subjects with longitudinal measurements of HbA1c over a period of about 15 months. We are interested in determining whether HbA1c levels are

6.5% at month 12. However, none of the four subjects have data collected precisely at month 12, so we need to interpolate from observed data to classify them as elevated or not at month 12. How we classify them is dependent on the algorithm used. Naive algorithms might include basing classification on the value closest to month 12, on the value closest to month 12 that is prior to month 12, or on the maximum value prior to month 12. For instance, it is clear that subject (a) can be classified as either elevated or not depending on the algorithm implemented. Subject (b) has many observations but only one is above the critical threshold and the overall trend suggests their value at month 12 would not elevated. The data for subject (c) has highly variable data and it is uncertain what their month 12 value would be. All subjects have varying degrees of uncertainty in their classifications. These naive classification methods do not use all of the data and do not account for uncertainty in the prediction/imputation.

Our BNP model, described below, was designed to impute outcomes at any time or times of interest, while fully utilizing all of the data (covariates and labs over time). It uses all available data and predicts the outcome at unobserved time points periodically throughout the MCMC algorithm. Thus, for each subject we estimate the distribution of the outcome at the time point of interest rather than just a single prediction.

Figure 1: Hypothetical example of data from electronic health records. The four panels represent the hemoglobin A1c (HbA1c) values for four subjects over a 12+ month period. The vertical dashed line at month 12 indicates the time point of interest. The horizontal dashed line represents the critical value (6.5%) of HbA1c above which or equal to indicates diabetes. The cross marks indicate the observed values for each subject. None of the four subjects have values taken precisely at month 12 so interpolation is necessary. Panel (a) shows a subject who has a rising trajectory but doesn’t cross the threshold until after month 12. Panel (b) shows a subject who crosses the threshold once prior to month 12 but is stable below the threshold for many other observations. Panel (c) shows a subject with highly variable data around the threshold before and after month 12. Panel (d) shows a subject below the threshold for all observed data.

2.1 Notation

Let denote the occurrence () of a continuous outcome for subject , , observed at time . Let

denote the vector of outcomes for all subjects and

denote the vector of outcomes for the subject. Let denote the vector of time points at which were recorded so that both and are length . The covariates for subject are measured at baseline and denoted by the -dimensional vector . Without loss of generality, let the first values be binary and the remaining be continuous with . Let denote the total number of subjects and denote the total number of observations, accounting for multiple observations per subject.

We model the distribution the outcome as a function of covariates and time jointly with the marginal distributions of . To allow for nonlinearities across time, we use splines with prespecified knots at with . Bigelow and Dunson [21] considered B-splines [28] and Li et al. [18] used P-splines. We opt for penalized, thin plate splines which have good mixing properties in Bayesian analysis [29]. The choice of penalized splines also allows us to choose a large number of knots, reducing the dependency of the model fit on the selection of knot locations. However, any number of basis expansions are possible, including wavelets [30]. For thin plate splines, let denote the by matrix with each row corresponding to the basis functions evaluated at each observed time point . The matrix is calculated as , where the rows of are equal to and the penalty matrix is a matrix where the th entry is [29]. The penalty matrix prevents overfitting by penalizing the coefficients of . Each subject in the sample contains a by submatrix of which corresponds to the basis functions evaluated at each .

We fit the model


where are the regression parameters. The notation means that and , where and are positive valued parameters and is the base distribution with parameters and independent. Here,


where the first product is among binary covariates followed by a product over the continuous covariates. The notation indicates the vector with time possibly added, as would be the case if splines were omitted.

We assume that continuous variables are (locally) normally distributed and that binary predictors are Bernoulli. Other distributions can be used, but these distributions are convenient for their conjugacy properties. The parameter corresponds to the two dimensional parameter with mean and variance if the

covariate is continuous or the one dimensional probability parameter

if the covariate is binary. Integrating out the subject specific parameters and as in Wade et al. [16], our model can be thought as a countable mixture of linear mixed models where each subject is assigned to one of the mixture components.

We do not posit any a priori relationship between time and the outcome. In some applications where the overall trend may be known (for example, the amount of medication in blood may decrease over time after a drug is administered in a pharmacokinetic study), we may posit a model for equation (1) incorporating such knowledge, as in Müller and Rosner [17] which assumed a piecewise linear structure.

2.2 Clustering

A consequence of using the EDP prior on the regression coefficients is that subjects cluster based on their regression parameters (that is, for some , ), and within these clusters, will form sub-clusters based on their covariate parameters . Since includes , the coefficients on the spline basis functions for time, subjects with similar trajectories of their outcomes over time will likely be assigned the same cluster. However, subjects are also clustered by the parameter , which governs variability of outcomes. Thus, it is possible to have clusters with small variability that follow a precise trajectory over time, and it is possible to have clusters whose large variability defines the cluster, or some combination of the two. The total number of clusters depend on the data and the parameters and , where values closer to 0 indicate fewer clusters.

For this paper, we use the term -cluster to indicate clusters based on the parameters . A -cluster denotes a cluster nested within a -cluster and indicates closeness in the covariate space governed by covariate parameters . The -clusters are only meaningful with respect to the -cluster in which it is nested.

While an advantage of the BNP approach is not having to select the number of clusters, this creates added difficulty in summarizing the clusters. We use the strategy employed in Medvedovic and Sivaganesan [31], which employed a distance metric based off of empirical pairwise probabilities of subjects being in the same cluster. To do this, we create a matrix where each element indicates the number of times two corresponding subjects were in the same -cluster over all post burn-in MCMC iterations. From the rows of this matrix, we compute a distance matrix using the supremum norm. We then use Ward’s hierarchical agglomerative clustering method implemented by Murtagh and Legendre [32]. This last step requires choosing a number of clusters, which we choose from the median of the posterior distribution on the number of -clusters. R code for this calculation is provided in the appendix.

3 Computations

Draws from the posterior distribution of all parameters are obtained through Gibbs sampling. We use an extension of algorithm 8 by Neal (2000) [33] accommodating the nested partitioning of the EDP [16] and repeated measurements. Algorithm 8 involves generating sets of auxiliary parameters corresponding to clusters that currently have no members. Broadly, at each iteration we alternate between updating cluster membership for each subject, and then within each cluster we update the parameters . Let denote the cluster membership for the subject, where denotes the -cluster corresponding to and denotes the -cluster nested within corresponding to . Let denote the value of corresponding to the unique value of . Similarly, let denote the value of corresponding to the unique value of within the unique value of Note that if , then . Furthermore, let denote the number of subjects in the unique cluster of and denote the number of subjects in the unique cluster of nested within the unique value of The notation and denote the size of the clusters with the subject removed. Recall that the similar notation with no superscript, , refers to the number of observations for the individual.

The first step of our algorithm updates the value of for every individual. First, remove individual from their current cluster. The probability that an individual is in any given cluster depends on the current values of and , the number of subjects within that cluster, the values of and as well as the observed data. In choosing clusters, there are three possibilities: subjects can be assigned to an existing -cluster within an existing -cluster, a new -cluster within an existing -cluster, or a new -cluster and a new -cluster. An individual is assigned to an existing cluster with probability proportional to:

An individual is assigned to a new -cluster within the existing -cluster with probability proportional to:

An individual is assigned a new -cluster and a new -cluster with probability proportional to:

These probabilities are then normalized to sum to 1.

The notation and refers to parameters from a cluster that currently has no members (also called auxiliary parameters, see [33]). They are generated randomly from the prior base distributions and for and . The notation corresponds to the normal density in equation (2) or the binomial density in equation (3) for continuous and binary, respectively, and corresponds to the normal density from equation (1) evaluated with parameters . Once we calculate these probabilities, we draw cluster membership using a random multinomial distribution. This is done separately for each individual in the cohort.

Once cluster memberships for all individuals have been updated, the within cluster parameters and are updated. To update the regression parameters for the cluster, we consider only individuals with . First, we update the regression variance

using a conjugate draw from an inverse gamma distribution and then update regression parameters

for covariates from a draw with a multivariate normal distribution. Next, update the variance for the spline effects from a random draw from an inverse gamma distribution. Lastly, we update the coefficients for the spline effects from a draw from a multivariate normal distribution. In essence, within each cluster we are fitting separate Bayesian mixed effects models and updating parameters accordingly [34]. Full posterior distributions for updating are in the appendix.

Next, we update covariate parameters . To update , we take subjects with . If the covariate is binary, then the distribution of is assumed Bernoulli and the parameter

is updated from a Beta distribution with parameters

and . If the covariate is continuous then the distribution of is normal and the parameters are updated from conjugate inverse- and normal distributions, available in the appendix.

It remains to update the random intercepts , the variance , , and . The new random intercepts are calculated after taking the residuals from the current fit given covariates and the rest of the current parameter values. The variance is updated through a random draw from an inverse gamma distribution with shape and rate . Finally, we update and . is updated by generating a random value from a mixture of two gamma posteriors as in Escobar and West [6]. is updated through a Metropolis-Hastings step. The updates for these parameters are equivalent to those in Roy et al. [11] who also employ an EDP mixture model. Consult the appendix for expanded details of the MCMC algorithm.

3.1 Predictions

Predicting values for subjects who have observed data (that is, data at time points other than the time point of interest) is straightforward. At every iteration where we seek to make a prediction, each subject is assigned a cluster with corresponding . From this, we can predict from a single draw from a normal distribution given with mean and variance .

For subjects missing outcome data, we must make predictions from their covariates . These subjects may be part of the existing cluster with parameters or may be in an entirely new cluster. If they are part of the existing -cluster, we use the current values from the corresponding parameters for that cluster (i.e., ) and draw the prediction from a normally distribution with mean and variance . If a subject is part of a new cluster that currently has no members, we generate using the base distribution .

The probability that a subject is in the existing -cluster is proportional to:

where the summation iterates through all nested -clusters for the -cluster.

The probability that a subject is in a new -cluster is proportional to:

where , the density integrated over the base measure evaluated at the observed data [16]

. This computation for binary and continuous covariates using our distributional and prior assumptions is shown in the appendix. Since we used conjugate priors, this integration can be done analytically. When non-conjugate priors are used, Monte Carlo integration is an option.

4 Simulations

We used simulation to assess the predictive performance of our longitudinal model with splines and an EDP prior. For each simulated subject, we predicted the outcome at a specific time and compared it to the true value. Let be the subject’s true value at time and let be the prediction of from a given model. We computed the mean absolute prediction error and the mean squared prediction error over all simulated subjects.

We simulated sample sizes of and . Each individual was randomly assigned a minimum of 1 and a maximum of 5 repeated measurements corresponding to time points within the interval

generated randomly from an independent uniform distribution. As before, let

denote the regression parameters and denote the covariate parameters. The true cluster structure had three -clusters. Within each -cluster, there were 3, 2, and 3 nested -clusters. Thus, the total number of unique clusters was 8 while the total number of unique -clusters was 3. The structure of the clustering along with probabilities of being in each cluster are given in Figure 2. Each subject was assigned 20 simulated covariates from distributions whose parameters differed between -clusters. Full data-generating details are available in the appendix and code is available upon request (code for the EDP and DP models are at https://www.github.com/zeldow/EDPlong and https://www.github.com/zeldow/DPlong).

Predictions were made for each subject at and the true value was calculated based on the mean for the -cluster to which the individual belongs (mean function shown in appendix) and the random intercept. We generated 100 datasets and take the mean of and over all simulations, and then calculate

where and are and calculated on the simulated dataset.

To assess the performance of our mixed model with an EDP prior, we compared it to two competitor models: a Bayesian mixed model with a DP prior and a linear mixed model (implemented by the lme4 package [35] in R [36]). For each sample size, we varied the regression variance and the random intercept variance resulting in four simulation scenarios: (1) low variability in and low variability in ; (2) low variability in and high variability in ; (3) high variability in and low variability in ; and (4) high variability in and high variability in . All models were fit using thin plate splines for the time effect. In the appendix, we show results using cubic B-splines as well with 2 knots at and .

; 0.66 0.87 0.89 1.43 1.11 1.85
; 0.82 1.19 1.07 1.93 1.11 1.87
; 0.89 1.46 1.08 2.00 1.23 2.32
; 1.05 1.90 1.17 2.28 1.24 2.37
Table 1: Simulation results for showing mean and errors over 100 datasets for predictions at . indicates the simulated regression variance and indicates the simulated random intercept variance. EDP indicates the longitudinal model with an enriched Dirichlet process prior. DP indicates the longitudinal model with a Dirichlet process prior. ME indicates a mixed effects model fit using the lmer package in R [37]. Fit with penalized thin plate splines with 20 knots.
; 0.60 0.73 0.91 1.49 1.10 1.82
; 0.77 1.06 1.10 2.03 1.11 1.86
; 0.71 0.99 1.04 1.90 1.21 2.27
; 0.85 1.27 1.15 2.25 1.23 2.34
Table 2: Simulation results for showing mean and errors over 100 datasets for predictions at . indicates the simulated regression variance and indicates the simulated random intercept variance. EDP indicates the longitudinal model with an enriched Dirichlet process prior. DP indicates the longitudinal model with a Dirichlet process prior. ME indicates a mixed effects model fit using the lmer package in R [37]. Fit with penalized thin plate splines with 20 knots.
Figure 2: Structure of clustering for simulations with probabilities of being in each cluster. Probabilities for being in a -cluster are conditional on being in the appropriate -cluster. refers to the regression parameters and refers to the covariate parameters.

The results of the simulation study for are shown in Table 1. For all scenarios the EDP model outperformed the DP model and the mixed model in terms of mean and prediction error. The mean error for the EDP model ranged from 0.66 to 1.05. For the DP model, it ranged from 0.89 to 1.17, and for the standard mixed model it ranged from 1.11 to 1.24. The mean errors range from 0.87 to 1.90, 1.43 to 2.28, and 1.85 to 2.37 among the models for the four simulation scenarios, respectively. Given that the data were generated in clusters, it is unsurprising that the two methods implementing clustering provide better predictions than the standard mixed effects model. However, we see that the EDP model yielded more precise prediction than the DP model based on and prediction error.

The results in Table 2 display the results for the simulations with . Again, the EDP model outperforms the DP model which outperforms the mixed model. For the most part, there are no large differences in the relative performance of methods at the two sample sizes. Using cubic B-splines (see Appendix) in lieu of thin plate splines also did not have considerable effect on prediction error. Overall the results for thin-plate splines were slightly improved over those of B-splines, but more research needs to be done and more scenarios examined.

; 0.31 0.16 0.31 0.16 0.28 0.12
; 0.56 0.50 0.56 0.50 0.38 0.23
; 0.34 0.18 0.34 0.18 0.35 0.20
; 0.58 0.52 0.58 0.52 0.52 0.42
Table 3: Simulation results for showing mean and errors over 100 datasets for predictions at . indicates the simulated regression variance and indicates the simulated random intercept variance. EDP indicates the longitudinal model with an enriched Dirichlet process prior. DP indicates the longitudinal model with a Dirichlet process prior. ME indicates a mixed effects model fit using the lmer package in R [37]. Fit with penalized thin plate splines with 20 knots.

Lastly, we performed simulations where all subjects were part of the same cluster so that the linear mixed model was correctly specified and was expected to work best. The results for are included in Table 3. Over 100 simulated datasets, the correctly specified standard mixed model outperformed both the EDP and DP models in almost all scenarios. Results from EDP and DP models showed no difference up to two decimal places. This interesting finding was due to the fact that the EDP model did not split -clusters in subclusters, rendering the difference between the DP and EDP models irrelevant. Overall, we found that the EDP and DP models concentrated around one large -cluster with scattered observations in other -clusters. With , the error for the mixed model ranged from 0.28 to 0.52, while for the DP and EDP models, it ranged from 0.31 to 0.58. The largest difference in favor of the standard mixed model occurred with low regression variance and high random intercept variance ( error: 0.38 versus 0.56; error: 0.23 versus 0.50). Other scenarios showed either no difference or only a modest improvement for the standard mixed model. One possible explanation for this discrepancy is that there is an identifiability problem in the models with DP or EDP priors in which the algorithm has difficulty determining if is smaller and there are many clusters or if is large and there are few clusters. Thus, in this scenario the EDP and DP models split the sample into more clusters than was necessary and prediction suffered accordingly. On the other hand, the reverse scenario with high regression variance and low random intercept variance showed no difference between the three models, indicating that the nonparametric prior performed fine in more likely situations.

5 Data Analysis

Sentinel is an initiative of the US Food and Drug Administration with 17 data partners [38]. Under Sentinel, a distributed database has been established that collects EHR and administrative health plan data to assess safety in approved medical products, particularly drugs and vaccines. As part of a workgroup effort to understand and use laboratory results data in the Sentinel Distributed Database (SDD) [39], Flory et al. [26] used the SDD to calculate incidence rates of diabetes among new initiators of second generation antipsychotics (SGAs), which are known to increase the risk of Type II diabetes mellitis (T2DM) [24, 25]. T2DM is often diagnosed based on elevated levels of hemoglobin A1c (HbA1c), serum glucose, or capillary glucose [2]. In Flory et al. [26] incidence rates for T2DM were computed from two outcomes: (O1) diagnosis codes and dispensement of antidiabetic medication and (O2) diagnosis codes, dispensement of antidiabetic medication as well as an elevated diabetes labs. Lab values were considered elevated if fasting glucose mg/dl, random glucose mg/dl, or HbA1c v%. Including diabetes labs increased the number of T2DM cases, but missingness was differential among the sites analyzed, affecting some sites more than others. In this paper, we extend some of the results of Flory et al. [26] using predictions from our longitudinal EDP model.

We restricted our analysis to site one of Flory et al. [26], which corresponds to a small integrated delivery system. As in that publication, our cohort was restricted to participants at least 21 years of age who had at least 183 days of health plan enrollment prior to initiating a SGA (aripiprazole, olanzapine, quetiapine, and risperidone). We included those who had first dispensement of a SGA between 1 January 2008 and 31 October 2012. Any individuals with evidence of diabetes prior to initiation of the SGA, including diagnosis of diabetes, receipt of an antidiabetic medication, or an elevated diabetes lab, were excluded. Follow-up began at first dispensement of a SGA and continued until discontinuation of insurance, death, occurrence of the outcome, or end of 365 days, whichever came first. The outcome was incident diabetes within 365 days of study, equal to that of outcome O1 above. We also define a new outcome O3, which consists of O1 and predicted elevated lab values.

The motivation for using our EDP longitudinal model for this problem is as follows. Our interest lies in calculating the incidence of diabetes within one year of initiating a SGA, supplementing the outcome with information from recorded lab values. The previous analysis was limited by restricting to lab values within one year of follow-up. However, lab values after one year can be informative as well, particularly those drawn soon after study end. Over 30% of the subjects from site one did not have any lab values recorded between 1 and 365 days of SGA initiation. Subjects with lab values recorded had differential amounts of data recorded within that study window, ranging from 1 to 4 records for HbA1c, 1 to 5 of fasting glucose, and 1 to 115 of random glucose. Lastly, the approach in Flory et al. [26] treats any instance of a lab value exceeding the threshold as part of the outcome even if only one measurement among many exceeded the threshold. Because of this, uncertainty stemming from measurement error was inadequately accounted for. Our model incorporates such uncertainty through the regression variance component as well as the fact that cluster membership changes throughout the algorithm.

We fit EDP longitudinal models for each of three lab values (HbA1c, fasting glucose, and random glucose) separately. Models were fit with the entire history of the subject’s lab values until initiation of an anti-diabetic medication. Our dataset had a total of study participants. Among these, 680 subjects contributed 1,003 observations for HbA1c. For fasting glucose, 2,032 subjects contributed 4,110 observations. For random glucose, 3,013 subjects contributed 21,614 observations. We used 200,000 iterations with 40,000 burn in period. Throughout the 160,000 post burn in iterations, predictions were drawn at 800 evenly spaced iterations. Each subject had predictions made at day 365, unless their study censoring time was prior to that, at which point we made predictions at that censoring time. All predicted values were appended to the original dataset resulting in 800 imputed datasets. Each imputed dataset consists of the original data, including diabetes diagnoses and dispensement of antidiabetics, along with three predicted values for HbA1c, random glucose, and fasting glucose. The outcome O3 was calculated for each imputed dataset. From this, we then calculate the incidence of diabetes and use multiple imputation methods to combine estimates across imputations [27]. Overall, the HbA1c model took 4.7 hours of runtime, the fasting glucose model 22.4 hours, and the random glucose model 63.2 hours.

In total, 89 participants were diagnosed with diabetes through diagnosis codes or dispensement of anti-diabetic medication. The total number of outcomes O3 ranged from 146 to 394 outcomes with a median of 200 throughout the 800 imputations. This resulted in an incidence of 0.059 events per person-year (95% confidence interval: 0.043–0.080). This result is similar to the incidence found in

Flory et al. [26] for site one among those with recorded lab values, except the confidence interval is wider, reflecting greater uncertainty in classification using lab values.

5.1 Clustering

We also examined clustering resulting from our model. There is a multitude of reasons one may be interested in clustering in the present example. First, it can show heterogeneity (or lack thereof) of outcome features among groups of individuals. The cluster itself may be able to predict outcomes. For example, if we know that a certain individual is in a cluster with rising HbA1c values over time, we know that their likelihood of a diabetes diagnosis is increased compared to a group with flat trajectories over time. Further, once we have identified the clustering structure, we can examine the distributions of covariates within cluster and determine covariates that may be affecting the differences among groups.

Recall that we refer to clusters based on regression parameters as -clusters and the nested clusters based on covariates as -clusters. For illustrative purposes, we focus strictly on functional clustering using -clusters. Other applications may have interest in summarizing -clusters as well. Given that within the MCMC algorithm, not only cluster membership but the number of clusters can change, we condense the results into a single point estimate for the posterior cluster structure. For the HbA1c and fasting glucose models, the posterior number of clusters concentrated around two. For random glucose, the posterior number of clusters concentrated around three. All models were initialized to have two -clusters. When we initialized the number of -clusters to 10, results eventually converged to similar answers for each of the outcomes. However, computation time was considerably longer when initialized with a large number of -clusters.

Figure 3: Clustering results for HbA1c model. The model settled on two -clusters which are shown in the figure. The larger cluster has 650 subjects and the smaller cluster has 30 subjects.
Figure 4: Clustering results for fasting glucose model. The model settled on two -clusters. The larger cluster has 1997 subjects and the smaller cluster has 35 subjects.
Figure 5: Clustering results for random glucose model. The model settled on three -clusters with 2563, 419, and 31 subjects.

For both the model with HbA1c as the outcome and the model with fasting glucose as the outcome, our algorithm settled on two distinct clusters as seen in Figure 3 and Figure 4, respectively. For HbA1c, the first cluster contained 650 observations consisting of trajectories that mostly stay within the values 5% and 8%. The remaining 30 observations in the second cluster consisted of highly variable trajectories that had spikes in their values. In the fasting glucose model, the first cluster had 1997 members and consisted of tight trajectories below the threshold of 126 mg/dL, while the second cluster housed the remaining 35 subjects mostly of subjects whose trajectory at some point contains a spike or is somehow indicative of higher variability.

The model with random glucose as the outcome settled on three clusters which can be seen in Figure 5. The largest cluster had 2563 subjects who had relatively flat trajectories with small within-subject variability. The second largest cluster contained 419 subjects who had trajectories with slightly more variability than the first cluster. The third cluster contains 31 subjects with large spikes and characterized by larger variability than the other clusters.

6 Discussion

In this paper, we presented a joint model for a continuous longitudinal outcome and the baseline covariates. The model is partitioned into the product of a linear mixed model for the outcome given the covariates and the marginal distributions for the covariates. The use of the EDP prior in a longitudinal model is an extension of the model developed by Wade et al. [3], which itself is an extension of the DP prior. Through the nested clustering of the EDP prior, where subjects are clustered separately for their regression trajectories and similarity in the covariate space, our model allows for improved prediction over the same model with the usual DP prior. This improvement was demonstrated in simulation scenarios in which the EDP longitudinal model outperformed both a standard mixed model and a longitudinal model with a DP prior when the data generating distribution contained a nested clustering structure. When the simulation scenario was simplified so that there was no underlying cluster structure and the linear mixed model was correctly specified, using the nonparametric EDP prior did not excessively diminish predictive performance. Our model also serves as a functional clustering algorithm, the first to use an EDP prior. In our model setup, the EDP prior is particularly useful because it allows the functional to cluster solely on functional features rather than non-functional components (i.e., closeness in the covariate space).

One limitation of the present model is that it can only incorporate baseline covariates. In many longitudinal settings, covariates may be updated throughout the study. One possibility to incorporate this into our model would be to use the dynamic DP, which allows for distributions to evolve in discrete time [40]. From the current state of the literature, DPs which evolve throughout time are less thoroughly developed and more difficult to implement. The extension of our model to handle time-varying covariates is a topic for future research.

Throughout the paper we made several modeling choices that could be changed or generalized. For example, the value for could depend on so that the mass parameter is written as . This would allow the number of subclusters to differ depending on the value of . Further, we made the assumption that the values of and were independent through the fact that . This assumption simplifies calculations but can be relaxed if needed. These two changes were discussed in Wade et al. [3]. Lastly, we focused on continuous outcomes, but our methods can be extended to more general settings such as binary or count outcomes or different link functions with various additional computational challenges (non-conjugacy for one). Bayesian computations for generalized linear mixed models are provided in Zhao et al. [41] and references therein.

We demonstrated our model with data from the Sentinel Distributed Database, where we used predicted lab measurements to augment incidence rates of diabetes among subjects initiating certain anti-psychotics. Our incidence rates were similar to those found in a previous paper on the same study population [26], but our estimates gave wider confidence intervals, reflecting greater uncertainty about the incorporation of labs as part of the diabetes diagnosis. Our model is well-suited for other applications as well, such as with data arising from studies using wearable devices or studies of symptoms of chronic conditions with interest in detecting patterns among patients.


The authors gratefully acknowledge Christine Lu, Joshua Gagne, Azadeh Shoaibi, Lisa Herrinton, Mano Selvan, Xingmei Wang, and Elisabetta Patorno for their work developing the research questions, designing the original study, and creating the analytic database. This project was supported in part by Task Order HHSF22301012T-0008 under Master Agreement HHSF2232009100061 from the US Food and Drug Administration (FDA). The FDA approved the study protocol including statistical analysis plan, and reviewed and approved this manuscript. Coauthors from the FDA participated in the results interpretation and in the preparation and decision to submit the manuscript for publication. The FDA had no role in data collection, management, or analysis.


  • of Sciences, Engineering, and Medicine [2017] National Academies of Sciences, Engineering, and Medicine. Refining the Concept of Scientific Inference When Working with Big Data: Proceedings of a Workshop. National Academies Press, 2017.
  • Association [2014] American Diabetes Association. Diagnosis and classification of diabetes mellitus. Diabetes care, 37(Supplement 1):S81–S90, 2014.
  • Wade et al. [2011] Sara Wade, Silvia Mongelluzzo, and Sonia Petrone. An enriched conjugate prior for bayesian nonparametric inference. Bayesian Analysis, 6(3):359–385, 2011.
  • Ferguson [1973] Thomas S Ferguson. A bayesian analysis of some nonparametric problems. The annals of statistics, pages 209–230, 1973.
  • Ferguson [1983] Thomas S Ferguson. Bayesian density estimation by mixtures of normal distributions. Recent advances in statistics, 24(1983):287–302, 1983.
  • Escobar and West [1995] Michael D Escobar and Mike West. Bayesian density estimation and inference using mixtures. Journal of the american statistical association, 90(430):577–588, 1995.
  • Teh et al. [2004] Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Sharing clusters among related groups: Hierarchical dirichlet processes. In NIPS, pages 1385–1392, 2004.
  • Hanson and Johnson [2004] Timothy Hanson and Wesley O Johnson. A bayesian semiparametric aft model for interval-censored data. Journal of Computational and Graphical Statistics, 13(2):341–361, 2004.
  • Hannah et al. [2011] Lauren A Hannah, David M Blei, and Warren B Powell. Dirichlet process mixtures of generalized linear models.

    Journal of Machine Learning Research

    , 12(Jun):1923–1953, 2011.
  • Cruz-Mesía et al. [2007] Rolando De la Cruz-Mesía, Fernando A Quintana, and Peter Müller. Semiparametric bayesian classification with longitudinal markers. Journal of the Royal Statistical Society: Series C (Applied Statistics), 56(2):119–137, 2007.
  • Roy et al. [2017] Jason Roy, Kirsten J Lum, Michael J Daniels, Bret Zeldow, Jordan Dworkin, and Vincent Lo Re III. Bayesian nonparametric generative models for causal inference with missing at random covariates. arXiv preprint arXiv:1702.08496, 2017.
  • Shahbaba and Neal [2009] Babak Shahbaba and Radford Neal. Nonlinear models using dirichlet process mixtures. Journal of Machine Learning Research, 10(Aug):1829–1850, 2009.
  • McCullagh [1984] Peter McCullagh. Generalized linear models. European Journal of Operational Research, 16(3):285–292, 1984.
  • Gelman et al. [2014] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data analysis, volume 2. Chapman & Hall/CRC Boca Raton, FL, USA, 2014.
  • Müller et al. [2015] Peter Müller, Fernando Andrés Quintana, Alejandro Jara, and Tim Hanson. Bayesian nonparametric data analysis. Springer, 2015.
  • Wade et al. [2014] Sara Wade, David B Dunson, Sonia Petrone, and Lorenzo Trippa. Improving prediction from dirichlet process mixtures via enrichment. Journal of Machine Learning Research, 15(1):1041–1071, 2014.
  • Müller and Rosner [1997] Peter Müller and Gary L Rosner. A bayesian population model with hierarchical mixture priors applied to blood count data. Journal of the American Statistical Association, 92(440):1279–1292, 1997.
  • Li et al. [2010] Yisheng Li, Xihong Lin, and Peter Müller. Bayesian inference in semiparametric mixed models for longitudinal data. Biometrics, 66(1):70–78, 2010.
  • Das et al. [2013] Kiranmoy Das, Runze Li, Subhajit Sengupta, and Rongling Wu. A bayesian semiparametric model for bivariate sparse longitudinal data. Statistics in medicine, 32(22):3899–3910, 2013.
  • Quintana et al. [2016] Fernando A Quintana, Wesley O Johnson, L Elaine Waetjen, and Ellen B. Gold. Bayesian nonparametric longitudinal data analysis. Journal of the American Statistical Association, 111(515):1168–1181, 2016.
  • Bigelow and Dunson [2009] Jamie L Bigelow and David B Dunson. Bayesian semiparametric joint models for functional predictors. Journal of the American Statistical Association, 104(485):26–36, 2009.
  • Scarpa and Dunson [2014] Bruno Scarpa and David B Dunson. Enriched stick-breaking processes for functional data. Journal of the American Statistical Association, 109(506):647–660, 2014.
  • Jacques and Preda [2014] Julien Jacques and Cristian Preda. Functional data clustering: a survey. Advances in Data Analysis and Classification, 8(3):231–255, 2014.
  • Newcomer [2005] John W Newcomer. Second-generation (atypical) antipsychotics and metabolic effects. CNS drugs, 19(1):1–93, 2005.
  • De Hert et al. [2012] Marc De Hert, Johan Detraux, Ruud Van Winkel, Weiping Yu, and Christoph U Correll. Metabolic and cardiovascular adverse effects associated with antipsychotic drugs. Nature Reviews Endocrinology, 8(2):114–126, 2012.
  • Flory et al. [2017] James H Flory, Jason Roy, Joshua J Gagne, Kevin Haynes, Lisa Herrinton, Christine Lu, Elisabetta Patorno, Azadeh Shoaibi, and Marsha A Raebel. Missing laboratory results data in electronic health databases: implications for monitoring diabetes risk. Journal of Comparative Effectiveness Research, 6(1):25–32, 2017.
  • Rubin [2004] Donald B Rubin. Multiple imputation for nonresponse in surveys, volume 81. John Wiley & Sons, 2004.
  • Hastie and Tibshirani [1990] Trevor J Hastie and Robert J Tibshirani. Generalized additive models, volume 43. CRC press, 1990.
  • Crainiceanu et al. [2005] Ciprian M Crainiceanu, David Ruppert, and Matthew P Wand. Bayesian analysis for penalized spline regression using win bugs. Journal of Statistical Software, 14(14), 2005.
  • Ray and Mallick [2006] Shubhankar Ray and Bani Mallick. Functional clustering by bayesian wavelet methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(2):305–332, 2006.
  • Medvedovic and Sivaganesan [2002] Mario Medvedovic and Siva Sivaganesan. Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics, 18(9):1194–1206, 2002.
  • Murtagh and Legendre [2014] Fionn Murtagh and Pierre Legendre. Ward’s hierarchical agglomerative clustering method: which algorithms implement ward’s criterion? Journal of Classification, 31(3):274–295, 2014.
  • Neal [2000] Radford M Neal. Markov chain sampling methods for dirichlet process mixture models. Journal of computational and graphical statistics, 9(2):249–265, 2000.
  • Zeger and Karim [1991] Scott L Zeger and M Rezaul Karim. Generalized linear models with random effects; a gibbs sampling approach. Journal of the American statistical association, 86(413):79–86, 1991.
  • Bates et al. [2014] Douglas Bates, Martin Maechler, Ben Bolker, and Steven Walker. lme4: Linear mixed-effects models using eigen and s4. R package version, 1(7), 2014.
  • R Core Team [2017] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2017. URL https://www.R-project.org/.
  • Bates [2005] Douglas Bates. Fitting linear mixed models in r. R news, 5(1):27–30, 2005.
  • [38] Sentinel. https://www.sentinelinitiative.org/. Accessed: 2017-10-13.
  • [39] Marsha A Raebel, Susan Shetterly, Andrea R Paolino, Christine Y Lu, Joshua J Gagne, Kevin Haynes, James Flory, Elisabetta Patorno, David H Smith, Mano Selvan, Lisa J Herrinton, Frank E Harrell Jr, Azadeh Shoaibi, and Jason Roy. Analytic methods for using laboratory test results in active database surveillance. https://www.sentinelinitiative.org/sentinel/methods/analytic-methods-using-laboratory-test-results-active-database-surveillance. Accessed: 2017-10-13.
  • Rodriguez and Ter Horst [2008] Abel Rodriguez and Enrique Ter Horst. Bayesian dynamic density estimation. Bayesian Analysis, 3(2):339–365, 2008.
  • Zhao et al. [2006] Yihua Zhao, John Staudenmayer, Brent A Coull, and Matthew P Wand. General design bayesian generalized linear mixed models. Statistical Science, pages 35–51, 2006.

Appendix: Simulation data-generating distribution

See Figure 2 for details on clustering structure. Each subject has a random number of time points observed drawn from a discrete uniform distribution on , call this . We now randomly draw time points from a uniform distribution on and order them as . For each subject we draw a random intercept from a distribution. The outcome is generated from independent normal distribution (given ) with variance and the mean depending on which -cluster the subject was randomly assigned to. For , the outcome has mean . For , . For , . For all the above represents the randomly generated time points for each subject.

There were 20 covariates were generated as:

Appendix: Computations

The R/C++ code is available at on GitHub. We give some further details on updating some of the parameters in our model below.

MCMC program:

Step 0: Let be the total number of subjects and denote the total number of observations. Initialize all parameter values including , the partitioning variable.

Step 1: Update for .

Let be the number of unique -clusters in .

Step 2: Iterate through .

Restrict to subjects with . Let be the number of subjects in the cluster and be the total number of observations in this cluster. Below , , , , , and will refer to subjects within the given cluster.

Step 2a: Update and :

First, calculate residuals: . We specify priors and Define and . The posteriors (within clusters) are given by and .

Step 2b: Update and

Now, calculate residuals: . Given prior distributions and , define and . The posteriors are given by and ).

Step 2c: Iterate through , where is the number of -clusters nested within the -cluster. Now, we update covariate parameters , further restricting to subjects with :

For binary covariates, the prior is and the posterior is given by .

For continuous covariates, prior: with posteriors and where and .

This marks the end of the within-cluster updates.

Step 3: Update random intercept :

Iterate through . Calculate residuals