1. Introduction
The United States is in the midst of an opioid addiction epidemic. According to estimates by the Centers for Disease Control and Prevention (CDC), 42,000 people died from opioid overdoses in 2016 and 49,000 in 2017. Overdose mortalities specifically from synthetic opioids, such as Fentanyl, have increased exponentially since 1999.^{1}^{1}1https://www.cdc.gov/drugoverdose/data/analysis.html A major cause of this epidemic is overprescription of opioids (for legitimate pain management) by physicians who lack proper prescribing guidelines (Califf et al., 2016; Makary et al., 2017).
One actionable insight for prescribers is characteristics of patients for whom treatment with synthetic opioids, as opposed to natural or semisynthetic opioids, causes a greater risk of adverse outcomes such as longterm use and addiction than for the general population. Toward this end, we study causal treatment effect estimation from observational data under heterogeneity, i.e. the phenomenon of different individuals having different responses to the same treatment. In particular, we focus on the discovery of subgroups of patients (really portions of a feature space) that have enhanced or diminished treatment effects. We aim for the discovered subgroups to be humaninterpretable so that the results can be directly used in prescribing guidelines.
We make use of the MarketScan database of medical claims and pharmaceutical claims. Claims are a form of administrative data that are commonly repurposed for medical studies because they capture the diagnoses, procedures, and prescriptions of patients longitudinally. Inference of causal relationships using observational data is challenging since counterfactual outcomes are not observed and the treated and untreated populations may have underlying differences that affect the outcome. However, modern machine learning techniques provide an avenue to overcome the challenges.
To identify subgroups with different treatment effects, we hypothesize that a latent variable determines the treatment effect of each individual. Moreover, individuals with similar characteristics belong to the same latent subgroup, resulting in similar responses to treatment across the subgroup. Figure 1 is an abstract representation of such a phenomenon.
We propose a Bayesian network to model these subgroups, specifically as a mixture model, along with their corresponding treatment effects. Sparsity is induced in the learned mixture component parameters to improve interpretability. Our approach, which we name the heterogeneous effect mixture model (HEMM), is similar in spirit to causal rule sets for identifying subgroups with enhanced treatment effect (Wang and Rudin, 2017)
but does not require hard partitions or assignments. Moreover, we incorporate nonlinear as well as linear outcome models, which increases the expressiveness of the model to better adjust for confounding without sacrificing the interpretability of the subgroup definitions. We thus benefit from both the interpretability of sparse mixture models and the representation learning capability of neural networks. In contrast, recent works
(Louizos et al., 2017; Shalit et al., 2017; Alaa and van der Schaar, 2017) that use neural networks or nonparametric methods to estimate heterogeneous treatment effects do not identify subgroups of individuals with similar responses. While our motivating application is opioid use, the proposed approach applies to any problem domain requiring the discovery of subgroups with heterogeneous responses to actions. In this spirit, we also validate our method on synthetic data and the Infant Health and Development Program (IHDP) dataset in terms of its heterogeneous effect estimation and subgroup identification performance.With respect to opioids, we provide domain expert interpretation of the enhanced treatment effect subgroup discovered using MarketScan data, i.e. patients at higher risk of adverse outcomes after an initial synthetic opioid prescription. Some characteristics of this subgroup are wellknown and/or reflected in CDC opioid prescribing guidelines (Dowell et al., 2016): chronic pain conditions, psychological comorbidities, heart disease and obesity. The presence of minor injuries and dental/oral conditions in the subgroup can be explained by the common practice of prescribing opioids for postsurgical or intense acute pain. Lastly, some discovered conditions are unexpected, such as skin infections, abscesses, and reproductive disorders.
Overall, our contributions can be summarized as follows:

[label=)]

We propose the HEMM for discovering subgroups with enhanced and diminished treatment effects in a potential outcomes causal inference framework, using sparsity to enhance interpretability.

We extend the HEMM’s outcome model to include neural networks to better adjust for confounding and develop a joint inference procedure for the overall graphical model and the neural networks.

We demonstrate strong performance in estimating heterogeneous effects and identifying subgroups compared to existing approaches.

We apply the methodology to a largescale medical claims dataset and discover actionable patient subgroups at enhanced risk of adverse outcomes with synthetic opioids.
2. Related Work
There is a rich literature of dataoriented research on understanding the patterns and risks of opioid prescribing and addiction in the fields of medicine, medical informatics, and machine learning; some is specifically intended to inform prescribing guidelines. Kim et al. (2016) conduct a randomized controlled trial and Neill and Herlands (2018) analyze spatiotemporal overdose event data, but a large part of the literature works with medical claims data (Parente et al., 2004; Edlund et al., 2007; Jena et al., 2014; Zhang et al., 2017; Brat et al., 2018; Klueh et al., 2018) and similar administrative data (Che et al., 2017) as we do. However, unlike us, none of these works focus on heterogeneous treatment effects.
The identification of subgroups with heterogeneous or enhanced treatment effects has been addressed in the statistics literature by building separate factual and counterfactual outcome models and then regressing the difference of the two using another method, e.g. a decision tree
(Su et al., 2009). This final model can then be deployed to identify subgroups. Within this category of approaches, Lipkovich et al. (2011) propose the subgroup identification based on differential effect search (SIDES) algorithm, Dusseldorp and Van Mechelen (2014) propose the qualitative interaction trees (QUINT) algorithm, and Foster et al. (2011) propose the virtual twins (VT) method. We consider empirical comparisons to these algorithms in the sequel.Wang and Rudin (2017) propose causal rule sets for discovering subgroups with enhanced treatment effect. This is the closest to and an inspiration for our work. That work seeks to learn discrete humaninterpretable rules predictive of enhanced treatment effect and involves optimization by Monte Carlo methods. We consider instead a mixture of experts approach with soft assignment to groups that retains most of the interpretability but allows greater expressiveness and can be optimized via gradient methods. Our outcome model (6), (7) also differs from that of (Wang and Rudin, 2017). Most importantly, we allow nonlinearity in the form of neural networks whereas (Wang and Rudin, 2017) considers only linear models. Our model also has a single term representing the main effect of treatment whereas (Wang and Rudin, 2017) has three such terms: a population average, a subgroup term that is always active, and a subgroup term that is only active under treatment.
Recent papers have proposed estimating heterogeneous/individual treatment effects using neural networks (Louizos et al., 2017; Shalit et al., 2017) or a Bayesian nonparametric method involving Gaussian processes (Alaa and van der Schaar, 2017)
. These methods rely on constructing distributional representations of the factual and counterfactual outcomes that are similar in a statistical sense. While these methods perform well on estimating heterogenous effects, they do not identify subgroups of individuals with similar treatment effects and characteristics and are thus less interpretable. This makes the application of such methods to inform policy decisions more difficult.
3. Heterogeneous Effect Mixture Model
In this section, we propose a generative mixture model for heterogeneous treatment effects. One way to model heterogeneity, and the one in our proposal, is as a finite mixture of components with a different treatment effect model in each component (some enhanced and some diminished). For tractability, we keep the form of the mixtures to be the simplest possible: Gaussiandistributed for continuous covariates and Bernoullidistributed for discrete covariates. We encode a preference for components to involve few covariates through a Laplace prior or a group
prior on the means of the covariates, described in Section 3.3. Section 3.4 presents a model for the outcomes as they depend on treatment, covariates, and mixture membership, including nonlinear dependence on the covariates.3.1. Preliminaries
We adopt the NeymanRubin potential outcomes framework (Rubin, 1974)
for causal inference. Define random variables
representing covariates and as the treatment indicator. The subset of continuousvalued covariates is denotedand the discrete covariates (binaryvalued or binarized) is denoted
. We will sometimes refer to as ‘the treatment’ and as ‘the control.’ Corresponding to the levels of treatment are two potential outcomes and , which are the outcomes under and respectively. These outcomes can be discrete or continuousvalued. We are given an observational dataset of samples in which only one of the outcomes is observed for each individual: if then , and if then .Our interest lies in estimating the conditional average treatment effect (CATE) conditioned on , defined as
In this work, the dependence on is mediated primarily through subgroup membership, i.e. members of the same subgroup have similar treatment effects. We make the standard assumptions that allow CATE to be identifiable from observational data, namely exchangeability conditioned on the available covariates, , positivity of the treatment propensity, for all , and no dependence between individuals i.e. the stable unit treatment value assumption (Rubin, 1986; Hernán and Robins, 2018). The first two assumptions are collectively known as strong ignorability.
For the mixture model proposed in this paper, we additionally define the latent random variable to indicate mixture membership. Both the distribution of covariates and the treatment effect are dependent on as described next.
3.2. Generative Model
The generative model is presented in Figure 2 in plate notation. We first give an overview of the distributions and then go into more detail regarding and in Sections 3.3 and 3.4.

We draw a sample independently for each individual that determines the latent group membership. The prior distribution for is uniform over the groups,
(1) 
Conditioned on the latent group assignment ,

The are drawn i.i.d. from a Gaussian distribution with mean and covariance :
(2) In this paper, we constrain the offdiagonal elements of to be to reduce the number of parameters, although nondiagonal covariances can be easily accommodated.

The are drawn i.i.d. from a multivariate Bernoulli distribution with mean :
(3) We enforce sparsity in in order to improve interpretability. We describe this in detail in Section 3.3.


Conditioned on the covariates , the treatment assignment is drawn from a Bernoulli distribution whose mean is a function of :
This corresponds to a model for treatment propensity. Note from Figure 2 that the generative model assumes that is conditionally independent of given . Under this assumption, it will be seen in Section 4 that inference for the propensity model can be done independently from the other components of the generative model.

Finally, an outcome sample is drawn from a distribution whose mean is a function of the covariates , treatment assignment , and latent group assignment . If is binaryvalued, the distribution is Bernoulli,
whereas if is continuous, the distribution is Gaussian,
where
is the variance. The outcome model is discussed further in Section
3.4.
3.3. Sparse Mixture Components for Interpretability
Without further measures, the mixture component means and learned from data may be dense, making them difficult for a domain expert to interpret. We hypothesize that a large number of learned mean parameters may have small values and that promoting sparsity through appropriate prior distributions can overcome this problem. To this end, we experiment with two different sparsitypromoting priors on the means of discrete covariates. The same priors can be placed on the continuous covariate means but we do not find this necessary in the present work.

Laplace () Prior: We assume that the means follow zeromean Laplace distributions and are independent across mixture components and covariates. The negative loglikelihood is therefore proportional to the norm
(4) where the summation over is restricted to the discrete covariates.

Group Prior: It may further be the case that some covariates are noninformative of group membership, in which case the means should be zero across all groups and follow the group distribution (Marlin et al., 2009), similar to the group lasso (Yuan and Lin, 2006). The corresponding negative loglikelihood is
(5)
3.4. Treatment Outcome Model
We model the enhanced or diminished treatment effect in a subgroup through the following relationships. In the case where
is binary, its mean is equal to the probability of
. We define the latter using the logistic sigmoid function
to be(6) 
where is a function of parametrized by , . The term represents the main effect due to treatment and the coefficient , i.e. the size of the effect, depends on the group membership . The parameters are allowed to be different for and to better account for differing covariate distributions between the two treatment groups, a.k.a. selection bias. In the case of continuous , we replace with the identity function as follows:
(7) 
The simplest choice for function is linear, i.e., two linear functions and . In practice, however, the outcome may have a highly nonlinear dependence on the covariates. To accommodate nonlinear covariate interactions and thus better adjust for confounding, we also allow
to be a nonlinear function. In this paper, we experiment with one and twohiddenlayer feedforward neural networks with ReLU activations. Outcomes under
and are produced by two different heads of the network, following (Shalit et al., 2017; Louizos et al., 2017; Johansson et al., 2016). Even in the nonlinear case, the assignment of an individual to a subgroup is still described by a mixture model and directly interpretable in terms of the original feature representation, thus preserving interpretability of the discovered subgroups.4. Inference
We would like to fit our proposed model to a given observational dataset . Denote by the set of all parameters of the model.
We have considered two approaches: maximizing the joint likelihood , and maximizing the conditional likelihood . The joint and conditional likelihoods can be related as follows:
(8) 
The conditional likelihood can be further expanded as
(9) 
where we have used the conditional independence of and given in the first factor on the righthand side. The resulting first factor as well as the term in (8) depend only on the mixture model (1)–(3), to wit and . The second factor on the righthand side of (9) depends only on the outcome model (6), (7). The remaining term in (8) depends on the propensity model. Since this is the only place where the propensity model appears, its inference is separable from the remainder of the problem, as claimed in Section 3.2. We do not discuss propensity modeling further as it is not the focus of this work.
Although maximizing the joint likelihood (8) results in some closedform expressions and accordingly easier inference of parameters, we have observed in practice that maximizing the conditional likelihood (9) has superior performance in estimating the potential outcomes and treatment effects. Therefore, we pursue this discriminative approach in this work. We do however include the sparsityinducing prior on the parameters discussed in Section 3.3. The full objective function is therefore
(10) 
where controls the strength of the prior.
4.1. Evidence Lower Bound (ELBO) Optimization
Instead of optimizing the conditional loglikelihood in (10) directly using a gradient method, we choose to lower bound the likelihood with a variational approximation, more commonly known as the Evidence Lower Bound (ELBO) (Blei et al., 2017). For any variational distribution over the latent variable , we have
(11) 
using Jensen’s inequality. Now, replacing with and using (9) (and from Figure 2), we obtain
(12) 
We hence substitute (12) in place of in (10) and proceed to maximize the objective function using the Adam gradient method (Kingma and Ba, 2014)
, a variant of stochastic gradient descent that is a popular choice for nonconvex functions like neural networks. The same method is used for both linear and nonlinear
in (6), (7). As noted above, the first factor in (12) depends only on the mixture model parameters in (1)–(3) while the second factor depends only on the outcome model parameters in (6), (7). We enable “weight decay” (Krogh and Hertz, 1992) on the parameters as a form of regularization. For tractability, we compute the ELBO only over a fixedsize minibatch of the data before each parameter update. Additional details on the algorithm and parameter initialization can be found in Appendices A and B in the supplement.We also considered an expectationmaximization (EM) algorithm to maximize (
10) as an alternative to ELBO. Our experience however was that ELBO provided better heterogeneous effect estimates in terms of the metric reported in Section 6.2.5. Dataset Descriptions
We demonstrate the performance of HEMM on a synthetic dataset, the semisynthetic Infant Health and Development Program (IHDP) dataset, and a realworld dataset on opioids. These datasets are described further below.
Synthetic: We take
and sample it from a uniform distribution over
. In order to simulate the selection bias inherent in observational studies, the treatment variable depends on as for and for . The potential outcomes and are also Bernoulli with means given by the functions of shown in Figure 2(a). The figure shows that , i.e. treatment increases the probability of positive outcome. Note that under the conditional exchangeability assumption we have . We model the effect of the confounders by assigning higher probability to the upper triangular region of . This together with the distribution of imply that individuals who are more likely to have positive outcome regardless of treatment (upper triangle) are also more likely to receive treatment (right halfsquare). Lastly, we model the enhanced treatment effect group as a circular region , where . We set and . A total of samples were generated as described above.IHDP (SemiSynthetic): The IHDP dataset has gained popularity in the causal inference literature dealing with heterogenous treatment effects (Alaa and van der
Schaar, 2017; Shalit
et al., 2017; Louizos et al., 2017; Hill, 2011). The original data includes 25 real covariates and comes from a randomized experiment to evaluate the benefit of IHDP on IQ scores of threeyearold children. A selection bias was introduced by removing some of the treated population, thus resulting in 608 control patients and 139 treated (747 total). The outcomes were simulated using the standard nonlinear ‘Response Surface B’ as described in (Hill, 2011).
Total Covariate Dimension  1226  3 Continuous, 
1223 Binary  
ICD9 Diagnostic Codes  1013  Binary 
CPT Procedure Codes  171  Binary 
HandCrafted Comorbidities  41  Binary 
Daily Morphine Equivalent,  3  Continuous 
Total Number of Visits, Age 
Addicted  NotAddicted  Total  
(=1)  (=0)  
Treated (=1)  2060  19983  22043 
Control (=0)  7269  176156  183425 
Total  9329  196139  205468 
Opioid: We sampled a subpopulation consisting of healthcare claims for five million patients from the MarketScan Commercial claims database. These claims describe patients’ medical histories, including both inpatient admissions and outpatient services. Diagnoses, procedures, prescriptions and dosages are recorded. We follow the cohort selection procedure outlined in Zhang et al. (2017) to filter patients based on several criteria.
For each patient in our final cohort, we create a feature vector that includes basic demographic information such as age, gender and geographic region. We also included a predefined set of procedures along with diagnostic codes which are associated with opioids and/or addiction, based on input from a physician.
We label all patients who received addiction diagnoses and patients who continued use of opioids for more than one year after the initial prescription as belonging to the positive (adverse outcome) class. Patients who discontinued opioid use within one year of initial treatment were labeled as negative. We use the terms “addicted” and “not addicted” as shorthand for these outcomes. Patients prescribed natural or semisynthetic opioids are considered the control group, whereas patients administered synthetic opioids are considered the treated group. Table 1 summarizes the basic statistics of this dataset.
6. Experimental Results
We evaluated the proposed HEMM quantitatively on two tasks, prediction of heterogeneous treatment effects and identification of subgroups with enhanced or reduced effect. These results are discussed in Sections 6.2 and 6.3 respectively in comparison to existing methods, focusing on those that also estimate heterogeneous effects in an interpretable manner. Methods used in the comparison are described in Section 6.1 and parameter selection details are in Appendix B. In Section 6.4, we provide qualitative results for the Opioid dataset on the features the model discovers as characteristics of “atrisk” individuals, i.e. those in enhanced effect subgroups.
6.1. Algorithms in Comparison
We have considered Virtual Twins (VT) (Foster et al., 2011), QUINT (Dusseldorp and Van Mechelen, 2014), and SIDES (Lipkovich et al., 2011)
among methods that identify subgroups with different treatment effects. We implemented two versions of VT in which the treatment effect is modeled by a decision tree classifier (VTC) or regressor (VTR). For VTC, to better represent the continuousvalued treatment effect (which is a difference in probabilities even if
is binary), we use a collection of decision tree classifiers obtained by applying different thresholds to the treatment effect.For QUINT and SIDES, we utilized the standard R implementations and performed extensive hyperparameter tuning. However both QUINT and SIDES failed to recover any subgroups on
Synthetic and Opioid and we thus did not consider them further. For QUINT, the likely reason is that its assumption of a subgroup with diminished effect is not always met, whereas for SIDES, there may be a numerical issue in how it discretizes continuous covariates.In terms of methods that only predict heterogeneous effects and do not identify subgroups, we also compare our method with some common approaches in Table 2
. Here Linear1 corresponds to a single ordinary least squares (for continuous outcomes) or logistic regression (for binary outcomes) for both factual and counterfactual outcomes. In the case of Linear2, we fit two separate linear models to the control and treated populations to better accommodate selection bias and confounding. The other baselines,
NN, GP, and CFRF are nonparametric versions of this approach where the estimators of the factual and counterfactual outcomes arenearest neighbours, Gaussian processes with a linear kernel and Random Forests respectively.
6.2. Heterogeneous Effect Estimation
We first evaluate our performance on estimation of the CATE (). A popular metric for this evaluation is the Precision in Estimating Heterogenous Effects (PEHE). The PEHE is defined as
Here and are the estimated potential outcomes under treatment and control, respectively.
Table 2 compares the performance of HEMM against the methods described in Section 6.1 on both insample PEHE (corresponding to a retrospective study) computed on the training data, and outofsample PEHE computed on heldout test data. HEMMMLP and HEMMLin refer to the proposed approach with in (6), (7
) as a multilayer perceptron and linear function respectively to model the effect of confounders on the outcome.
HEMM consistently outperforms these standard causal inference baselines. GP and Linear2 perform close to HEMM on Synthetic. We noticed that when a larger sample of data points is available to VTR, its performance increases dramatically. However, its performance drops in higherdimensional settings as in the case of IHDP and Opioid; this is expected with methods involving nonparametric regression.
Synthetic  IHDP  

Insample  OutSample  Insample  OutSample  
HEMMMLP  
HEMMLin  
Linear1  
Linear2  
NN  
GP  
CFRF  
VTR 
values in estimating heterogeneous effects. Error represents 95% confidence interval of multiple Monte Carlo initializations.
6.3. Subgroup Identification
Synthetic: In the case of synthetic data, the subgroup with enhanced treatment effect is known. We first visualize the performance of HEMM and VT in identifying this subgroup. For HEMMLin, Figure 2(b) shows the estimated probability of belonging to the enhanced effect subgroup evaluated on the test set. The true circular region is recovered well. Figures 2(c) and 2(d) plot the VTC and VTR predictions of CATE on the test set. For VTC, the prediction represents an average over the collection of decision tree classifiers, while for VTR, it is simply the output of the decision tree regressor. While the difficulty in reproducing the circular shape is expected for decision trees, the enhanced effect estimates are also less uniform than in Figure 2(b).
We also evaluate subgroup identification more quantitatively by treating it as a problem of classifying whether or not points in the test set belong to the enhanced effect subgroup. ROC curves may then be plotted as in Figure 3(a). For HEMM, the ROC is traced by varying the threshold on the probability of being in the enhanced effect subgroup (shown in Figure 2(b)). Similarly for VTC and VTR, the threshold on the CATE estimates (Figures 2(c) and 2(d)) is varied. HEMM has higher ROCs than VT on this example, in line with Figures 2(b)–2(d). There is little difference between HEMMLin and HEMMMLP since the dependence on the covariates in Figure 2(a) is simple and complex adjustment is not needed.
IHDP and Opioid: For the other two datasets, we conduct a relative comparison with VT since we lack data on ground truth subgroups. The evaluation involves two steps. First, we assign individuals to an enhanced effect subgroup of varying size. (The same procedure can be used for a diminished effect subgroup but we omit the results due to space.) For HEMM, we choose the subgroup with the largest main effect and vary the threshold applied to the corresponding membership probability returned by the model. For VTC and VTR, we vary the threshold applied to the CATE estimates, either the composite estimate of the decision tree classifiers or the regressor estimate, the same quantities as for the synthetic data. In the second step, we build a propensity score model (an estimator of treatment propensity ) to estimate the average treatment effect (ATE) conditioned on belonging to the enhanced treatment effect subgroup defined in the first step. For the propensity score model , we fit a random forest, for which parameter tuning is performed on the dev set. We then use the inverse probability of treatment weighting (IPTW) estimator (Imbens, 2004) of the ATE within a subgroup as follows:
(13) 
IPTW estimation is used for both HEMM and VTdefined subgroups to be consistent.
Figures 3(b) and 3(c) plot subgroup ATE versus subgroup size (as a fraction of the population) as the threshold for subgroup assignment is varied. When the subgroup is the entire population at size , all curves meet at the population ATE (dashed line). Since we have selected the enhanced effect subgroup, the curves are then expected to increase as the subgroup is restricted to individuals with larger treatment effects. The fact that this increase is nearly monotonic for HEMMMLP is evidence for the validity of the discovered subgroup, since the IPTW estimator used here is an independent check on the treatment effect model (6), (7) used by HEMM. Compared to VT, the subgroups identified by HEMMMLP have higher ATE. This suggests that for a given subgroup size, HEMMMLP is better at grouping together individuals with more enhanced effects. HEMMLin on the other hand displays contrasting performances. On IHDP in Figure 3(b), the estimated ATE actually decreases for subgroup sizes less than , likely due to the inadequacy of a linear model to adjust for confounding and accurately estimate CATE. In Figure 3(c) however, the ATE does increase monotonically and faster than for VT.
6.4. Interpretation of the Opioid Enhanced Effect Subgroup
Musculoskeletal System  Nervous System 

1.0 spinal curve (kyphosis, lordosis, scoliosis)  1.0 extrapyramidal diseases/movmt. disorders 
1.0 ankle fracture  1.0 idiopathic peripheral neuropathies 
1.0 sprains/strains of hand and wrist  1.0 headaches 
Integumentary System  Endocrine System 
1.0 cellulitis and abscess of finger and toe  .70 simple and unspecified goiter 
1.0 local skin infections  .67 other endocrine disorders 
1.0 psoriasis and similar disorders  .65 thyrotoxicosis with or without goiter 
Reproductive System  Digestive and Excretory Systems 
1.0 female infertility  .71 benign neoplasm of intestinal tract 
.82 testicular dysfunction  .69 inguinal hernia 
.82 disorders of penis  .58 diverticulitis 
Circulatory System  Immune System 
1.0 hypertensive heart disease  .56 immunization 
.79 other disorders of circulatory system  .54 strep throat and scarlet fever 
.70 cardiac dysrhythmias  .52 bacterial infections in other conditions 
Nutrition  Visual System 
1.0 BMI  .87 keratitis 
1.0 bcomplex deficiency  .60 other disorders of eye (epi)scleritis 
.71 disorder of electrolyte/acidbase balance  .57 visual disturbances 
Auditory System  Psychology 
.53 vertiginous syndrome/vestibular disorder  1.0 suspected mental health condition 
.50 otitis media/eustachian tube disorders  .58 adjustment reaction 
.44 disorders of pinna and mastoid process  .55 nondependent abuse of drugs 
Digestive System (upper/oral)  Respiratory System 
.77 hernia, abdominal cavity w/o obstruction  1.0 other diseases of respiratory tract 
.77 dentofacial anomalies of jaw  .74 deviated nasal septum 
.76 diseases of oral soft tissues  .69 influenza 
We now turn our focus to the motivating application of opioids and analyze key characteristics of the enhanced effect subgroup, i.e. those patients at greater risk of adverse outcomes when treated initially with synthetic opioids. To interpret these features, we collaborated with a subject matter expert (SME) with a PhD in cognitive neuroscience and a clinical research emphasis in chronic pain conditions and treatments, including opioids. Table 3 shows the top features of the enhanced effect subgroup as identified by HEMMMLP. The features are organized by general bodily system and sorted in descending order of prevalence relative to the other subgroup (see table caption); the selection of 3 features in each system was arbitrary and chosen primarily for simplicity and space constraints.
Patients with a history of chronic conditions in general, as well as chronic pain conditions more specifically, are at an increased risk for addiction. Many of the chronic conditions in Table 3, e.g. heart disease (circulatory system), psoriasis (integumentary system), and BMI/obesity (nutrition) also appear in the CDC opioid prescribing guidelines (Dowell et al., 2016) or have extensive literature linking them to increased risk for longterm pain, either intrinsic to the condition or due to needed medical procedures that are more likely to expose patients to opioids (Glanz et al., 2018). For example, numerous papers show a link between increased bodymass index (BMI) and increased pain intensity and duration (with anticorrelations between BMI and pain recovery) (Okifuji and Hare, 2015), and obesity has also been associated with higher initial opioid doses (Kobus et al., 2012). Additionally, the chronic nutritional deficiencies and imbalances shown in Table 3 have been linked to acute but intense muscle spasms as well as peripheral polyneuropathies and paresthesias (see e.g. (Mostacci et al., 2018)), pain disorders which also show up as increased risk factors (nervous system).
Regarding chronic pain conditions, patients with a history of abnormal spinal curvatures (which can produce low back pain and neuropathy), idiopathic peripheral neuropathies, and headaches (musculoskeletal and nervous systems) are at increased risk for addiction. These are not surprising as they nonopioid therapies such as nonsteroidal antiinflammatory drugs (NSAIDs), steroids, or common procedures and surgeries (e.g. joint replacement or local injections) (Crofford, 2013). They involve pain that may be severely intense or debilitating, sometimes unpredictable or idiopathic, and often nonspecific or diffuse (pain is referred, not well localized, or difficult to describe) and thus require a cocktail of prescription medications or invasive procedures, increasing the likelihood of exposure to opioids (Volkow and McLellan, 2016; Rosenblum et al., 2008; Patil et al., 2015). The intensity, duration, and nonspecificity of pain may also be a reason why digestive excretory, digestive (upper/oral), and reproductive conditions also show up as moderately strong features in Table 3. These diagnoses may either directly result in acute or chronic nonsomatic visceral pain (hernia, diverticulitis) (Davis, 2012) or relate to conditions with chronic visceral pain (e.g. female infertility may be secondary to endometriosis or pelvic inflammatory diseases). Opioids are widely utilized for such visceral pain conditions (Gebhart et al., 2000), although often in short duration due to adverse events. Notably, some minor injuries causing acute or procedurerelated pain (ankle fractures or hand/wrist sprains – musculoskeletal system) also feature prominently, as do features related to the mouth (dentofacial abnormalities and soft oral tissues – digestive system (upper/oral)). This is also expected given that opioids are most commonly prescribed for postsurgical or intense acute pain (Harbaugh et al., 2018; Brummett et al., 2017); regarding dental procedures specifically, previous research suggests that a substantial proportion of adults are first exposed to opioids through dental procedures (Schroeder et al., ress).
Another expected finding was that individuals with psychological comorbidities (mental health conditions) also have high probability of belonging to the enhanced response group, with individuals participating in psychotherapy having reduced risk of addiction (psychotherapy was among the lowestscored features and hence not shown in the table). Substantial research has already linked mental illness with opioid misuse (Glanz et al., 2018; Volkow and McLellan, 2016; Rosenblum et al., 2008). Although adjustment reaction (psychology) appears with a lower score in Table 3, it encompasses reactions to trauma, episodic emotional disorders, and chronic anxiety that have been shown to be comorbid with many of the chronic diagnoses and pain conditions discussed above (Glanz et al., 2018; Volkow and McLellan, 2016; Rosenblum et al., 2008) and have also been linked with increased opioid dosages (Helmerhorst et al., 2014). Similarly, nondependent abuse of drugs also has a lower score but it is well known that opioid dependence and addiction are associated with polysubstance use and abuse (Soyka, 2015; Pergolizzi et al., 2012).
A small subset of features with high scores were more challenging to interpret, likely because they pick up on subtle relationships between existing clinical variables or hidden variables. For example, electrolyte imbalances (nutrition) occur often and in many situations, making it difficult to speculate on why they were a top feature for risk propensity. Likewise, skin infections and abscesses (integumentary system) also are common and nonspecific. However, it is possible that these features are secondary symptoms of important risk features. For instance, electrolyte imbalances are commonly seen in alcoholism and substance use disorders (Palmer and Clegg, 2017), as are skin infections and abscesses (Springer et al., 2018).
Based on this initial overview, the SME judged the majority of the identified features to be scientifically meaningful with potential clinical utility for future prescribing guidelines. In summary, acute or chronic conditions that put patients at increased risk for initial exposure to opioids, via acute procedures or comorbid prolonged intense pain, both increased a patient’s addiction likelihood. Comorbid mental health disorders, particularly those related to stress or trauma and substance abuse also put individuals at greater risk for future opioid addiction.
7. Conclusion
We presented a Heterogeneous Effect Mixture Model (HEMM) for inferring subgroups of individuals that exhibit an enhanced effect caused by treatment. Our work contrasts with existing heterogeneous effect estimation methods as we learn interpretable subgroups using soft assignments while retaining expressiveness in the model. The latter is attributed to the capabilities of neural networks, used here to adjust for confounding. We evaluated the performance of HEMM on a synthetic dataset, the semisynthetic IHDP dataset, and a large realworld healthcare claims dataset (Opioid).
We additionally conducted qualitative analysis of the results obtained by HEMM on the Opioid dataset. Some of our findings are in accordance with existing CDC opioid prescribing guidelines. However, our interpretations are preliminary and future analyses are needed to better understand these features and their relationships. A longerterm goal is to translate such insights into a policy white paper on datadriven, causallyvalid opioid prescribing guidelines.
Acknowledgments
This work was conducted under the auspices of the IBM Science for Social Good initiative. The authors thank ChingHua Chen, Fredrik Johansson, Aleksandra Mojsilović, Peder Olsen, Jinghe Zhang, and colleagues at IBM Watson Health for assistance.
Appendix A Parameter Initialization
Gradient based optimization strategies can be subject to local minima and hence their performance is dependent on parameter initialization. To initialize the model with ‘good’ values ensuring better convergence, we set the mean for each component, and , equal to the sample mean of the entire data, i.e. , , and the covariance of every component to , where is a vector consisting of the sample variances of the continuous covariates . We pretrain the parameters in the outcome model (6) using standard crossentropy loss without the subgroup and treatment assignment term . Finally, we initialize the treatment coefficients, , randomly with positive values for all .
Appendix B Model Fitting
Our implementation of HEMM has two free parameters, the number of groups and the strength of the sparsity prior, .
For the Opioid dataset we divide the dataset into 3 parts with 70% as train for model training, 10% as dev for parameter tuning, and 20% as test
for evaluation. The partition is done so that the joint distribution of outcome and treatment is approximately the same in the 3 sets:
. For IHDP we use the stnadard 80/20 train/test split as is popular in literature.We perform a grid search over and . For each pair, we perform 5 runs with randomly initialized values of the treatment coefficients . All other parameters are initialized as described in Section A. For Adam, we use a step size of and minibatch sizes of 10, 20 & 1000 for Synthetic, IHDP, and Opioid respectively, and stop parameter update if the ELBO on the dev
is lower at the end of an epoch. We also search over the space of models where the outcome and counterfactual have the same or different parameterisation based on treatment assignment. From all the
pairs and random initializations above, we select the model that has the best performance on the dev set in predicting the outcome , in terms of the Area Under the Receiver Operating Characteristic (AUROC).For the Synthetic dataset, we simply set and . In this case there is no need for a dev set and the data is split 50/50 between train and test.
Figure 5 shows the percentage of parameters equal to zero (labeled “Sparsity” and “Group Sparsity” in the plots) across all groups for different values of and averaged over random initializations. It is clear that larger values of the prior strength parameter result in sparser solutions.
References
 (1)
 Alaa and van der Schaar (2017) Ahmed M Alaa and Mihaela van der Schaar. 2017. Bayesian inference of individualized treatment effects using multitask gaussian processes. In Advances in Neural Information Processing Systems. 3424–3432.
 Blei et al. (2017) David M Blei, Alp Kucukelbir, and Jon D McAuliffe. 2017. Variational inference: A review for statisticians. J. Amer. Statist. Assoc. 112, 518 (2017), 859–877.
 Brat et al. (2018) Gabriel A. Brat, Denis Agniel, Andrew Beam, Brian Yorkgitis, Mark Bicket, Mark Homer, Kathe P. Fox, Daniel B. Knecht, Cheryl N McMahillWalraven, Nathan Palmer, and Isaac Kohane. 2018. Postsurgical Prescriptions for Opioid Naive Patients and Association with Overdose and Misuse: Retrospective Cohort Study. BMJ 360 (Jan. 2018), j5790.
 Brummett et al. (2017) Chad M. Brummett, Jennifer F. Waljee, Jenna Goesling, Stephanie Moser, Paul Lin, Michael J. Englesbe, Amy S. B. Bohnert, Sachin Kheterpal, and Brahmajee K. Nallamothu. 2017. New Persistent Opioid Use After Minor and Major Surgical Procedures in US Adults. JAMA Surg. 152, 6 (2017), e170504.
 Califf et al. (2016) Robert M Califf, Janet Woodcock, and Stephen Ostroff. 2016. A proactive response to prescription opioid abuse. New England J. Med. 374, 15 (2016), 1480–1485.
 Che et al. (2017) Zhengping Che, Jennifer St. Sauver, Hongfang Liu, and Yan Liu. 2017. Deep Learning Solutions for Classifying Patients on Opioid Use. In AMIA Annu. Symp. Proc. 525–534.
 Crofford (2013) Leslie J. Crofford. 2013. Use of NSAIDs in Treating Patients with Arthritis. Arthritis Res. Ther. 15, Suppl. 3 (July 2013), S2.
 Davis (2012) Mellar P. Davis. 2012. Drug Management of Visceral Pain: Concepts from Basic Research. Pain Res. Treat. 2012 (2012), 265605.
 Dowell et al. (2016) Deborah Dowell, Tamara M Haegerich, and Roger Chou. 2016. CDC guideline for prescribing opioids for chronic pain—United States, 2016. JAMA 315, 15 (2016), 1624–1645.
 Dusseldorp and Van Mechelen (2014) Elise Dusseldorp and Iven Van Mechelen. 2014. Qualitative Interaction Trees: A Tool to Identify Qualitative Treatment–Subgroup Interactions. Stat. Med. 33, 2 (Jan. 2014), 219–237.
 Edlund et al. (2007) Mark J. Edlund, Diane Steffick, Teresa Hudson, Katherine M. Harris, and Mark Sullivan. 2007. Risk Factors for Clinically Recognized Opioid Abuse and Dependence among Veterans using Opioids for Chronic NonCancer Pain. Pain 129, 3 (June 2007), 355–362.
 Foster et al. (2011) Jared C Foster, Jeremy MG Taylor, and Stephen J Ruberg. 2011. Subgroup identification from randomized clinical trial data. Stat. Med. 30, 24 (2011), 2867–2880.
 Gebhart et al. (2000) G. F. Gebhart, Xin Su, Shailen Joshi, N. Ozaki, and J. N. Sengupta. 2000. Peripheral Opioid Modulation of Visceral Pain. Ann. N. Y. Acad. Sci. 909, 1 (Jan. 2000), 41–50.
 Glanz et al. (2018) Jason M. Glanz, Komal J. Narwaney, Shane R. Mueller, Edward M. Gardner, Susan L. Calcaterra, Stanley Xu, Kristin Breslin, and Ingrid A. Binswanger. 2018. Prediction Model for TwoYear Risk of Opioid Overdose Among Patients Prescribed Chronic Opioid Therapy. J. Gen. Intern. Med. 33, 10 (Oct. 2018), 1646–1653.
 Harbaugh et al. (2018) Calista M. Harbaugh, Jay S. Lee, Hsou Mei Hu, Sean Esteban McCabe, Terri VoepelLewis, Michael J. Englesbe, Chad M. Brummett, and Jennifer F. Waljee. 2018. Persistent Opioid Use Among Pediatric Patients After Surgery. Pediatrics 144, 1 (Jan. 2018), e20172439.
 Helmerhorst et al. (2014) G. T. T. Helmerhorst, A.M. Vranceanu, M. Vrahas, M. Smith, and D. Ring. 2014. Risk Factors for Continued Opioid Use One to Two Months After Surgery for Musculoskeletal Trauma. J. Bone & Joint Surgery 96, 6 (March 2014), 495–499.
 Hernán and Robins (2018) Miguel A. Hernán and James M. Robins. 2018. Causal Inference. Chapman & Hall/CRC, Boca Raton, FL, USA. Forthcoming.
 Hill (2011) Jennifer L Hill. 2011. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 20, 1 (2011), 217–240.
 Imbens (2004) Guido W Imbens. 2004. Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and statistics 86, 1 (2004), 4–29.
 Jena et al. (2014) Anupam B Jena, Dana Goldman, Leonard D Schaeffer, Lesley Weaver, and Pinar KaracaMandic. 2014. Opioid Prescribing by Multiple Providers in Medicare: Retrospective Observational Study of Insurance Claims. BMJ 348 (Feb. 2014), g1393.
 Johansson et al. (2016) Fredrik Johansson, Uri Shalit, and David Sontag. 2016. Learning representations for counterfactual inference. In Proc. Int. Conf. Mach. Learn. 3020–3029.
 Kim et al. (2016) Nayoung Kim, Jonas L. Matzon, Jack Abboudi, Christopher Jones, William Kirkpatrick, Charles F. Leinberry, Frederic E. Liss, Kevin F. Lutsky, Mark L. Wang, Mitchell Maltenfort, and Asif M. Ilyas. 2016. A Prospective Evaluation of Opioid Utilization After UpperExtremity Surgical Procedures: Identifying Consumption Patterns and Determining Prescribing Guidelines. J. Bone Joint Surg. 98, 20 (Oct. 2016), e89.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
 Klueh et al. (2018) Michael P. Klueh, Hsou M. Hu, Ryan A. Howard, Joceline V. Vu, Calista M. Harbaugh, Pooja A. Lagisetty, Chad M. Brummett, Michael J. Englesbe, Jennifer F. Waljee, and Jay S. Lee. 2018. Transitions of Care for Postoperative Opioid Prescribing in Previously OpioidNaïve Patients in the USA: a Retrospective Review. J. Gen. Intern. Med. in press (2018).
 Kobus et al. (2012) A. M. Kobus, D. H. Smith, B. J. Morasco, E. S. Johnson, X. Yang, A. F. Petrik, and R. A. Deyo. 2012. Correlates of higherdose opioid medication use for low back pain in primary care. J. Pain 13, 11 (Nov. 2012), 1131–1138.
 Krogh and Hertz (1992) Anders Krogh and John A Hertz. 1992. A simple weight decay can improve generalization. In Advances in neural information processing systems. 950–957.
 Lipkovich et al. (2011) Ilya Lipkovich, Alex Dmitrienko, Jonathan Denne, and Gregory Enas. 2011. Subgroup identification based on differential effect search —– a recursive partitioning method for establishing response to treatment in patient subpopulations. Stat. Med. 30, 21 (2011), 2601–2621.
 Louizos et al. (2017) Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. 2017. Causal effect inference with deep latentvariable models. In Adv. Neur. Inf. Process. Syst. 6446–6456.
 Makary et al. (2017) Martin A Makary, Heidi N Overton, and Peiqi Wang. 2017. Overprescribing is major contributor to opioid crisis. BMJ 359 (2017), j4792.
 Marlin et al. (2009) Benjamin M. Marlin, Mark Schmidt, , and Kevin P. Murphy. 2009. Group Sparse Priors for Covariance Estimation. In Proc. Conf. Uncertainty Artif. Intell. Montreal, Canada, 383–392.
 Mostacci et al. (2018) Barbara Mostacci, Rocco Liguori, and Arrigo F. Cicero. 2018. Nutraceutical Approach to Peripheral Neuropathies: Evidence from Clinical Trials. Current Drug Metabolism 19, 5 (2018), 460–468.
 Neill and Herlands (2018) Daniel B. Neill and William Herlands. 2018. Machine Learning for Drug Overdose Surveillance. J. Tech. Human Serv. 36, 1 (Jan. 2018), 8–14.
 Okifuji and Hare (2015) Akiko Okifuji and Bradford D. Hare. 2015. The association between chronic pain and obesity. J. Pain Res. 8 (2015), 399–408.
 Palmer and Clegg (2017) Biff F. Palmer and Deborah J. Clegg. 2017. Electrolyte disturbances in patients with chronic alcoholuse disorder. New England J. Med. 377, 14 (2017), 1368–1377.
 Parente et al. (2004) Stephen T. Parente, Susan S. Kim, Michael D. Finch, Lisa A. Schloff, Thomas S. Rector, Raafat Seifeldin, and J. David Haddox. 2004. Identifying Controlled Substance Patterns of Utilization Requiring Evaluation Using Administrative Claims Data. Am. J. Managed Care 10, 11 (Nov. 2004), 783–790.
 Patil et al. (2015) Pravinkumar R. Patil, Jonathan Wolfe, Qayyim Said, Jeremy Thomas, and Bradley C. Martin. 2015. Opioid Use in the Management of Diabetic Peripheral Neuropathy (DPN) in a Large Commercially Insured Population. Clin. J. Pain 31, 5 (May 2015), 414–424.
 Pergolizzi et al. (2012) Joseph V. Pergolizzi, Christopher Gharibo, Steven Passik, Sumedha Labhsetwar, Robert Taylor, Jason S. Pergolizzi, and Gerhard MüllerSchwefe. 2012. Dynamic risk factors in the misuse of opioid analgesics. J. Psychosomatic Res. 72, 6 (2012), 443–451.
 Rosenblum et al. (2008) A. Rosenblum, L. A. Marsch, H. Joseph, and R. K. Portenoy. 2008. Opioids and the treatment of chronic pain: controversies, current status, and future directions. Exp. Clin. Psychopharmacol. 16, 5 (Oct. 2008), 405–416.
 Rubin (1974) Donald B Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. J. Edu. Psych. 66, 5 (1974), 688–701.
 Rubin (1986) Donald B. Rubin. 1986. Which Ifs Have Causal Answers. J. Am. Stat. Assoc. 81, 396 (1986), 961–962.
 Schroeder et al. (ress) Alan R. Schroeder, Melody Dehghan, Thomas B. Newman, Jason P. Bentley, and K. T. Park. in press. Association of Opioid Prescriptions From Dental Clinicians for US Adolescents and Young Adults With Subsequent Opioid Use and Abuse. JAMA Intern Med. (in press).
 Shalit et al. (2017) Uri Shalit, Fredrik D. Johansson, and David Sontag. 2017. Estimating individual treatment effect: Generalization bounds and algorithms. In Proc. Int. Conf. Mach. Learn. Sydney, Australia, 3076–3085.
 Soyka (2015) M. Soyka. 2015. Alcohol Use Disorders in Opioid Maintenance Therapy: Prevalence, Clinical Correlates and Treatment. Eur. Addict. Res. 21, 2 (Jan. 2015), 78–87.
 Springer et al. (2018) Sandra A Springer, P Todd Korthuis, and Carlos Del Rio. 2018. Integrating treatment at the intersection of opioid use disorder and infectious disease epidemics in medical settings. Ann Intern Med 169, 5 (2018), 335–336.
 Su et al. (2009) Xiaogang Su, ChihLing Tsai, Hansheng Wang, David M Nickerson, and Bogong Li. 2009. Subgroup analysis via recursive partitioning. Journal of Machine Learning Research 10, Feb (2009), 141–158.
 Volkow and McLellan (2016) Nora D. Volkow and A. Thomas McLellan. 2016. Opioid Abuse in Chronic Pain — Misconceptions and Mitigation Strategies. N. Engl. J. Med. 374, 13 (2016), 1253–1263.
 Wang and Rudin (2017) Tong Wang and Cynthia Rudin. 2017. Causal Rule Sets for Identifying Subgroups with Enhanced Treatment Effect. arXiv:1710.05426.
 Yuan and Lin (2006) Ming Yuan and Yi Lin. 2006. Model Selection and Estimation in Regression with Grouped Variables. J. R. Stat. Soc. B 68, 1 (2006), 49–67.
 Zhang et al. (2017) Jinghe Zhang, Vijay S. Iyengar, Dennis Wei, Bhanukiran Vinzamuri, Hamsa Bastani, Alexander R. Macalalad, Anne E. Fischer, Gigi YuenReed, Aleksandra Mojsilović, and Kush R. Varshney. 2017. Exploring the Causal Relationships between Initial Opioid Prescriptions and Outcomes. In AMIA Workshop Data Min. Med. Informat. Washington, DC, USA.
Comments
There are no comments yet.