Learning Optimal Interventions

by   Jonas Mueller, et al.

Our goal is to identify beneficial interventions from observational data. We consider interventions that are narrowly focused (impacting few covariates) and may be tailored to each individual or globally enacted over a population. For applications where harmful intervention is drastically worse than proposing no change, we propose a conservative definition of the optimal intervention. Assuming the underlying relationship remains invariant under intervention, we develop efficient algorithms to identify the optimal intervention policy from limited data and provide theoretical guarantees for our approach in a Gaussian Process setting. Although our methods assume covariates can be precisely adjusted, they remain capable of improving outcomes in misspecified settings where interventions incur unintentional downstream effects. Empirically, our approach identifies good interventions in two practical applications: gene perturbation and writing improvement.



page 1

page 2

page 3

page 4


Stacking interventions for equitable outcomes

Predictive risk scores estimating probabilities for a binary outcome on ...

The Difficult Task of Distribution Generalization in Nonlinear Models

We consider the problem of predicting a response from a set of covariate...

All in one stroke? Intervention Spaces for Dark Patterns

This position paper draws from the complexity of dark patterns to develo...

Estimating the Effects of Continuous-valued Interventions using Generative Adversarial Networks

While much attention has been given to the problem of estimating the eff...

Analysis of "Learn-As-You-Go" (LAGO) Studies

In learn-as-you-go (LAGO) adaptive designs, the intervention is a packag...

Instrumental Variable Methods using Dynamic Interventions

Recent work on dynamic interventions has greatly expanded the range of c...

VOIDD: automatic vessel of intervention dynamic detection in PCI procedures

In this article, we present the work towards improving the overall workf...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many data-driven applications, including medicine, the primary interest is identifying interventions that produce a desired change in some associated outcome. Due to experimental limitations, learning in such domains is commonly restricted to an observational dataset

which consists of IID samples from a population with joint distribution

over covariates (features) and outcomes . Typically, such data is analyzed using models which facilitate understanding of the relations between variables (eg. assuming linearity/additivity). Based on conclusions drawn from this analysis, the analyst decides how to intervene in a manner they confidently believe will improve outcomes.

Formalizing such beliefs via Bayesian inference, we develop a framework that identifies beneficial interventions directly from the data. In our setup, an intervention on an individual with pre-treatment covariates

produces post-treatment covariate values that determine the resulting outcome (depicted as the graphical model: . Each possible intervention results in a diffferent . More concretely, we make the following simplifying assumption:


for some underlying function that encodes the effects of causal mechanisms (ie.  represents a fair description of the system state, and some covariates in causally affect , not vice-versa). The observed data is comprised of naturally occurring covariate values where we presume for (ie. the system state remains static without intervention, so the observed covariate values directly influence the observed outcomes). Moreover, we assume the relationship between these covariate values and the outcomes remains invariant, following the same (unknown) function for any arising from one of our feasible interventions (or no intervention at all). Note that this assumption precludes the presence of hidden confounding. Peters et al. (2016) have also relied on this invariance assumption, verifying it as a reasonable property of causal mechanisms in nature.

Given this data, we aim to learn an intervention policy defined by a covariate transformation , applied to each individual in the population. Here, presents a desired setting of the covariates that should be reflected by subsequent intervention to actually influence outcomes. When only specifies changes to a subset of the covariates, an intervention seeking to realize may have unintended side-effects on covariates outside of this subset. We ignore such “fat hand” settings (Duvenaud et al., 2010) until §7. Instead, our methods assume interventions can always be carried out with great precision to ensure the desired transformation is exactly reflected in the post-treatment values: . Our goal is to identify the transformation which produces the largest corresponding post-treatment improvement with high certainty. can either represent a single mapping to be performed on all individuals (global policy) or encode a personalized policy where the intervened upon variables and their values may change with .

Our strong assumptions are made to ensure that statistical modeling alone suffices to identify beneficial interventions. While many real-world tasks violate these conditions, there exist important domains in which violations are sufficiently minor that our methods can discover effective interventions (cf. Rojas-Carulla et al. (2016); Peters et al. (2016)). We use two applications to illustrate our framework. One is a writing improvement task where the data consists of documents labeled with associated outcomes (eg. grades or popularity) and the goal is to suggest beneficial changes to the author. Our second example is a gene perturbation task where the expression of some regulatory genes can be up/down-regulated in a population (eg. cells or bacteria) with the goal of inducing a particular phenotype or activation/repression of a downstream gene. In these examples, covariates are known to cause outcomes and our other assumptions may hold to some degree, depending on the type of external intervention used to alter covariate values.

The contributions of this work include: (1) a formal definition of the optimal intervention that exhibits desirable characteristics under uncertainty due to limited data, (2) widely applicable types of (sparse) intervention policy that are easily enacted across a whole population, (3) algorithms to find the optimal intervention under practical constraints, (4) theoretical insight regarding our methods’ properties in Gaussian Process settings as well as certain misspecified applications.

2 Related Work

The same invariance assumption has been exploited by Peters et al. (2016) and Rojas-Carulla et al. (2016) for causal variable selection in regression models. Recently, researchers such as Duvenaud et al. (2010) and Kleinberg et al. (2015) have supported a greater role for predictive modeling in various decision-making settings. Zeevi et al. (2015)

use gradient boosting to predict glycemic response based on diet (and personal/microbiome covariates), and found they can naively leverage their regressor to select personalized diets which result in superior glucose levels than the meals proposed by a clinical dietitian. As treatment-selection in high-impact applications (eg. healthcare) grows increasingly reliant on supervised learning methods, it is imperative to properly handle uncertainty.

Nonlinear Bayesian predictive models have been employed by Hill (2011), Brodersen et al. (2015), and Krishnan et al. (2015) for quantifying the effects of a given treatment from observations of individuals who have been treated and those who have not. Rather than considering a single given intervention, we introduce the notion of an optimal intervention under various practical constraints, and how to identify such a policy from a limited dataset (in which no individuals have necessarily received any interventions).

Although our goals appear similar to Bayesian optimization and bandit problems (Shahriari et al., 2016; Agarwal et al., 2013), additional data is not collected in our setup. Since we consider settings where interventions are proposed based on all available data, acquisition functions for sequential exploration of the response-surface are not appropriate. As most existing data is not generated through sequential experimentation, our methods are more broadly applicable than iterative approaches like Bayesian optimization.

A greater distinction is our work’s focus on the pre vs. post-intervention change in outcome for each particular individual, whereas Bayesian optimization seeks a single globally optimal configuration of covariates. In practice, feasible covariate transformations are constrained based on an individual’s naturally occurring covariate-values, which stem from some underlying population beyond our control. For example in the writing improvement task, the goal is not to identify a globally optimal configuration of covariates that all texts should strive to achieve, but rather to inform a particular author of simple modifications likely to improve the outcome of his/her existing article. Appropriately treating such constraints is particularly important when we wish to prescribe a global policy corresponding to a single intervention applied to all individuals from the population (there is no notion of an underlying population in Bayesian optimization).

3 Methods

Our strategy is to first fit a Bayesian model for whose posterior encodes our beliefs about the underlying function given the observed data. Subsequently, the posterior for is used to identify a transformation of the covariates which is likely to improve expected post-intervention outcomes according to our current beliefs. The posterior for may be summarized at any points by mean function and covariance function .

3.1 Intervening at the Individual Level

For representing the covariate-measurements from an individual, we are given a set that denotes constraints of possible transformations of . Let denote the new covariate-measurements of this individual after a particular intervention on which alters covariates as specified by transformation . Recall that we assume an intervention can be conducted to produce post-treatment covariate-values that exactly match any feasible transformation: , and we thus write in place of .

We first consider personalized interventions in which may be tailored to a particular . Under the Bayesian perspective, is randomly distributed according to our posterior beliefs, and we define the individual expected gain function:


Since , random function evaluates the expected outcome-difference at the post vs. pre-intervention setting of the covariates (this expectation is over the noise

, not our posterior). To infer the best personalized intervention (assuming higher outcomes are desired), we use optimization over vectors

to find:


where denotes the quantile of our posterior distribution over . We choose , which implies the intervention that produces

should improve the expected outcome with probability

under our posterior beliefs.

Defined based on known constraints of feasible interventions, the set enumerates possible transformations that can be applied to an individual with covariate values . If the set of possible interventions is independent of (ie. ), then our goal is similar to the optimal covariate-configuration problem studied in Bayesian optimization. However, in many practical applications, -independent transformations are not realizable through intervention. Consider gene perturbation, a scenario where it is impractical to simultaneously target more than a few genes due to technological limitations. If alternatively intervening on a quantity like caloric intake, it is only realistic to change an individual’s current value by at most a small amount. The choice reflects the constraint that at most covariates can be intervened upon. We can denote limits on the amount that the covariate may be altered by for . In realistic settings, may be the intersection of many such sets reflecting other possible constraints such as boundedness, impossible joint configurations of multiple covariates, etc.

For any : the posterior distribution for has:


which is easily computed using the corresponding mean/covariance functions of the posterior . When , the objective in (3) takes value 0, so any superior optimum corresponds to an intervention we are confident will lead to expected improvement. If there is no good intervention in (corresponding to a large increase in the posterior mean) or too much uncertainty about given limited data, then our method simply returns indicating no intervention should be performed.

Our objective exhibits these desirable characteristics because it relies on the posterior beliefs regarding both and , which are tied via the covariance function. In contrast, a similarly-conservative lower confidence bound objective (ie. the UCB acquisition function with lower rather than upper quantiles) would only consider , and could propose unsatisfactory transformations where .

3.2 Intervening on Entire Populations

The above discussion focused on personalized interventions tailored on an individual basis. In certain applications, policy-makers are interested in designing a single intervention which will be applied to all individuals from the same underlying population as the data. Relying on such a global policy is the only option in cases where we no longer observe covariate-measurements of new individuals outside the data. In our gene perturbation example, gene expression may no longer be individually profiled in future specimens that receive the decided-upon intervention to save costs/labor.

Here, the covariates are assumed distributed according to some underlying (pre-intervention) population, and we define the population expected gain function:

which is also randomly distributed based on our posterior ( is expectation with respect to the covariate-distribution which is not modeled by ). Our goal is now to find a single transformation corresponding to a population intervention which will (with high certainty under our posterior beliefs) lead to large outcome improvements on average across the population:


Here, the family of possible transformations is constrained such that for all

. As a good model of our multivariate features may be unknown, we instead work with the empirical estimate:


is the empirical population expected gain, whose posterior distribution has:


The population intervention objective in (7) is again 0 for the identity mapping . Under excessive uncertainty or a dearth of beneficial transformations in , the policy produced by this method will again simply be to perform no intervention. In this population intervention setting, is designed assuming future individuals will stem from the same underlying distribution as the samples in . Although is a function of , the form of the transformation must be agnostic to the specific values of (so the intervention can be applied to new individuals without measuring their covariates).

We consider two types of transformations that we find widely applicable. Shift interventions involve transformations of the form: where represents a (sparse) shift that the policy applies to each individuals’ covariates (eg. always adding 3 to the value of the second covariate corresponds to ). Covariate-fixing interventions are policies which set certain covariates to a constant value for all individuals, and involve transformations such that for some covariate-subset and for : is fixed across all (eg. always setting the first covariate to 0, for example in gene knockout, corresponds to ). Figure 1 depicts examples of these different interventions. Under a sparsity constraint, we must carefully model the underlying population in order to identify the best covariate-fixing intervention (here, setting to a large value is superior to intervening on ).

Figure 1: Contour plot of expected outcomes over feature space for relationship . Black points: the underlying population. Gold diamond: optimal covariate-setting if any transformation in the box were feasible. Red points: same population after shift intervention . Light (or dark) green points (along border): best covariate-fixing intervention which can only set (or only ) to a fixed value. Blue, purple, light blue points: individuals who receive a single-variable personalized intervention (arrows indicate direction of optimal transformation).

4 Algorithms

Throughout this work, we use Gaussian Process (GP) regression (Rasmussen, 2006) to model as described in §S1

(‘S’ indicates references in the Supplementary Material). This nonparametric method has been favored in many applications as it produces both accurate predictions and effective measures of uncertainty (with closed-form estimators available in the standard case). Furthermore, a variety of GP models exist for different settings including: non-Gaussian response variables

(Rasmussen, 2006), non-stationary relationships (Paciorek & Schervish, 2004), deep representations (Daminaou & Lawrence, 2013), measurement error (McHutchon & Rasmussen, 2011)

, and heteroscedastic noise

(Le et al., 2005). While these variants are not employed in this work, our methodology can be directly used in conjunction with such extensions (or more generally, any model which produces a useful posterior for ).

Under the standard GP model,

follows a Gaussian distribution and the

quantile of our personalized gain is simply given by:


where denotes the quantile function. The quantiles of the empirical population gain may be similarly obtained. When a smooth smooth covariance kernel is adopted in the GP prior, derivatives of our intervention-objectives are easily computed with respect to .

In many practical settings, an intervention that only affects a small subset of variables is desired. Software to improve text, for example, should not overwhelm authors with a multitude of desired changes, but rather present a concise list of the most beneficial revisions in order to retain underlying semantics. Note that identifying a sparse transformation of the covariates is different from feature selection in supervised learning (where the goal is to identify dimensions along which

varies most). In contrast, we seek the dimensions along which one of our feasible covariate-transformations can produce the largest high-probability increase in , assuming the other covariates remain fixed at their initial pre-treatment values (in the case of personalized intervention) or follow the same distribution as the pre-intervention population (in the case of a global policy).

For a shift intervention , we introduce the convenient notation . In applications where shifting (the covariate for ) by one unit incurs cost , we account for these costs by considering the following regularized intervention-objective:


By maximizing this objective over feasible set , policy-makers can decide which variables to intervene upon (and how much to shift them), depending on the relative value of outcome-improvements (specified by ).

This optimization is performed using the proximal gradient method (Bertsekas, 1995), where at each iterate: a step in the gradient direction is followed by a soft-thresholding operation (Bach et al., 2012) as well as a projection back onto the feasible set . However, a simple gradient method may suffer from local optima. To avoid severely suboptimal solutions, we develop a continuation technique (Mobahi et al., 2012) that performs a series of gradient-based optimizations over variants of this objective with tapering levels of added smoothness (details in §S2).

In some settings, one may want to ensure at most covariates are intervened upon. We identify the optimal -sparse shift intervention via the Sparse Shift Algorithm below, which relies on -relaxation (Bach et al., 2012) and the regularization path of our penalized objective in (13).

[leftmargin=0cm,rightmargin=0cm,skipbelow=0cm,skipabove=0cm, innertopmargin=3pt, innerbottommargin=3pt, innerrightmargin=0.8pt, innerleftmargin=0.8pt] Sparse Shift Algorithm: Finds best -sparse
  shift intervention.

  1. [1:, topsep=-1.5ex,leftmargin=6mm]

  2. Set for

  3. Perform binary search over to find:

  4. Define

  5. Return:  

Recall that in the case of personalized intervention, we simply optimize over vectors . Any personalized transformation can therefore be equivalently expressed as a shift in terms of such that . After substituting the individual gain in place of the population gain within our definition of in (13), we can thus employ the same algorithms to identify sparse/cost-sensitive personalized interventions. To find a covariate-fixing intervention which sets of the covariates to particular fixed constants across all individuals from the population, we instead employ a forward step-wise selection algorithm (detailed in §S2.2), as the form of the optimization is not amenable to -relaxation in this case.

5 Theoretical Results

Consider the following basic conditions: (A1) all data lies in , (A2) . Throughout this section, we assume (A5), (A5), and the conditions laid out in §1 hold. For clarity, we rewrite the true underlying relationship as , letting now denote arbitrary functions. Our results are with respect to the true improvement of an intervention , (note that are no longer random). Our theory relies on Gaussian Process results derived by Srinivas et al. (2010); van der Vaart & van Zanten (2011), and we relegate proofs and technical definitions to §S6.

Theorem 1.

Suppose we adopt a GP prior and the following conditions hold:
(A3) noise variables (A4) there exist such that the Hölder space has probability one under our prior (see van der Vaart & van Zanten (2011)). (A5) and any supported by the prior are Lipschitz continuous over with constant (A6) the density of our input covariates is bounded above and below over domain .

Then, for all :

where constant depends on the prior and density and we define:

is the (generalized) inverse of which depends on the concentration function . measures how well the RKHS of our GP prior approximates (see van der Vaart & van Zanten (2011) for more details). The expectation is over the distribution of the data . Importantly, Theorem 1 does not assume anything about the true relationship , and the bound depends on the distance between and our prior. When is a -smooth function, a typical bound is given by if is the Matérn kernel with smoothness parameter . When is the squared exponential kernel and is -regular (in Sobolev sense), (van der Vaart & van Zanten, 2011).

Theorem 2.

Under the assumptions of Theorem 1, for any such that :

Theorems 1 and 2 characterize the rate at which our personalized/population-intervention objectives are expected to converge to the true improvement (due to contraction of the posterior as grows). Since these results hold for all , this implies the maximizer of our intervention-objectives will converge to the true optimal transformation as (under a reasonable prior). Complementing these results, Theorem 6 in §S6 ensures that for any : optimizing our personalized intervention objective corresponds to improving a lower bound on the true improvement with high probability, when is small and belongs to the RKHS of our prior. In this case, the optimal transformation inferred by our approach only worsens the actual expected outcome with low probability.

6 Results

§S3 contains an analysis of our approach on simulated data from simple covariate-outcome relationships. The average improvement produced by our chosen interventions rapidly converges to the best possible value with increasing . In these experiments, sparse-interventions consistently alter the correct feature subset, and proposed transformations under our conservative criterion are much more rarely harmful than those suggested by optimizing the posterior mean function (which ignores uncertainty).

6.1 Gene Perturbation

Next, we applied our method to search for population interventions in observational yeast gene expression data from Kemmeren et al. (2014). We evaluated the effects of proposed interventions (restricted to single gene knockouts) over a set of 10 transcription factors () with the goal of down-regulating each of a set of 16 small molecule metabolism target genes, . Results for all methods are compared to the actual expression change of the target gene found experimentally under individual knockouts of each transcription factor in

. Compared to marginal linear regressions and multivariate linear regression, our method’s uncertainty prevents it from proposing harmful interventions, and the interventions it proposes are optimal or near optimal (Figure 


Insets (a) and (b) in Figure 2 show empirical marginal distributions between target gene TSL1 and members of identified for knockout by our method (CIN5) and marginal regression (GAT2). From the linear perspective, these relationships are fairly indistinguishable, but only CIN5 displays a strong inhibitory effect in the knockout experiments. Inset (c) shows the empirical marginal for a harmful intervention proposed by multivariate regression for down-regulating GPH1, where the overall correlation is significantly positive, but the few lowest expression values (which influence our GP intervention objective the most) do not provide strong evidence of a large knockdown effect.

Figure 2: Actual effects of proposed interventions (single gene knockout) over a set transcription factors on down-regulation of each of a set of 16 small molecule metabolism target genes.

6.2 Writing Improvement

Finally, we apply our personalized intervention methodology to the task of transforming a given news article into one which will be more widely-shared on social media. We use a dataset from Fernandes et al. (2015) containing various features about individual Mashable articles along with their subsequent popularity in social networks (detailed description/results for this analysis in §S5). We train a GP regressor on 5,000 articles labeled with popularity-annotations and evaluate sparse interventions on a held-out set of 300 articles based on changes they induce in article benchmark popularity (defined in §S5). When , the average benchmark popularity increase produced by our personalized intervention methodology is 0.59, whereas it statistically significantly decreases to 0.55 if is chosen. Thus, even given this large sample size, ignoring uncertainty appears detrimental for this application, and results in 4 articles whose benchmark popularity worsens post-intervention (compared to only 2 for ). Nonetheless, both methods generally produce very beneficial improvements in this analysis, as seen in Figure S3.

As an example of the personalization of proposed interventions, our method () generally proposes different sparse interventions for articles in the Business category vs. the Entertainment category. On average, the sparse transformation for business articles uniquely advocates decreasing global sentiment polarity and increasing word count (which are not commonly altered in the personalized interventions found for entertainment articles), whereas interventions to decrease title subjectivity are uniquely prevalent throughout the entertainment category. These findings appear intuitive (eg. critical business articles likely receive more discussion, and titles of popular entertainment articles often contain startling statements written non-subjectively as fact). Interestingly, the model also tends to advise shorter titles for business articles, but increasing the length for entertainment articles. Articles across all categories are universally encouraged to include more references to other articles and keywords that were historically popular.

7 Misspecified Interventions

Our methodology heavily relies on the assumption that the outcome-determining covariate values produced through intervention exactly match the desired covariate transformation . When transformations are only allowed to alter at most covariates, this requires that we can intervene to alter only this subset without affecting the values of other covariates. If specifies a sparse change affecting only a subset of the covariates , our methods assume the post-treatment value of any non-intervened-upon covariate remains at its initial value (ie. ).

In some domains, the covariate-transformation induced via sparse external intervention can only be roughly controlled (eg. our gene perturbation example when the profiled genes belong to a common regulatory network). Let denote a covariate-fixing transformation which sets a subset of covariates in to constant values across all individuals in the population. In this section, we consider an alternative assumption under which the intervention applied in hopes of achieving propagates downstream to affect other covariates outside (so there may exist : ), which we formalize as the do-operation in the causal calculus of Pearl (2000). Here, we suppose the underlying population of follows a structural equation model (SEM) (Pearl, 2000). The outcome is restricted to be a sink node of the causal DAG, so we can still write and maintain the other conditions from §1. Rather than exhibiting covariate-distribution with (as presumed in our methods), the post-treatment population which arises from an intervention seeking to enact transformation is now assumed to follow the distribution specified by . Note that the do-operation here is only applied to some nodes in the DAG (variables in subset ) as discussed by Peters et al. (2014), but its effects can alter the distributions of non-intervened-upon covariates outside of which lie downstream in the DAG.

Theorem 3.

For some , suppose the condition: (A7) holds. Then, for any covariate-fixing transformation :  and
are equal.

Here, denotes the variables which are parents of outcome in the underlying causal DAG, and is the set of variables which are not descendants of variables in subset . For the next result, we define: as the intervention set corresponding to the optimal -sparse covariate-fixing transformation (where in the case of ties, the set of smallest cardinality is chosen), if transformations were exactly realized by our interventions (which is not necessarily the case in this section).

Theorem 4.

Suppose the underlying DAG satisfies: (A8) No variable in is a descendant of other parents, ie. s.t. . Then, satisfies (A3).

In the absence of extremely strong interactions between variables in , the equality of Theorem 3 will also hold for if . For settings where sparse interventions elicit unintentional -effects and the causal DAG meets condition (A4), Theorems 3 and 4 imply that, under complete certainty about , the (minimum cardinality) maximizer of our covariate-fixing intervention objective corresponds to an transformation that produces an equally good outcome change when the corresponding intervention is actually realized as a -operation in the underlying population. Combined with Theorem 2, our results ensure that, even in this misspecified setting, the empirical maximizer of our sparse covariate-fixing intervention objective (7) produces (in expectation as ) beneficial interventions for populations whose underlying causal relationships satisfy certain conditions.

Next, we empirically investigate how effective our methods are in this misspecified SEM setting, where a proposed sparse population transformation is actually realized as a do-operation and can therefore unintentionally affect other covariates in the post-intervention population. We generate data from an underlying linear non-Gaussian SEM, and where is a sink node in the corresponding causal DAG (see §S3.1 for details). Our approach to identify a beneficial sparse population intervention is compared with inferring the complete SEM using the LinGAM estimator of Shimizu et al. (2006) and subsequently identifying the optimal single-node do-operation in the inferred SEM. Note that LinGAM is explicitly designed for this setting, while both our method and the relied-upon Gaussian Process model are severely misspecified.

Figures 3A and 3B demonstrate that the inferred best single-variable shift population intervention (under constraints on the magnitude of the shift) matches the performance the interventions suggested by LinGAM (except for in rare cases with tiny sample size) when the proposed interventions are evaluated as do-operations in the true underlying SEM. Thus, we believe a supervised learning approach like ours is preferable in practical applications where interpreting the underlying causal structure is not as important as producing good outcomes (especially for higher dimensional data where estimation of the causal structure becomes difficult (Peters et al., 2014)).

Figure 3: The average (solid) and quantile (dashed) expected outcome change produced by our method (red) vs LinGAM (blue) over 100 datasets drawn from two underlying SEMs chosen by Shimizu et al. (2006). The black dashed line indicates the best possible improvement in each case.

The assumption of sparse interventions realized as a -operation (as defined by Peters et al. (2014)) may also be an inappropriate in many domains, particularly if off-target effects of interventions are explicitly mitigated via external controls. To appreciate the intricate nature of assumptions regarding non-intervened-upon variables, consider our example of modeling text documents represented using two features: polarity and word count. A desired transformation to increase the text’s polarity can be accomplished by inserting additional positive adjectives, but such an intervention also increases articles’ word count. Alternatively, polarity may be identically increased by replacing words with more positive alternatives, an external intervention which would not affect the word count (and thus follows the assumptions of our framework).

8 Discussion

This work introduces methods for directly learning beneficial interventions from purely observational data without treatments. While this objective is, strictly speaking, only possible under stringent assumptions, our approach performs well in both intentionally-misspecified and complex real-world settings. As supervised learning algorithms grow ever more popular, we expect intervention-decisions in many domains will increasingly rely on predictive models. Our conservative definition of the optimal intervention provides a principled approach to handle the inherent uncertainty in these settings due to finite data. Able to employ any Bayesian regressor, our ideas are widely applicable, considering practical types of interventions that can either be personalized or enacted uniformly over a population.

Acknowledgements: We thank David Gifford for helpful comments. DR was funded by an award from IARPA (under research contract 2015-15061000003).



  • (1)
  • Agarwal et al. (2013) Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L. & Schapire, R. (2013), ‘Taming the monster: A fast and simple algorithm for contextual bandits’,

    30th International Conference on Machine Learning (ICML)

  • Bach et al. (2012) Bach, F., Jenatton, R., Mairal, J. & Obozinski, G. (2012), ‘Optimization with sparsity-inducing penalties’, Foundations and Trends in Machine Learning 4(1), 1–106.
  • Bertsekas (1995) Bertsekas, D. (1995), Nonlinear Programming, Athena Scientific.
  • Brodersen et al. (2015) Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N. & Scott, S. L. (2015), ‘Inferring causal impact using bayesian structural time-series models’, Annals of Applied Statistics 9, 247–274.
  • Daminaou & Lawrence (2013) Daminaou, A. & Lawrence, A. (2013), ‘Deep Gaussian processes’, 16th International Conference on Artificial Intelligence and Statistics (AISTATS) .
  • Duvenaud et al. (2010) Duvenaud, D., Eaton, D., Murphy, K. & Schmidt, M. (2010), ‘Causal learning without DAGs’, JMLR: Workshop and Conference Proceedings 6, 177–190.
  • Fernandes et al. (2015) Fernandes, K., Vinagre, P. & Cortez, P. (2015), ‘A proactive intelligent decision support system for predicting the popularity of online news’, 17th EPIA Portuguese Conference on Artificial Intelligence .
  • Hill (2011) Hill, J. L. (2011), ‘Bayesian nonparametric modeling for causal inference’, Journal of Computational and Graphical Statistics 20(1), 217–240.
  • Kemmeren et al. (2014) Kemmeren, P., Sameith, K., van de Pasch, L. A., Benschop, J. J., Lenstra, T. L., Margaritis, T., O’Duibhir, E., Apweiler, E., van Wageningen, S., Ko, C. W. et al. (2014), ‘Large-scale genetic perturbations reveal regulatory networks and an abundance of gene-specific repressors’, Cell 157(3), 740–752.
  • Kleinberg et al. (2015) Kleinberg, J., Ludwig, J., Mullainathan, S. & Obermeyer, Z. (2015), ‘Prediction policy problems’, American Economic Review: Papers & Proceedings 105(5), 491–495.
  • Krishnan et al. (2015)

    Krishnan, R. G., Shalit, U. & Sontag, D. (2015), ‘Deep kalman filters’,

    Advances in Neural Information Processing Systems (NIPS) 28.
  • Le et al. (2005) Le, Q. V., Smola, A. J. & Canu, S. (2005), ‘Heteroscedastic Gaussian process regression’, 22nd International Conference on Machine Learning (ICML) .
  • McHutchon & Rasmussen (2011) McHutchon, A. & Rasmussen, C. E. (2011), ‘Gaussian process training with input noise’, Advances in Neural Information Processing Systems (NIPS) 24.
  • Mobahi et al. (2012) Mobahi, H., L, Z. C. & Ma, Y. (2012), ‘Seeing through the blur’,

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  • Paciorek & Schervish (2004) Paciorek, C. J. & Schervish, M. J. (2004), ‘Nonstationary covariance functions for Gaussian process regression’, Advances in Neural Information Processing Systems (NIPS) 17.
  • Pearl (2000) Pearl, J. (2000), Causality: Models, Reasoning and Inference, Cambridge Univ. Press.
  • Peters et al. (2016)

    Peters, J., Bühlmann, P. & Meinshausen, N. (2016), ‘Causal inference using invariant prediction: identification and confidence intervals’,

    Journal of the Royal Statistical Society: Series B 78, 1–42.
  • Peters et al. (2014) Peters, J., Mooij, J. M., Janzing, D. & Schölkopf, B. (2014), ‘Causal discovery with continuous additive noise models’, Journal of Machine Learning Research 15, 2009–2053.
  • Rasmussen (2006) Rasmussen, C. E. (2006), Gaussian processes for machine learning, MIT Press.
  • Rojas-Carulla et al. (2016) Rojas-Carulla, M., Schölkopf, B., Turner, R. & Peters, J. (2016), ‘Causal transfer in machine learning’, arXiv:1507.05333 .
  • Shahriari et al. (2016) Shahriari, B., Swersky, K., Wang, Z., Adams, R. P. & de Freitas, N. (2016), ‘Taking the human out of the loop: A review of Bayesian optimization’, Proceedings of the IEEE 104(1), 148–175.
  • Shimizu et al. (2006) Shimizu, S., Hoyer, P., Hyvärinen, A. & Kerminen, A. J. (2006), ‘A linear non-Gaussian acyclic model for causal discovery’, Journal of Machine Learning Research 7, 2003–2030.
  • Srinivas et al. (2010) Srinivas, N., Krause, A., Kakade, S. M. & Seeger, M. (2010), ‘Gaussian process optimization in the bandit setting: No regret and experimental design’, 27th International Conference on Machine Learning (ICML) .
  • van der Vaart & van Zanten (2011) van der Vaart, A. & van Zanten, H. (2011), ‘Information rates of nonparametric Gaussian process methods’, Journal of Machine Learning Research 12, 2095–2119.
  • Zeevi et al. (2015) Zeevi, D., Korem, T., Zmora, N., Israeli, D., Rothschild, D., Weinberger, A., Ben-Yacov, O., Lador, D., Avnit-Sagi, T., Lotan-Pompan, M. et al. (2015), ‘Personalized nutrition by prediction of glycemic responses’, Cell 163(5), 1079–1094.

S1 Gaussian Process Regression

Gaussian Process regression Rasmussen2006si adopts a prior under which follow multivariate Gaussian distribution N for any collection . The model is specified by a prior mean function and positive-definite covariance function which encodes our prior belief regarding properties of the underlying relationship between and (such as smoothness or periodicity). Here, the vector denotes the evaluation of function at each point , and denotes the matrix whose component is . Given test input points in addition to training data , we additionally define: , , matrix with entry (where is the training input), and matrix which contains pairwise covariances between test inputs.

Assuming the noise ) is independently sampled for each observation, the posterior for at the test inputs, , follows   distribution with the following mean vector and covariance matrix:

Note that our intervention-optimization framework is not specific to this GP model, but can be combined with any algorithm that learns a reasonable posterior for . While employing a more powerful model should improve the results of our approach, comparing various regressors is not our focus. Thus, all practical results of our methodology are presented using only the standard GP regression model, under which the posterior distribution over is given by the above expressions. In each application presented here, our GP uses the Automatic-Relevance-Determination (ARD) covariance function, a popular choice for multi-dimensional data Rasmussen2006si:


The ARD kernel relies on length-scale hyperparameters

which determine how much can depend on each dimension of the feature-space. All hyperparameters of our GP regression model (covariance-kernel parameters and (the output variance) as well as the variance of the noise ) are empirically selected via marginal-likelihood maximization Rasmussen2006si. In each application, we employ the posterior-quantile () in our method to ensure that with high probability, the intervention it infers to be optimal induces a nonnegative change in expected outcomes.

S2 Algorithmic Details

To find an optimal transformation of our regularized objective in (13), we employ the proximal gradient method described in §4. When and there is no penalty, we instead use Sequential Least Squares Programming Kraft1988si. However, the intervention objective

may be highly nonconcave. To deal with local optima in acquisition functions, Bayesian optimization methods employ heuristics like combining the results of many local optimizers or operating over a fine partitioning of the feature space Shahriari2016si, Lizotte2008. We instead propose a continuation technique that solves a series of optimization problems, each of which operates on our objective under a smoothed posterior (and the amount of additional smoothing is gradually decreased to zero). Excessive smoothing of the posterior is achieved by simply considering GP models whose kernels are given overly large length-scale parameters. Each time the amount of smoothing is tapered, we initialize our local optimizer using the solution found at the previously greater smoothing level. Intuitively, the highly smoothed GP model is primarily influenced by the global structure in the data, and thus our optimization with respect to the posterior of this model is far less susceptible to low-quality local maxima. Analysis of a similar homotopy strategy under radial basis kernels has been conducted by Mohabi2012si.

s2.1 Sparse Shift Intervention

Here, we provide an explanatory description of the Sparse Shift Algorithm from §4. To find the best -sparse population shift intervention, we resort to relaxation. As the -norm provides the closest convex relaxation to the norm, this is a a commonly adopted strategy to avoid combinatorial search in feature selection Bach2012si. First, we compute the regularization path over different settings of the penalty for the following regularized objective:


which is maximized over the feasible set
(recall we write when ).

Subsequently, we identify the regularization penalty which produces a shift of desired cardinality and select our intervention set as the covariates which receive nonzero shift. Finally, we optimize the original unregularized objective () with respect to only the selected covariates in to remove bias induced by the regularizer. Each inner maximization in both the Sparse Shift/Covariate-fixing algorithms is performed via the proximal gradient methods combined with our continuation approach introduced in §S2.

s2.2 Sparse Covariate-fixing Intervention

Another goal is to identify the optimal covariate-fixing intervention which sets of the covariates to particular fixed constants uniformly across all individuals from the population. We employ the forward step-wise selection algorithm described below, as the form of the optimization in this case is not amenable to -relaxation. Recall denotes the subset of covariates which are intervened upon, and the covariate-fixing intervention produces vector such that if , otherwise which is a constant chosen by the policy-maker. This same transformation is applied to each individual in the population, creating a more homogeneous group which share the same value for the covariates in . For a given , the objective function to find the best constants is:


which is maximized over the constraints: for .

  Sparse Covariate-fixing Algorithm: Identifies best -sparse covariate-fixing intervention.
  Input: Dataset , Posterior
Parameters: specifies the maximal number of covariates which may be set by the covariate-fixing intervention, are sets of feasible settings for each covariate.

  1. [1:, topsep=-1.5ex,leftmargin=6mm]

  2. Initialize ,  ,  

  3. While :

  4. Set       for each

  5. Find

  6. If :       ,  ,  

  7. Else:      break

  8. Return:


S3 Simulations

(A) Linear:   (B) Quadratic:  
(C) Product:  
Figure S1: The mean (solid) and quantile (dashed) expected outcome change produced under personalized interventions suggested by various methods, over 100 datasets of each sample size. Each dataset contains 10-dimensional covariates, with , and is determined by the indicated relationships and additive Gaussian noise (). The black lines indicate the best possible expected outcome change (when the best change depends on which individual received the intervention, the black solid/dashed lines indicates the mean and quantile over our 100 trials).
(A) Linear:   (B) Quadratic:  
Figure S2: The mean (solid) and quantile (dashed) expected outcome change produced by our population intervention method, over 100 datasets for each sample size (same setting as in Figure §S1). The black line indicates the best possible expected outcome improvement.

Over the simulated data summarized in Figure S1, we apply our basic personalized intervention method () with purely local optimization (standard) and our continuation technique (smoothed), which significantly improves results. For each of the 100 datasets, we randomly sampled a new point (from the same underlying distribution) to receive a personalized intervention. The magnitude of each intervention is bounded by 1, except for in data from the quadratic relationship. We also infer sparse interventions (with a cardinality constraint of 2 for the linear and quadratic relationships, 1 for the product relationship). When , the optimal (constrained) intervention may drastically vary depending upon the individual’s covariate-values, and our algorithm is able to correctly infer this behavior (Simulation C). Finally, we also apply a variant of our method which entirely ignores uncertainty (). While this approach is on average better for larger sample sizes, highly harmful interventions are occasionally proposed, whereas our uncertainty-adverse method () is much less prone to producing damaging interventions (preferring to abstain by returning instead). This is an invaluable characteristic since interventions generally require effort and are only worth conducting when they are likely to produce a substantial benefit.

Figure S2 displays the behavior of both the population shift intervention in the linear setting, and the population covariate-fixing intervention under the quadratic relationship. The population intervention is notably safer than the individually tailored variants, producing no negative changes in our experiments.

s3.1 Linear SEM Analysis

Here, we suppose that a desired transformation upon variable cannot be enacted exactly and the which arises post-treatment is distributed according to , where is the mean of the pre-treatment marginal distribution of the th covariate. In this case, do-effects can propagate to other covariates which are descendants of in the DAG because the values of descendant variables are redrawn from the do-distribution which arises as a result of shifting . Because all relationships are linear in our SEMs, the actual expected outcome change resulting from a particular shift (resulting from the corresponding do-operation) is easily obtained in closed form.

Our GP framework is applied to the data to infer an optimal 1-sparse shift population intervention (only interventions on a single variable are allowed). The maximal allowed magnitude of the shift is constrained to ensure the optimum is well-defined (to

times the standard deviation of each variable in the underlying SEM distribution). An alternative approach to improve outcomes in contrast to our black-box approach is to apply a causal inference method like LinGAM Shimizu2006si to estimate the SEM from the data, and then identify the optimal single-variable shift

in the LinGAM-inferred SEM (since all inferred relationships are also linear, the optimal single-variable shift will be either 0 or the lower/upper allowed shift and we simply search over these possibilities). We compare our approach against LinGAM by evaluating the actual expected outcome change produced by the shift proposed by each method (where the actual expected outcome change is found by analytically performing the operation in the true underlying SEM) .

In our experiment, two underlying SEM models are considered which were used by Shimizu2006si to demonstrate the utility of their LinGAM method (albeit with impractically large sample size = 10,000). SEM is used to refer to the model depicted in Figure 3 of Shimizu2006si, where we define as x6 (a sink node in the causal DAG). SEM denotes the underlying model of Figure 4 in the same paper ( is defined as sink node x7). The remainder of the variables in each SEM are adopted as our observed covariates .

This experiment represents an application of our method in a highly misspecified setting. The true data-generating mechanism differs significantly from assumptions of our GP regressor (output noise is now fairly non-Gaussian, the underlying relationships are all linear while we use an ARD kernel). Furthermore, an intervention to transform a single covariate incurs a multitude of unintentional off-target effects resulting from the do-effects propagating to downstream covariates in the SEM, whereas our method believes only the chosen covariate is changed. In contrast, this data exactly follows the special assumptions required by LinGAM, and we properly account for inferred downstream do-operation effects when identifying the best inferred intervention under LinGAM. The only disadvantage of the LinGAM method is that it does not know the direction of the causal relationship (although we found it always estimated this direction correctly except on rare occasions with tiny sample sizes of ).

Since LinGAM only estimates linear relations, the best inferred shift-intervention found by this approach will always be 0 or the minimal/maximal shift allowed for a particular covariate. Searching over these three values for each covariate ensures the actual optimal shift will be recovered if the LinGAM SEM-estimate were correct. However, under our approach, identifying the optimal population shift-intervention requires solving an optimization problem. Even if the GP regression posterior were to exactly reflect the true data-generating mechanism, our approach might get stuck in a suboptimal local maximum or avoid the minimal/maximal allowed shift due to too much uncertainty about in the resulting region of feature-space. In practice, these potential difficulties do not pose much of an issue for our approach.

S4 Gene Knockout Interventions

The data set used for this analysis contains gene expression levels for a set of wild type (ie. ‘observational’) samples, , as well as for a set of ‘interventional’ samples, , in which each individual gene was serially knocked out. In our analysis, we search for potential interventions for affecting the expression of a desired target gene by training our GP regressor on and determining which knockout produces the best value of our empirical covariate-fixing population intervention objective (for down-regulating the target). Subsequently, we use to evaluate the actual effectiveness of proposed interventions in the knockout experiments. We only search for interventions present in (single gene knockouts) rather than optimizing to infer optimal covariate transformations.

As candidate genes for this analysis we used only the 700 genes that Kemmeren2014largesi classified as responsive mutants (at least four transcripts show robust changes in response to the knockout). Furthermore, we omitted genes whose expression over the 161 observational samples had standard deviation

. Out of the transcription factors present in the remaining set of genes, we defined the top 10 factors as our feature set , after ranking the transcription factors by the difference between their expression when they were knocked out in the interventional data and their quantile expression level in the observational data. This was to ensure that our model would be trained on data that at least resembled the experimental data . The set of genes to down-regulate was simply chosen to be those classified by Kemmeren2014largesi as small molecule metabolism genes that met the minimum standard deviation requirement in their observational expression marginal distribution. The resulting set was 16 target genes, and the (negative) expression of each of was treated as an outcome in our analyses.

Each method evaluated in this analysis was to propose an intervention (single gene knockout) to down-regulate the expression of each target gene (separately). Once a gene to knock out was proposed, this intervention was evaluated by comparing the resulting expression of the target when the proposed knockout was actually performed in the experimental data . This expression level could then be compared to the ‘optimal’ choice of gene from to intervene upon (the gene in whose knockout produced the largest down-regulation of the target in ).

We compared our approach against two methods popularly used to draw conclusions about affecting outcomes in the sciences. First, we applied a multivariate regression analysis in which a linear regression model was fit to the observations of

in . The best gene to knockout was inferred on the basis of the regression coefficients and expression values (if no beneficial regression coefficient was found significant at the 0.05 level under the standard -test, then no intervention was proposed). Second, we performed a marginal analysis in which separate univariate linear regression models were fit to , and the best knockout was again inferred on the basis of the regression coefficients and expression values (again, no intervention was recommended if there was no statistically significant beneficial regression coefficient at the 0.05 level, after correcting for multiple testing via the False Discovery Rate).

Figure 2 compares the results produced by these methods to the optimal intervention over for down-regulating each , as found in the experimental data . Of the 16 small molecule metabolism target genes tested, in three cases our method proposed an intervention which was found to be optimal or near optimal in , while in the remaining cases, the model uncertainty causes the method not to recommend any intervention (except for one very minorly harmful intervention for target SAM3). On the other hand, neither form of linear regression proposed effective interventions for any target other than FKS1, and in some cases, the linear regressors proposed counterproductive interventions that up-regulated the target. This highlights the importance of a model that properly accounts uncertainty when evaluating potential interventions.

S5 Interventions to Improve Article Popularity

We demonstrate our personalized intervention methodology in a setting with rich nonlinear underlying relationships. The data consist of 39,000 news articles published by Mashable around 2013-15 Fernandes2015si. Each article is annotated with the number of shares it received in social networks (which we use as our outcome variable after log-transform and rescaling). A multitude of features have been extracted from each article (eg. word count, the category such as “tech” or “lifestyle”, keyword properties), many of which Fernandes2015si produced using natural language processing algorithms (eg. subjectivity, polarity, alignment with topics found by Latent Dirichlet Allocation). After removing many highly redundant covariates, we center and rescale all variables to unit-variance (see Table

S2 for a complete description of the 29 covariates used in this analysis).

We randomly partition the articles into 3 disjoint groups: a training set (5,000 articles on which scaling-factors are computed and our GP regressor is trained), an improvement set (300 articles we find interventions for), and a held-out set (over 34,000 articles used for evaluation). A large group is left out for validation to ensure there are many near-neighbors for any given article, so we can reasonably estimate the true expected popularity given any setting of the article-covariates. Subsequently, a basic GP regression model is fitted to the training set. As the predictive power of our GP regressor did not measurably benefit from ARD feature-weighting, we simply use the squared exponential kernel. Over the held-out articles, the Pearson correlation between the observed popularity and the GP (posterior mean) predictions is 0.35. Furthermore, there is a highly significant () positive correlation of 0.07 between the model’s predictive variance and the actual squared errors of GP predictions over this held-out set. Our model is thus able to make reasonable predictions of popularity based on the available covariates, and its uncertainty estimates tend to be larger in areas of the feature-space where the posterior mean lies further from actual popularity values.

In this analysis, we compare our personalized intervention methodology which rejects uncertainty (using ) with a variant of the this approach that ignores uncertainty (using the same objective function with ). Both methods share the same GP regressor, optimization procedure, and set of constraints. For the 300 articles in the intervention set (not part of the training data) we allow intervening upon all covariates except for the article category which presumably is fixed from an author’s perspective. All covariate-transformations are constrained to lie within [-2,2] of the original (rescaled) covariate value, and we impose a sparsity constraint that at most 10 covariates can be intervened upon for a given article.

Unfortunately, no pre-and-post-intervention articles are available for us to ascertain a ground truth evaluation. To crudely measure performance, we estimate the underlying expected popularity of a given covariate-setting using benchmark popularity: the (weighted) average observed popularity amongst 100 nearest neighbors (in the feature-space) from the set of held-out articles (with weights based on inverse Euclidean distance). Over our improvement set, the Pearson correlation between articles’ observed popularity and benchmark popularity is 0.28 (highly significant: ). This approach thus appears to be, on average, a reasonable way to benchmark performance (even though nearest-neighbor held-out articles can individually differ from the text of a particular pre/post-intervention article despite sharing similar values of our 29 measured covariates).

Figure S3 depicts the results of our personalized intervention for each article in our intervention set. The expected improvement produced by a particular intervention is estimated as the difference between the benchmark popularity of the post-intervention covariate-settings and the original covariate-settings of the article receiving the personalized intervention. Table S1 summarizes these results. A paired-sample -test suggests our method is significantly superior on average ().


Figure S3: Benchmark popularity changes produced by the personalized interventions for 300 articles suggested by our method with (Rejecting Uncertainty) vs. (Ignoring Uncertainty). The points (ie. articles) are colored according to the value of our personalized intervention objective with . Using outperforms in this analysis in 177/300 articles in the improvement set.
      Method  Mean  Median Quantile Num. Negative
Rejecting Uncertainty 0.586 0.578 0.126 2
Ignoring Uncertainty 0.552 0.555 0.105 4
Table S1: Summary statistics for the benchmark popularity change produced by each method over the 300 articles of the intervention set. The last column counts the number of harmful interventions (with change ).

To provide concrete examples, we present some articles of the Business and Entertainment categories (taken from our improvement set). For this business article: http://mashable.com/2014/07/30/how-to-beat-the-heat/, our method proposes shifting the following 10 covariates (see Table S2 for feature descriptions):

num_hrefs: +2, num_self_hrefs: -1.25, average_token_length: -1.771, kw_avg_min: +1.71, kw_avg_avg: +2, self_reference_min_shares: +2, self_reference_max_shares: +1.68, self_reference_avg_sharess: +2, global_subjectivity: +1.57, global_sentiment_polarity: -2

For this entertainment article: http://mashable.com/2014/07/30/how-to-beat-the-heat/, our method proposes shifting the following 10 covariates:

average_token_length: -1.55, kw_avg_min: + 1.63, kw_avg_avg: +2, self_reference_min_shares: +2 self_reference_max_shares: +1.85, self_reference_avg_shares: +2.0, LDA_00: +1.63, LDA_01: -2, LDA_04: +0.82, global_subjectivity: +1.62

Indifferent to uncertainty, the method with advocates shifting all these covariates by the maximal allowed amounts, which leads to a 0.04 worse improvement in benchmark popularity compared with the covariate-changes specified above for this article.

        Feature Description
n_tokens_title Number of words in the title
n_tokens_content Number of words in the content
n_unique_tokens Rate of unique words in the content
n_non_stop_words Rate of non-stop words in the content
num_hrefs Number of links
num_self_hrefs Number of links to other articles published by Mashable
average_token_length Average length of the words in the content
num_keywords Number of keywords in the metadata
data_channel_is_lifestyle Is the article category “Lifestyle”?
data_channel_is_entertainment Is the article category “Entertainment”?
data_channel_is_bus Is the article category “Business”?
data_channel_is_socmed Is the article category “Social Media”?
data_channel_is_tech Is the article category “Tech”?
data_channel_is_world Is the article category “World”?
kw_avg_min Avg. shares of articles with the least popular keyword used for this article
kw_avg_max Avg. shares of articles with the most popular keyword used for this article
kw_avg_avg Avg. shares of the average-popularity keywords used for this article
self_reference_min_shares Min. shares of referenced articles in Mashable
self_reference_max_shares Max. shares of referenced articles in Mashable
self_reference_avg_shares Avg. shares of referenced articles in Mashable
LDA_00 Closeness to first LDA topic
LDA_01 Closeness to second LDA topic
LDA_02 Closeness to third LDA topic
LDA_03 Closeness to fourth LDA topic
LDA_04 Closeness to fifth LDA topic
global_subjectivity Subjectivity score of the text
global_sentiment_polarity Sentiment polarity of the text
title_subjectivity Subjectivity score of title
title_sentiment_polarity Sentiment polarity of title
Table S2: The 29 covariates of each article (dimensions of in this analysis). Features involving the share-counts of other articles and LDA were based only on data known before the publication date.

S6 Proofs and additional Theoretical Results

Notation and Definitions

All points lie in convex and compact domain .

denotes constants whose value may change from line to line.

All occurrences of are implicitly referring to .

, , and respectively denote the mean, variance, and covariance function of our posterior for under the GP prior.

denotes the

quantile of random variable


denotes the quantile function.

denotes the norm of reproducing kernel Hilbert space .

denotes the ball of radius centered at .

represents the set of variables which are intervened upon in sparse settings.

denotes the set of variables which are parents of in a causal directed acyclic graph (DAG) Pearl2000si

is the set of variables which are descendants of at least one variable in according to the causal DAG.

denotes the complement of set .

The squared exponential kernel (with length-scale parameter ) is defined:

The Matérn kernel (with another parameter controlling smoothness of sample paths) is defined:

Random variables form a martingale difference sequence which is uniformly bounded by if and .

A function is Lipshitz continuous with constant if: for every .

Suppose is expressed as for nonnegative integer and .
The Hölder space is the space of functions with existing partial derivatives of orders for all integers satisfying . Additionally, each function’s highest order partial derivative must form a function that satisfies: for any .

Theorem 5 (VanderVaart2011si).

Under the assumptions of Theorem 1:

where is defined as in §5. See VanderVaart2011si for a detailed discussion about this function.

Proof of Theorem 1


Recall depends on . We fix and adapt the bound provided by Theorem 5 to show our result. Let denote the ball of radius centered at . We first establish the bound:


where Vol. Theorem 5 implies the following inequality (ignoring constant factors):