1 Introduction
Machine learning is increasingly being used in domains that have considerable impact on people, ranging from healthcare Callahan and Shah (2017) to banking Siddiqi (2012) to manufacturing Wuest et al. (2016). Moreover, in many of these domains, the desire for transparency has led to published machinelearned models that play a dual role in prediction and influencing behavior change. Consider a doctor who uses a risk tool to predict whether a patient is in danger of having a heart attack, while at the same time wanting to recommend lifestyle changes to improve outcomes.^{1}^{1}1MDCalc is one example of a site that provides risk assessment calculators for use by medical professionals https://www.mdcalc.com/ Consider a bank, looking to predict whether customers are likely to repay loans while customers are, at the same time, seeking to improve their credit scores. Consider a wine producer, looking to predict the demand for a new vintage while at the same time deciding how to make changes to their production process to improve future vintages.
It is well understood that correlation and causation need not go handinhand Pearl and others (2009); Rubin (2005). What is novel about this work is that we seek models that serve the dual purpose of achieving predictive accuracy as well as providing high confidence that decisions made with respect to the model improve outcomes. That is, we care about the utility that comes from having a predictive tool, while recognizing that these tools may also drive decisions. To illustrate the potential pitfalls of a purely predictive approach, consider a doctor who would like to advise a patient on how to reduce risk of heart attack. If the doctor assesses risk using a linear predictive model (as is often the case, see Ustun and Rudin (2016)), then a negative coefficient for alcohol consumption may lead the doctor to suggest a daily glass of red wine. Is this decision justified? Perhaps not, although this recommendation has often been made based on correlative evidence and despite a clear lack of experimental support Haseeb et al. (2017); Sahebkar et al. (2015).
At the same time, predictive models are valuable in and of themselves, for example in assessing whether a patient is in immediate risk. Similarly, banks want to understand credit risk while promoting good decisions by consumers in regard to true creditworthiness, and wine producers want to predict the marketability of a new vintage while improving their processes for next year. As designers of a learning framework, what degrees of freedom can we utilize to promote good decisions? Our main insight is that
controlling the tradeoff between accuracy and decision quality, where it exists, can be cast as a problem of model selection. For instance, there may be multiple models with similar predictive performance but with different coefficients that therefore induce very different decisions Breiman and others (2001). To achieve this tradeoff we introduce lookahead regularization, which balances accuracy and the improvement associated with induced decisions. Lookahead regularization anticipates how users will act and penalizes a model unless there is high confidence that these decisions will improve outcomes.Formally, these decisions, which depend on the predictive model, induce a target distribution on covariates that may differ from an initial distribution . For an individual with covariates , they are mapped to new covariates . For a prespecified confidence level , we want to guarantee improvement for at least a fraction of the population, comparing outcomes under in relation to observed outcomes in (under an invariance assumption on . The technical challenge is that may differ considerably from
, resulting in uncertainty in estimating the effect of decisions. To solve this, lookahead regularization makes use of an uncertainty model that provides confidence intervals around decision outcomes. A discriminative uncertainty model is trained under a learning framework that makes use of importance weighting
Gretton et al. (2009); Shimodaira (2000); Sugiyama et al. (2008) to handle covariate shift and is designed to provide accurate intervals for .Our algorithm has stages that alternate between optimizing the different components of our framework: the predictive model (under the lookahead regularization term), the uncertainty model (used within the regularization term), and the propensity model (used for covariate shift adjustment). If the uncertainty model is differentiable and the predictive model is twicedifferentiable, then gradients can pass through the entire pipeline and gradientbased optimization can be applied. We run three experiments. One uses synthetic data and illustrates how our approach can be useful, as well as helping to understand what is needed for it to succeed. The second application is to wine quality prediction and shows that even simple tasks have interesting tradeoffs between accuracy and improved decisions that can be utilized. The third experiment focuses on predicting diabetes progression and includes a demonstration of the framework in a setting with individualized actions.
1.1 Related work
Strategic Classification. In the field of strategic classification
, the learner and agents engage in a Stackelberg game, where the learner attempts to publish a maximally accurate classifier taking into account that agents will shift their features to obtain better outcomes under the classifier
Hardt et al. (2016). While early efforts viewed all modifications as “gaming"— an adversarial effect to be mitigated Dong et al. (2018); Brückner and Scheffer (2011) —a recent trend has focused on creating incentives for modifications that lead to better outcomes under the ground truth function rather than simply better classifications Kleinberg and Raghavan (2019); Alon et al. ; Haghtalab et al. (2020); Tabibian et al. (2019). In the absence of a known mapping from effort to ground truth, Miller et al. (2019) show that incentive design relates to causal modeling, and several responsive works explore how the actions induced by classifiers can facilitate discovery of these causal relationships Perdomo et al. (2020); Bechavod et al. (2020); Shavit et al. (2020). The second order effect of strategic classification on algorithmic fairness has also motivated several works Liu et al. (2020); Hu et al. (2019); Milli et al. (2019). Generally, these works consider the equilibrium effects of classifiers, where the choice of model affects covariate distributions and in turn predictive accuracy. In contrast, we consider what can be done given a snapshot at a point in time, or when the input distribution remains unaffected by user actions.Causality, Covariate Shift, and Distributionally Robust Learning. There are many efforts in ML to quantify the uncertainty associated with predictions and identify domain regions where models err Lakshminarayanan et al. (2017); HernándezLobato and Adams (2015); Gal and Ghahramani (2016); Guo et al. (2017); Tagasovska and LopezPaz (2019); Liu et al. (2019). However, most methods fail to achieve desirable properties when deployed out of distribution (OOD) Snoek et al. (2019). When the shifted distribution is unknown at train time, distributionally robust learning can provide worstcase guarantees for specific types of shifts but require unrealistic computational expense or restrictive assumptions on model classes Sinha et al. (2017). Although we do not know ahead of training our shifted distribution of interest, our framework is concerned only with the single, specific OOD distribution that is induced by the learned predictive model. Hence, we need only guarantee robustness to this particular distribution, for which we make use of tools from learning under covariate shift Bickel et al. (2009). Relevant to our task, Mueller et al. (2016)
seek to identify treatments which are beneficial with high probability under the covariate shift assumption. Because model variance generally increases when covariate shift acts on noncausal variables
Peters et al. (2016), our framework of trading off uncertainty minimization with predictive power relates to efforts in the causal literature to find models which have optimal predictive accuracy while being robust to classes of interventional perturbations Meinshausen (2018); Rothenhäusler et al. (2018).2 Method
Let
denote a feature vector and
denote a label, where describes the object of interest (e.g., a patient, a customer, a wine vintage), and describes the quality of an outcome associated with , where we assume that higher is better. We assume an observational dataset, which consists of IID samples from a population with joint distribution
over covariates (features) and outcomes . We denote by the marginal distribution on covariates.Let denote a model trained on . We assume that is used in two different ways:

Prediction: To predict outcomes for objects , sampled from .

Decision: To take action, through changes to , with the goal of improving outcomes.
We will assume that user actions map each to a new . We refer to as a user’s decision or action and denote decision outcomes by , We set and refer to as the decision function. We will assume that users consult to drive decisions—either because they care only about predicted outcomes (e.g., the case of bank loans), or because they consider the model to be a valid proxy of the effect of a decision on the outcome (e.g., the case of heart attack risk or wine production). As in other works incorporating strategic users into learning Perdomo et al. (2020); Hardt et al. (2016), our framework requires an explicit model of how users use the model to make decisions. For concreteness, we model users as making a step in the direction of the gradient of , but note that the framework can also support any other differential decision model.^{2}^{2}2Many works consider decisions that take gradient steps under cost constraints . Note that such constraints can be incorporated into learning using, for example, differentiable optimization layers Agrawal et al. (2019). Since not all attributes may be susceptible to change, we distinguish between mutable and immutable features using a masking operator .
Assumption 1 (User decision model).
Given masking operator , we define user decisions as:
(1) 
where the step size is a design parameter.
Through Assumption 1, user decisions induce a particular decision function , and in turn, a target distribution over , which we denote . This leads to a new joint distribution , with decisions inducing new outcomes. To achieve causal validity in the way we reason about the effect of decisions on outcomes, we follow Peters et al. (2016) and assume that depends only on and is invariant to the distribution on .
Assumption 2 (Covariate shift Shimodaira (2000)).
The conditional distribution on outcomes, , is invariant for any arbitrary, marginal distribution on covariates, including the data distribution .
Assumption 2 says that whatever the transform , conditional distribution is fixed, and the new joint distribution is , for any (note that also depends on ). This covariateshift assumption ensures the causal validity of our approach (and entails the property of nounobserved confounders). Although a strong assumption, this kind of invariance has been leveraged in other works that relate to questions of causality RojasCarulla et al. (2018); Mueller et al. (2016), as well as for domain adaptation Schweikert et al. (2009); QuioneroCandela et al. (2009).
2.1 Learning objective
Our goals in designing a learning framework are twofold. First, we would like learning to result in a model whose predictions closely match the corresponding labels for . Second, we would like the model to induce decisions for counterfactual distribution whose outcome improves upon the initial
. To balance between these two goals, we construct a learning objective in which a predictive loss function is augmented with a regularization term that promotes good decisions. The difficulty is that decision outcomes
depend on decisions through the learned model . Hence, realizations of are unavailable at train time, as they cannot be observed until after the model is deployed. For this reason, simple constraints of the form are illdefined, and to regularize we must reason about outcome distributions , for . A naive approach might consider the average improvement, with , for a given , and penalize the model whenever , for example linearly using . Concretely, must be estimated, and since minimizes MSE, then is a plausible estimate of , giving:(2) 
where determines the relative importance of improvement over accuracy. There are two issues with this approach. First, learning can result in an that severely overfits in estimating , meaning that at train time the penalty term in the (empirical) objective will appear to be low whereas at test time its (expected) value will be high. This can happen, for example, when is moved to a lowdensity region of where is unconstrained by the data and, if flexible enough, can artificially (and wrongly) signal improvement. To address this we use two decoupled models—one for predicting on distribution , and another for handling on distribution .
Second, in many applications it may be unsafe to guarantee that improvement hold only on average per individual (e.g., heart attack risk, credit scores). To address this, we encourage to improve outcomes with a certain degree of confidence , for , i.e., such that for a given and induced and thus . Importantly, while one source of uncertainty in is , other sources of uncertainty exist, including those coming from insufficient data as well as model uncertainty. Our formulation is useful when additional sources of uncertainty are significant, such as when the model leads to actions that place in lowdensity regions of .
In our method, we replace the averagecase penalty in Eq. (2) with a confidencebased penalty:
(3) 
where if is true, and otherwise. In practice, is unknown, and must be estimated. For this, we make use of an uncertainty model, , , which we learn, and maps points to intervals that cover with probability . We also replace the penalty term in Eq. (3) with the slightly more conservative , and to make learning feasible we use the hinge loss as a convex surrogate.^{3}^{3}3The penalty is conservative in that it considers only onesided uncertainty, i.e., and is not used explicitly. Although open intervals suffice here, most methods for interval prediction consider closed intervals, and in this way our objective can support them. For symmetric intervals, simply becomes . For a given uncertainty model, , the empirical learning objective for model on sample set is:
(4) 
where is the lookahead regularization term.
By anticipating how users decide, this penalizes models whose induced decisions do not improve outcomes at a sufficient rate (see Figure 1).The novelty in the regularization term is that it accounts for uncertainty in assessing improvement, and does so for points that are out of distribution. If pushes towards regions of high uncertainty, then the interval is likely to be large, and must make more “effort" to guarantee improvement, something that may come at some cost to indistribution prediction accuracy. As a byproduct, while the objective encodes the rate of decision improvement, we will also see the magnitude of improvement increase in our experiments.
Note that the regularization term depends both on and —to determine , and to determine given , respectively. This justifies the need for the decoupling of and : without this, uncertainty estimates based on are prone to overfit by artificially manipulating intervals to be higher than , resulting in low penalization at train time without actual improvement (see Figure 2 (right)).
2.2 Estimating uncertainty
The usefulness of lookahead regularization relies on the ability of the uncertainty model to correctly capture the various kinds of uncertainties about the outcome value for the perturbed points. This can be difficult because uncertainty estimates are needed for outofdistribution points .
Fortunately, for a given the counterfactual distribution is known (by Assumption 1), and we can use the covariate transform associated with the decision to construct sample set . Even without labels for , estimating is now a problem of learning under covariate shift, where the test distribution can differ from the training distribution . In particular, we are interested in learning uncertainty intervals that provide good coverage. There are many approaches to learning under covariate shift. Here we describe the simple and popular method importance weighting, or inverse propensity weighting Shimodaira (2000). For a loss function , we would like to minimize . Let , then by the covariate shift assumption:
Hence, training with points sampled from distribution but weighted by will result in an uncertainty model that is optimized for the counterfactual distribution . In practice, is itself unknown, but many methods exist for learning an approximate model using sample sets and (e.g. Kanamori et al. (2009)). To remain within our discriminative approach, here we follow Bickel et al. (2009)
and train a logistic regression model
, , to differentiate between points (labeled ) and (labeled ) and set weights to . As we are interested in training to gain a coverage guarantee, we define .2.3 Algorithm
All the elements in our framework— the predictive model , the uncertainty model , and the propensity model —are interdependent. Specifically, optimizing in Eq. (4) requires intervals from ; learning requires weights from ; and is trained on which is in turn determined by . Our algorithm therefore alternates between optimizing each of these components while keeping the others fixed. At round , is optimized with intervals , is trained using weights , and is trained using points as determined by . The procedure is initialized by training without the lookahead term . For training and , weights and points , respectively, can be precomputed and plugged into the objective. Training with Eq. (4), however, requires access to the function , since during optimization, the lower bounds must be evaluated for points that vary as updates to are made. Hence, to optimize with gradient methods, we use an uncertainty model that is differentiable, so that gradients can pass through them (while keeping their parameters fixed). Furthermore, since gradients must also pass through (which includes ), we require that be twicedifferentiable.
In the experiments we consider two methods for learning :

Bootstrapping Efron and Tibshirani (1994), where a collection of models is trained for prediction each on a subsampled dataset and combined to produce a single interval model , and
3 Experiments
In this section, we evaluate our approach in three experiments of increasing complexity and scale, where the first is synthetic and the latter two use real data. Because the goal of regularization is to balance accuracy with decision quality, we will be interested in understanding the attainable frontier of accuracy vs. improvement. For our method, this will mostly be controlled by varying lookahead regularization parameter, . In all experiments we measure predictive accuracy with root mean squared error (RMSE), and decision quality in two ways: mean improvement rate and mean improvement magnitude .
To evaluate the approach, we need a means for evaluating counterfactual outcomes for decisions . Therefore, and similarly to Shavit and Moses (2019), we make use of an inferred ‘groundtruth’ function to test decision improvement, assuming . Model is trained on the entirety of the data. By optimizing for RMSE, we think of this as estimating the conditional mean of , with the data labels as (arbitrarily) noisy observations. To make for an interesting experiment, we learn from a function class that is more expressive than or . The sample set will contain a small and possibly biased subsample of the data, which we call the ‘active set’, and that plays the role of a representative sample from . This setup allows us not only to evaluate improvement, but also to experiment with the effects of different sample sets.
3.1 Experiment 1: Quadratic curves
For a simple setting, we explore the effects of regularized and unregularized learning on decision quality in a stylized setting using unidimensional quadratic curves. Let , and assume where
is independently, normally distributed. By varying the decision model stepsize
, we explore three conditions: one where a naïve approach works well, one where it fails but regularization helps, and one where regularization also fails.In Figure 2 (left), is small, and the points stay within the high certainty region of . Here, the baseline works well, giving both a good fit and effective decisions, and the regularization term in the lookahead objective remains inactive. In Figure 2 (center), is larger. Here, the baseline model pushes points to a region where values are low. Meanwhile, the lookahead model, by incorporating into the objective the decision model and estimating uncertainty surrounding , is able to adjust the model to induce good decisions with some reduction in accuracy. In Figure 2 (right), is large. Here, the points are pushed far into areas of high uncertainty. The success of lookahead relies on the successful construction of intervals at through the successful estimation of , and may fail if and differ considerably, as is the case here.
3.2 Experiment 2: Wine quality
The second experiment focuses on wine quality using the wine dataset from the UCI data repository Dua and Graff (2017). The wine in the data set has 13 features, most of which correlate linearly with quality , but two of which (alcohol and malic acid) have a nonlinear Ushaped or inverseU shaped relationship with . For the ground truth model, we set (, ) so that it captures these nonlinearities, . To better demonstrate the capabilities of our framework, we sample points into the active set nonuniformly by thresholding on the nonlinear features. The active set includes 30% of the data, and is further split 7525 into a train set used for learning and tuning and a heldout test set used for final evaluation.
For the predictive model, our focus here is on linear models. The baseline includes a linear trained with
regularization (i.e., Ridge Regression) with regularization coefficient
. Our lookahead model includes a linear trained with lookahead regularization (Eq. (4)) with regularization coefficient . In some cases we will add to the objective an additional term, so that for a fixed , settingrecovers the baseline model. Lookahead was trained for 10 rounds and the baseline with a matching number of overall epochs. The uncertainty model
uses residualsbased bootstrapping with 20 linear submodels. The propensity model is also linear. We consider two settings: one where all features (i.e., wine attributes) are mutable and using decision stepsize , and another where only a subset of the features are mutable and using stepsize .Full mutability.
Figure 3 (left) presents the frontier of accuracy vs. improvement on the test set when all features are mutable. The baseline and lookahead models coincide when . For the baseline, as increases, predictive performance (RMSE) displays a typical learning curve with accuracy improving until reaching an optimum at some intermediate value of . Improvement, however, monotonically decreases with , and is highest with no regularization (). This is because in this setting, gradients of induce reasonably good decisions: is able to approximately recover the dominant linear coefficients of , and shrinkage due to higher penalization reduces the magnitude of the (typically positive, on average) change. With lookahead, increasing leads to better decisions, but at the cost of higher (albeit sublinear) RMSE. The initial improvement rate at is high, but lookahead and penalties have opposing effects on the model. Here, improvement is achieved by (and likely requires) increasing the size of the coefficients of linear model, . We see that learns to do this in an efficient way, as compared to a naïve scaling of the predictivelyoptimal .
Partial mutability.
Figure 3 (center) presents the frontier of accuracy vs. improvement when only a subset of the features are mutable (note that this effects the scale of possible improvement). The baseline presents a similar behavior to the fullymutable setting, but with the optimal predictive model inducing a negative improvement. Here we consider lookahead with various degrees of additional regularization. When , the models again coincide. However, for larger , significant improvement can be gained with very little or no loss in RMSE, while moderate values improve both decisions and accuracy. This holds for various values of , and setting to the optimal value of results in lookahead dominating the tradeoff curve for all observed . Improvement is reflected in magnitude and rate, which rises quickly from the baseline’s to an optimal , showing how lookahead learns models that lead to safe decisions.
Figure 3 (right) shows how the coefficients of and change as and increase, respectively (for lookahead ). As can be seen, lookahead works by making substantial changes to mutable coefficients, sometimes reversing their sign, with milder changes to immutable coefficients. Lookahead achieves improvement by capitalizing on its freedom to learn a useful direction of improvement within the mutable subspace, while compensating for the possible loss in accuracy through mild changes in the immutable subspace.
3.3 Experiment 3: Diabetes
The final experiment focuses on the prediction of diabetes progression using the diabetes dataset^{4}^{4}4https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html Efron et al. (2004). The dataset has 10 features describing various patient attributes. We consider two features as mutable: BMI and Tcell count (marked as ‘s1’). While both display a similar (although reversed) linear relationship with , feature s1 is much noisier. The setup is as in wine but with two differences: to capture nonlinearities we set to be a flexible generalized additive model (GAM) with splines of degree 10 (), and train and test sets are sampled uniformly from the data. We normalize to and set .
Figure 4 (left) presents the accuracyimprovement frontier for linear and bootstrapped linear . Results show a similar trend to the wine experiment, with lookahead providing improved outcomes (both rate and magnitude) while preserving predictive accuracy. Here, lookahead improves results by learning to increase the coefficient of s1, while adjusting other coefficients to maintain reasonable uncertainty. The baseline fails to utilize s1 for improvement since from a predictive perspective there is little value in placing weight on s1.
When is linear, decisions are uniform across the population in that is independent of . To explore individualized actions, we also consider a setting where is a more flexible quadratic model (i.e., linear in and ) in which gradients depend on and uncertainty is estimated using quantile regression. Figure 4 (center) shows the data as projected onto the subspace , with color indicating outcome values
, interpolated within this subspace. As can be seen, the mapping
due to generally improves outcomes. The plot reveals that, had we had knowledge of , uniformly decreasing BMI would also have improved outcomes, and this is in fact the strategy envoked by the linear . But decisions must be made based on the sample set, and so uncertainty must be taken into account. Figure 4 (right) shows a similar plot but with color indicating uncertainty estimates as measured by the interval sizes given by . The plot shows that decisions are directed towards regions of lower uncertainty (i.e., approximately following the negative gradients of the uncertainty slope), showing how lookahead successfully utilizes these uncertainties to adjust the predictive model .4 Discussion
Given the extensive use of machine learning across an evergrowing range of applications, we think it is appropriate to assume, as we have here, that predictive models will remain in widespread use, and that at the same time, and despite wellunderstood concerns, users will continue to act upon them. In line with this, our goal with this work has been to develop a machine learning framework that accounts for decision making by users but remains fully within the discriminative framing of statistical machine learning. The lookahead regularization framework that we have proposed augments existing machine learning methodologies with a component that promotes good human decisions. We have demonstrated the utility of this approach across three different experiments, one on synthetic data, one on predicting and deciding about wine, and one on predicting and deciding in regard to diabetes progression. We hope that this work will inspire continued research in the machine learning community that embraces predictive modeling while also being cognizant of the ways in which our models are used.
Broader Impact
In our work, the learning objective was designed to align with and support the possible use of a predictive model to drive decisions by users. It is our belief that a responsible and transparent deployment of models with “lookaheadlike" regularization components should avoid the kinds of mistakes that can be made when predictive methods are conflated with causally valid methods.
At the same time, we have made a strong simplifying assumption, that of covariate shift, which requires that the relationship between covariates and outcome variables is invariant as decisions are made and the feature distribution changes. This strong assumption is made to ensure validity for the lookahead regularization, since we need to be able to perform inference about counterfactual observations. As discussed by Mueller et al. (2016) and Peters et al. (2016), there exist realworld tasks that reasonably satisfy this assumption, and yet at the same time, other tasks— notably those with unobserved confounders —where this assumption would be violated. Moreover, this assumption is not testable on the observational data. This, along with the need to make an assumption about the user decision model, means that an application of the method proposed here should be done with care and will require some domain knowledge to understand whether or not the assumptions are plausible.
Furthermore, the validity of the interval estimates requires that any assumptions for the interval model used are satisfied and that weights provide a reasonable estimation of . In particular, fitting to which has little to no overlap with (see Figure 2) may result in underestimating the possibility of bad outcomes.
If used carefully and successfully, then the system provides safety and protects against the misuse of a model. If used in a domain for which the assumptions fail to hold then the framework could make things worse, by trading accuracy for an incorrect view of user decisions and the effect of these decisions on outcomes.
We would also caution against any specific interpretation of the application of the model to the wine and diabetes data sets. We note that model misspecification of could result in arbitrarily bad outcomes, and estimating in any highstakes setting requires substantial domain knowledge and should err on the side of caution. We use the data sets for purely illustrative purposes because we believe the results are representative of the kinds of results that are available when the method is correctly applied to a domain of interest.
References
 Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, pp. 9558–9570. Cited by: footnote 2.
 [2] Multiagent evaluation mechanisms. Cited by: §1.1.
 Causal feature discovery through strategic modification. arXiv preprint arXiv:2002.07024. Cited by: §1.1.
 Discriminative learning under covariate shift. Journal of Machine Learning Research 10 (Sep), pp. 2137–2155. Cited by: §1.1, §2.2.
 Statistical modeling: the two cultures (with comments and a rejoinder by the author). Statistical science 16 (3), pp. 199–231. Cited by: §1.
 Stackelberg games for adversarial prediction problems. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 547–555. Cited by: §1.1.
 Machine learning in healthcare. In Key Advances in Clinical Informatics, pp. 279–291. Cited by: §1.
 Strategic classification from revealed preferences. In Proceedings of the 2018 ACM Conference on Economics and Computation, pp. 55–70. Cited by: §1.1.
 UCI machine learning repository. Cited by: §3.2.
 Least angle regression. The Annals of statistics 32 (2), pp. 407–499. Cited by: §3.3.
 An introduction to the bootstrap. CRC press. Cited by: item 1.

Dropout as a bayesian approximation: representing model uncertainty in deep learning
. In international conference on machine learning, pp. 1050–1059. Cited by: §1.1.  Covariate shift by kernel mean matching. Dataset shift in machine learning 3 (4), pp. 5. Cited by: §1.

On calibration of modern neural networks
. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1321–1330. Cited by: §1.1.  Maximizing welfare with incentiveaware evaluation mechanisms. Technical report working paper. Cited by: §1.1.
 Strategic classification. In Proceedings of the 2016 ACM conference on innovations in theoretical computer science, pp. 111–122. Cited by: §1.1, §2.
 Wine and cardiovascular health: a comprehensive review. Circulation 136 (15), pp. 1434–1448. Cited by: §1.

Probabilistic backpropagation for scalable learning of bayesian neural networks
. In International Conference on Machine Learning, pp. 1861–1869. Cited by: §1.1.  The disparate effects of strategic manipulation. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 259–268. Cited by: §1.1.
 A leastsquares approach to direct importance estimation. Journal of Machine Learning Research 10 (Jul), pp. 1391–1445. Cited by: §2.2.
 How do classifiers induce agents to invest effort strategically?. In Proceedings of the 2019 ACM Conference on Economics and Computation, pp. 825–844. Cited by: §1.1.
 Quantile regression. Journal of economic perspectives 15 (4), pp. 143–156. Cited by: item 2.

Sequential imputations and bayesian missing data problems
. Journal of the American statistical association 89 (425), pp. 278–288. Cited by: §B.1.  Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pp. 6402–6413. Cited by: §1.1.
 Accurate uncertainty estimation and decomposition in ensemble learning. In Advances in Neural Information Processing Systems, pp. 8950–8961. Cited by: §1.1.
 The disparate equilibria of algorithmic decision making when individuals invest rationally. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 381–391. Cited by: §1.1.

Causality from a distributional robustness point of view.
In
2018 IEEE Data Science Workshop (DSW)
, pp. 6–10. Cited by: §1.1.  Strategic adaptation to classifiers: a causal perspective. arXiv preprint arXiv:1910.10362. Cited by: §1.1.
 The social cost of strategic classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 230–239. Cited by: §1.1.
 Learning optimal interventions. arXiv preprint arXiv:1606.05027. Cited by: §1.1, §2, Broader Impact.
 Causal inference in statistics: an overview. Statistics surveys 3, pp. 96–146. Cited by: §1.
 Performative prediction. arXiv preprint arXiv:2002.06673. Cited by: §1.1, §2.
 Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78 (5), pp. 947–1012. Cited by: §1.1, §2, Broader Impact.
 Dataset shift in machine learning. The MIT Press. Cited by: §2.

Invariant models for causal transfer learning
. The Journal of Machine Learning Research 19 (1), pp. 1309–1342. Cited by: §2.  Anchor regression: heterogeneous data meets causality. arXiv preprint arXiv:1801.06229. Cited by: §1.1.
 Causal inference using potential outcomes: design, modeling, decisions. Journal of the American Statistical Association 100 (469), pp. 322–331. Cited by: §1.
 Lack of efficacy of resveratrol on creactive protein and selected cardiovascular risk factors—results from a systematic review and metaanalysis of randomized controlled trials. International journal of cardiology 189, pp. 47–55. Cited by: §1.
 An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In Advances in neural information processing systems, pp. 1433–1440. Cited by: §2.
 Learning from strategic agents: accuracy, improvement, and causality. arXiv preprint arXiv:2002.10066. Cited by: §1.1.
 Extracting incentives from blackbox decisions. arXiv preprint arXiv:1910.05664. Cited by: §3.
 Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference 90 (2), pp. 227–244. Cited by: §1, §2.2, Assumption 2.
 Credit risk scorecards: developing and implementing intelligent credit scoring. Vol. 3, John Wiley & Sons. Cited by: §1.
 Certifying some distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571. Cited by: §1.1.
 Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pp. 13969–13980. Cited by: §1.1.
 Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, pp. 1433–1440. Cited by: §1.
 Optimal decision making under strategic behavior. arXiv preprint arXiv:1905.09239. Cited by: §1.1.
 Singlemodel uncertainties for deep learning. In Advances in Neural Information Processing Systems, pp. 6414–6425. Cited by: §1.1.
 Supersparse linear integer models for optimized medical scoring systems. Machine Learning 102 (3), pp. 349–391. Cited by: §1.
 Machine learning in manufacturing: advantages, challenges, and applications. Production & Manufacturing Research 4 (1), pp. 23–45. Cited by: §1.
Appendix A Pseudocode
Our algorithm alternates between optimizing the three components of the framework: a predictive model, a propensity model, and an uncertainty model. Here we give pseudocode for the following percomponent objectives:

A predictive model , optimizing the squared loss:

A propensity weight model , optimizing the logloss:

An uncertainty interval model , optimizing the quantile loss:
but note that others can be plugged in. The pseudocode is given below.
Appendix B Uncertainty models
Here we describe the two uncertainty methods used in our paper and how they apply to our setting.
b.1 Bootstrapping
Bootstrapping produces uncertainty intervals by combining the outputs of a collection of models , each trained independently for prediction on a random subset of the data. There are many approaches to bootstrapping, and here we describe two:

Vanilla bootstrapping: Each is trained using a predictive objective (e.g., squared loss) on a sample set where are sampled with replacement from . The submodels are then combined using:
where:
and
is the zscore corresponding to the confidence parameter
under a normal distribution. 
Bootstrapping residuals: First, a predictive model is fit to the data, and residuals are computed. Then, each is trained on the original sample data but with ground truthlabels replaced with random pseudolabels:
where are sampled with replacement from .
In our framework, because must apply to , each is trained with propensity weights . To account for cases where and differ, the are trained not on sample sets of size , but rather, of size , where is the effective sample size Kong et al. [1994] given by:
b.2 Quantile regression
Quatile regression is a learning framework for training models to predict the quantile of the conditional label distribution . Just as training with the squared loss is aimed at predicting the mean of , training with the absolute loss is aimed at the median. Quantile regression generalizes the absolute loss by considering a ’tilted’ variant with slopes and :
Appendix C Experimental details
c.1 Experiment 1: Quadratic curves
Here we set . and include quadratic functions, and to include linear functions. For uncertainty estimation we used vanilla bootstrap, and for propensity scores we used logistic regression. For lookahead, we set , , use bootstrapped models, and train for rounds. The data includes samples drawn from , and where . We use a traintest split. The three conditions vary only in with values , and .
Quantitative results are given in the table below:
RMSE  Imp. rate  Imp. mag. [b]  

baseline  0.349  0.857  1.109 [t]  
lookahead  0.351  0.857  1.108 [b]  
baseline  0.342  0.143  0.261 [t]  
lookahead  0.424  0.714  1.065 [b]  
baseline  0.342  0  35.13 [t]  
lookahead  0.675  0.571  0.604 
c.2 Experiment 2: Wine quality
The wine dataset includes examples and features. We learn a quadratic . , , and include linear functions. For uncertainty estimation we used residuals bootstrap, and for propensity scores we used logistic regression. For lookahead, we set , use bootstrapped models, and train for rounds. For , we use SGD with a learning rate of 0.1 and 1000 epochs for initialization and 100 additional epochs per round. For , each submodel was trained with SGD using a learning rate of 0.1 and for 500 epochs. We set and for the fully and partially mutable settings, respectively.
c.3 Experiment 3: Diabetes
The diabetes dataset includes examples and features. We set to be a generalized additive model (GAM) with splines of degree 10 trained on the entire dataset and tuned using crossvalidation. In the first setting, , , and include linear functions. In the second setting, , are quadratic functions (i.e., linear in and in ) and remains linear. For uncertainty estimation we used quantile regression, and for propensity scores we used logistic regression. For lookahead, we set and train for rounds. For , we use SGD with a learning rate of 0.05 and 1000 epochs for initialization and 100 additional epochs per round. For , we use SGD with a learning rate of 0.05 and for 500 epochs. For both linear and nonlinear settings we set , and normalize to be in .
Comments
There are no comments yet.