Counterfactual Explanations in Sequential Decision Making Under Uncertainty

07/06/2021 ∙ by Stratis Tsirtsis, et al. ∙ Max Planck Institute for Software Systems IIT Bombay 0

Methods to find counterfactual explanations have predominantly focused on one step decision making processes. In this work, we initiate the development of methods to find counterfactual explanations for decision making processes in which multiple, dependent actions are taken sequentially over time. We start by formally characterizing a sequence of actions and states using finite horizon Markov decision processes and the Gumbel-Max structural causal model. Building upon this characterization, we formally state the problem of finding counterfactual explanations for sequential decision making processes. In our problem formulation, the counterfactual explanation specifies an alternative sequence of actions differing in at most k actions from the observed sequence that could have led the observed process realization to a better outcome. Then, we introduce a polynomial time algorithm based on dynamic programming to build a counterfactual policy that is guaranteed to always provide the optimal counterfactual explanation on every possible realization of the counterfactual environment dynamics. We validate our algorithm using both synthetic and real data from cognitive behavioral therapy and show that the counterfactual explanations our algorithm finds can provide valuable insights to enhance sequential decision making under uncertainty.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, there has been a rising interest on the potential of machine learning models to assist and enhance decision making in high-stakes applications such as justice, education and healthcare 

[kleinberg2018human, zhu2015machine, ahmad2018interpretable]

. In this context, machine learning models and algorithms have been used not only to predict the outcome of a decision making process from a set of observable features but also to find what would have to change in (some of) the features for the outcome of a specific process realization to change. For example, in loan decisions, a bank may use machine learning both to estimate the probability that a customer repays a loan and to find by how much a customer may need to reduce her credit card debt to increase the probability of repayment over a given threshold. Here, our focus is on the latter, which has been often referred to as counterfactual explanations.

The rapidly growing literature on counterfactual explanations has predominantly focused on one-step decision making processes as the one described above [verma2020counterfactual, karimi2020survey]

. In such settings, the probability that an outcome occurs is typically estimated using a supervised learning model and finding counterfactual explanations reduces to a search problem across the space of features and model predictions 

[wachter2017counterfactual, ustun2019actionable, karimi2020model, mothilal2020explaining, rawal2020beyond]. Moreover, it has been argued that, to obtain counterfactual explanations that are actionable and have the predicted effect on the outcome, one should favor causal models [karimi2020algorithmic, karimi2021algorithmic, tsirtsis2020decisions, madumal2020explainable]. In this work, our goal is instead to find counterfactual explanations for decision making processes in which multiple, dependent actions are taken sequentially over time. In this setting, the (final) outcome of the process depends on the overall sequence of actions and the counterfactual explanation specifies an alternative sequence of actions differing in at most actions from the observed sequence that could have led the process realization to a better outcome. For example, in medical treatment, assume a physician takes a sequence of actions to treat a patient but the patient’s prognosis does not improve, then a counterfactual would help the physician understand how a small number of actions taken differently could have improved the patient’s prognosis. However, since there is (typically) uncertainty on the counterfactual dynamics of the environment, there may be a different counterfactual explanation that is optimal under each possible realization of the counterfactual dynamics. Moreover, we focus on decision making processes where the state and action spaces are discrete and low-dimensional since, in many realistic scenarios, a decision maker needs to decide among a small number of actions on the basis of a few observable covariates, which are often discretized into (percentile) ranges.

The work most closely related to ours, which lies within the area of machine learning for healthcare, has achieved early success at using machine learning to enhance sequential decision making [parbhoo2017combining, guez2008adaptive, komorowski2018artificial, buesing2018woulda]

. However, rather than finding counterfactual explanations, this line of work has focused on using reinforcement learning to design better treatment policies. A recent notable exception is by Oberst and Sontag 

[oberst2019counterfactual], which introduces an off-policy evaluation procedure for highlighting specific realizations of a sequential decision making process where a policy trained using reinforcement learning would have led to a substantially different outcome. We see this work, whose modeling framework we build upon, as complementary to ours.

Our contributions. We start by formally characterizing a sequence of (discrete) actions and (discrete) states using finite horizon Markov decision processes (MDPs). Here, we model the transition probabilities between a pair of states, given an action, using the Gumbel-Max structural causal model [oberst2019counterfactual]. This model has been shown to have a desirable counterfactual stability property and, given a sequence of actions and states, it allows us to reliably estimate the counterfactual outcome under an alternative sequence of actions. Building upon this characterization, we formally state the problem of finding the counterfactual explanation for an observed realization of a sequential decision making process as a constrained search problem over the set of alternative sequences of actions differing in at most actions from the observed sequence. Then, we present a polynomial time algorithm based on dynamic programming that finds a counterfactual policy that is guaranteed to always provide the optimal counterfactual explanation on every possible realization of the counterfactual transition probability induced by the observed process realization. Finally, we validate our algorithm using both synthetic and real data from cognitive behavioral therapy and show that the counterfactual explanations can provide valuable insights to enhance sequential decision making under uncertainty.

Further related work. Our work also builds upon further related work on interpretable machine learning and counterfactual inference. In addition to the work on counterfactual explanations for one-step decision making processes discussed previously [verma2020counterfactual, karimi2020survey], there is also a popular line of work focused on feature-based explanations [ribeiro2016should, koh2017understanding, lundberg2017unified]. Feature-based explanations highlight the importance each feature has on a particular prediction by a model, typically through local approximation. The literature on counterfactual inference has a long history [imbens2015causal], however, it has primarily focused on estimating quantities related to the counterfactual distribution of interest such as, , the conditional average treatment effect (CATE).

2 Characterizing Sequential Decision Making using Causal Markov Decision Processes

Our starting point is the following stylized setting, which resembles a variety of real-world sequential decision making processes. At each time step , the decision making process is characterized by a state , where is a space of states, an action , where is a space of actions, and a reward . Moreover, given a realization of a decision making process , the outcome of the decision making process is given by the sum of the rewards.

Given the above setting, we characterize the relationship between actions, states and outcomes using finite horizon Markov decision processes (MDPs). More specifically, we consider a MDP , where is the state space, is the set of actions, denotes the transition probability , denotes the immediate reward , and is the time horizon. While this characterization is helpful to make predictions about future states and design action policies [sutton2018reinforcement], it is not sufficient to make counterfactual predictions, , given a realization of a decision making process , we cannot know what would have happened if, instead of taking action at time , we had taken action . To be able to overcome this limitation, we will now augment the above characterization using a particular class of structural causal model (SCM) [peters2017elements] with desirable properties.

Let be a structural causal model defined by the assignments

(1)

where and are - and -dimensional independent noise variables, respectively, and and are two given functions, and let be the distribution entailed by . Then, as argued by Buesing et al. [buesing2018woulda], we can always find a distribution for the noise variables and a function so that the transition probability of the MDP of interest is given by the following interventional distribution over the SCM :

(2)

where denotes a (hard) intervention in which the second assignment in Eq. 1 is replaced by the value .

Under this view, given an observed realization of a decision making process , we can compute the posterior distribution of each noise variable and, building on the conditional density function of this posterior distribution, which we denote as , we can define a (non-stationary) counterfactual transition probability

(3)

where, in (a), we drop the because and are independent in the modified SCM. Importantly, the above counterfactual transition probability allows us to make counterfactual predictions, , given that, at time the state was and, at time , the process transitioned to state after taking action , what would have been the probability of transitioning to state after taking action if the state had been at time .

However, for state variables taking discrete values, the posterior distribution of the noise may be non-identifiable without further assumptions, as argued by Oberst and Sontag [oberst2019counterfactual]. This is because there may be several noise distributions and functions which give interventional distributions consistent with the MDP’s transition probabilities but result in different counterfactual transition distributions. To avoid these non-identifiability issues, we follow Oberst and Sontag and restrict our attention to the class of Gumbel-Max SCMs, ,

(4)

where and is the transition probability of the MDP. More specifically, this class of SCMs has been shown to satisfy a desirable counterfactual stability property, which goes intuitively as follows. Assume that, at time , the process transitioned from state to state after taking action . Then, in a counterfactual scenario, it is unlikely that, at time , the process would transition from a state to a state after taking action if

In words, transitioning to a state other than —the factual one—is unlikely unless choosing an action that decreases the relative chances of compared to the other states. More formally, given , then for all with , the condition

implies that . In practice, in addition to solving the non-identifiability issues, the use of Gumbel-Max SCMs allows for an efficient procedure to sample from the corresponding noise posterior distribution , described elsewhere [oberst2019counterfactual, gumbelmachinery], and given a set of samples from the noise posterior distribution, we can compute an unbiased finite sample Monte-Carlo estimator for the counterfactual transition probability, as defined in Eq. 3, as follows:

(5)

3 Counterfactual Explanations in Markov Decision Processes

Inspired by previous work on counterfactual explanations in supervised learning [wachter2017counterfactual, ustun2019actionable], we focus on counterfactual outcomes that could have occurred if the alternative action sequence was “close” to the observed one. However, since in our setting, there is uncertainty on the counterfactual dynamics of the environment, we will look for a non-stationary counterfactual policy that, under every possible realization of the counterfactual transition probability defined in Eq. 3, is guaranteed to provide the optimal alternative sequence of actions differing in at most actions from the observed one.

More specifically, let be an observed realization of a decision making process characterized by a Markov decision process (MDP) with a transition probability defined via a Gumbel-Max structural causal model (SCM), as described in Section 2. Then, to characterize the effect that any alternative action sequence would have had on the outcome of the above process realization, we start by building a non-stationary counterfactual MDP . Here, is an enhanced state space such that each corresponds to a pair indicating that the original decision making process would have been at state had already taken actions differently from the observed sequence. Following this definition, denotes the reward function which we define as for any and , , the counterfactual rewards remain independent of the number of modifications in the action sequence. Lastly, let be the counterfactual transition probability, as defined by Eq. 3. Then, the transition probability for the enhanced state space is defined as:

(6)

where note that the dynamics of the original states are equivalent both under and , however, under , we also keep track of the number of actions differing from the observed actions. Now, let be a policy that deterministically decides about the counterfactual action that should have been taken if the process’s enhanced state had been , , the counterfactual state at time was after performing action changes. Then, given such a counterfactual policy , we can compute the corresponding average counterfactual outcome as follows:

(7)

where is a realization of the non-stationary counterfactual MDP with and the expectation is taken over all the realizations induced by the transition probability and the policy . Here, note that, if for all , then matches the outcome of the observed realization.

Input: counterfactual policy , horizon , counterfactual transition probability , reward function , initial state .
for  do
      
      
       if  then
            
             if  then
                  
            else
                  
             end if
            
       end if
      
end for
Return
ALGORITHM 1 It samples a counterfactual explanation from the counterfactual policy

Then, our goal is to find the optimal counterfactual policy that maximizes the counterfactual outcome subject to a constraint on the number of counterfactual actions that can differ from the observed ones, ,

(8)

where is one realization of counterfactual actions and are the observed actions. The constraint guarantees that any counterfactual action sequence induced by the counterfactual transition probability and the counterfactual policy can differ in at most actions from the observed sequence. Finally, once we have found the optimal policy , we can sample a counterfactual realization of the process and the optimal counterfactual explanation using Algorithm 1.

4 Finding Optimal Counterfactual Explanations via Dynamic Programming

To solve the problem defined by Eq. 8, we break the problem into several smaller sub-problems. Here, the key idea is to compute the counterfactual policy values that lead to the optimal counterfactual outcome recursively by expanding the expectation and switching the order of the sums in Eq. 7.

We start by computing the highest average cumulative reward that one could have achieved in the last steps of the decision making process, starting from state , if at most actions had been different to the observed ones in those last steps. For , we proceed recursively:

(9)

and, for , we trivially have that:

(10)

with , , , and for all and . In Eq. 9, the first parameter of the outer maximization corresponds to the case where, at time , the observed action is taken and the second parameter corresponds to the case where, instead of the observed action, the best alternative action is taken.

By definition, we can easily conclude that is the average counterfactual outcome of the optimal counterfactual policy , , the objective value of the solution to the optimization problem defined by Eq. 8, and we can recover the optimal counterfactual policy by keeping track of the action chosen at each recursive step in Eq. 9 and 10. The overall procedure, summarized by Algorithm 2 in Appendix A, uses dynamic programming—it first computes the values for all and and then proceeds with the rest of computations in a bottom-up fashion—and has complexity . Finally, we have the following proposition (proven by induction in Appendix B): The counterfactual policy returned by Algorithm 2 is the solution to the optimization problem defined by Eq. 8.

5 Experiments on Synthetic Data

In this section, we evaluate Algorithm 2 on realizations of a synthetic decision making process111 All experiments were performed on a machine equipped with 48 Intel(R) Xeon(R) 3.00GHz CPU cores and 1.5TB memory.. To this end, we first look into the average outcome improvement that could have been achieved if at most actions had been different to the observed ones in every realization, as dictated by the optimal counterfactual policy. Then, we investigate to what extent the level of uncertainty of the decision making process influences the average counterfactual outcome achieved by the optimal counterfactual policy as well as the number of distinct counterfactual explanations it provides.

Experimental setup. We characterize the synthetic decision making process using an MDP with states and actions , where and , and time horizon . For each state and action , we set the immediate reward equal to , , the higher the state, the higher the reward. To set the values of the transition probabilities , we proceed as follows. First we pick one uniformly at random and we set a weight and then, for the remaining states , we sample weights , where . Next, for all , we set . It is easy to check that, for each state-action pair at time , is most likely to be observed in the next timestep . Here, the parameter controls the level of uncertainty.

Then, we compute the optimal policy that maximizes the average outcome of the decision making process by solving Bellman’s equation using dynamic programming [sutton2018reinforcement] and use this policy to sample the (observed) realizations as follows. For each realization, we start from a random initial state and, at each time step , we pick the action indicated by the optimal policy with probability and a different action uniformly at random with probability . This leads to action sequences that are slightly suboptimal in terms of average outcome. Finally, to compute the counterfactual transition probability for each observed realization , we follow the procedure described in Section 2 with samples for each noise posterior distribution.

(a)
(b)
(c)
Figure 1: Empirical distribution of the relative difference between the average counterfactual outcome achieved by the optimal counterfactual policy and the observed outcome , , , in a synthetic decision making process. In all panels, we set , , , and estimate the distributions using realizations from different instances of the decision making process ( realizations per instance), each with different .
(a) vs.
(b) vs.
(c) # of uniq. explanations vs.
Figure 2: Influence that the level of uncertainty of a synthetic decision making process has on the average counterfactual outcome achieved by the optimal counterfactual policy as well as on the number of distinct counterfactual explanations provides. In all panels, we set , and and, in each experiment, use realizations from different instances of the decision making process ( realizations per instance), each with different . In panel (c), for each realization, we sample counterfactual realizations and compute the average number of unique explanations across realizations. Shaded regions correspond to

% confidence intervals.

Results. We first measure to what extent the counterfactual explanations provided by the optimal counterfactual policy would have improved the outcome of the decision making process. To this end, for each observed realization , we compute the relative difference between the average optimal counterfactual outcome and the observed outcome, , . Figure 1 summarizes the results for different values of . We find that the relative difference between the average counterfactual outcome and the observed outcome is always positive, , the sequence of actions specified by the counterfactual explanations would have led the process realization to a better outcome in expectation. However, this may not come as a surprise given that the counterfactual policy is optimal, as shown in Proposition 4; Moreover, as the sequences of actions specified by the counterfactual explanations differ more from the observed actions (, increases), the improvement in terms of expected outcome increases.

Next, we investigate how the level of uncertainty of the decision making process influences the average counterfactual outcome achieved by the optimal counterfactual policy as well as the number of distinct counterfactual explanations provides. Figure 2 summarizes the results, which reveal several interesting insights. As the level of uncertainty increases, the average counterfactual outcome decreases, as shown in panel (a), however, the relative difference with respect to the observed outcome increases, as shown in panel (b). This suggest that, under high level of uncertainty, the counterfactual explanations may be more valuable to a decision maker who aims to improve her actions over time. However, in this context, we also find that, under high levels of uncertainty, the number of distinct counterfactual explanations increases rapidly with . As a result, it may be preferable to use relatively low values of to be able to effectively show the counterfactual explanations to a decision maker in practice.

6 Experiments on Real Data

In this section, we evaluate Algorithm 2 using real patient data from a series of cognitive behavioral therapy sessions. To this end, similarly as in Section 5, we first measure the average outcome improvement that could have been achieved if at most actions had been different to the observed ones in every therapy session, as dictated by the optimal counterfactual policy. Then, we look into individual therapy sessions and showcase how Algorithm 2, together with Algorithm 1, can be used to highlight specific patients and actions of interest for closer inspection222Our results should be interpreted in the context of our modeling assumptions and they do not suggest the existence of medical malpractice.. Appendix D contains additional experiments benchmarking the optimal counterfactual policy against several baselines333We do not evaluate our algorithm against prior work on counterfactual explanations for one-step decision making processes since the settings are not directly comparable..

Experimental setup. We use anonymized data from a clinical trial comparing the efficacy of hypnotherapy and cognitive behavioral therapy [fuhr2017efficacy] for the treatment of patients with mild to moderate symptoms of major depression444All participants gave written informed consent and the study protocol was peer-reviewed [fuhr2021efficacy].. In our experiments, we use data from the patients who received manualized cognitive behavioral therapy, which is one of the gold standards in depression treatment. Among these patients, we discard four of them because they attended less than sessions. Each patient attended up to weekly therapy sessions and, for each session, the dataset contains the topic of discussion, chosen by the therapist from a pre-defined set of topics (, psychoeducation, behavioural activation, cognitive restructuring techniques). Additionally, a severity score is included, based on a standardized questionnaire [kroenke2001phq], filled by the patient at each session, which assesses the severity of depressive symptoms. For more details about the severity score and the pre-defined set of discussion topics, refer to Appendix C.

To derive the counterfactual transition probability for each patient, we start by creating an MDP with states and actions. Each state corresponds to a severity score, where small numbers represent lower severity, and each action corresponds to a topic from the pre-defined list of topics that the therapists discussed during the sessions. Moreover, each realization of the MDP corresponds to the therapy sessions of a single patient ordered in chronological order and time horizon is the number of therapy sessions per patient. Here, we denote the set of realizations for all patients as .

(a)
(b) vs.
(c) # of uniq. explanations vs.
Figure 3: Performance achieved by the optimal counterfactual policy in a series of real manualized cognitive behavioral sessions , where each realization includes all the sessions of an individual patient sorted in chronological order. Panel (a) shows the distribution of the relative difference between the average counterfactual outcome achieved by and the observed outcome , , , for . Panels (b) and (c) show the average counterfactual outcome achieved by and the average number of unique counterfactual explanations provided by each , averaged across patients, against the number of actions differing from the observed ones. In panel (c), for each realization, the average number of unique counterfactual explanations provided by is estimated using counterfactual realizations. In all panels, we set and use data from patients. Shaded regions correspond to % confidence intervals.

In addition, to estimate the values of the transition probabilities, we proceed as follows. For every state-action pair , we assume a -dimensional prior on the probabilities , where if and otherwise. Then, if we observe transitions from state to each state after action in the patients’ therapy sessions , we have that the posterior of the probabilities is a . Finally, to estimate the value of the transition probabilities , we take the average of

samples from the posterior probability

. This procedure sets the value of the transition probabilities proportionally to the number of times they appeared in the data, however, it ensures that all transition probability values are non zero and transitions between adjacent severity levels are much more likely to happen. Moreover, we set the immediate reward for a pair of state and action equal to , , the lower the patient’s severity level, the higher the reward. Here, if some state-action pair is never observed in the data, we set its immediate reward to . This ensures that those state-action pairs never appear in a realization induced by the optimal counterfactual policy. Finally, to compute the counterfactual transition probability for each realization , we follow the procedure described in Section 2 with samples for each noise posterior distribution.

(a)
(b) Action changes, severity vs.
(c) Unique explanations
Figure 4: Insights on the counterfactual explanations provided by the optimal counterfactual policy for one real patient who received manualized cognitive behavioral therapy. Panel (a) shows the distribution of the counterfactual outcomes for the counterfactual realizations induced by and . Panel (b) shows, for each time step, how frequently a counterfactual explanation changes the observed action as well as the observed severity level and the severity level in the counterfactual realization with the highest counterfactual outcome. Here, darker colors correspond to higher frequencies and higher severities. Panel (c) shows the action changes in the unique counterfactual explanations (green) provided by along with the mean of counterfactual outcomes (r) that each one achieves and how frequently (f) they appear across the counterfactual realizations. Here, the bottom row shows the observed actions that were changed by at least one of the counterfactual explanations. Refer to Appendix C for a definition of the actions (, topics). In all panels, we set and the results are estimated using counterfactual realizations.

Results. We first measure to what extent the counterfactual explanations provided by the optimal counterfactual policy would have improved each patient’s severity of depressive symptoms over time. To this end, for each observed realization corresponding to each patient, we compute the same quality metrics as in experiments on synthetic data in Section 5. Figure 3 summarizes the results. Panel (a) reveals that, for most patients, the improvement in terms of relative difference between the average optimal counterfactual outcome and the observed outcome is rather modest. Moreover, panel (b) also shows that the absolute average optimal counterfactual outcome , averaged across patients, does not increase significantly even if one allows for more changes in the sequence of observed actions. These findings suggest that, in retrospect, the choice of topics by most therapists in the sessions was almost optimal. That being said, for % of the patients, the average counterfactual outcome improves a % over the observed outcome and, as we will later discuss, there exist individual counterfactual realizations in which the counterfactual outcome improves much more than %. In that context, it is also important to note that, as shown in panel (c), the growth in the number of unique counterfactual explanations with respect to is weaker than the growth found in the experiments with synthetic data and, for , the number of unique counterfactual explanations is smaller than . This latter finding suggests that, in practice, it may be possible to effectively show, or summarize, the optimal counterfactual explanations, a possibility that we investigate next.

We focus on a patient for whom the average counterfactual outcome achieved by the optimal policy with improves % over the observed outcome . Then, using the policy , also with , and the counterfactual transition probability , we sample multiple counterfactual explanations using Algorithm 1 and look at each counterfactual outcome . Figure 4(a) summarizes the results, which show that, in most of these counterfactual realizations, the counterfactual outcome is greater than the observed outcome—if at most actions had been different to the observed ones, as dictated by the optimal policy, there is a high probability that the outcome would have improved. Next, we investigate to what extent there are specific time steps within the counterfactual realizations where is more likely to suggest an action change. Figure 4(b) shows that, for the patient under study, there are indeed time steps that are overrepresented in the optimal counterfactual explanations, namely . Moreover, the first of these time steps () is when the patient had started worsening their depression after an earlier period in which they showed signs of recovery. Remarkably, we find that, in the counterfactual realization with the best counterfactual outcome, the worsening is mostly avoided. Finally, we look closer into the actual action changes suggested by the optimal counterfactual policy . Figure 4(c) summarizes the results, which reveal that recommends replacing some of the sessions on cognitive restructuring techniques (CRT) by behavioral activation (BHA) consistently across counterfactual realizations , particularly at the start of the worsening period. We discussed this recommendation with one of the researchers on clinical psychology who co-authored Fuhr et al. [fuhr2017efficacy] and she told us that, from a clinical perspective, such recommendation is sensible since, whenever the severity of depressive symptoms is high, it is very challenging to apply CRT and instead it is quite common to use BHA. Appendix E contains additional insights about other patients in the dataset.

7 Conclusions

We have initiated the study of counterfactual explanations in decision making processes in which multiple, dependent actions are taken sequentially over time. Building on a characterization of sequential decision making processes using Markov decision processes and the Gumbel-Max structural causal model, we have developed an algorithm based on dynamic programming to find optimal counterfactual explanations in polynomial time. We have validated our algorithm using synthetic and real data from cognitive behavioral therapy and shown that the counterfactual explanations our algorithm finds can provide valuable insights to enhance sequential decision making under uncertainty.

Our work opens up many interesting avenues for future work. For example, we have considered discrete states and actions. It would be interesting to extend our work to continuous states and actions. Moreover, it would be valuable to validate our algorithm using real datasets from other (medical) applications. In this context, it would be worth to consider applications in which the true transition probabilities are due to a machine learning algorithm. In our work, we have focused on sequential decision making processes that satisfy the Markov property. It would be interesting to lift this assumption and consider semi-Markov processes. Finally, it would be important to carry out a user study in which the counterfactual explanations found by our algorithm are shared with the human experts (, therapists, physicians) who took the observed actions.

Acknowledgements. We would like to thank Kristina Fuhr and Anil Batra for giving us access to the cognitive behavioral therapy data that made this work possible. Tsirtsis and Gomez-Rodriguez acknowledge support from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 945719).

References

Appendix A Dynamic Programming Algorithm

Here, we provide a complete description of the algorithm, based on dynamic programming, that solves the optimization problem of Equation 8.

Input: States , actions , realization , horizon , counterfactual transition probability , reward function , constraint .
Initialize: .
for  do
       for  do
            
             for  do
                  
             end for
            
       end for
      
end for
for  do
       for  do
             for  do
                  
                   for  do
                        
                        
                   end for
                  
                  
                   for  do
                        
                         for  do
                              
                              
                         end for
                        if  then
                              
                              
                              
                         end if
                        
                   end for
                  
                  
                  
             end for
            
       end for
      
end for
Return
ALGORITHM 2 It returns the optimal counterfactual policy and its average counterfactual outcome

Appendix B Proof of Proposition 4

Using induction, we will prove that the policy value set by Algorithm 2 is optimal for every , , in the sense that following this policy maximizes the average cumulative reward that one could have achieved in the last steps of the decision making process, starting from state , if at most actions had been different to the observed ones in those last steps. Formally:

(11)
(12)

Recall that, are the observed actions and the counterfactual realizations are induced by the counterfactual transition probability and the policy .

We start by proving the induction basis. Assume that a realization has reached a state at time , one time step before the end of the process. If (, ), following Equation 10, the algorithm will choose the observed action and return an average cumulative reward , where for all . Since no more action changes can be performed at this stage, this is the only feasible solution and therefore it is optimal.

If , since for all it is easy to verify that Equation 9 reduces to and is obviously the optimal choice for the last time step.

Now, we will prove that, for a counterfactual realization being at state at a time step (, , ), the maximum average cumulative reward given by Algorithm 2 is optimal, under the inductive hypothesis that the values of already computed for , and all are optimal. Assume that the algorithm returns an average cumulative reward by choosing action while the optimal solution gives an average cumulative reward by choosing an action . Here, by we will denote realizations starting from time with where is the policy given by Algorithm 2 and we will use if the policy is optimal. Also, we will denote a possible next state at time , after performing action , as where if , otherwise and, . Similarly, after performing action , we will denote a possible next state as where if , otherwise and, . Then, we get: