1 Introduction
Recent years have seen an active interest in causal reinforcement learning. In this thread of work, a fundamental model is that of
causal bandits [BFP15, LLR16, SSDS17, LB18, YHS18, LB19, NPS21]. In the causal bandits setting, one assumes an environment comprising of causal variables that influence an outcome of interest; specifically, a reward. The goal of a learner then is to maximize her reward by intervening on certain variables (i.e., by fixing the values of certain variables). Note that the reward is assumed to be dependent on the values that the causal variables take, and the causal variables themselves may influence each other. The relationship between these causal variables is typically expressed via a directed acyclic graph (DAG), which is referred to as the causal graph [Pea09].Of particular interest are causal settings wherein the learner is allowed to perform atomic interventions. Here, at most one causal variable can be set to a particular value, while other variables take values in accordance with their underlying distributions. Prominent results in the context of atomic interventions include [CB20] and [BGK20].
It is relevant to note that when a learner performs an intervention in a causal graph, she gets to see the values that all the causal variables took. Hence, the collective dependence of the reward on the variables is observed through each intervention. That is, from such an observation, the learner may be able to make inferences about the (expected) reward under other values for the causal variables [PJS17]. In essence, with a single intervention, the learner is allowed to intervene on a variable (in the causal graph), allowed to observe all other variables, and further, is privy to the effects of such an intervention. Indeed, such an observation in a causal graph is richer than a usual sample from a stochastic process. Hence, a standard goal in causal bandits (and causal reinforcement learning, in general) is to understand the power and limitations of interventions. This goal manifests in the form of developing algorithms that identify intervention(s) that lead to high rewards, while using as few observations/interventions as possible. We use the term intervention complexity (rather than sample complexity) for our algorithm, to emphasize the point that in causal reinforcement learning one deals with a richer class of observations.
Addressing causal bandits, the notable work of Lattimore et al. [LLR16] obtains an interventioncomplexity bound with a focus on atomic interventions and parallel causal graphs. While the causal bandit framework provides a meaningful (causal) extension of the classic multiarmed bandit setting, it is still not general enough to directly capture settings wherein one requires multiple states to model the environment. Specifically, causal bandits, as is, do not carry the save modelling prowess as Markov decision processes (MDP). Motivated, in part, by this consideration, recent works in causal reinforcement learning have generalized causal bandits to causal MDPs, see, e.g., [LMTY20].
The current work contributes to this thread of work and extends the causal bandit framework of Lattimore et al. [LLR16]. In particular, we develop results for twostage causal MDPs (see Figure 0(a)). Such a setup is general enough to address the fact that underlying environment states can evolve, as in an MDP, while simultaneously utilizing (causal) aspects from the causal bandit setup.
We now provide a stylized example that highlights the applicability of the twostage model. Consider a patient who visits a doctor with certain medical issues. The patient may arrive with a combination of symptoms and lifestyle factors. Some of these may include immediate symptoms, such as fever, but may also include more complex lifestyle factors, e.g., a sedentary routine or smoking. On observing the patient and before prescribing an invasive procedure, the doctor may consider prescribing certain lifestyle changes or milder medicines. This initial intervention can then lead to the patient evolving to a new set of symptoms. At this point, with fresh symptoms and lifestyle factors (i.e., in the second stage of the MDP), the doctor can finalize a course of medication. Such an interaction can be modelled as a twostage causal MDP, and is not directly captured by the causal bandit framework. Also, the outcome of whether the patient is cured, or not, corresponds to a 01 reward for the interventions chosen by the doctor.
1.1 Additional Related Work
An extension to the earlier literature on causal bandits—towards causal MDPs—was proposed by Lu et al. [LMTY20]. This work considers a causal graph at each state of the MDP. Furthermore, in this model, along with the rewards, the state transitions are also (stochastically) dependent on the causal variables. We address a similar model in our twostage causal MDP, wherein the state transitions as well as the rewards are functions of the causal variables. It is, however, relevant to note that in [LMTY20] it is assumed that the MDP can be initialized to any state. The work of Azar et al. [AOM17] also conforms to this assumption. Hence, while these two results address a more general MDP setup (than the twostage one), their results are not directly applicable in the current context wherein the MDP always starts at a specific state and transitions based on the chosen interventions. Indeed, the assumption that the MDP can be initialized arbitrarily might not hold in realworld domains, such as the medicalintervention example mentioned above.
Sachidananda and Brunskill [SB17]
propose a ThompsonSampling based model in a causal bandit setting to minimize cumulative regret. Nair et al.
[NPS21] study the problem of online causal learning to minimize expected cumulative regret under the setting of nobackdoor graphs. They also supply an algorithm for expected simple regret minimization in the causal bandit setting with nonuniform costs associated with the interventions.Much of the literature in causal learning assumes the causal graph structure is known. In more general settings, learning the causal graph structure is an important subproblem; for relevant contributions to the problem of causal graph learning see [SKDV15, KSB17, KDV17], and references therein. Lu et al. [LMT21] and Maiti et al. [MNS21] extend this to the causal bandit problem. Further, under many circumstances, the structure of the causal graph can be learnt externally, or via some form of hypothesis testing [ABDK18].
The current work contributes to the growing body of work on causal reinforcement learning by developing interventionefficient algorithms for finding nearoptimal policies. We focus on simple regret minimization (i.e., near optimal policy identification) in causal MDPs.
1.2 Our Contributions
Our main contributions are summarized next.
We formulate and study twostage causal MDPs, which encompass many of the issues that arise when considering extensions from bandits to general MDPs. At the same time, the current setup is structured enough to be amenable to a thorough analysis. A notable feature of our setting is that we do not assume that the learner has ready access to all the states, and has to rely on the transitions to reach certain states.
Here, we develop and analyze an algorithm for finding (near) optimal intervention policies. The algorithm’s objective is to minimize simple regret in an intervention efficient manner. We focus on causal MDPs wherein the nonzero transition probabilities are sufficiently high and show that, interestingly, the intervention complexity of our algorithm depends on an instance dependent structural parameter—referred to as
(see equation (1))— rather than directly on the number of interventions or states (Theorem 1).Notably, our algorithm uses a convex program to identify optimal interventions. Using convex optimization to design efficient explorations is a distinguishing feature of the current work. The algorithm spends some time of the given budget learning the MDP parameters (e.g., the transition probabilities). After this, it solves an optimization problem to design efficient exploration of the causal graphs at various states. Such an optimization problem gives rise to the structural parameter, , of the causal MDP instance. We note that the parameter can be significantly smaller than, say, the total number of interventions in the causal MDP, as demonstrated by our experiments (see Section 5).
In fact, we provide a lower bound showing that our algorithm’s regret guarantee is tight (up to a log factor) for certain classes of twostage causal MDPs (see Section 4).
2 Notation and Preliminaries
We consider a Markov decision process (MDP) that starts at state , transitions to one of states , receives a reward, and then finally terminates at state ; see Figure 0(a). At each state , there is a causal graph along the lines of the ones studied in [LLR16]; see Figure 0(b). In particular, at state , the causal graph is composed of independent Bernoulli variables . For each , the associated probability .
In the MDP, for each state , all the variables , are observable. Furthermore, we are allowed atomic interventions, i.e., we can select at most one variable and set it to either or . We will use to denote the set of atomic interventions available at state ; in particular, . We note that
is an empty intervention that allows all the variables to take values from their underlying (Bernoulli) distributions. Also,
and set the value of variable to and , respectively, while leaving all the other variables to independently draw values from their respective distributions. Note that for all , we have . Write .The model provides us with a reward as we transition to the terminal state from an intermediate state. Depending on the state , from where we transition to , we label the reward as . Note that the reward stochastically depends on the variables ; in particular, for all and each realization , the reward is distributed as . Extending this, we will write to denote the expected value of reward when intervention is performed in state . For instance, is the expected reward when variable is set to , and all the other variables independently draw values from their respective distributions.
Note that, across the states, the probabilities s and the reward distributions are fixed but unknown. Indeed, the highlevel goal of the current work is to develop an algorithm that—in a sample efficient manner—identifies interventions that maximize the expected rewards.
We denote by , the causal parameter from [LLR16] at state . This parameter is a crucial factor in the regret bound obtained by [LLR16]. Formally, at state , we consider the Bernoulli probabilities of the variables in increasing order, , and write . In addition, let denote the diagonal matrix of .
Remark 1.
The probabilities s are a priori unknown. It is, however, instructive to consider the computation of from s: (1) Without loss of generality, assume that (otherwise consider the lesser of the two quantities as ) (2) Sort the s in increasing order (3) Compute and write
MDP Notations: At state , the transition to the intermediate states
stochastically depends on the independent Bernoulli random variables
. Here, denotes the probability of transitioning into state with atomic intervention atomic intervention ; recall that includes the donothing intervention. We will collectively denote these transition probabilities as matrix . Furthermore, write to denote the minimum nonzero value in . Note that matrix is fixed, but unknown.A map , between states and interventions (performed by the algorithm), will be referred to as a policy. Specifically, is the intervention at state . Note that, for any policy , the expected reward is equal to .
Maximizing expected reward, at each intermediate state , we obtain the overall optimal policy as follows: , for , and .
Our goal is to find a policy with (expected) reward as close to that of as possible. We will use to denote the suboptimality of a policy ; in particular, is defined as the difference between the expected rewards of and .
Conforming to the standard simpleregret framework, the algorithm is given a time budget , i.e., the algorithm can go through the twostages of the MDP times. In each of these rounds, the algorithm can perform the atomic interventions of its choice (both at state and then at the resulting intermediate state). The overall goal of the algorithm is to compute a policy with high expected reward and the algorithm’s suboptimality is defined as its regret, . Here, the expectation is with respect to the policy computed by the algorithm; indeed, given any twostage causal MDP instance and time budget to an algorithm, different policies s will have potentially different probabilities of being returned.
Notation  Explanation 

Transition probabilities matrix:  
Policy, a map from states to interventions.  
i.e. for  
Expectation of the reward at state given intervention  
Optimal Policy  
Computed policy  
Suboptimality of  
3 Main Algorithm and its Analysis
Our algorithm (ALGCE
) uses subroutines to estimate the transition probabilities, the causal parameters, and the rewards. From these, it outputs the best available interventions as its policy
. Given time budget , the algorithm uses the first rounds to estimate the transition probabilities (i.e., the matrix ) in Algorithm 2. The subsequent rounds are utilized in Algorithm 3 to estimate causal parameters s. Finally, the remaining budget is used in Algorithm 4 to estimate the interventiondependent reward s, for all intermediate states .To judiciously explore the interventions at state , ALGCE computes frequency vectors . In such vectors, the th component denotes the fraction of time that each intervention is performed by the algorithm, i.e., given time budget , the intervention will be performed times. Note that, by definition, and the frequency vectors are computed by solving convex programs over the estimates. The algorithm and its subroutines throughout consider empirical estimates, i.e., find the estimates by direct counting. Here, let denote the computed estimate of the matrix and be the estimate of the diagonal matrix . We obtain a regret upper bound via an optimal frequency vector (see Step 5 in ALGCE).
Recall that for any vector (with nonnegative components), the Hadamard exponentiation leads to the vector wherein for each component .
We next define a key parameter that specifies the regret bound in Theorem 1 (below).
(1) 
Furthermore, we will write to denote the optimal frequency vector in equation (1). Hence, with vector , we have . Note that Step 5 in ALGCE addresses an analogous optimization problem, albeit with the estimates and . Also, we show in Lemma 11 (see Section B) that this optimization problem is convex and, hence, Step 5 admits an efficient implementation.
The following theorem is the main result of the current work. It upper bounds the regret of ALGCE. The result requires the algorithm’s time budget to be at least
(2) 
Theorem 1.
3.1 Proof of Theorem 1
We prove the theorem, we analyze the algorithm’s execution as falling under either good event or bad event, and tackling the regret under each.
Definition 1.
We define five events, to , the intersection of which we call as good event, E, i.e., .

[label=E0.]

. That is, for every intervention , the empirical estimate of transition probability in each of Algorithms 2, 3 and 4 is good, up to an absolute factor of .

Estimate in Algorithm 2. In other words, our estimate for causal parameter for state 0 in Algorithm 2 is relatively good.

, for all states . That is, our estimate of parameter is relatively good for every state , in Algorithms 3 and 4.

, for all interventions . Here, random variable and is the estimated transition probability computed in Algorithm 2.
Definition 2.
We define bad event F, as the complement of the intersection of events  , as defined above, i.e., .
Before we proceed with the proof, we state below a corollary which provides a multiplicative bound on with respect to , complementing the additive form of .
Corollary 1.
Under event , for all interventions and states , we have:
Proof.
Event ensures that , for each interventions and states . This, in particular, implies that for each intervention and state the following inequality holds: . Note that if , then the algorithm will never observe state with intervention , i.e., in such a case . For the nonzero s, recall that (by definition), . Therefore, for any nonzero , the abovementioned inequality gives us . Equivalently, and . Therefore, for all s the corollary holds. ∎
Considering the estimates and , along with frequency vector (computed in Step 5), we define random variable . Note that is a surrogate for . We will, in fact, show that, under the good event, is close to (Lemma 3).
Recall that and here the expectation is with respect to the policy computed by the algorithm. We can further consider the expected suboptimality of the algorithm and the quality of the estimates (in particular, , and ) under good event (E).
Based on the estimates returned at Step 5 of ALGCE, either the good event holds, or we have the bad event (though this is unknown to our algorithm). We obtain the regret guarantee by first bounding suboptimality of policies computed under the good event, and then bound the probability of the bad event.
Lemma 1.
For the optimal policy , under the good event (E), we have
Proof.
Consider the expression
We can add and subtract and take common terms out to reduce the expression:
Note that:

[label=()]


(from )

(from )
Furthermore, it follows from Corollary 1 that (componentwise) . Hence, the abovementioned expression is bounded above by
Note that the definition of ensures . Further, . Therefore,
This establishes the lemma. ∎
We now state another similar lemma for any policy computed under good event.
Lemma 2.
Let be a policy computed by ALGCE under the good event (E). Then,
Proof.
Consider the expression:
We can add and subtract to get:
Analogous to Lemma 1, one can show that this expression is bounded above by
∎
We can also bound to within a constant factor of .
Lemma 3.
Under the good event, we have .
Proof.
Corollary 1 ensures that given event (and, hence, the good event), . In addition, note that event gives us . From these observations we obtain the desired bound:
Here, the first inequality follows from the fact that is the minimizer of the expression, and for the second inequality, we substitute the appropriate bounds of and . ∎
Corollary 2.
Let be a policy computed by ALGCE under good event (E), then
Proof.
Corollary 2 shows that under the good event, the (true) expected reward of and are within of each other.
In Lemma 4 (stated below and proved in Appendix A.6) we will show that^{4}^{4}4Recall that, by definition, . , for appropriately large .
Lemma 4 (Bound on Bad Event).
Write . Then for any :
4 Lower Bound
This section provides a lower bound on regret for a family of instances. For any number of states , we show that there exist transition matrices and reward distributions () such that regret achieved by ALGCE(Theorem 1) is tight, up to log factors.
Theorem 2.
There exists a transition matrix , reward distributions, and probabilities corresponding to causal variables , such that for any , corresponding to causal variables at states , the simple regret achieved by any algorithm is
4.1 Theorem 2: Proof Setup
This section establishes Theorem 2. We will identify a collection of twostage causal MDP instances and show that, for any given algorithm , there exists an instance in this collection for which ’s regret is .
First we describe the collection of instances and then provide the proof.
For any integer , consider causal variables at each state . The transition matrix is set to be deterministic. Specifically, for each , we have . For all other interventions at state 0, we transition to state k with probability 1. Such a transition matrix can be achieved by setting for all . As before, the total number of interventions .
Now consider a family of instances . Here, and each is a twostage causal MDP with the abovementioned transition probabilities. The instances differ in the rewards at the intermediate states. In particular, in instance , we set the reward distributions such that for all states and interventions . For each and , instance differs from only at state and for intervention . Specifically, by construction, we will have , for a parameter . The expected rewards under all other interventions will be , the same as in .
Given any algorithm , we will consider the execution of over all the instances in the family. The execution of algorithm over each instance induces a trace, which may include the realized transition probabilities , the realized variable probabilities for and and the corresponding s, and the realized rewards . Each of such realizations (random variables) has a corresponding distribution (over many possible runs of the algorithm). We call the measures corresponding to these random variables under the instances and as and , respectively.
4.2 Proof of Theorem 2
For any algorithm and given time budget , we first consider the ’s execution over instance . As mentioned previously, denotes the trace distribution induced by the algorithm for . In particular, write to denote the expected number of times state is visited, .