Chronological Causal Bandits

by   Neil Dhir, et al.
The Alan Turing Institute

This paper studies an instance of the multi-armed bandit (MAB) problem, specifically where several causal MABs operate chronologically in the same dynamical system. Practically the reward distribution of each bandit is governed by the same non-trivial dependence structure, which is a dynamic causal model. Dynamic because we allow for each causal MAB to depend on the preceding MAB and in doing so are able to transfer information between agents. Our contribution, the Chronological Causal Bandit (CCB), is useful in discrete decision-making settings where the causal effects are changing across time and can be informed by earlier interventions in the same system. In this paper, we present some early findings of the CCB as demonstrated on a toy problem.


page 1

page 2

page 3

page 4


Causal Bandits with Unknown Graph Structure

In causal bandit problems, the action set consists of interventions on v...

Causal Discovery for Causal Bandits utilizing Separating Sets

The Causal Bandit is a variant of the classic Bandit problem where an ag...

Using cognitive psychology to understand GPT-3

We study GPT-3, a recent large language model, using tools from cognitiv...

Dynamic Causal Bayesian Optimization

This paper studies the problem of performing a sequence of optimal inter...

Budgeted and Non-budgeted Causal Bandits

Learning good interventions in a causal graph can be modelled as a stoch...

Hierarchical Causal Bandit

Causal bandit is a nascent learning model where an agent sequentially ex...

Bandit Models of Human Behavior: Reward Processing in Mental Disorders

Drawing an inspiration from behavioral studies of human decision making,...

1 Introduction

A dynamical system evolves in time. Examples include weather, financial markets as well as unmanned aerial vehicles (Gauthier et al., 2021). One practical goal is to take decisions within those systems such as deciding which financial instrument to buy, one day after another. The bandit (mab) paradigm (Robbins, 1952; Chernoff, 1959) can help with that. In that environment an agent and an environment interact sequentially (Lattimore and Szepesvári, 2020); the agent picks an action and receives a reward from the environment. The agent continues like so, usually for a finite number of plays, with the goal of maximising the total reward. The challenge in this problem stems from the trade-off between exploiting actions with known rewards, versus exploring actions with unknown rewards.

Recently, many studies have addressed the case wherein which there is a non-trivial dependency structure between the arms. One such direction presumes that the dependency structure is modelled explicitly through causal graphs (Lee and Bareinboim, 2018, 2019; Lattimore et al., 2016; Lu et al., 2020). We extend that paradigm to also account for the causal temporal dynamics of the system. mabs already constitute sequential decision-making paradigms, but here we expand that idea to cover chronological MABs: where the reward distribution is conditional upon the actions taken by earlier MABs (see fig. 1

). We are not considering a Markov Decision Process (


) – we have no explicit notion of state and consequently do not maintain a model of the state dynamics. This type of transfer learning in causal

mabs was also studied by (Zhang and Bareinboim, 2017; Azar et al., 2013) but there the authors transfer information between unrelated tasks whereas we are interested in transfers for when all agents operate in the same (possibly non-stationary) dynamical system. Ours, instead, is more similar to the restless bandit (Whittle, 1988; Guha et al., 2010) problem where rewards vary with time (unlike the standard bandit setting where they are fixed but unknown).


Consider a dynamical environment 111We will somewhat abuse standard notation for dynamical systems theory., such as a country subject to the COVID-19 pandemic. The highly transient nature of the virus (Mandal et al., 2020) necessitates multiple chronological clinical trials, the start of each indexed by (see fig. 1), to find a treatment or vaccine. Suppose we conduct clinical trial which has different treatments of unknown efficacy for COVID-19 and we have patients in our study group. Patients arrive sequentially, and we must decide on a treatment to administer to each new patient.

Figure 1: Structural causal model framed as a multi-armed bandit problem. scm-mab topology is adapted from figure 3(c) in (Lee and Bareinboim, 2018). Top: Each intervention can be construed as pulling an arm and receiving a reward (measuring the causal effect). Shown are all possible interventions – including sub-optimal ones (e.g. pulling and together). Bottom: The optimal intervention (and consequent samples from the SEM) from the preceding scm-mab are transferred to the current scm-mab, via . Exogenous variables and incoming edges from are not shown to avoid clutter. Implemented interventions are represented by squares.

To make this decision, we could learn from how the previous choices of treatments fared for the previous patients. After a sufficient number of trials, we may have a reasonable idea of which treatment is most effective, and from thereon, we could administer that treatment to all patients. However the exploration phase may take a long time and many patients may receive a sub-optimal treatment during that period. But we know than a earlier, similar trial , has just concluded and because we are aware of the causal nature of our treatments and their evolution over time, we can condition our trial on the lessons learned, and actions taken in trial , before we start ours. There are two purposes to this (1) the additional information may aid the discovery of the most effective treatment in our trial and (2) it may also show that the most optimal intervention changes over time owing to the highly non-stationary environment of real systems, where a typical assumption (Lattimore and Szepesvári, 2020) is for standard MABs to have a stationary reward distribution (Lattimore and Szepesvári, 2020). The time between two consecutive trials and is .


The chronological causal bandit (ccb) extends the scm-mab by Lee and Bareinboim (2018) by conditioning a scm-mab on prior causal bandits played in the same environment . The result of this is a piece-wise stationary model which offers a novel approach for causal decision making under uncertainty within dynamical systems. The reward process of the arms is non-stationary on the whole, but stationary on intervals (Yu and Mannor, 2009). Specifically, past optimal interventions are transferred across time allowing the present mab to weigh the utility of those actions in the present game.

1.1 Preliminaries


Random variables are denoted by upper-case letters and their values by lower-case letters. Sets of variables and their values are noted by bold upper-case and lower-case letters respectively. We make extensive use of the do-calculus (for details see (Pearl, 2000, §3.4)). Samples (observational) drawn from a system or process unperturbed over time are contained in . Samples (interventional) drawn from a system or process subject to one or more interventions are denoted by . The domain of a variable is denoted by where e.g. and .

Structural causal model.

Structural causal models (SCMs) (Pearl, 2000, ch. 7) are used as the semantics framework to represent an underlying environment. For the exact definition as used by Pearl see (Pearl, 2000, def. 7.1.1). Let be a scm parametrised by the quadruple . Here,

is a set of exogenous variables which follow a joint distribution

and is a set of endogenous (observed) variables. Within

we distinguish between two types of variables: manipulative (treatment) and target (output) variables (always denoted by

in this paper). Further, endogenous variables are determined by a set of functions . Let (Lee and Bareinboim, 2020, §1) s.t. each is a mapping from (the respective domains of) to – where and . Graphically, each SCM is associated with a causal diagram (a directed acyclical graph, dag for short) where the edges are given by . Each vertex in the graph corresponds to a variable and the directed edges point from members of and toward (Pearl, 2000, ch. 7). A directed edge is s.t. if (i.e. is a child of ). A bidirected edge between and occurs if they share an unobserved confounder which is to say (Lee and Bareinboim, 2018). Unless otherwise stated, from hereon, when referring to we are implicitly considering – i.e. the manipulative variables not including the outcome variable. Finally, the fundamental do-operator represents the operation of fixing a set of endogenous variable(s) to constant value(s) irrespective of their original mechanisms. Throughout we do not consider graphs with non-manipulative variables. For more a more incisive discussion on the properties of SCMs we refer the reader to (Pearl, 2000; Bareinboim and Pearl, 2016).

Multi-armed bandit.

The MAB setting (Robbins, 1952) entertains a discrete sequential decision-making scenario in which an agent selects an action or ‘arm’ according to a policy , and receives a stochastic reward , emanating from an unknown distribution particular to each arm. The expectation of the reward is given by . The goal of the agent is to optimise the arm selection sequence and thereby maximise the expected reward after rounds, where is the expectation under the given policy and is the arm played on the round. We will use a similar performance measure, the cumulative regret (Lee and Bareinboim, 2018) where the max reward is . Using the regret decomposition lemma (Lattimore and Szepesvári, 2020, Lemma 4.5), we can write this in the form


where each arm’s gap from the best arm (“suboptimality gap” (Lattimore and Szepesvári, 2020, §4.5)) is and is the total number of times arm was played after rounds.

Figure 2: Toy scm used throughout this paper, based on (Lee and Bareinboim, 2018, Fig. 3(c)) but with the important difference that this scm is propagated in time.

Connecting SCMs to MABs.

Echoing the approach taken by Lee and Bareinboim (2018, §2), using the notation and concepts introduced in section 1.1; let be an SCM parametrised by where is the, as-noted, reward variable. The arms of the bandit are given by the set 222For example if then . If then . This is a set of all possible interventions on endogenous variables except the reward variable (Lee and Bareinboim, 2018, §2). Each arm is associated with a reward distribution where the mean reward is . This is the scm-mab setting (Lee and Bareinboim, 2018), fully represented by the tuple . As noted by the authors an agent facing a scm-mab , intervenes (plays arms) with knowledge of and but does not have access to the structural equations model and the joint distribution over the exogenous variables .

Causality across time.

Taking inspiration from Aglietti et al. (2021, §2), the authors consider causality in time, manifested by propagating a dag in time, and connecting each time-slice dag with directed edges as shown in fig. 2

. By doing this we are invariably considering a dynamic Bayesian network (

dbn) (Koller and Friedman, 2009). As we are making interventions on time-slices of the DBN, we introduce notation to aid the exposition of the method.

Definition 1.

Let be the scm at time step defined as for . The temporal window covered by the scm spans i.e. taking into account only the most recent time-slice, and the actions taken therein. It is also possible to increase the size of to include the entire history i.e. 333The choice of window size is a difficult one. More information is typically better but we may also subject the model to ‘stale’ information i.e. interventions which are no longer of any relevance or worse, misleading in the present scenario. as is done in fig. 3.

Definition 2.

Let (Pearl, 2000, p. 203) be the induced subgraph associated with . In , following the rules of do-calculus (Pearl, 2000, Theorem 3.4.1), the intervened variable(s) at have no incoming edges i.e. the time-slice part of has been mutilated in accordance with the implemented intervention at .

2 Chronological Causal Bandits

The scm-mab is particular to one graph, wherein which we seek to minimise . We instead seek a sequence of interventions which minimise at each time-step (i.e. the start of the trial)444For clarity: each trial contains multiple time-trials (rounds), summarised in . of a piece-wise stationary scm-mab as set out in the next paragraph.

Problem statement.

Similar to (Aglietti et al., 2021) the goal of this work is to find a sequence of optimal interventions over time, indexed by , by playing a series of sequential conditional scm-mabs or chronological causal bandits (ccb). The agent is provided with for each and is then tasked with optimising the arm-selection sequence (within each ‘trial’ ) and in so doing, minimise the total regret, for the given horizon . However, different to the standard mab problem we must take into account previous interventions as well. We do that by first writing the regret eq. 1, in a different form which is conditional on past interventions, enabling the transfer of information


where denotes the previously implemented intervention and is the indicator function. Now, take particular care w.r.t. the index for which the “implemented intervention” concerns. Drawing upon our example in section 1, the implemented intervention here corresponds to the treatment that was found to be the most effective during trial (during which rounds were played). Although the agent finds a sequence of decisions/treatments, only one of them, in this setting, is the overall optimal choice – i.e. has the lowest, on average, suboptimality gap . The implemented intervention at is found by investigating which arm has the lowest regret at the end of horizon


Simply, we pick the arm that has been played the most by the agent.

Remark 2.1.

The above problem statement does not discuss which set of interventions are being considered. Naively one approach is to consider all the sets in (Aglietti et al., 2020). Though a valid approach, the size will grow exponentially with the complexity of . Intuitively, one may believe that the best and only action worth considering is to intervene on all the parents of the reward variable . This is indeed true provided that is not confounded by any of its ancestors (Lee and Bareinboim, 2018). But whenever unobserved confounders are present (shown by red and dashed edges in fig. 2) this no longer holds true. By exploiting the rules of do-calculus and the partial-orders amongst subsets (Aglietti et al., 2020), Lee and Bareinboim (2018) characterise a reduced set of intervention variables called the possibly optimal minimal intervention set (pomis) or from hereon555Generally, refers to all possible interventions whereas refers only to the pomis., where , and typically . They demonstrate empirically that the selection of arms based on pomiss make standard mab solvers “converge faster to an optimal arm” (Lee and Bareinboim, 2018, §5). Throughout we use the pomis as well. For the full set of arms for the toy problem in fig. 2, see table 1.

Assumptions 1.

To tackle a ccb style problem, we make the following assumptions (included are also those made by Lee and Bareinboim (2018) w.r.t. the standard scm-mab):

  1. [leftmargin=*, noitemsep]

  2. Invariance of the causal structure .

  3. Aglietti et al. (2021) showed that does not change across time given assumption (1).

  4. An agent facing a scm-mab , plays arms with knowledge of and but not and .

Assumption (2) posits that the dag is known. If this were not true then we would have to undertake causal discovery (CD) (Glymour et al., 2019) or spend the first interactions with the environment (Lee and Bareinboim, 2018) learning the causal dag, from (Spirtes et al., 2000), from (Kocaoglu et al., 2017) or both (Aglietti et al., 2020). As it is, CD is outside the scope of this paper.

2.1 Transferring information between causal bandits

Herein we describe how to express the reward distribution for trial as a function of the intervention(s) implemented at the previous trial

. The key to this section is the relaxation of assumption (3). We seek an estimate

for (the true SEM), and for . Thus, if we have , we can relay information about past interventions to scm-mab and thus enable the present reward distribution to take into account those interventions. Reminding ourselves that the members of . Alas for e.g. we seek function estimates . Because we model all functions in

as independent probability mass functions (



Recall that is an observational distribution, who’s samples are contained in . Where is an interventional distribution, found by fixing variables to in – its samples are contained in . As we are operating in a discrete world, we do not need to approximate any intractable integrals (as is required in e.g. (Aglietti et al., 2021, 2020)) to compute the effect of interventions. Rather, by assuming the availability of and using the do-calculus, we are at liberty to estimate by approximating the individual pmfs that arise from applying the do-calculus (see (Pearl, 2000, Theorem 3.4.1)). Consequently we can build a ‘simulator’ using (which concerns the whole of , not just the window ). is very scarce because playing an arm does not yield an observed target value but only a reward (hence we cannot exploit the actions taken during horizon ). The only interventional data that is available, at each , to each scm-mab , is the implemented intervention . The ccb method is graphically depicted in fig. 3.

Figure 3: Transfer learning shown through the lens of the ccb applied to fig. 2. Each row corresponds to one scm-mab which, if , takes into account previous interventions and secondary effects from those interventions (intervening on e.g. means we can also calculate values for and in time-slice ). Further, shown are the structural equation models considered by ccb at every time step . As well as the minimisation task over the cumulative regret, undertaken to find the best action for each time-slice. Square nodes denote intervened variables.

3 Experiments

We demonstrate some empirical observations on the toy-example used throughout, shown in fig. 2. We consider the first five time-slices (trials) as shown in fig. 3. The reward distribution, under the pomis, for each slice is shown in fig. 4 (these distributions are found conditional upon the optimal intervention being implemented in the preceding trial as shown in fig. 3). For the true sem see eqs. 4 to 6 and eqs. 7 to 9. Figure 4 shows that the system is in effect oscillating in the reward distribution. This is because the optimal intervention changes in-between trials.

Figure 4: Reward distribution for arms in the pomis across the first five trials . For setting, see table 1.

The horizon for each trial was set to 10,000. We used two common mab

solvers: Thompson sampling (TS,

(Thompson, 1933)) and KL-UCB (Cappé et al., 2013). Displayed results are averaged over 100 replicates of each experiment shown in fig. 5. We investigate the CR and the optimal arm selection probability at various instances in time.

For trial , ccb and scm-mab are targeting the same reward distribution. Consequently they both identify the same optimal intervention . Come things start to change; having implemented the optimal intervention from ccb is now targeting a different set of rewards (see fig. 4). The scm-mab, being agnostic about past interventions, targets the same reward as previously (blue bars in fig. 4). As discussed, this ignores the dynamics of the system; a vaccinated population will greatly alter the healthcare environment, hence to administer the same vaccine (arm 3 at ) at the next clinical trial (), without taking this into account, makes for poor public policy.

(a) scm-mab at .
(b) ccb at .
(c) Optimal arm selection proba.
Figure 5: Figure 4(a) and fig. 4(b)

(truncated at 5000 rounds) show cumulative regret along with the standard deviation at trial

. Optimal arm-selection probability is shown in fig. 4(c), found using a Thompson sampling policy (Lee and Bareinboim, 2018, §5). An equivalent plot for KL-UCB can be found in fig. 6.

Consider the CR from the ccb at trial fig. 4(b); it is lower than the scm-mab in fig. 4(a), as it is transferring preceding intervention to the current causal model, fig. 3(), and now finds that is the optimal intervention (see appendix A for full optimal intervention sequence). Let’s now turn to fig. 4(c), to minimise the regret the agent should be choosing the optimal action almost all of the time. But it is only possible to reduce regret if the algorithm has discovered the arm with the largest mean. In trials and the reward per arm, across the pomis, is almost identical. As is stands the agent does not have a reasonable statistical certainty that is has found the optimal arm (orange and red curves in fig. 4(c)). But all have the same causal effect, why the CR in fig. 4(b) is low.

4 Conclusion

We propose the chronological causal bandit (ccb) algorithm which transfers information between causal bandits which have been played in the dynamical system at an earlier point in time. Some initial findings are demonstrated on a simple toy example where we show that taking the system dynamics into account has a profound effect on the final action selection. Further, whilst we in this example have assumed that is the same for each trial, it remains a large assumption that will be studied further in future work. There are many other interesting avenues for further work such as the optimal length of horizon as well as determining the best time to play a bandit (i.e. start a trial). Moreover, the current framework allows for confounders to change across time-step – something we have yet to explore.


  • V. Aglietti, N. Dhir, J. González, and T. Damoulas (2021) Dynamic causal bayesian optimization. In Advances in Neural Information Processing Systems, Vol. 35. Cited by: §1.1, item 2, §2, §2.1.
  • V. Aglietti, X. Lu, A. Paleyes, and J. González (2020) Causal Bayesian Optimization. In

    Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics


    Proceedings of Machine Learning Research

    , Vol. 108, pp. 3155–3164.
    Cited by: §2, §2.1, Remark 2.1.
  • M. G. Azar, A. Lazaric, and E. Brunskill (2013) Sequential transfer in multi-armed bandit with finite set of models. arXiv preprint arXiv:1307.6887. Cited by: §1.
  • E. Bareinboim and J. Pearl (2016) Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences 113 (27), pp. 7345–7352. Cited by: §1.1.
  • O. Cappé, A. Garivier, O. Maillard, R. Munos, and G. Stoltz (2013) Kullback-leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, pp. 1516–1541. Cited by: Figure 6, §3.
  • H. Chernoff (1959) Sequential design of experiments. The Annals of Mathematical Statistics 30 (3), pp. 755–770. Cited by: §1.
  • D. J. Gauthier, E. Bollt, A. Griffith, and W. A. Barbosa (2021) Next generation reservoir computing. arXiv preprint arXiv:2106.07688. Cited by: §1.
  • C. Glymour, K. Zhang, and P. Spirtes (2019) Review of causal discovery methods based on graphical models. Frontiers in Genetics 10. Cited by: §2.
  • S. Guha, K. Munagala, and P. Shi (2010) Approximation algorithms for restless bandit problems. Journal of the ACM (JACM) 58 (1), pp. 1–50. Cited by: §1.
  • M. Kocaoglu, K. Shanmugam, and E. Bareinboim (2017) Experimental design for learning causal graphs with latent variables. In Nips, Cited by: §2.
  • D. Koller and N. Friedman (2009) Probabilistic graphical models: principles and techniques - adaptive computation and machine learning. The MIT Press. External Links: ISBN 0262013193 Cited by: §1.1.
  • F. Lattimore, T. Lattimore, and M. D. Reid (2016) Causal bandits: learning good interventions via causal inference. In Advances in Neural Information Processing Systems, pp. 1181–1189. Cited by: §1.
  • T. Lattimore and C. Szepesvári (2020) Bandit algorithms. Cambridge University Press. Cited by: §1, §1.1, §1.
  • S. Lee and E. Bareinboim (2018) Structural causal bandits: where to intervene?. Advances in Neural Information Processing Systems 31 31. Cited by: §A.1, §A.2, Appendix A, Figure 1, Figure 2, §1, §1.1, §1.1, §1.1, §1, §2, Remark 2.1, Figure 5, Assumptions 1.
  • S. Lee and E. Bareinboim (2019) Structural causal bandits with non-manipulable variables. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4164–4172. Cited by: §1.
  • S. Lee and E. Bareinboim (2020) Characterizing optimal mixed policies: where to intervene and what to observe. Advances in neural information processing systems 33. Cited by: §1.1.
  • Y. Lu, A. Meisami, A. Tewari, and W. Yan (2020) Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artificial Intelligence, pp. 141–150. Cited by: §1.
  • M. Mandal, S. Jana, S. K. Nandi, A. Khatua, S. Adak, and T. Kar (2020) A model based study on the dynamics of covid-19: prediction and control. Chaos, Solitons & Fractals 136, pp. 109889. Cited by: §1.
  • J. Pearl (2000) Causality: models, reasoning and inference. Vol. 29, Springer. Cited by: §1.1, §1.1, §2.1, Definition 2.
  • H. Robbins (1952) Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society 58 (5), pp. 527–535. Cited by: §1.1, §1.
  • P. Spirtes, C. N. Glymour, R. Scheines, and D. Heckerman (2000) Causation, prediction, and search. MIT press. Cited by: §2.
  • W. R. Thompson (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: §3.
  • P. Whittle (1988) Restless bandits: activity allocation in a changing world. Journal of applied probability 25 (A), pp. 287–298. Cited by: §1.
  • J. Y. Yu and S. Mannor (2009) Piecewise-stationary bandit problems with side observations. In Proceedings of the 26th annual international conference on machine learning, pp. 1177–1184. Cited by: §1.
  • J. Zhang and E. Bareinboim (2017) Transfer learning in multi-armed bandit: a causal approach. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 1778–1780. Cited by: §1.

Appendix A Toy example details

Material herein is adapted from (Lee and Bareinboim, 2018, task 2).

a.1 Structural equation model

Generating model for exogenous variables:

In the ccb setting, these probabilities remain the same for all time-slices indexed by , as shown by the functions in , when :


When we use the original set of structural equation models (Lee and Bareinboim, 2018, appendix D):


where is the exclusive disjunction operator and is the logical conjunction operator (i.e. ‘and’). The biggest difference between eqs. 4 to 6 and eqs. 7 to 9 is that the former has an explicit dependence on the past. Depending on the implemented intervention at one or both of will be fixed to the value(s) in .

a.2 Intervention sets

“The task of finding the best arm among all possible arms can be reduced to a search within the miss” (Lee and Bareinboim, 2018). The pomis is given by .

Arm ID Domain Interventions
Table 1: Assume binary domains i.e. and . The first arm (ID 0) does nothing i.e. corresponds to the intervention on the empty set . Arms which belong to the pomis arms are highlighted.

a.3 Additional results

The optimal intervention sequence is given by . Using the example domain, this translates to .

Figure 6: Results using a Kullback-Leibler Upper-confidence bound (KL-UCB) policy (Cappé et al., 2013).