1 Introduction
A dynamical system evolves in time. Examples include weather, financial markets as well as unmanned aerial vehicles (Gauthier et al., 2021). One practical goal is to take decisions within those systems such as deciding which financial instrument to buy, one day after another. The bandit (mab) paradigm (Robbins, 1952; Chernoff, 1959) can help with that. In that environment an agent and an environment interact sequentially (Lattimore and Szepesvári, 2020); the agent picks an action and receives a reward from the environment. The agent continues like so, usually for a finite number of plays, with the goal of maximising the total reward. The challenge in this problem stems from the tradeoff between exploiting actions with known rewards, versus exploring actions with unknown rewards.
Recently, many studies have addressed the case wherein which there is a nontrivial dependency structure between the arms. One such direction presumes that the dependency structure is modelled explicitly through causal graphs (Lee and Bareinboim, 2018, 2019; Lattimore et al., 2016; Lu et al., 2020). We extend that paradigm to also account for the causal temporal dynamics of the system. mabs already constitute sequential decisionmaking paradigms, but here we expand that idea to cover chronological MABs: where the reward distribution is conditional upon the actions taken by earlier MABs (see fig. 1
). We are not considering a Markov Decision Process (
mdp) – we have no explicit notion of state and consequently do not maintain a model of the state dynamics. This type of transfer learning in causal
mabs was also studied by (Zhang and Bareinboim, 2017; Azar et al., 2013) but there the authors transfer information between unrelated tasks whereas we are interested in transfers for when all agents operate in the same (possibly nonstationary) dynamical system. Ours, instead, is more similar to the restless bandit (Whittle, 1988; Guha et al., 2010) problem where rewards vary with time (unlike the standard bandit setting where they are fixed but unknown).Example.
Consider a dynamical environment ^{1}^{1}1We will somewhat abuse standard notation for dynamical systems theory., such as a country subject to the COVID19 pandemic. The highly transient nature of the virus (Mandal et al., 2020) necessitates multiple chronological clinical trials, the start of each indexed by (see fig. 1), to find a treatment or vaccine. Suppose we conduct clinical trial which has different treatments of unknown efficacy for COVID19 and we have patients in our study group. Patients arrive sequentially, and we must decide on a treatment to administer to each new patient.
To make this decision, we could learn from how the previous choices of treatments fared for the previous patients. After a sufficient number of trials, we may have a reasonable idea of which treatment is most effective, and from thereon, we could administer that treatment to all patients. However the exploration phase may take a long time and many patients may receive a suboptimal treatment during that period. But we know than a earlier, similar trial , has just concluded and because we are aware of the causal nature of our treatments and their evolution over time, we can condition our trial on the lessons learned, and actions taken in trial , before we start ours. There are two purposes to this (1) the additional information may aid the discovery of the most effective treatment in our trial and (2) it may also show that the most optimal intervention changes over time owing to the highly nonstationary environment of real systems, where a typical assumption (Lattimore and Szepesvári, 2020) is for standard MABs to have a stationary reward distribution (Lattimore and Szepesvári, 2020). The time between two consecutive trials and is .
Contributions.
The chronological causal bandit (ccb) extends the scmmab by Lee and Bareinboim (2018) by conditioning a scmmab on prior causal bandits played in the same environment . The result of this is a piecewise stationary model which offers a novel approach for causal decision making under uncertainty within dynamical systems. The reward process of the arms is nonstationary on the whole, but stationary on intervals (Yu and Mannor, 2009). Specifically, past optimal interventions are transferred across time allowing the present mab to weigh the utility of those actions in the present game.
1.1 Preliminaries
Notation.
Random variables are denoted by uppercase letters and their values by lowercase letters. Sets of variables and their values are noted by bold uppercase and lowercase letters respectively. We make extensive use of the docalculus (for details see (Pearl, 2000, §3.4)). Samples (observational) drawn from a system or process unperturbed over time are contained in . Samples (interventional) drawn from a system or process subject to one or more interventions are denoted by . The domain of a variable is denoted by where e.g. and .
Structural causal model.
Structural causal models (SCMs) (Pearl, 2000, ch. 7) are used as the semantics framework to represent an underlying environment. For the exact definition as used by Pearl see (Pearl, 2000, def. 7.1.1). Let be a scm parametrised by the quadruple . Here,
is a set of exogenous variables which follow a joint distribution
and is a set of endogenous (observed) variables. Withinwe distinguish between two types of variables: manipulative (treatment) and target (output) variables (always denoted by
in this paper). Further, endogenous variables are determined by a set of functions . Let (Lee and Bareinboim, 2020, §1) s.t. each is a mapping from (the respective domains of) to – where and . Graphically, each SCM is associated with a causal diagram (a directed acyclical graph, dag for short) where the edges are given by . Each vertex in the graph corresponds to a variable and the directed edges point from members of and toward (Pearl, 2000, ch. 7). A directed edge is s.t. if (i.e. is a child of ). A bidirected edge between and occurs if they share an unobserved confounder which is to say (Lee and Bareinboim, 2018). Unless otherwise stated, from hereon, when referring to we are implicitly considering – i.e. the manipulative variables not including the outcome variable. Finally, the fundamental dooperator represents the operation of fixing a set of endogenous variable(s) to constant value(s) irrespective of their original mechanisms. Throughout we do not consider graphs with nonmanipulative variables. For more a more incisive discussion on the properties of SCMs we refer the reader to (Pearl, 2000; Bareinboim and Pearl, 2016).Multiarmed bandit.
The MAB setting (Robbins, 1952) entertains a discrete sequential decisionmaking scenario in which an agent selects an action or ‘arm’ according to a policy , and receives a stochastic reward , emanating from an unknown distribution particular to each arm. The expectation of the reward is given by . The goal of the agent is to optimise the arm selection sequence and thereby maximise the expected reward after rounds, where is the expectation under the given policy and is the arm played on the round. We will use a similar performance measure, the cumulative regret (Lee and Bareinboim, 2018) where the max reward is . Using the regret decomposition lemma (Lattimore and Szepesvári, 2020, Lemma 4.5), we can write this in the form
(1) 
where each arm’s gap from the best arm (“suboptimality gap” (Lattimore and Szepesvári, 2020, §4.5)) is and is the total number of times arm was played after rounds.
Connecting SCMs to MABs.
Echoing the approach taken by Lee and Bareinboim (2018, §2), using the notation and concepts introduced in section 1.1; let be an SCM parametrised by where is the, asnoted, reward variable. The arms of the bandit are given by the set ^{2}^{2}2For example if then . If then . This is a set of all possible interventions on endogenous variables except the reward variable (Lee and Bareinboim, 2018, §2). Each arm is associated with a reward distribution where the mean reward is . This is the scmmab setting (Lee and Bareinboim, 2018), fully represented by the tuple . As noted by the authors an agent facing a scmmab , intervenes (plays arms) with knowledge of and but does not have access to the structural equations model and the joint distribution over the exogenous variables .
Causality across time.
Taking inspiration from Aglietti et al. (2021, §2), the authors consider causality in time, manifested by propagating a dag in time, and connecting each timeslice dag with directed edges as shown in fig. 2
. By doing this we are invariably considering a dynamic Bayesian network (
dbn) (Koller and Friedman, 2009). As we are making interventions on timeslices of the DBN, we introduce notation to aid the exposition of the method.Definition 1.
Let be the scm at time step defined as for . The temporal window covered by the scm spans i.e. taking into account only the most recent timeslice, and the actions taken therein. It is also possible to increase the size of to include the entire history i.e. ^{3}^{3}3The choice of window size is a difficult one. More information is typically better but we may also subject the model to ‘stale’ information i.e. interventions which are no longer of any relevance or worse, misleading in the present scenario. as is done in fig. 3.
Definition 2.
2 Chronological Causal Bandits
The scmmab is particular to one graph, wherein which we seek to minimise . We instead seek a sequence of interventions which minimise at each timestep (i.e. the start of the trial)^{4}^{4}4For clarity: each trial contains multiple timetrials (rounds), summarised in . of a piecewise stationary scmmab as set out in the next paragraph.
Problem statement.
Similar to (Aglietti et al., 2021) the goal of this work is to find a sequence of optimal interventions over time, indexed by , by playing a series of sequential conditional scmmabs or chronological causal bandits (ccb). The agent is provided with for each and is then tasked with optimising the armselection sequence (within each ‘trial’ ) and in so doing, minimise the total regret, for the given horizon . However, different to the standard mab problem we must take into account previous interventions as well. We do that by first writing the regret eq. 1, in a different form which is conditional on past interventions, enabling the transfer of information
(2)  
where denotes the previously implemented intervention and is the indicator function. Now, take particular care w.r.t. the index for which the “implemented intervention” concerns. Drawing upon our example in section 1, the implemented intervention here corresponds to the treatment that was found to be the most effective during trial (during which rounds were played). Although the agent finds a sequence of decisions/treatments, only one of them, in this setting, is the overall optimal choice – i.e. has the lowest, on average, suboptimality gap . The implemented intervention at is found by investigating which arm has the lowest regret at the end of horizon
(3) 
Simply, we pick the arm that has been played the most by the agent.
Remark 2.1.
The above problem statement does not discuss which set of interventions are being considered. Naively one approach is to consider all the sets in (Aglietti et al., 2020). Though a valid approach, the size will grow exponentially with the complexity of . Intuitively, one may believe that the best and only action worth considering is to intervene on all the parents of the reward variable . This is indeed true provided that is not confounded by any of its ancestors (Lee and Bareinboim, 2018). But whenever unobserved confounders are present (shown by red and dashed edges in fig. 2) this no longer holds true. By exploiting the rules of docalculus and the partialorders amongst subsets (Aglietti et al., 2020), Lee and Bareinboim (2018) characterise a reduced set of intervention variables called the possibly optimal minimal intervention set (pomis) or from hereon^{5}^{5}5Generally, refers to all possible interventions whereas refers only to the pomis., where , and typically . They demonstrate empirically that the selection of arms based on pomiss make standard mab solvers “converge faster to an optimal arm” (Lee and Bareinboim, 2018, §5). Throughout we use the pomis as well. For the full set of arms for the toy problem in fig. 2, see table 1.
Assumptions 1.
To tackle a ccb style problem, we make the following assumptions (included are also those made by Lee and Bareinboim (2018) w.r.t. the standard scmmab):

[leftmargin=*, noitemsep]

Invariance of the causal structure .

Aglietti et al. (2021) showed that does not change across time given assumption (1).

An agent facing a scmmab , plays arms with knowledge of and but not and .
Assumption (2) posits that the dag is known. If this were not true then we would have to undertake causal discovery (CD) (Glymour et al., 2019) or spend the first interactions with the environment (Lee and Bareinboim, 2018) learning the causal dag, from (Spirtes et al., 2000), from (Kocaoglu et al., 2017) or both (Aglietti et al., 2020). As it is, CD is outside the scope of this paper.
2.1 Transferring information between causal bandits
Herein we describe how to express the reward distribution for trial as a function of the intervention(s) implemented at the previous trial
. The key to this section is the relaxation of assumption (3). We seek an estimate
for (the true SEM), and for . Thus, if we have , we can relay information about past interventions to scmmab and thus enable the present reward distribution to take into account those interventions. Reminding ourselves that the members of . Alas for e.g. we seek function estimates . Because we model all functions inas independent probability mass functions (
pmf).Simulation.
Recall that is an observational distribution, who’s samples are contained in . Where is an interventional distribution, found by fixing variables to in – its samples are contained in . As we are operating in a discrete world, we do not need to approximate any intractable integrals (as is required in e.g. (Aglietti et al., 2021, 2020)) to compute the effect of interventions. Rather, by assuming the availability of and using the docalculus, we are at liberty to estimate by approximating the individual pmfs that arise from applying the docalculus (see (Pearl, 2000, Theorem 3.4.1)). Consequently we can build a ‘simulator’ using (which concerns the whole of , not just the window ). is very scarce because playing an arm does not yield an observed target value but only a reward (hence we cannot exploit the actions taken during horizon ). The only interventional data that is available, at each , to each scmmab , is the implemented intervention . The ccb method is graphically depicted in fig. 3.
3 Experiments
We demonstrate some empirical observations on the toyexample used throughout, shown in fig. 2. We consider the first five timeslices (trials) as shown in fig. 3. The reward distribution, under the pomis, for each slice is shown in fig. 4 (these distributions are found conditional upon the optimal intervention being implemented in the preceding trial as shown in fig. 3). For the true sem see eqs. 4 to 6 and eqs. 7 to 9. Figure 4 shows that the system is in effect oscillating in the reward distribution. This is because the optimal intervention changes inbetween trials.
The horizon for each trial was set to 10,000. We used two common mab
solvers: Thompson sampling (TS,
(Thompson, 1933)) and KLUCB (Cappé et al., 2013). Displayed results are averaged over 100 replicates of each experiment shown in fig. 5. We investigate the CR and the optimal arm selection probability at various instances in time.For trial , ccb and scmmab are targeting the same reward distribution. Consequently they both identify the same optimal intervention . Come things start to change; having implemented the optimal intervention from ccb is now targeting a different set of rewards (see fig. 4). The scmmab, being agnostic about past interventions, targets the same reward as previously (blue bars in fig. 4). As discussed, this ignores the dynamics of the system; a vaccinated population will greatly alter the healthcare environment, hence to administer the same vaccine (arm 3 at ) at the next clinical trial (), without taking this into account, makes for poor public policy.
(truncated at 5000 rounds) show cumulative regret along with the standard deviation at trial
. Optimal armselection probability is shown in fig. 4(c), found using a Thompson sampling policy (Lee and Bareinboim, 2018, §5). An equivalent plot for KLUCB can be found in fig. 6.Consider the CR from the ccb at trial fig. 4(b); it is lower than the scmmab in fig. 4(a), as it is transferring preceding intervention to the current causal model, fig. 3(), and now finds that is the optimal intervention (see appendix A for full optimal intervention sequence). Let’s now turn to fig. 4(c), to minimise the regret the agent should be choosing the optimal action almost all of the time. But it is only possible to reduce regret if the algorithm has discovered the arm with the largest mean. In trials and the reward per arm, across the pomis, is almost identical. As is stands the agent does not have a reasonable statistical certainty that is has found the optimal arm (orange and red curves in fig. 4(c)). But all have the same causal effect, why the CR in fig. 4(b) is low.
4 Conclusion
We propose the chronological causal bandit (ccb) algorithm which transfers information between causal bandits which have been played in the dynamical system at an earlier point in time. Some initial findings are demonstrated on a simple toy example where we show that taking the system dynamics into account has a profound effect on the final action selection. Further, whilst we in this example have assumed that is the same for each trial, it remains a large assumption that will be studied further in future work. There are many other interesting avenues for further work such as the optimal length of horizon as well as determining the best time to play a bandit (i.e. start a trial). Moreover, the current framework allows for confounders to change across timestep – something we have yet to explore.
References
 Dynamic causal bayesian optimization. In Advances in Neural Information Processing Systems, Vol. 35. Cited by: §1.1, item 2, §2, §2.1.

Causal Bayesian Optimization.
In
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics
,Proceedings of Machine Learning Research
, Vol. 108, pp. 3155–3164. Cited by: §2, §2.1, Remark 2.1.  Sequential transfer in multiarmed bandit with finite set of models. arXiv preprint arXiv:1307.6887. Cited by: §1.
 Causal inference and the datafusion problem. Proceedings of the National Academy of Sciences 113 (27), pp. 7345–7352. Cited by: §1.1.
 Kullbackleibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, pp. 1516–1541. Cited by: Figure 6, §3.
 Sequential design of experiments. The Annals of Mathematical Statistics 30 (3), pp. 755–770. Cited by: §1.
 Next generation reservoir computing. arXiv preprint arXiv:2106.07688. Cited by: §1.
 Review of causal discovery methods based on graphical models. Frontiers in Genetics 10. Cited by: §2.
 Approximation algorithms for restless bandit problems. Journal of the ACM (JACM) 58 (1), pp. 1–50. Cited by: §1.
 Experimental design for learning causal graphs with latent variables. In Nips, Cited by: §2.
 Probabilistic graphical models: principles and techniques  adaptive computation and machine learning. The MIT Press. External Links: ISBN 0262013193 Cited by: §1.1.
 Causal bandits: learning good interventions via causal inference. In Advances in Neural Information Processing Systems, pp. 1181–1189. Cited by: §1.
 Bandit algorithms. Cambridge University Press. Cited by: §1, §1.1, §1.
 Structural causal bandits: where to intervene?. Advances in Neural Information Processing Systems 31 31. Cited by: §A.1, §A.2, Appendix A, Figure 1, Figure 2, §1, §1.1, §1.1, §1.1, §1, §2, Remark 2.1, Figure 5, Assumptions 1.
 Structural causal bandits with nonmanipulable variables. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4164–4172. Cited by: §1.
 Characterizing optimal mixed policies: where to intervene and what to observe. Advances in neural information processing systems 33. Cited by: §1.1.
 Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artificial Intelligence, pp. 141–150. Cited by: §1.
 A model based study on the dynamics of covid19: prediction and control. Chaos, Solitons & Fractals 136, pp. 109889. Cited by: §1.
 Causality: models, reasoning and inference. Vol. 29, Springer. Cited by: §1.1, §1.1, §2.1, Definition 2.
 Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society 58 (5), pp. 527–535. Cited by: §1.1, §1.
 Causation, prediction, and search. MIT press. Cited by: §2.
 On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: §3.
 Restless bandits: activity allocation in a changing world. Journal of applied probability 25 (A), pp. 287–298. Cited by: §1.
 Piecewisestationary bandit problems with side observations. In Proceedings of the 26th annual international conference on machine learning, pp. 1177–1184. Cited by: §1.
 Transfer learning in multiarmed bandit: a causal approach. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 1778–1780. Cited by: §1.
Appendix A Toy example details
Material herein is adapted from (Lee and Bareinboim, 2018, task 2).
a.1 Structural equation model
Generating model for exogenous variables:
In the ccb setting, these probabilities remain the same for all timeslices indexed by , as shown by the functions in , when :
(4)  
(5)  
(6) 
When we use the original set of structural equation models (Lee and Bareinboim, 2018, appendix D):
(7)  
(8)  
(9) 
where is the exclusive disjunction operator and is the logical conjunction operator (i.e. ‘and’). The biggest difference between eqs. 4 to 6 and eqs. 7 to 9 is that the former has an explicit dependence on the past. Depending on the implemented intervention at one or both of will be fixed to the value(s) in .
a.2 Intervention sets
“The task of finding the best arm among all possible arms can be reduced to a search within the miss” (Lee and Bareinboim, 2018). The pomis is given by .
Arm ID  Domain  Interventions  

0  
1  
2  
3  
4  
5  
6  
7  
8 
a.3 Additional results
The optimal intervention sequence is given by . Using the example domain, this translates to .