    # Causal Bandits with Propagating Inference

Bandit is a framework for designing sequential experiments. In each experiment, a learner selects an arm A ∈A and obtains an observation corresponding to A. Theoretically, the tight regret lower-bound for the general bandit is polynomial with respect to the number of arms |A|. This makes bandit incapable of handling an exponentially large number of arms, hence the bandit problem with side-information is often considered to overcome this lower bound. Recently, a bandit framework over a causal graph was introduced, where the structure of the causal graph is available as side-information. A causal graph is a fundamental model that is frequently used with a variety of real problems. In this setting, the arms are identified with interventions on a given causal graph, and the effect of an intervention propagates throughout all over the causal graph. The task is to find the best intervention that maximizes the expected value on a target node. Existing algorithms for causal bandit overcame the Ω(√(|A|/T)) simple-regret lower-bound; however, their algorithms work only when the interventions A are localized around a single node (i.e., an intervention propagates only to its neighbors). We propose a novel causal bandit algorithm for an arbitrary set of interventions, which can propagate throughout the causal graph. We also show that it achieves O(√(γ^*(|A|T) / T)) regret bound, where γ^* is determined by using a causal graph structure. In particular, if the in-degree of the causal graph is bounded, then γ^* = O(N^2), where N is the number N of nodes.

07/06/2021

### Causal Bandits on General Graphs

We study the problem of determining the best intervention in a Causal Ba...
06/05/2021

### Causal Bandits with Unknown Graph Structure

In causal bandit problems, the action set consists of interventions on v...
10/11/2019

### Regret Analysis of Causal Bandit Problems

We study how to learn optimal interventions sequentially given causal in...
01/10/2017

### Identifying Best Interventions through Online Importance Sampling

Motivated by applications in computational advertising and systems biolo...
03/07/2021

### Hierarchical Causal Bandit

Causal bandit is a nascent learning model where an agent sequentially ex...
11/17/2016

### Unimodal Thompson Sampling for Graph-Structured Arms

We study, to the best of our knowledge, the first Bayesian algorithm for...
05/29/2021

### Understanding Bandits with Graph Feedback

The bandit problem with graph feedback, proposed in [Mannor and Shamir, ...

## 1 Introduction

Multi-armed bandit has been widely recognized as a standard framework for modeling online learning with a limited number of observations. In each round in the bandit problem, a learner chooses an arm from given candidates , and obtains a corresponding observation. Since observation is limited, the learner must adopt an efficient strategy for exploring the optimal arm . The efficiency of the strategy is measured by regret, and the theoretically tight lower-bound is with respect to the number of arms in the general multi-armed bandit setting. Thus, in order to improve the above lower bound, one requires additional information for the bandit setting. For example, contextual bandit [1, 3] is a well-known class of bandit problems with side information on domain-expert knowledge. For this setting, there is a logarithmic regret bound with respect to the number of arms. In this paper, we also achieve regret bound for a novel class of bandit problems with side information. To this end, let us introduce our bandit setting in detail.

Causal graph  is a well-known tool for modeling a variety of real problems, including computational advertising , genetics , agriculture , and marketing . Based on causal graph discovery studies [5, 7, 8, 16], Lattimore et al.  recently introduced the causal bandit framework. They consider the problem of finding the best intervention which causes desirable propagation of a probabilistic distribution over a given causal graph with a limited number of experiments . In this setting, the arms are identified as interventions

on the causal graph. A set of binary random variables

is associated with nodes of the causal graph. At each round of an experiment, a learner selects an intervention which enforces a realization of a variable to when . The effect of the intervention then propagates throughout the causal graph through the edges, and a realization over all nodes is observed after propagation. The goal of the causal bandit problem is to control the realization of a target variable with an optimal intervention.

Figure 1 is an illustrative example of the causal bandit problem. In the figure, the four nodes on the right represent a consumer decision-making model in e-commerce borrowed from . This model assumes that customers make a decision to purchase based on their perceived risk in an online transition (e.g., defective product), the consumer’s trust of a web vendor, and the perceived benefit in e-commerce (e.g., increased convenience). Consumer trust influences perceived risk. Here, we consider controlling customer’s behavior by two kinds of advertising that correspond to adding two nodes (Ad A and Ad B) to be intervened into the model. Ad A can change only the reliability of a website, that is, it can influence the decision of customers in an indirect way through the middle nodes. In contrast, Ad B can change the perceived benefit. The aim is to increase the number of purchases by consumers through choosing an effective advertisement. This is indeed a bandit problem over a causal graph.

The work in  considered the causal bandit problem to minimize simple regret and offered an improved regret bound over the aforementioned tight lower-bound  [Theorem 4] for the general bandit setting [2, 6]. Sen et al.  extended this study by incorporating a smooth intervention, and they provided a new regret bound parameterized by the performance gap between the optimal and sub-optimal arms. This parameterized bound comes from the technique developed for the general multi-armed bandit problem . These analyses, however, only work for a special class of interventions with known true parameters. Indeed, they only consider localized interventions.

#### Main contribution

This paper proposes the first algorithm for the causal bandit problem with an arbitrary set of interventions (which can propagate throughout the causal graph), with a theoretically guaranteed simple regret bound. The bound is , where is a parameter bounded on the basis of the graph structure. In particular, if the in-degree of the causal graph is bounded by a constant, where is the number of nodes.

The major difficulty in dealing with an arbitrary intervention comes from accumulation and propagation of estimation error. Existing studies consider interventions that only affect the parents

of a single node . To estimate the relationship between and in this setting, we could apply an efficient importance sampling algorithm [4, 11]. On the other hand, when we intervene an arbitrary node, it can affect the probabilistic propagation mechanism in any part of the causal graph. Hence, we cannot directly control the realization of intermediate nodes when designing efficient experiments.

The proposed algorithm consists of two steps. First, the preprocessing step is devoted to estimating parameters for designing efficient experiments used in the main step. More precisely, we focus on estimation of parameters with bounded relative error. By truncating small parameters that are negligible but tend to have large relative error, we manage to avoid accumulation of estimation error. In the main step, we apply an importance sampling approach introduced in [11, 15] on the basis of estimated parameters with a guaranteed relative error. This step allows us to estimate parameters with bounded absolute error, which results in the desired regret bound.

#### Related studies

Minimizing simple regret in bandit problems is called the best-arm identification [6, 9] or pure exploration 

problem, and it has been extensively studied in the machine learning research community. The inference of a causal graph structure is also well-studied, which can be classified into causal graph discovery and causal inference: Causal graph discovery

[5, 7, 8, 16] considers efficient experiments for determining the structure of causal graph, while causal inference [13, 14, 17, 18] challenges one to determine the graph structure only from historical data without additional experiments. The causal bandit problem designs experiments without using historical data, which is rather compatible with causal graph discovery studies.

#### Outline

This paper is organized as follows. We introduce the causal bandit problem proposed in  in Section 2. We then present our bandit algorithm and regret bound in Section 3. The proof of the bound is presented in Section 4. We offer experimental evaluation of our algorithm in Section 5.

## 2 Causal bandit problem

This section introduces the causal bandit problem proposed by .

Let be a directed acyclic graph (DAG) with a node set and a (directed) edge set . Let denote an edge from to . Without loss of generality, we suppose that the nodes in are topologically sorted so that no edge from to exists if . For each , let denote the index set of the parents of , i.e., . We then define .

Each node is associated with a random variable , which takes a value in . The distribution of is then influenced by the variables associated with the parents of (unless is intervened, as described below). For each , the parameter defined below characterizes the distribution of given the realizations of its parents:

That is to say, if the parents for are realized as , then

with probability

, and with probability .

Together with a DAG, we are also given a set

of interventions. Each intervention is identified with a vector

, where implies that is intervened and that the realization of is fixed as . Let . Given an intervention and realizations over the parents , the probability that holds is then determined as follows:

 Prob(Vn=πn∣Vi=πi for all % i∈Pn,do(A))=⎧⎨⎩αn(π)% if An=∗,1 if An=πn,0 if An=1−πn.

This equality together with the adjacency of the causal graph

completely determines the joint distribution over the variables

, under an arbitrary intervention .

In the causal bandit problem, we are given a DAG and a set of interventions. However, the parameters () are not known. Our ideal goal is then to find an intervention that maximizes the probability of realizing , where is defined by

 μ(A):=Prob(VN=1∣do(A))

for each .

For this purpose, we discuss the following algorithms. First, they estimate () from experimental trials. Each experiment consists of the application of an intervention and the observation of a realization over all nodes. Let denote the estimate of . Second, the algorithm selects the intervention that maximizes . We evaluate the efficiency of such an algorithm with the simple regret defined as follows:

 RT=μ(A∗)−E[μ(^A)].

Note that, even if an algorithm is deterministic, includes stochasticity since the observations obtained in each experiment are produced by a stochastic process.

In this paper, we assume that and for ease of technical discussion.

## 3 Proposed Algorithm

We propose an algorithm for the causal bandit problem, and present regret bound of the proposed algorithm in this section. The proofs of the bound are presented in the next section. Let for each , and . For and , let denote the restriction of onto .

### 3.1 Outline of the proposed algorithm

Recall that the purpose of the causal bandit problem is to identify an intervention that maximizes . This task is trivial if is known for all , because can then be calculated for all . Let , and for , let denote the set of nodes in which are not intervened by ; . can then be represented as

 μ(A)=∑π∈B(A)∏m∈IN,Aαm(π¯¯¯¯¯¯Pn).

Therefore, for computing approximately, our algorithm estimates ().

In order to estimate efficiently, we are required to manipulate the random variables associated with the parents of . More concretely, to estimate for , we require samples with realization satisfying over the parents of . For , , and , we thus introduce the additional quantities that denote the probability of realizing with under a given intervention . More precisely, we define

 βn(π,A):={Prob(Vm=πm,∀m∈Pn∣do(A))if An=∗,0otherwise.

Our algorithm consists of two phases. The first phase estimates (), and the second phase estimates . The algorithm requires experiments in the first phase, and experiments in the second phase. In the rest of this section, we first explain those phases and present a regret bound on the algorithm.

### 3.2 First Phase: Estimation of β

Here, we introduce the estimation phase of for all . The pseudo-code of this phase is described in Algorithm 1. Algorithm 1 requires a positive number as a parameter, which will be set to . We perform experiments in this phase.

Before explaining the details of Algorithm 1, we note that can be calculated from . For , let

 Bn(π,A):={π′∈{0,1}n−1∣π′i=Ai if Ai≠∗ and i∈[1,n−1],π′i=πi if i∈Pn} (1)

denote the set of realizations over that is consistent with the realization over and the intervention . If , then is then described as

 βn(π,A)=∑π′∈Bn(π,A)∏m∈In−1,Aαm(π′¯¯¯¯¯¯¯Pm). (2)

Algorithm 1 consists of iterations. The -th iteration computes the following objects:

• an estimate of ,

• for each ,

• an estimate of , and

• .

We remark that in Algorithm 1 are used only for computing an estimate and are not used for estimating . An estimate of is computed in the next phase of our algorithm.

At the beginning of the -th iteration, we compute for each and by (2) substituting for ;

 ^βn(π,A)=∑π′∈Bn(π,A)∏m∈In−1,Aˇαm(π′¯¯¯¯¯¯Pn). (3)

Let us confirm that this can be computed if () are available.

For each , then, we identify an intervention that attains . Using , we compute as follows, where is an extension of onto . We conduct experiments with . Let be the number of experiments in those experiments in which the obtained realization satisfies for each . Let be the number of experiments counted in , where also holds. We then compute using the equation

 ˇα′n(¯¯¯π)={¯¯¯¯¯tn(π)/tn(π) if ¯¯¯πn=1,1−¯¯¯¯¯tn(π)/tn(π) if ¯¯¯πn=0. (4)

The vector is added to if

 ˇα′n(¯¯¯π)^βn(π,^An,π)≤2eS(λ), (5)

where is defined as

 S(λ):=12λN2ClogTT.

This reserves such that is too small to estimate with sufficient accuracy. Then is determined by replacing with for :

 ˇαn(¯¯¯π):={ˇα′n(¯¯¯π) if ¯¯¯π∉Gn,0otherwise. (6)

This replacement contributes to reducing the relative estimation error of in subsequent steps ().

After iterating for all , the algorithm computes and  () defined by

 Hn ={¯¯¯π∈{0,1}¯¯¯¯¯¯Pn∣∣^βn(¯¯¯πPn,^An,π)≤8eC2S(λ)}, (7) Dn =Gn∪Hn. (8)

This contributes to bound the absolute error of the estimation of for . The algorithm returns an estimate and the family .

### 3.3 Second Phase: Estimation of α

In this phase, our algorithm computes an estimate of for all . The pseudo-code for this phase is given in Algorithm 2. As an input, it receives and from Algorithm 1.

Algorithm 2 consists of two parts. The first part conducts experiments with (computed from , ) for each and . This is the same process used to compute in Algorithm 1. Let

 D↓n:={π∈{0,1}Pn∣¯¯¯π0,¯¯¯π1∈Dn}

where is the extension of onto with . Let us define a constant for each and . In the second part, the algorithm solves the following optimization problem:

 minη∈[0,1]AmaxA∈A ∑n∈IN,A∑π∈{0,1}Pn∖D↓n^β2n(π,A)∑A′∈AηA′^βn(π,A′)+rn,π s.t. ∑A′∈AηA′=1. (9)

Note that, for each , only if according to Line 20 of Algorithm 1. Thus the denominator is positive for every , and the above optimization problem is well-defined. Let be an optimal solution for (9). Consider the distribution over that generates with a probability of . For times, the second part samples an intervention according to and uses it to conduct experiments.

For each and , the algorithm counts the number (resp., ) of experiments that result in with (resp., and ). Then, (, ) is defined by

 ^α′n(π)={¯¯¯¯¯t′n(πPn)/t′n(πPn) if πn=1,1−¯¯¯¯¯t′n(πPn)/t′n(πPn) if πn=0. (10)

The output defined by

 ^αn(π)={^α′n(π)if π∉Dn,0otherwise. (11)

### 3.4 Regret bound

Pseudo-code of our entire algorithm is provided in Algorithm 3. It computes an estimate of by Algorithm 1 and then computes by Algorithm 2. It then computes an estimate of by

 ^μ(A)=∑π∈B(A)∏n∈IN,A^αn(π¯¯¯¯¯¯Pn) (12)

for each . The algorithm returns an intervention that maximizes .

Let us define as the optimum value of the following problem:

 γ∗:=minη∈[0,1]AmaxA∈A N∑n=1∑π∈{0,1}Pn:βn(π,A)>0β2n(π,A)∑A′∈AηA′βn(π,A′) s.t. ∑A′∈AηA′=1. (13)

The regret bound of Algorithm 3 is parameterized by the optimum value :

###### Theorem 1.

The regret of Algorithm 3 satisfies

 RT≤O⎛⎝√max{γ∗,N}log(|A|T)T⎞⎠.

The notation is used here under the assumption that is sufficiently small with respect to but not negligible. The optimum value is bounded as follows. Let denote the number of nodes intervened by , i.e., :

###### Proposition 2.

It holds that .

Since the lower-bound for the general best-arm identification problem is [Theorem 4], our algorithm provides a better regret bound when the number of interventions is large compared to , which is only dependent on the causal graph structure.

###### Remark 3.

We present Algorithms 12, and 3 for the setting that every is unknown. However, our algorithms can be applied even when is known for some and by incorporating minor modifications. In this case, we denote the number of unknown as . The modified algorithm just skips experiments for estimating the known , and we can define for such and . We then redefine by replacing corresponding with in (3.4), and our bound in Theorem 1 is valid for this reduced . In particular, we can recover the regret bound considered in [Theorem 3] as follows:

###### Corollary 4.

Suppose that is known for every and . Then the regret of Algorithm 3 satisfies , where

 γ∗=minη∈[0,1]A maxA∈A∑π∈{0,1}PNβ2N(π,A)∑A′∈AηA′βN(π,A′) s.t. ∑A′∈AηA′=1.
###### Remark 5.

Our problem setting is often called hard intervention, which directly controls the realization of a node as . In contrast, Sen et al.  introduced the soft intervention model on a node where an intervention changes the conditional probability of a node . They in fact considered a simple case where a graph has a single node such that , whose conditional probability can be controlled by soft intervention. In their model, we are given a discrete set as the set of soft interventions. For and , define

 αk(π,S):=Prob(Vk=πk∣Vi=πi,∀i∈Pk,dosoft(S))

as the probability of realizing under the soft intervention and the condition for . The goal is then to maximize the following probability:

 Prob(VN=1∣dosoft(S))=∑π∈{0,1}N:πN=1αN(π¯¯¯¯¯¯¯PN)αk(π¯¯¯¯¯¯Pk,S)⋅Prob(Vi=πi,∀i∈Pk).

Sen et al.  proved parameterized regret bound assuming that is known in advance.

We here remark that their model can be implemented by the hard intervention model as follows. Regard as the set of indices, and we add nodes for each to the graph. Every has only one adjacent edge from to . Observe that is the set of indices of nodes which are the parents of in the new graph. For , we define by

 αk(π):={αk(π¯¯¯¯¯¯Pk,S)if πS=1 and πS′=0 for % all the other S′∈S0otherwise.

We consider the set of hard interventions , where each intervention is indexed by , and fixes the realization of the node as . More concretely,

 AS,n:=⎧⎨⎩1if n=S,0if n∈S and n≠S,∗otherwise.

Then the joint distribution over the nodes under the soft intervention is equal to the distribution under the corresponding hard intervention , and thus the soft intervention model is reduced to the hard intervention model.

## 4 Proofs

This section is devoted to proving Theorem 1. We introduce a series of well-known technical lemmas together with a novel variant of Hoeffding’s inequality in Section 4.1. In Sections 4.2 and 4.3, we ensure the accuracy of estimation in Algorithms 1 and 2, respectively, which are presented formally as Propositions 10 and 14. Section 4.4 then proves Theorem 1 and Proposition 2, whose statements are presented in the previous section.

### 4.1 Technical lemmas

We introduce Hoeffding’s inequality, Chernoff’s bound, and Hoeffding’s lemma as follows.

###### Proposition 6 (Hoeffding’s inequality).

For every , suppose that is an independent random variable over . We define and . Then for any we have

 Prob(|S−μ|≥ε)≤2exp(−2ε2∑ni=1(bi−ai)2).
###### Proposition 7 (Chernoff’s bound).

For every , suppose that is an independent random variable over . We define and . Then we have

 Prob(S≤(1−δ)μ)≤exp(−δ2μ2),0≤∀δ≤1, Prob(S≥(1+δ)μ)≤exp(−δ2μ3),0≤∀δ≤1, Prob(S≥(1+δ)μ)≤exp(−δμ3),∀δ≥1.
###### Proposition 8 (Hoeffding’s lemma).

Suppose that is a random variable over , and define . Then for any it holds that

 E[exp(λ(X−¯¯¯¯¯X))]≤exp(18λ2).

The following statement is a variant of Hoeffding’s inequality, which is proven on the basis of Hoeffding’s lemma.

###### Lemma 9 (Variant of Hoeffding’s inequality).

For every and , let be a random variable over . For each , we assume that the variables in are independent and has the identical mean (i.e., for all ) under the condition that the variables in are fixed. For , let be a random variable over which is independent of