1 Introduction
Multiarmed bandit has been widely recognized as a standard framework for modeling online learning with a limited number of observations. In each round in the bandit problem, a learner chooses an arm from given candidates , and obtains a corresponding observation. Since observation is limited, the learner must adopt an efficient strategy for exploring the optimal arm . The efficiency of the strategy is measured by regret, and the theoretically tight lowerbound is with respect to the number of arms in the general multiarmed bandit setting. Thus, in order to improve the above lower bound, one requires additional information for the bandit setting. For example, contextual bandit [1, 3] is a wellknown class of bandit problems with side information on domainexpert knowledge. For this setting, there is a logarithmic regret bound with respect to the number of arms. In this paper, we also achieve regret bound for a novel class of bandit problems with side information. To this end, let us introduce our bandit setting in detail.
Causal graph [14] is a wellknown tool for modeling a variety of real problems, including computational advertising [4], genetics [12], agriculture [19], and marketing [10]. Based on causal graph discovery studies [5, 7, 8, 16], Lattimore et al. [11] recently introduced the causal bandit framework. They consider the problem of finding the best intervention which causes desirable propagation of a probabilistic distribution over a given causal graph with a limited number of experiments . In this setting, the arms are identified as interventions
on the causal graph. A set of binary random variables
is associated with nodes of the causal graph. At each round of an experiment, a learner selects an intervention which enforces a realization of a variable to when . The effect of the intervention then propagates throughout the causal graph through the edges, and a realization over all nodes is observed after propagation. The goal of the causal bandit problem is to control the realization of a target variable with an optimal intervention.Figure 1 is an illustrative example of the causal bandit problem. In the figure, the four nodes on the right represent a consumer decisionmaking model in ecommerce borrowed from [10]. This model assumes that customers make a decision to purchase based on their perceived risk in an online transition (e.g., defective product), the consumer’s trust of a web vendor, and the perceived benefit in ecommerce (e.g., increased convenience). Consumer trust influences perceived risk. Here, we consider controlling customer’s behavior by two kinds of advertising that correspond to adding two nodes (Ad A and Ad B) to be intervened into the model. Ad A can change only the reliability of a website, that is, it can influence the decision of customers in an indirect way through the middle nodes. In contrast, Ad B can change the perceived benefit. The aim is to increase the number of purchases by consumers through choosing an effective advertisement. This is indeed a bandit problem over a causal graph.
The work in [11] considered the causal bandit problem to minimize simple regret and offered an improved regret bound over the aforementioned tight lowerbound [2][Theorem 4] for the general bandit setting [2, 6]. Sen et al. [15] extended this study by incorporating a smooth intervention, and they provided a new regret bound parameterized by the performance gap between the optimal and suboptimal arms. This parameterized bound comes from the technique developed for the general multiarmed bandit problem [2]. These analyses, however, only work for a special class of interventions with known true parameters. Indeed, they only consider localized interventions.
Main contribution
This paper proposes the first algorithm for the causal bandit problem with an arbitrary set of interventions (which can propagate throughout the causal graph), with a theoretically guaranteed simple regret bound. The bound is , where is a parameter bounded on the basis of the graph structure. In particular, if the indegree of the causal graph is bounded by a constant, where is the number of nodes.
The major difficulty in dealing with an arbitrary intervention comes from accumulation and propagation of estimation error. Existing studies consider interventions that only affect the parents
of a single node . To estimate the relationship between and in this setting, we could apply an efficient importance sampling algorithm [4, 11]. On the other hand, when we intervene an arbitrary node, it can affect the probabilistic propagation mechanism in any part of the causal graph. Hence, we cannot directly control the realization of intermediate nodes when designing efficient experiments.The proposed algorithm consists of two steps. First, the preprocessing step is devoted to estimating parameters for designing efficient experiments used in the main step. More precisely, we focus on estimation of parameters with bounded relative error. By truncating small parameters that are negligible but tend to have large relative error, we manage to avoid accumulation of estimation error. In the main step, we apply an importance sampling approach introduced in [11, 15] on the basis of estimated parameters with a guaranteed relative error. This step allows us to estimate parameters with bounded absolute error, which results in the desired regret bound.
Related studies
Minimizing simple regret in bandit problems is called the bestarm identification [6, 9] or pure exploration [4]
problem, and it has been extensively studied in the machine learning research community. The inference of a causal graph structure is also wellstudied, which can be classified into causal graph discovery and causal inference: Causal graph discovery
[5, 7, 8, 16] considers efficient experiments for determining the structure of causal graph, while causal inference [13, 14, 17, 18] challenges one to determine the graph structure only from historical data without additional experiments. The causal bandit problem designs experiments without using historical data, which is rather compatible with causal graph discovery studies.Outline
2 Causal bandit problem
This section introduces the causal bandit problem proposed by [11].
Let be a directed acyclic graph (DAG) with a node set and a (directed) edge set . Let denote an edge from to . Without loss of generality, we suppose that the nodes in are topologically sorted so that no edge from to exists if . For each , let denote the index set of the parents of , i.e., . We then define .
Each node is associated with a random variable , which takes a value in . The distribution of is then influenced by the variables associated with the parents of (unless is intervened, as described below). For each , the parameter defined below characterizes the distribution of given the realizations of its parents:
That is to say, if the parents for are realized as , then
with probability
, and with probability .Together with a DAG, we are also given a set
of interventions. Each intervention is identified with a vector
, where implies that is intervened and that the realization of is fixed as . Let . Given an intervention and realizations over the parents , the probability that holds is then determined as follows:This equality together with the adjacency of the causal graph
completely determines the joint distribution over the variables
, under an arbitrary intervention .In the causal bandit problem, we are given a DAG and a set of interventions. However, the parameters () are not known. Our ideal goal is then to find an intervention that maximizes the probability of realizing , where is defined by
for each .
For this purpose, we discuss the following algorithms. First, they estimate () from experimental trials. Each experiment consists of the application of an intervention and the observation of a realization over all nodes. Let denote the estimate of . Second, the algorithm selects the intervention that maximizes . We evaluate the efficiency of such an algorithm with the simple regret defined as follows:
Note that, even if an algorithm is deterministic, includes stochasticity since the observations obtained in each experiment are produced by a stochastic process.
In this paper, we assume that and for ease of technical discussion.
3 Proposed Algorithm
We propose an algorithm for the causal bandit problem, and present regret bound of the proposed algorithm in this section. The proofs of the bound are presented in the next section. Let for each , and . For and , let denote the restriction of onto .
3.1 Outline of the proposed algorithm
Recall that the purpose of the causal bandit problem is to identify an intervention that maximizes . This task is trivial if is known for all , because can then be calculated for all . Let , and for , let denote the set of nodes in which are not intervened by ; . can then be represented as
Therefore, for computing approximately, our algorithm estimates ().
In order to estimate efficiently, we are required to manipulate the random variables associated with the parents of . More concretely, to estimate for , we require samples with realization satisfying over the parents of . For , , and , we thus introduce the additional quantities that denote the probability of realizing with under a given intervention . More precisely, we define
Our algorithm consists of two phases. The first phase estimates (), and the second phase estimates . The algorithm requires experiments in the first phase, and experiments in the second phase. In the rest of this section, we first explain those phases and present a regret bound on the algorithm.
3.2 First Phase: Estimation of
Here, we introduce the estimation phase of for all . The pseudocode of this phase is described in Algorithm 1. Algorithm 1 requires a positive number as a parameter, which will be set to . We perform experiments in this phase.
Before explaining the details of Algorithm 1, we note that can be calculated from . For , let
(1) 
denote the set of realizations over that is consistent with the realization over and the intervention . If , then is then described as
(2) 
Algorithm 1 consists of iterations. The th iteration computes the following objects:

an estimate of ,

for each ,

an estimate of , and

.
We remark that in Algorithm 1 are used only for computing an estimate and are not used for estimating . An estimate of is computed in the next phase of our algorithm.
At the beginning of the th iteration, we compute for each and by (2) substituting for ;
(3) 
Let us confirm that this can be computed if () are available.
For each , then, we identify an intervention that attains . Using , we compute as follows, where is an extension of onto . We conduct experiments with . Let be the number of experiments in those experiments in which the obtained realization satisfies for each . Let be the number of experiments counted in , where also holds. We then compute using the equation
(4) 
The vector is added to if
(5) 
where is defined as
This reserves such that is too small to estimate with sufficient accuracy. Then is determined by replacing with for :
(6) 
This replacement contributes to reducing the relative estimation error of in subsequent steps ().
After iterating for all , the algorithm computes and () defined by
(7)  
(8) 
This contributes to bound the absolute error of the estimation of for . The algorithm returns an estimate and the family .
3.3 Second Phase: Estimation of
In this phase, our algorithm computes an estimate of for all . The pseudocode for this phase is given in Algorithm 2. As an input, it receives and from Algorithm 1.
Algorithm 2 consists of two parts. The first part conducts experiments with (computed from , ) for each and . This is the same process used to compute in Algorithm 1. Let
where is the extension of onto with . Let us define a constant for each and . In the second part, the algorithm solves the following optimization problem:
s.t.  (9) 
Note that, for each , only if according to Line 20 of Algorithm 1. Thus the denominator is positive for every , and the above optimization problem is welldefined. Let be an optimal solution for (9). Consider the distribution over that generates with a probability of . For times, the second part samples an intervention according to and uses it to conduct experiments.
For each and , the algorithm counts the number (resp., ) of experiments that result in with (resp., and ). Then, (, ) is defined by
(10) 
The output defined by
(11) 
3.4 Regret bound
Pseudocode of our entire algorithm is provided in Algorithm 3. It computes an estimate of by Algorithm 1 and then computes by Algorithm 2. It then computes an estimate of by
(12) 
for each . The algorithm returns an intervention that maximizes .
Let us define as the optimum value of the following problem:
s.t.  (13) 
The regret bound of Algorithm 3 is parameterized by the optimum value :
Theorem 1.
The regret of Algorithm 3 satisfies
The notation is used here under the assumption that is sufficiently small with respect to but not negligible. The optimum value is bounded as follows. Let denote the number of nodes intervened by , i.e., :
Proposition 2.
It holds that .
Since the lowerbound for the general bestarm identification problem is [2][Theorem 4], our algorithm provides a better regret bound when the number of interventions is large compared to , which is only dependent on the causal graph structure.
Remark 3.
We present Algorithms 1, 2, and 3 for the setting that every is unknown. However, our algorithms can be applied even when is known for some and by incorporating minor modifications. In this case, we denote the number of unknown as . The modified algorithm just skips experiments for estimating the known , and we can define for such and . We then redefine by replacing corresponding with in (3.4), and our bound in Theorem 1 is valid for this reduced . In particular, we can recover the regret bound considered in [11][Theorem 3] as follows:
Corollary 4.
Suppose that is known for every and . Then the regret of Algorithm 3 satisfies , where
s.t. 
Remark 5.
Our problem setting is often called hard intervention, which directly controls the realization of a node as . In contrast, Sen et al. [15] introduced the soft intervention model on a node where an intervention changes the conditional probability of a node . They in fact considered a simple case where a graph has a single node such that , whose conditional probability can be controlled by soft intervention. In their model, we are given a discrete set as the set of soft interventions. For and , define
as the probability of realizing under the soft intervention and the condition for . The goal is then to maximize the following probability:
Sen et al. [15] proved parameterized regret bound assuming that is known in advance.
We here remark that their model can be implemented by the hard intervention model as follows. Regard as the set of indices, and we add nodes for each to the graph. Every has only one adjacent edge from to . Observe that is the set of indices of nodes which are the parents of in the new graph. For , we define by
We consider the set of hard interventions , where each intervention is indexed by , and fixes the realization of the node as . More concretely,
Then the joint distribution over the nodes under the soft intervention is equal to the distribution under the corresponding hard intervention , and thus the soft intervention model is reduced to the hard intervention model.
4 Proofs
This section is devoted to proving Theorem 1. We introduce a series of wellknown technical lemmas together with a novel variant of Hoeffding’s inequality in Section 4.1. In Sections 4.2 and 4.3, we ensure the accuracy of estimation in Algorithms 1 and 2, respectively, which are presented formally as Propositions 10 and 14. Section 4.4 then proves Theorem 1 and Proposition 2, whose statements are presented in the previous section.
4.1 Technical lemmas
We introduce Hoeffding’s inequality, Chernoff’s bound, and Hoeffding’s lemma as follows.
Proposition 6 (Hoeffding’s inequality).
For every , suppose that is an independent random variable over . We define and . Then for any we have
Proposition 7 (Chernoff’s bound).
For every , suppose that is an independent random variable over . We define and . Then we have
Proposition 8 (Hoeffding’s lemma).
Suppose that is a random variable over , and define . Then for any it holds that
The following statement is a variant of Hoeffding’s inequality, which is proven on the basis of Hoeffding’s lemma.
Lemma 9 (Variant of Hoeffding’s inequality).
For every and , let be a random variable over . For each , we assume that the variables in are independent and has the identical mean (i.e., for all ) under the condition that the variables in are fixed. For , let be a random variable over which is independent of