1 Introduction
Causal graphs [28] are useful for representing causal relationships among interacting variables in large systems [7]. Over the last few decades, causal models have found use in computational advertising [7], biological systems [25], sociology [5], agriculture [36] and epidemiology [18]. There are two important questions commonly studied with causal graphs: (i) How to learn a directed causal graph that encodes the pattern of interaction among components in a system (casual structure learning)? [28] , and (ii)
Using previously acquired (partial) knowledge about the causal graph structure, how to estimate and/or to optimize the effect of a new
intervention on other variables (optimization) [7, 18, 20, 6, 21]? Here, an intervention is a forcible change to the value of a variable in a system. The change either alters the relationship between the parental causes and the variable, or decouples it from the parental causes entirely. Our focus is on optimizing over a given set of interventions.An illustrative example includes online advertising [7]
, where there is a collection of clickthrough rate scoring algorithms that provide an estimate of the probability that an user clicks on an ad displayed at a specific position. The interventions occur through the choice of clickthrough rate scoring algorithm; the algorithm choice directly impacts ad placement and pricing, and through a complex network of interactions, affects the revenue generated through advertisements. The revenue is used to determine the best scoring algorithm (optimize for the best intervention); see Figure
1. Another example is in biological generegulatory networks [6], where a large number of genomes interact amongst each other and also interact with environmental factors. The objective here is to understand the best perturbation of some genomes in terms of its effect on the expression of another subset of genomes (target) in cellular systems.This paper focuses on the following setting: We are given apriori knowledge about the structure and strength of interactions over a small part of the causal graph. In addition, there is freedom to intervene (from a set of allowable interventions) at a certain node in the known part of the graph, and collect data under the chosen intervention; further we can alter the interventions over time and observe the corresponding effects. Given a set of potential interventions to optimize over, the key question of interest is: How to choose the best sequence of allowable interventions in order to discover which intervention maximizes the expectation of a downstream target node?
Determining the best intervention in the above setting can be cast as a best arm identification bandit problem, as noted in [22]. The possible interventions to optimize over are the arms of the bandit, while the sample value of the target node under an intervention is the reward.
More formally, suppose that is a node in a causal graph (as shown in Fig. 2), with the parents of denoted by In Fig. 1, corresponds to the clickthrough rate and its parents are userquery and adschosen. This essentially means that is causally determined by a function of and some exogenous random noise. This dependence is characterized by the conditional ^{1}^{1}1Formally if node has parents , then this distribution is the conditional for all .
. Then a (soft) intervention mathematically corresponds to changing this conditional probability distribution i.e. probabilistically forcing
to certain states given its parents. In the computational advertising example, the interventions correspond to changing the click through rate scoring algorithm i.e , whose inputoutput characteristics are wellstudied. Further, suppose that the effect of an intervention is observed at a node which is downstream of in the topological order (w.r.t ) refer to Fig. 2. Then, our key question is stated as follows: Given a collection of interventions find the best intervention among these possibilites that maximizes under a fixed budget of (intervention, observation) pairs.1.1 Main Contributions
(Successive Rejects Algorithm) We provide an efficient successive rejects multiphase algorithm. The algorithm uses clipped importance sampling. The clipper level is set adaptively in each phase in order to tradeoff
bias and variance
. Our procedure yields a major improvement over the algorithm in [22] (both in theoretical guarantees and in practice), which sets the clippers and allocates samples in a static manner.(Gap Dependent Error Guarantees under Budget Constraints) In the classic best arm identification problem [3], Audibert et al. derive gap dependent bounds on the probability of error given a fixed sample budget. Specifically, let be the th largest gap (difference) in the expected reward from that of the best arm (e.g. is the difference between the best arm expected reward and the second best reward). Then, it has been shown in [3] that the number of samples needed scales as (upto poly log factors)
In our setting, a fundamental difference from the classical best arm setting [3] is the information leakage across the arms, i.e, samples from one arm can inform us about the expected value of other arms because of the shared causal graph. We show that this information leakage yields significant improvement both in theory and practice. We derive the first gap dependent (gaps between the expected reward at the target under different interventions) bounds on the probability of error in terms of the number of samples , cost budget on the relative fraction of times various arms are sampled and the divergences between soft intervention distributions of the arms.
In our result (upto poly log factors) the factor is replaced by the ’effective variance’ of the estimator for arm , i.e. we obtain (with informal notation) . can be much smaller than (the corresponding term in the results of [3]). Our theoretical guarantees quantify the improvement obtained by leveraging information leakage, which has been empirically observed in [7]. We discuss in more detail in Sections 3.3 and B (in the appendix), about how these guarantees can be exponentially better than the classical ones. We derive simple regret (refer to Section 3.1) bounds analogous to the gap dependent error bounds.
(Novel divergence measure for analyzing Importance Sampling) We provide a novel analysis of clipped importance sampling estimators, where pairwise divergences between the distributions , for a carefully chosen function (see Section C.1) act as the ‘effective variance’ term in the analysis for the estimators (similar to Bernstein’s bound [4]).
(Extensive Empirical Validation) We demonstrate that our algorithm outperforms the prior works [22, 3] on the Flow Cytometry dataset [32] (in Section 4.1). We exhibit an innovative application of our algorithm for model interpretation of the Inception Deep Network [38] for image classification (refer to Section 4.2).
Remark 1.
The techniques in this paper can be directly applied to more general settings like (i) the intervention source () can be a collection of nodes () and the changes affect the distribution , where is the union of all the parents; (ii) the importance sampling can be applied at a directed cut separating the sources and the targets, provided the effect of the interventions, on the nodes forming the cut can be estimated. Moreover, our techniques can be applied without the complete knowledge of the source distributions. We explain the variations in more detail in Section A in the appendix.
1.2 Related Work
The problem lies at the intersection of causal inference and best arm identification in bandits. There have been many studies on the classical best arm identification in the bandit literature, both in the fixed confidence regime [19, 13] and in the fixed budget setting [3, 10, 17, 8]. It was shown recently in [8] that the results of [3] are optimal. The key difference from our work is that, in these models, there is no information leakage among the arms.
There has been a lot of work [27, 16, 12, 15, 35, 29, 24, 33] on learning casual models from data and/or experiments and using it to estimate causal strength questions of the counterfactual nature. One notable work that partially inspired our work is [7] where the causal graph underlying a computational advertising system (like in Bing, Google etc.) is known and the primary interest is to find out how a change in the system would affect some other variable.
At the intersection of causality and bandits, [22] is perhaps most relevant to our setting. It studies the problem of identifying the best hard interventions on multiple variables (among many), provided the distribution of the parents of the target is known under those interventions. Simple regret bound of order was derived. We assume soft interventions that affect the mechanism between a ’source’ node and its parents, far away from the target (similar to the case of computational advertising considered in [7]). Further, we derive the first gap dependent bounds (that can be exponentially small in ), generalizing the results of [3]. Our formulation can handle general budget constraints on the bandit arms and also recover the problem independent bounds of [22] (orderwise). Budget constraints in bandit settings have been explored before in [1, 34].
In the context of machine learning,
importance sampling has been mostly used to recondition input data to adhere to conditions imposed by learning algorithms [37, 23, 39].2 Problem Setting
A causal graph
specifies causal relationships among the random variables representing the vertices of the graph
. The relationships are specified by the directed edges ; an edge implies that is a direct parental cause for the effect . With some abuse of notation, we will denote the random variable associated with a node by itself. We will denote the parents of a node by . The causal dependence implies that , where is an independent exogenous noise variable. One does not get to measure the functions in practice. The noise variable and the above functional dependence induce a conditional probability distribution. Further, the joint distribution of
decomposes into product of conditional distributions according toviewed as a Bayesian Network, i.e.
.Interventions in a causal setting can be categorized into two kinds:

Soft Interventions: At node , the conditional distribution relating and is changed to .

Hard Interventions: We force the node to take a specific value . The conditional distribution is set to a point mass function .
In this work, we consider the problem of identifying the best soft intervention, i.e. the one that maximizes the expected value of a certain target variable. The problem setting is best illustrated in Figure 2. Consider a causal graph that specifies directed causal relationships between the variables . Let be a target random variable which is downstream in the graph ; the expected value of this target variable is the quantity of interest. Consider another random variable along with its parents . We assume that there are possible soft interventions. Each soft intervention is a distinct conditional distribution that dictates the relationship . During a soft intervention (), the conditional distribution of given its parents is set to and all other relationships in the causal graph are unchanged.
It is assumed that the conditional distributions and marginals for for are known from past experiments or existing domain knowledge. We only observe samples of and , while the rest of the variables in the causal graph may be unobserved under different interventions. For simplicity we assume that the variables are discrete while the target variable may be continuous/discrete and has bounded support in . Further, we assume that the various conditionals, i.e. are absolutely continuous with respect to each other. In the case of discrete distributions, the nonzero supports of these distributions are identical. However, our algorithm can be easily generalized for continuous distributions on and (as in our experiments in Section 4.1). In this setting, we are interested in the following natural questions: Which of the soft interventions yield the highest expected value of the target and what is the misidentification error that can be achieved with a finite total budget for samples ?
Remark 2.
Although we may know apriori the joint distribution of and under different interventions, how the change affects another variable in the causal graph is unknown and must be learnt from samples. The task is to transfer prior knowledge to identify the best intervention.
Bandit Setting: The different soft interventions can be thought of as the arms of a bandit problem. Let the reward of arm be denoted by: , where is the expected value of under the soft intervention when the conditional distribution of given its parents is set to (soft intervention ) while keeping all other things in unchanged. We assume that there is only one best arm. Let be the arm that yields the highest expected reward and be the value of the corresponding expected reward i.e. and Let the optimality gap of the arm be defined as . We shall see that the these gaps and the relationship between distributions are important parameters in the problem. Let the minimum gap be .
Fixed Budget for Samples: In this paper, we work under the fixed budget setting of best arm identification [3]. Let be the number of times the intervention is used to obtain samples. We require that . Let be the fraction of times the intervention is played.
Additional Cost Budget on Interventions: In the context of causal discovery, some interventions require a lot more resources or experimental effort than the others. We find such examples in the context of online advertisement design [7]. Therefore, we introduce two variants of an additional cost constraint that influences the choice of interventions. (i) Difficult arm budget (S1): Some arms are deemed to be difficult. Let be the set of difficult arms. We require that the total fraction of times the difficult arms are played does not exceed i.e. (ii) Cost Budget (S2): This is the most general budget setting that captures the variable costs of sampling each arm [34]. We assume that there is a cost associated with sampling arm . It is required that the average cost of sampling does not exceed a cost budget .ie. along with the total budget completely defines this budget setting. It should be noted that S1 is a special case of S2.
We note that unless otherwise stated, we work with the most general setting in S2. We state some of our results in the setting S1 for clearer exposition.
Objectives: There are two main quantities of interest:
(Probability of Error): This is the probability of failing to identify the best soft intervention (arm). Let be the arm that is predicted to be the best arm at the end of the experiment. Then the probability of error [3, 8] is given by,
(Simple Regret): Another important quantity that has been analyzed in the best arm identification setting is the simple regret [22]. The simple regret is given by .
3 Our Main Results
In this section we provide our main theoretical contributions. In Section 3.2, we provide a successive rejects style algorithm that leverages the information leakage between the arms via importance sampling. Then, we provide theoretical guarantees on the probability of misidentification and simple regret for our algorithm in Section 3.3. In order to explain our algorithm and our results formally, we first describe several key ideas in our algorithm and introduce important definitions in Section 3.1.
3.1 Definitions
Quantifying Information Leakage: Our setting is one in which there is information leakage among the arms of the bandit. Recall that each arm corresponds to a different conditional distribution imposed on a node given its parents , while the rest of the relationships in the causal graph remain unchanged. Since the different candidate conditional distributions are known from prior knowledge (and are absolutely continuous with respect to each other), it is possible to utilize samples obtained under an arm to obtain an estimate for the expectation under some other arm (i.e ). A popular method for utilizing this information leakage among different distributions is through importance sampling, which has been used in counterfactual analysis in similar causal settings [22, 7].
Importance Sampling: Suppose we get samples from arm and we are interested in estimating . In this context it helpful to express in the following manner:
(1) 
(1) is trivially true because the only change to the joint distribution of all the variables in the causal graph under arm and is at the factor . Suppose we observe samples of from the arm , denoted by . Here denotes the sample from random variable at time step , while the subscript just denotes that the samples are collected under arm . Under the observation of Equation (1), one might assume that the naive estimator,
(2) 
provides a good estimate for . However, the confidence guarantees on such an estimate can be arbitrarily bad even if is bounded. This is because the factor can be very large for several instances of
. Therefore, usual measure concentrations (e.g. the AzumaHoeffding inequality) would not yield good confidence intervals. This has been noted in
[22] in a similar setting, where a static clipper has been applied to the weighted samples to control the variance. However, a static clipper introduces a fixed bias, and thus it is not suitable for obtaining gap dependent simple regret bounds. Instead, in our algorithm we a multiphase approach and use dynamic clipping to adaptively control the bias vs. variance tradeoff in a phase dependent manner, which leads to significantly better gap dependent bounds^{2}^{2}2We note that the authors in [22] discuss the possibility of a multiphase approach, where clipper levels could change across phases. However, they do not pursue this direction (no specific algorithm or results) as their objective is to derive gap independent bounds (minimax regret).. We now define some key quantities.Definition 1.
Let be a nonnegative convex function such that . For two joint distributions and (and the associated conditionals), the conditional divergence is given by:
Recall that is the conditional distribution of node given the state of its parents Thus, is the conditional divergence between the conditional distributions and Now we define some logdivergences that are are crucial in the our analysis.
Definition 2.
( measure) Consider the function . We define the following logdivergence measure:
These log divergences help us in controlling the bias variance tradeoff in importance sampling as shown in Section C.2 in the appendix. We also note that estimates of the divergence measure can be had directly from empirical data (without the knowledge of the full distributions) using techniques like that of [30].
Aggregating Interventional Data: We describe an efficient estimator of () that combines available samples from different arms. This estimator adaptively weights samples depending on the relative measures, and also uses clipping to control variance by introducing bias. The estimator is given by (3).
Suppose we obtain samples from arm . Let the total number of samples from all arms be denoted by . Further, let us index all the samples by and be the indices of all the samples collected from arm . Let denotes the sample collected for random variable under intervention , at time instant . Finally, let . We denote the estimate of by ( indicates the level of confidence desired). Our estimator is:
(3) 
In other words, is the weighted average of the clipped samples, where the samples from arm are weighted by and clipped at . The choice of controls the biasvariance tradeoff which we will adaptively change in our algorithm.
3.2 Algorithm
Now, we describe our main algorithmic contribution  Algorithm 1 and 2. Algorithm 1 starts by having all the arms under consideration and then proceeds in phases, possibly rejecting one or more arms at the end of each phase.
At every phase, Estimator with a phase specific choice of the parameter (i.e. controlling bias variance tradeoff), is applied to all arms under consideration. Using a phase specific threshold on these estimates, some arms are rejected at the end of each phase. A random arm among the ones surviving at the end of all phases is declared to be the optimal. We now describe the duration of various phases.
Recall the parameters  Total sample budget available and  average cost budget constraint. Let . Let . We will have an algorithm with phases numbered by . Let be the total number of samples in phase . We set for . Note that . Let be the set of arms remaining to compete with the optimal arm at the beginning of phase which is continuously updated.
Allocation of Budget: Let be the samples allocated to arm in phase . Let
be the vector consisting of entries
. The vector of allocated samples, i.e. is decided by Algorithm 3. Intuitively, an arm that provides sufficient information about all the remaining arms needs to be given more budget than other less informative arms. This allocation depends on the average budget constraints and the relative log divergences between the arms (Definition 2). Algorithm 3 formalizes this intuition, and ensures that variance of the worst estimator (of the form (3)) for the arms in is as good as possible (quantified in Theorem 4 and Lemma 4 in the appendix).The inverse of the maximal objective of the LP in Algorithm 3 acts as
effective standard deviation
uniformly for all the estimators for the remaining arms in . It is analogous to the variance terms appearing in Bernsteintype concentration bounds (refer to Lemma 4 in the appendix).Definition 3.
The effective standard deviation for budget and arm set is defined as from Algorithm 3 with input and arm set .
Algorithm 3 minimizes the variance terms in the confidence bounds for the estimates of the arms that are in contention i.e. for all . This minimization is performed subject to the constraints on the fractional budget of each arms (recall from Section 2). Note that Algorithm 3 only needs to ensure good confidence bounds for the arms that are remaining (). This gets easier as the number of arms remaining (i.e. ) decreases. Therefore the effective variances become progressively better with every phase.
Remark 3.
Note that Line 6 uses only the samples acquired in that phase. Clearly, a natural extension is to modify the algorithm to reuse all the samples acquired prior to that step. We give that variation in Algorithm 2. We prove all our guarantees for Algorithm 1. We conjecture that the second variation has tighter guarantees (dropping a multiplicative log factor) in the sample complexity requirements.
(4)  
3.3 Theoretical Guarantees
We state our main results as Theorem 1 and Theorem 2, which provide guarantees on probability of error and simple regret respectively. Our results can be interpreted as a natural generalization of the results in [3], when there is information leakage among the arms. This is the first gap dependent characterization.
Theorem 1.
(Proved formally as Theorem 5) Let . Let be the effective standard deviation as in Definition 3. The probability of error for Algorithm 1 satisfies:
(5) 
when the budget for the total number of samples is and . Here,
(6) 
and is the set of arms whose distance from the optimal arm is roughly at most twice that of arm .
Comparison with the result in [3]: Let , i.e. the set of arms which are closer to the optimal than arm . Let . The result in [3] can be stated as: The error in finding the optimal arm is bounded as: .
Our work is analogous to the above result (upto poly log factors) except that appears instead of . In Section B.1 (in the appendix), we demonstrate through simple examples that can be significantly smaller than (the corresponding term in above) even when there are no average budget constraints. Moreover, our results can be exponentially better in the presence of average budget constraints (examples in Section B.1). Now we present our bounds on simple regret in Theorem 2.
Theorem 2.
Comparison with the result in [22]: In [22], the simple regret scales as and does not adapt to the gaps. We provide gap dependent bounds that can be exponentially better than that of [22] (when ’s are not too small and the first term in (7) is zero). Moreover our bounds generalize to gap independent bounds that match . Further details are provided in Section B.2 (in the appendix).
4 Empirical Validation
We empirically validate the performance of our algorithms in two real data settings. In Section 4.1, we study the empirical performance of our algorithm on the flow cytometry dataset [32]. In Section 4.2, we apply our algorithms for the purpose of model interpretability of the Inception Deep Network [38] in the context of image classification. Section 4.3 is dedicated to synthetic experiments. In Section D (in the appendix) we provide more details about our experiments. In the appendix we empirically show that our divergence metric is fundamental and replacing it with other divergences is suboptimal.
4.1 Flow Cytometry DataSet
The flow cytometry dataset [32] consists of multivariate measurements of protein interactions in a single cell, under different experimental conditions (soft interventions). This dataset has been extensively used for validating causal inference algorithms. Our experiments are aimed at identifying the best intervention among many, given some ground truth about the causal graph. For, this purpose we borrow the causal graph from Fig. 5(c) in [26] (shown in Fig. (a)a) and consider it to be the ground truth.
Parametric linear models have been popularly used for causal inference on this dataset [25, 11]. We fit a GLM gamma model [14] between the activation of each node and its parents in Fig. (a)a using the observational data. In Section D.1 (in the appendix) we provide further details showing that the sampled distributions in the fitted model are extremely close to the empirical distributions from the data. The soft interventions signifying the arms are generated by changing the distribution of a source node pkc in the GLM. The objective is to identify the intervention that yields the highest output at the target node erk^{3}^{3}3The activations of the node erk have been scaled so that the mean is less than one. Note that the marginal distribution still has an exponential tail, and thus does not strictly adhere to our boundedness assumption on the target variable. However, the experiments suggest that our algorithms still perform extremely well. We provide empirical results for two sets of interventions at the source node. Both these experiments have been performed with arms each representing different distributions at pkc.
Budget Restriction: The experiments are performed in the budget setting S1, where all arms except arm are deemed to be difficult. We plot our results as a function of the total samples , while the fractional budget of the difficult arms () is set to . Therefore, we have . This essentially belongs to the case when there is a lot of data that can be acquired for a default arm while any new change requires significant cost in acquiring samples.
Competing Algorithms: We test our algorithms on different problem parameters and compare with related prior work [3, 22]. The algorithms compared are (i) SRISv1: Algorithm 1 introduced in Section 3.2. The divergences, are estimated from sampled data using techniques from [30]; (ii)SRISv2: Algorithm 2 as detailed in Section 3.2; (iii) SR: Successive Rejects Algorithm from [3] adapted to the budget setting. The division of the total budget into phases is identical, while the individual arm budgets are decided in each phase according to the budget restrictions; (iv) CR: Algorithm 2 from [22]. The optimization problem for calculating the mixture parameter is not efficiently solvable for general distributions and budget settings. Therefore, the mixture proportions are set by Algorithm 3.
In these experiments, the budget restrictions imply that arm can be pulled much more than the other arms. Intuitively the divergences of the arms from arm as well as the gap defines the hardness of identification. Fig. (b)b represents a difficult scenario where the divergences for many arms (large divergences imply low information leakage) and (small increases hardness). In Fig. (c)c (easier scenario) the divergences for most arms while the gap is same as before. We see that SRISv2 outperforms all the other algorithms by a large margin, especially in the low sample regime.
4.2 Interpretability of Inception Deep Network
In this section we use our algorithm for model interpretation of the pretrained Inceptionv3 network [38] for classifying images. Model Interpretation essentially addresses: ’why does a learning model classify in a certain way?’, which is an important question for complicated models like deep nets [31].
When an RGB image is fed to Inception, it produces an ordered sequence of labels (e.g ’drums’, ’sunglasses’) and generally the top labels are an accurate description of the objects in the image. To address interpretability, we segment the image into a number of superpixels/segments (using segmentation algorithms like SLIC [2]) and infer which superpixels encourage the neural net to output a certain label (henceforth referred to as labelI; e.g ’drum’) in top (e.g. ), and to what extent.
Given a mixture distribution over the superpixels of an image (Figure (a)a), a few superpixels are randomly sampled from the distribution with replacement. Then a new image is generated where all other superpixels of the original image are blurred out except the ones selected. This image is then fed to Inception, and it is observed whether labelI appears within the top labels. A mixture distribution is said to be a good interpretation for labelI if there is a high probability that labelI appears for an image generated by this mixture distribution. To empirically test the goodness of a mixture distribution, we would generate (using this mixture distribution) a number of random images, and determine the fraction of images for which labelI appears; a large fraction indicates that the mixture distribution is a good interpretation of labelI.
Motivated by the above discussion, we generate a large number () of mixture distributions, with the goal of finding the one that best interprets labelI. To highlight the applicability of our algorithm, we allow images to be generated for only of these mixture distributions; in other words, most of the mixture distributions cannot be directly tested. Nevertheless, we determine the best from among the entire collection of mixture distributions (learning counterfactuals).
Specifically in our experiments, we consider the image in Figure (a)a, partition it into superpixels, and generate images from mixture distributions by sampling superpixels (with replacement). We generate arm distributions which lie in the dimensional simplex but have sparse support (sparsity of in our examples). The support of these distributions are randomly generated by techniques like markov random walk (encourages contiguity), random choice, etc. as detailed in Section D.2 in the appendix. However, we are only allowed to sample using a different set of arms that are dense distributions chosen uniformly at random from the dimensional simplex. The distributions are generated in a manner which is completely agnostic to the image content. The total sample budget () is .
Figure 4 shows images in which the segments are weighted in proportion to the optimal distribution (obtained by SRISv2) for the interpretation of three different labels. This showcases the true counterfactual power of the algorithm, as the set of arms that can be optimized over are disjoint from the arms that can be sampled from. Moreover the sample budget is less than the number of arms. This is an extreme special case of budget setting S2. We see that our algorithm can generate meaningful interpretations for all the labels with relatively less number of runs of Inception. Even sampling times from each of the arms to be optimized over would require runs of Inception for a single image and label, while we use only runs by leveraging information leakage.
4.3 Synthetic Experiments
In this section, we empirically validate the performance our algorithm through synthetic experiments. We carefully design our simulation setting which is simple, but at the same time sufficient to capture the various tradeoffs involved in the problem. An important point to note is that our algorithm is not aware of the actual effect of the changes on the target (gaps between expectations) but it only knows the divergence among the candidate soft interventions. Sometimes, a change with large divergence from an existing one may not maximize the effect we are looking for. Conversely, smaller divergence may sometimes lead you closer to the optimal. We demonstrate that our algorithm performs well in all the experiments, as compared to previous works [3, 22].
Experimental Setup: We set up our experiments according to the simple causal graph in Figure 5. is assumed to be a random variable taking values in . The various arms are discrete distributions with support . We will vary and over the course of our experiments.
is assumed to be a function of and some random noise which is external to the system. In our experiments, we set the function as follows:
where is an arbitrary function. We set in all our experiments. The discrete candidate distributions are modified to explore various tradeoffs between the gaps and the effective standard deviation parameters.
Budget Restriction: The experiments are performed in the budget setting S1, where all arms except arm are deemed to be difficult. We plot our results as a function of the total samples , while the fractional budget of the difficult arms () is set to . Therefore, we have . This essentially belongs to the case when there is a lot of data that can be acquired for a default arm while any new change requires significant cost in acquiring samples.
Experiments: In our experiments, we choose to be the parity function, when , is represented in base . Note that arm is the arm that can be sampled times while the rest of the arms can only be sampled times due to the above budget constraints. So, the divergence of the arm from other arms is crucial alongside the gaps. We perform our experiments in different regimes that get progressively easier. In these experiments, we function in various regimes of the divergences between the other arms and arm , and the gaps from the optimal arm in terms of target value. When there is no information leakage, the samples are divided among the arms. So, the loss in having multiple arms can be expressed as a scaling in standard deviation. Recall the divergence measure which is a measure of information leakage from arm to another arm . Therefore, in the following, when we say high divergence from arm , it means that is high for most arms .
High Divergence and Low : This is the hardest of all settings. Here, we set and . Here, we have to be pretty high for all the arms . This means that the arm , which can be pulled times provides highly noisy estimates for other arms. We have for most arms. Moreover, the minimum gap from the best arm , which is pretty small. This implies that it is harder to distinguish the best arm.
The results are demonstrated in Figure 6. Figure (a)a displays the simple regret. We see that both SRISv1 and SRISv2 outperform the others by a large extent, in this hard setting, even when the number of samples are very low. In Figure (b)b we plot the probability of error in exactly identifying the best arm. We see that none of the algorithms successfully identify the best arm, in the small sample regime, as the gap is very low. However, our algorithms quickly zero in on arms that are almost as good as the optimal, and therefore the simple regret is wellbehaved. Our algorithm performs this well even when the divergences are big, because it is able to reject the arms that have high in the early phases, very effectively.
High Divergence and High : This is easier than the previous setting. Here, we set and . Here, we have to be very high for all the arms . Thus arm provides very noisy estimates on other arms. We have for many arms. However, the minimum gap from the best arm , which is not too small. This implies that it might be easier to distinguish the best arm.
The results are demonstrated in Figure 7. Figure (a)a displays the simple regret. We see that in the small sample regime SRISv1 and SRISv2 outperform the others by a large extent. In the high sample regime, SRISv2 is still the best, while SR and SRISv1 are close behind. In Figure (b)b we plot the probability of error in exactly identifying the best arm. We see that SRISv2 performance very well in identifying the best arm even though arm gives highly noise estimates. It is interesting to note that CR does not perform well. This can be attributed to the nonadaptive clipper in CR, that incurs a significant bias because arm has highdivergences from most of the other arms.
Low Divergence and Low : This is another moderately hard setting, similar to the previous one. Here, we set and . Here, we have to be not too high for the arms . This means that the arm , which can be pulled times is moderately good for estimating the other arms. Here, for most arms . However, the minimum gap from the best arm , which is small. This implies that it might be hard to distinguish the best arm.
The results are demonstrated in Figure 8. Figure (a)a displays the simple regret. We see that in the small sample regime SRISv1 and SRISv2 outperforms the others by a large extent. In the high sample regime, SRISv2 is still the best, while CR is close behind. In Figure (a)a we plot the probability of error in exactly identifying the best arm. We see that most of the algorithms have moderately bad probability of error as the gap is small. However, the algorithms SRISv2 and SRISv1 are quickly able to zero down on arms close to optimal as shown in the simple regret in the small sample regime.
Low Divergence and High : This is the easiest of all settings. Here, we set and . Here, arms has
pretty close to the uniform distribution on
. Therefore, it is very wellposed for estimating the means of all other arms. In fact we have for many arms. Moreover, the minimum gap from the best arm , which is not too small. This implies that it might be very easy to distinguish the best arm.The results are demonstrated in Figure 9. Figure (a)a displays the simple regret. We see that SRISv2 and CR perform extremely well closely followed by SRISv1. In Figure (b)b we plot the probability of error in exactly identifying the best arm. Again SRISv2 and CR have almost zero probability of error and SRISv1 is close behind. This is because is pretty large. In this example, we observe that all the algorithms that use information leakage are better than SR, because arm is wellbehaved. CR performs almost as well as SRISv2 in this example, as the static clipper is never invoked because almost always the ratios in the importance sampler are well bounded.
5 Discussion and Future Work
In this paper, we analyze the problem of identifying the best arm at a node in a causal graph (various known conditionals in terms of its effect on a target variable further downstream, possibly in a less understood portion of the larger causal network. We characterize the hardness of this problem in terms of the relative divergences of the various conditionals that are being tested and the gaps between the expected value of the target under the various arms. We provide the first problem dependent simple regret and error bounds for this problem, that is a natural generalization of [3], but with information leakage between arms. We provide an efficient successive rejects style algorithm that achieves these guarantees, by leveraging the leakage of information, through carefully designed clipped importance samplers. Further, we introduce a new divergence measure that may be relevant for analyzing importance sampling estimators in the causal context. This may be of independent interest. We believe that our work paves the way for various interesting problems with significant practical implications. In the following, we state a few open questions in this regard:
Tighter guarantees on SRISv2: In Section 4, we have observed that a slightly modified version of our algorithm SRISv2 performs the best among all the competing algorithms including SRISv1. The only difference of SRISv2 from Algorithm 1, is that in line 6 the estimators used in a phase also uses samples from past phases, but clipped according to the criterion in the current phase. We believe that this algorithm has tighter error and simple regret guarantees. We conjecture that at least one of the , in the definition of in (6) can be eliminated, thus leading to better guarantees.
Estimating the marginals of the parents: In Algorithm 1, either the marginals of the parents of , that is is required in order to calculate the divergences in Definition 2, or prior data involving the parents is required to estimate the divergences directly from data. However, we believe it is possible to model this estimation, directly into the online framework, as data about the marginals of the parents are available through the samples in all the arms, as these marginals remain unchanged.
Problem Dependent Lower Bound: In [22], a problem independent lower bound of has been provided for a special causal graph. However, the problem parameter dependent lower bound like that of [3] still remains an open problem. We believe that the lower bound will depend on the divergences between the distributions and the gaps between the rewards of the arms, similar to the term in (6).
General Learning Framework: Our work paves the way for a more general setting for learning counterfactual effects. Importance sampling is a fairly general tool and can be ideally applied at any set of nodes of a causal graph. So, in principle it is possible to study the effect of a change at on a target , by using importance sampling between the changed marginal distributions at an intermediate cut that blocks every path from to . In fact, this is explored in a nonbandit context in [7]. An important question is: What is the most suitable cut to be used? [22] uses the cut closest to , i.e. immediate parents of . However, the marginals of the cut under different changes need to be estimated this ’far’ from the source closer to the target. Therefore, there is a tradeoff that involves a delicate balance between the estimation errors of the changes at an intermediate cut between and , and the reduction in importance sampling divergences between cut distributions closer to the target . We believe understanding this is quite important to fully exploit partial/full knowledge about causal graph structure to answer causal strength questions from data observed.
References
 [1] Jacob Abernethy, Yiling Chen, ChienJu Ho, and Bo Waggoner. Lowcost learning via active data procurement. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, pages 619–636. ACM, 2015.
 [2] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. Slic superpixels. Technical report, 2010.
 [3] JeanYves Audibert and Sébastien Bubeck. Best arm identification in multiarmed bandits. In COLT23th Conference on Learning Theory2010, pages 13–p, 2010.
 [4] George Bennett. Probability inequalities for the sum of independent random variables. Journal of the American Statistical Association, 57(297):33–45, 1962.
 [5] Hubert M Blalock. Causal models in the social sciences. Transaction Publishers, 1985.
 [6] Richard Bonneau, Marc T Facciotti, David J Reiss, Amy K Schmid, Min Pan, Amardeep Kaur, Vesteinn Thorsson, Paul Shannon, Michael H Johnson, J Christopher Bare, et al. A predictive model for transcriptional control of physiology in a free living cell. Cell, 131(7):1354–1365, 2007.
 [7] Léon Bottou, Jonas Peters, Joaquin Quinonero Candela, Denis Xavier Charles, Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Y Simard, and Ed Snelson. Counterfactual reasoning and learning systems: the example of computational advertising. Journal of Machine Learning Research, 14(1):3207–3260, 2013.
 [8] Alexandra Carpentier and Andrea Locatelli. Tight (lower) bounds for the fixed budget best arm identification bandit problem. arXiv preprint arXiv:1605.09004, 2016.
 [9] Image source. http://bit.ly/2hviIwP. Accessed: 20170223.
 [10] Lijie Chen and Jian Li. On the optimal sample complexity for best arm identification. arXiv preprint arXiv:1511.03774, 2015.

[11]
Hyunghoon Cho, Bonnie Berger, and Jian Peng.
Reconstructing causal biological networks through active learning.
PloS one, 11(3):e0150611, 2016. 
[12]
Frederick Eberhardt.
Almost optimal intervention sets for causal discovery.
In
Proceedings of 24th Conference in Uncertainty in Artificial Intelligence (UAI)
, pages 161–168, 2008.  [13] Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identification: A unified approach to fixed budget and fixed confidence. In Advances in Neural Information Processing Systems, pages 3212–3220, 2012.
 [14] James William Hardin, Joseph M Hilbe, and Joseph Hilbe. Generalized linear models and extensions. Stata press, 2007.
 [15] Alain Hauser and Peter Bühlmann. Two optimal strategies for active learning of causal networks from interventional data. In Proceedings of Sixth European Workshop on Probabilistic Graphical Models, 2012.
 [16] Antti Hyttinen, Frederick Eberhardt, and Patrik Hoyer. Experiment selection for causal discovery. Journal of Machine Learning Research, 14:3041–3071, 2013.
 [17] Kevin G Jamieson, Matthew Malloy, Robert D Nowak, and Sébastien Bubeck. lil’ucb: An optimal exploration algorithm for multiarmed bandits. In COLT, volume 35, pages 423–439, 2014.
 [18] Michael Joffe, Manoj Gambhir, Marc ChadeauHyam, and Paolo Vineis. Causal diagrams in systems epidemiology. Emerging themes in epidemiology, 9(1):1, 2012.
 [19] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best arm identification in multiarmed bandit models. The Journal of Machine Learning Research, 2015.
 [20] Patrick Kemmeren, Katrin Sameith, Loes AL van de Pasch, Joris J Benschop, Tineke L Lenstra, Thanasis Margaritis, Eoghan O?Duibhir, Eva Apweiler, Sake van Wageningen, Cheuk W Ko, et al. Largescale genetic perturbations reveal regulatory networks and an abundance of genespecific repressors. Cell, 157(3):740–752, 2014.
 [21] Gabriel Krouk, Jesse Lingeman, Amy Marshall Colon, Gloria Coruzzi, and Dennis Shasha. Gene regulatory networks in plants: learning causality from time and perturbation. Genome biology, 14(6):1, 2013.
 [22] Finnian Lattimore, Tor Lattimore, and Mark D Reid. Causal bandits: Learning good interventions via causal inference. In Advances In Neural Information Processing Systems, pages 1181–1189, 2016.
 [23] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306. ACM, 2011.
 [24] PoLing Loh and Peter Bühlmann. Highdimensional learning of linear causal networks via inverse covariance estimation. Journal of Machine Learning Research, 15(1):3065–3105, 2014.
 [25] Nicolai Meinshausen, Alain Hauser, Joris M Mooij, Jonas Peters, Philip Versteeg, and Peter Bühlmann. Methods for causal inference from gene perturbation experiments and validation. Proceedings of the National Academy of Sciences, 113(27):7361–7368, 2016.
 [26] Joris Mooij and Tom Heskes. Cyclic causal discovery from continuous equilibrium data. arXiv preprint arXiv:1309.6849, 2013.
 [27] Joris M Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Schölkopf. Distinguishing cause from effect using observational data: methods and benchmarks. Journal of Machine Learning Research, 17(32):1–102, 2016.
 [28] Judea Pearl. Causality. Cambridge university press, 2009.
 [29] Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2009.
 [30] Fernando PérezCruz. Kullbackleibler divergence estimation of continuous distributions. In Information Theory, 2008. ISIT 2008. IEEE International Symposium on, pages 1666–1670. IEEE, 2008.
 [31] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
 [32] Karen Sachs, Omar Perez, Dana Pe’er, Douglas A Lauffenburger, and Garry P Nolan. Causal proteinsignaling networks derived from multiparameter singlecell data. Science, 308(5721):523–529, 2005.
 [33] Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G Dimakis, and Sriram Vishwanath. Learning causal graphs with small interventions. In Advances in Neural Information Processing Systems, pages 3195–3203, 2015.
 [34] Aleksandrs Slivkins. Dynamic ad allocation: Bandits with budgets. arXiv preprint arXiv:1306.0155, 2013.
 [35] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search. A Bradford Book, 2001.

[36]
Jerzy SplawaNeyman, DM Dabrowska, TP Speed, et al.
On the application of probability theory to agricultural experiments. essay on principles. section 9.
Statistical Science, 5(4):465–472, 1990.  [37] Masashi Sugiyama, Matthias Krauledat, and KlausRobert MÃžller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985–1005, 2007.

[38]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich.
Going deeper with convolutions.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1–9, 2015.  [39] Tong Zhang and Peiling Zhao. Stochastic optimization with importance sampling for regularized loss minimization. arXiv preprint arXiv:1401.2753, 2014.
Appendix A Variations of the Problem Setting
In this section we provide more general causal settings where our results can be directly applied.
Multiple nodes at the graph: This is illustrated in Fig. (a)a. Soft interventions can be performed at multiple nodes like at . These interventions can be modeled as changing the distribution where are the union of parents of and . These distributions can be thought of as the arms of the bandits and our techniques can be applied as before to estimate the best intervention.
Directed cut between sources and targets: Fig. (b)b represents the most general scenario in which our techniques can be applied. Soft or hard interventions can be performed at multiple source nodes, while the goal is to choose the best out of these interventions in terms of maximizing a known function of multiple target nodes. If the effect of these interventions can be estimated on a directed cut separating the targets and the sources then our techniques can be applied as before. This is akin to knowing under all the interventions in Fig. (b)b, because and is a directed cut separating the sources and the targets.
Empirical knowledge of continuous arm distributions: Our techniques can be applied to continuous distributions as shown in our empirical results in Section 4.1. The extension is straightforward by using the general definition of divergences. More importantly our techniques can be applied even if only prior empirical samples from the distributions is available and not the whole distributions. In this case the divergences can be estimated using nearest neighbor estimators similar to [30]. Moreover, for the importance sampling only ratios of distributions are needed, which can again be estimated using nearest neighbor based techniques from empirical data.
Appendix B Interpretation of our Theoretical Results
In this section we compare our theoretical bounds on the probability of misidentification with the corresponding bounds in [3]. We also compare our simple regret guarantees with the guarantees in [22]. In both these cases, we demonstrate significant improvements. These theoretical improvements are exhibited in our empirical results in Section 4.1.
b.1 Comparison with [3]
Let , i.e. the set of arms which are closer to the optimal than arm . Let . The result for the best arm identification with no information leakage in [3] can be stated as: The error in finding the optimal arm is bounded as:
(8) 
One intuitive interpretation for is that it is the maximum among the number of samples (neglecting some log factors) required to conclude that arm is suboptimal from among the arms which are closer to the optimal than itself. Intuitively, this is because when there is no information leakage, one requires samples to distinguish between the th optimal arm and the optimal arm. Further, the th arm is played only fraction of the times since we do not know the identity of the th optimal arm.
Our main result in Theorem 1 can be seen to be a generalization of the existing result for the case when there is information leakage between the arms (various changes in a causal graph).
The term in our setting is the ‘effective standard deviation’ due to information leakage. There is a similar interpretation of our result (ignoring the log factors): Since there is information leakage, the expression characterizes the number of samples required to weed out arm out of contention from among competing arms (arms that are at a distance at most twice than that of arm from the optimal arm). The interpretation of ’effective variance’ is justified using importance sampling which is detailed in Section C.3. Further, in our framework also incorporates any budget constraint that comes with the problem, i.e. any apriori constraint on the relative fraction of times different arms need to be pulled.
For ease of exposition let denote the index of the th best arm (for ) and denotes the corresponding gap. In this setting, the terms (from the result in [3]) and can be written as:
can be smaller than due to information leakage as every single arm pull contributes to another arm’s estimate. Therefore, these provide better guarantees than [3].
To see the improvement over the previous result in [3], we consider a special case when the cost budget is infinity and there is only the the sample budget . In addition, let us assume that the log divergences are such that: . Let . If , the optimal solution for (4) is a bit complicated to interpret. Consider the feasible allocation in (4). Evaluating the objective function for this feasible allocation, it is possible to show that . Hence, unless the variance due to information leakage is too bad, the effective variance is smaller than that of the case with no information leakage.
The improvement over the no information leakage setting, is even more pronounced under budget constraints. Consider the setting S1, and assume that the fractional budget of the difficult arms, . This implies that the total number of samples available for difficult arms is . The budget constrained case has not been analyzed in [3], however in the absence of information leakage, one would expect that the arms with the least number of samples would be the most difficult to eliminate, and therefore the error guarantees would scale as (excluding factors). On the other hand, our algorithm can leverage the information leakage and the error guarantee would scale as , which can be orderwise better if the effective standard deviations are wellbehaved.
b.2 Comparison with [22]
In [22], the algorithm is based on clipped importance samples, where the clipper is always set at a static level of (excluding factors). The simple regret guarantee in [22] scales as , where is a global hardness parameter. The guarantees do not adapt to the problem parameters, specifically the gaps .
On the contrary, we provide problem dependent bounds, which differentiates the arms according to its gap from the optimal arm and its effective standard deviation parameter. The terms can be interpreted as the hardness parameter for rejecting arm . Note that depends only on the arms that are at least as bad in terms of their gap from the optimal arm. Moreover the guarantees are adapted to our general budget constraints, which is absent in [22]. It can be seen that when ’s do not scale in , then our simple regret is exponentially small in (dependent on ’s) and can be much less than . The guarantee also generalizes to the problem independent setting when ’s scale as .
Appendix C Proofs
In this section we present the theoretical analysis of our algorithm. Before we proceed to the proof of our main theorems, we derive some key lemmas that are useful in analyzing clipped importance sampled estimators.
c.1 Clipped Importance Sampling Estimator
In Section 3.1 we have introduced the concept of importance sampling that helps us in using samples collected under one arm to estimate the means of other arms. As noted in Section 3.1, a naive unbiased importance sampled estimator can potentially have unbounded variances thus leading to poor guarantees. We now introduce clipped importance samplers and provide a novel analysis of these estimators that alleviates the variance related issues.
Clipped Importance Samplers: The naive estimator of (2) is not suitable for yielding good confidence intervals. It has been observed in the context of importance sampling, that clipping the estimator in (2) at a carefully chosen value, can yield better confidence guarantees even though the resulting estimator will become slightly biased [7]. Before we introduce the precise estimator, let us define a key quantity that will be useful for the analysis.
Definition 4.
We define as follows:
(9) 
for all , where .
We shall see that the is related to the conditional divergence between and for the carefully chosen function as introduced in Section 3.1.
Now we are at a position to provide confidence guarantees on the following clipped estimator:
(10) 
Lemma 1.
The estimate for satisfies the following:

(11) 
(12)
Proof.
We have the following chain:
Here, is because . This yields the first part of the lemma:
(13) 
where . Note that all the terms in the summation of (10) are bounded by
Comments
There are no comments yet.