 # Approximating Pandora's Box with Correlations

The Pandora's Box problem asks to find a search strategy over n alternatives given stochastic information about their values, aiming to minimize the sum of the search cost and the value of the chosen alternative. Even though the case of independently distributed values is well understood, our algorithmic understanding of the problem is very limited once the independence assumption is dropped. Our work aims to characterize the complexity of approximating the Pandora's Box problem under correlated value distributions. To that end, we present a general reduction to a simpler version of Pandora's Box, that only asks to find a value below a certain threshold, and eliminates the need to reason about future values that will arise during the search. Using this general tool, we study two cases of correlation; the case of explicitly given distributions of support m and the case of mixtures of m product distributions. ∙ In the first case, we connect Pandora's Box to the well studied problem of Optimal Decision Tree, obtaining an O(log m) approximation but also showing that the problem is strictly easier as it is equivalent (up to constant factors) to the Uniform Decision Tree problem. ∙ In the case of mixtures of product distributions, the problem is again related to the noisy variant of Optimal Decision Tree which is significantly more challenging. We give a constant-factor approximation that runs in time n^Õ( m^2/ε^2 ) for m mixture components whose marginals on every alternative are either identical or separated in TV distance by ε.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many everyday tasks involve making decisions under uncertainty; for example driving to work using the fastest route or buying a house at the best price. Despite not knowing how our current decisions will turn out, or how they affect future outcomes, there is usually some prior information which we can use to facilitate the decision making process. For example, having driven on the possible routes to work before, we know which is more frequently the busiest one. It is also common in such cases that we can remove part of the uncertainty by paying some additional cost. This type of problem is modeled by Pandora’s Box, first formalized by Weitzman in [WEI79]. In this problem, the algorithm is given alternatives called boxes, each containing a value from a known distribution. The exact value is not known, but can be revealed for a known opening cost specific to the box. The goal is for the algorithm to decide which is the next box to open and whether to select a value and stop, such that the total opening cost plus the minimum value revealed is minimized. In the case of independent distributions on the boxes’ values, this problem has a very elegant and simple solution, as described by Weitzman [WEI79] which obtains the optimal cost; calculate an index for each box111This is a special case of Gittins index [GJ74]., open the boxes in decreasing index, and stop when the expected gain is worse than the value already obtained.

Weitzman’s model makes the crucial assumption that the distributions on the values are independent. This, however, is not always the case in practice and, as it turns out, the simple algorithm of the independent case fails to find the optimal solution under correlated distributions. Generally, the complexity of the Pandora’s Box with correlations is not yet well understood. The first step towards this direction was made by [CGT+20], who considered competing against a simpler benchmark, namely the optimal performance achievable using a strategy that cannot adapt the order in which it opens boxes to the values revealed. In general, optimal strategies can decide both the ordering of the boxes and the stopping time based on the values revealed, but such strategies can be hard to learn using samples.

In this work we study the complexity of Pandora’s Box problem with correlated value distributions against the most general benchmark and provide the first non-trivial approximations. We start by presenting a reduction to a simpler version of Pandora’s Box, in which we optimize the search cost until a value less than a threshold is found. The goal of this reduction is to serve as a tool to remove the need to account for values altogether, making the problem easier to approach. The generality of this tool allows us to use it under any correlated setting. We specifically study two cases of succinctly representable correlated distributions: the case of explicitly given distributions over a small support of size and the case of mixtures of product distributions.

In the case of correlated distributions with small (explicitly given) support, we show that Pandora’s Box is tightly connected to another well known problem in decision making, the Optimal Decision Tree (). In , we are asked to identify an unknown hypothesis, out of possible ones, by performing a sequence of tests. Each test has a cost and, if chosen, reveals a result, which depends on which hypothesis is realized. The goal of the algorithm is to minimize the total cost of tests performed in order to learn which is the correct hypothesis. This problem has been studied for many years, and has various applications in medical diagnosis (e.g.[PKS+02]), fault diagnosis (e.g. [PD92]

), and active learning (

[GB09, GKR10]). Currently there is a -approximation for this problem [GB09] which is also shown to be the best possible in [CPR+11]. By connecting Pandora’s Box to Optimal Decision tree, we immediately obtain a -approximation algorithm. However, going one step further, we show that Pandora’s Box is in fact equivalent, up to constant factors, to Uniform Decision Tree ; a special case of where the distribution over hypotheses is uniform. The problem was recently shown to be strictly easier than the general version [LLM20], which also makes Pandora’s Box strictly easier.

In the mixture of distributions case, we observe that Pandora’s Box is related to the noisy version of , where the result of every test and every hypothesis is not deterministic. Previous work in this area obtained algorithms whose approximation and runtime depend on the amount of noise. In our case, by only requiring that the marginals of the mixtures differ enough in TV distance for each box, we obtain a constant approximation that only depends on the number of boxes and the number of the mixtures.

We give a more detailed overview of our results in the next section.

### 1.1 Our Results

##### A tool for removing values:

in section 3 we present a reduction to the threshold version of Pandora’s Box (), where the objective is to find all values below a threshold , instead of the minimum value. Given an approximation for our reduction produces an algorithm that is approximate for the original instance with values, while making no assumptions on the type of correlation.

##### Explicitly given distributions:

in this case we show that Pandora’s Box is closely related to the optimal decision tree problem (Theorem 3.1), which implies an approximation for Pandora’s Box. Subsequently we show that, in fact, Pandora’s Box is equivalent up to constant factors to the Uniform Decision Tree problem (Theorems 4.24.4), which is known to be strictly easier than the general Optimal Decision Tree (as shown in [LLM20]).

##### Mixture of m product distributions:

in this case Pandora’s Box is related to the noisy version of the optimal decision tree. The noise can be arbitrary and we only require the mixtures to satisfy a separability condition; the marginals should differ by at least in TV distance. Using this property, we design a constant-factor approximation for that runs in (Theorem 5.1), where is the number of boxes, which also implies a constant-factor for the initial Pandora’s Box problem with values (Corollary 5.1.1), when using the tool of Theorem 3.2.

### 1.2 Related work

The Pandora’s Box problem was first introduced by Weitzman in the Economics literature [WEI79]. Since then, there has been a long line of research studying Pandora’s Box and its variants [DOV18, BK19, EHL+19, CGT+20, BFL+20], the generalized setting where more information can be obtained for a price [CFG+00, GK01, CJK+15b, CHK+15a] and in settings with more complex combinatorial constraints [SIN18, GGM06, GN13, ASW16, GNS16, GNS17b, GJS+19].

Optimal decision tree is an old problem studied in a variety of settings, while its most notable application is in active learning settings. It was proven to be NP-Hard by Hyafil and Rivest [HR76]. Since then the problem of finding the best algorithm was an active one [GG74, LOV85, KPB99, DAS04, CPR+11, CPR+09, GB09, GNR17a, CJL+10, AH12], where finally a greedy for the general case was given by [GB09]. This approximation ratio is proven to be the best possible [CPR+11]. For the case of Uniform decision tree less is known, until recently the best algorithm was the same as the optimal decision tree, and the lower bound was [CPR+11]. The recent work of Li et al. [LLM20] showed that there is an algorithm strictly better than for the uniform decision tree.

The noisy version of optimal decision tree was first studied in [GKR10]222This result is based on a result from [GK11] which turned out to be wrong [NS17]. The correct results are presented in [GK17], which gave an algorithm with runtime that depends exponentially on the number of noisy outcomes. Subsequently, Jia et al. in [JNN+19] gave an -approximation algorithm, where (resp. ) is the maximum number of different test results per test (resp. scenario) using a reduction to Adaptive Submodular Ranking problem [KNN17]. In the case of large number of noisy outcome they obtain a approximation exploiting the connection to Stochastic Set Cover [LPR+08, INv16].

## 2 Preliminaries

We formally define the problems used in the following sections. We distinguish them in two families; the variants of Decision Tree and the variants of Pandora’s Box.

### 2.1 Decision Tree-like problems

In Decision Tree problems we are given a set of scenarios

each occurring with (known) probability

and tests each with non-negative cost for . Each test gives a specific result for every scenario . Nature picks a scenario from the distribution . The goal of the algorithm is to run a series of tests to determine which scenario is realized.

The output of the algorithm is a decision tree where at each node there is a test that is performed, and the branches are the outcomes of the test. In each of the leaves there is an individual scenario that is the only one consistent with the results of the test in the branch from the root to this leaf. We can think of this tree as an adaptive policy that, given the set of outcomes so far, decides the next test to perform.

The objective is to find a decision tree that minimizes the cost i.e. the average cost for all scenarios to be identified where is the total cost of the tests used to reach scenario . We denote this general version as : Optimal Decision Tree. In the case where all , the problem is called Uniform Decision Tree or .

###### Definition 2.1 (Policy π).

Let be the set of possible tests and the set of possible results for each test. A policy is a function that given a set of tests done so far and their results, it returns the next test to be performed.

### 2.2 Pandora’s Box-like problems

In the original Pandora’s Box problem we are given boxes, each with cost and value , where the distributions are known and independent. To learn the exact value inside box we need to pay the respective probing cost . The objective is to minimize the sum of the total probing cost and the minimum value obtained. In the correlated version of Pandora’s Box problem, denoted by , the distributions on the values can be arbitrary correlated we use

to denote the joint distribution over the values and

its marginal for the value in box . The objective is to find a policy of opening boxes such that the expected cost of the boxes opened plus the minimum value discovered is minimized, where the expectation is taken over all possible realizations of the values in each box. Formally we want to minimize

 Es[minj∈πvjs+∑i∈πci],

where we slightly abused notation and denoted by the set of boxes opened.

We also introduce Pandora’s Box with costly outside option and parameters (called the threshold) and (called the outside option). In this version the objective is to minimize the cost of finding a value , while we have the extra option to quit searching by opening a box of cost . We say that a scenario is covered in a given run of the algorithm if it does not choose the outside option box . For the remainder of this paper, we always have , and denote the problem by .

Finally, we define by and the policy and optimal policy for an instance of or . Additionally, the policy for a specific scenario is denoted by .

#### 2.2.1 Modeling Correlation

In this work we study two general ways of modeling the correlation between the values in the boxes.

##### Explicit Distributions:

in this case, is a distribution over scenarios where the ’th scenario is realized with probability , for

. Every scenario corresponds to a fixed and known vector of values contained in each box. Specifically, box

has value for scenario .

##### Mixture of Distributions:

We also consider a more general setting, where is a mixture of product distributions. Specifically, each scenario is a product distribution; instead of giving a deterministic value for every box , the result is drawn from distribution . This setting is a generalization of the explicit distributions setting described before.

## 3 A Tool Removing Values from Pandora’s Box

### 3.1 Warmup: A naive reduction

A solution to Pandora’s Box involves two components: the order in which to open boxes and a stopping rule. As a warm-up, we present a simple reduction from to that is computationally efficient for the explicit distribution setting. This result essentially simplifies the stopping rule of the problem allowing us to focus on the order in which boxes are opened.

###### Theorem 3.1.

Given an efficient -approximation for for arbitrary , there exits a -approximation for that runs in polynomial time in the number of scenarios, number of boxes, and the number of values.

The main idea is we can move the value information contained in the boxes into the cost of the boxes by creating one new box for every box and realized value pair. We still need to use the original boxes to obtain information about which scenario is realized. We do so by replacing values in the original boxes by high values while maintaining correlation.

##### PB≤T Instance:

given an instance of , we construct an instance of . We need to be sufficiently large so that the outside option is never chosen and so we can easily get a policy for from a policy for . This is just a technical nuance. In particular, choosing to be larger than the sum of all the boxes plus the largest value that ever could be achieved will ensure the outside option is never better than even opening all boxes. Now, define . Next, we define . Note all of these values will be larger than and so we cannot stop after receiving such a value. However, they will cause the same branching behaviour as before since each distinct value is mapped in a bijectively to a new distinct value. Also, we add additional “final” boxes for each pair of a box and a value that this box could give. Each “final” box has cost and value for the scenarios where box gives exactly value and values for all other scenarios. Formally,

 V′i,(j,v)={0if Vi,j=vT+1else

Intuitively, these “final” boxes indicate to a policy that this will be the last box opened, and so its values, which is at least that of the best values of the boxes chosen, should now be taken into account in the cost of the solution. The proof of the theorem is deferred to section A.1 of the Appendix.

### 3.2 Main Tool

The reduction presented in the previous section, even though simple to describe, cannot be generally applied to any correlation setting. Specifically, observe that the number of boxes is increased to a number proportional to the size of the support, which could even be exponential. In this section we introduce a more sophisticated reduction, that is able to overcome these issues, and is still preserving the approximation up to logarithmic factors.

###### Theorem 3.2.

If there exists an -approximation for , then there exists an -approximation for .

On a high level, in this reduction we repeatedly run the algorithm for with increasingly large value of with the goal of capturing some mass of scenarios at every step. The thresholds for every run have to be cleverly chosen to guarantee that enough mass is captured at every run. The distributions on the boxes remain the same, and this reduction does not increase the number of boxes, therefore avoiding the issues the simple reduction of section 3.1 faced.

In our theorem proof we are using the following quantity called -threshold. A -threshold is the minimum possible threshold such that at most mass of the scenarios has cost more than in . Formally

###### Definition 3.3 (p-Threshold).

Let be an instance of and be the cost of scenario in , we define the -threshold as

 tp=min{T:Pr[cs>T]≤p}

Before continuing to the proof of the theorem, we show two key lemmas that guarantee that enough probability mass is covered at every phase we run the algorithm (Lemma 3.5) and show a lower bound on the optimal policy for the initial problem in every phase of the reduction (Lemma 3.6). Their proofs are deferred to section A.2 of the Appendix. We first formally define what is a sub-instance of Pandora’s Box which is used both in our reduction and lemma.

###### Definition 3.4 (Sub-instance).

Let be an instance of with set of scenarios each with probability . For any . We call a sub-instance of if and .

###### Lemma 3.5 (Threshold Bound).

Given an instance of , an -approximation algorithm to , and let be a sub-instance of . If the threshold satisfies

 T≤tq/(10α)+10α∑cs∈[tq,tq/(10α)]s∈Scspsq

then when running , at most scenarios pick the outside option box .

###### Lemma 3.6.

(Optimal Lower Bound) In the reduction of Theorem 3.2, let be the instance of . For the optimal policy for for every phase holds that

 π∗I≥∞∑i=1110α⋅(0.2)it(0.2)i/10α.
###### Proof of Theorem 3.2.

Given an instance of , we repeatedly run in phases. Phase consists of running with threshold on an sub instance of the original problem, denoted by . After every run , we remove the probability mass that was covered333Recall, a scenario is covered if it does not choose the outside option box., and run on this new instance with a new threshold . In each phase, the boxes, costs and values remain the same, but in this case the objective is different; we are seeking values less than . The thresholds are chosen such that at the end of each phase, of the remaining probability mass is covered. The reduction process is formally shown in Algorithm 1.

##### Calculating the thresholds:

for every phase we choose a threshold such that i.e. at most of the probability mass of the scenarios are not covered. In order to select this threshold, we do binary search starting from , running every time the -approximation algorithm for with outside option box and checking how many scenarios select it. We denote by the relevant interval of costs at every run of the algorithm, then by Lemma 3.5, we know that for remaining total probability mass , a threshold which satisfies

 Ti≤t(0.2)i−1/10a+10α∑s∈Scs∈Inticsps(0.2)i (1)

also satisfies the desired covering property ; at least mass of the current scenarios is covered. Therefore the threshold found by our binary search satisfies inequality (1).

##### Constructing the final policy:

by running in phases, we get a different policy , which we denote by for brevity, for every phase . We construct policy for the original instance, by following in each phase until the total probing cost exceeds , at which point starts following or stops if a value below is found.

##### Accounting for the values:

in the initial problem, the value chosen for every scenario is part of the cost, while in the transformed problem, only the probing cost is part of the cost. Let be one of these scenarios, and and be its cost in and respectively. We claim that . Observe that in every run with threshold , only the scenarios with for some box can be covered and removed from the instance. The way we constructed the final policy, is essentially the same as running ski-rental for every scenario; in the phase the scenario is covered, its value is at most , and we stop when the probing cost exceeds this amount. Therefore, the total cost without the value will be within the cost with the value included.

##### Bounding the final cost:

using the guarantee that at the end of every phase we cover of the scenarios, observe that the algorithm for is run in an interval of the form . Note also that these intervals are overlapping. Bounding the cost of the final policy for all intervals we get

 πI ≤∞∑i=0(0.2)iTi ≤∞∑i=0⎛⎜ ⎜⎝(0.2)it(0.2)i−1/10a+10α∑s∈Scs∈Inticsps⎞⎟ ⎟⎠ From inequality (1) ≤2⋅10απ∗I+10α∞∑i=0∑s∈Scs∈Inticsps Using Lemma 3.6 ≤20αlogα⋅π∗I.

Where the last inequality follows since each scenario with cost can belong to at most intervals, therefore we get the theorem. The extra factor of accounts for the values in . ∎

Notice the generality of this reduction; the distributions on the values are preserved, and we did not make any more assumptions on the scenarios or values throughout the proof. Therefore we can apply this tool regardless of the type of correlation or the way it is given to us, e.g. we could be given a parametric distribution, or an explicitly given distribution, as we see in the next section.

## 4 Explicit Distributions

In this section we assume we study the case where the distributions are explicitly given, in the form of scenarios. We show that in this case, our problem is directly related to the optimal decision tree literature. We first describe a straightforward reduction from to Optimal Decision Tree. Then we show that the problem is actually easier and reduces to Uniform Decision Tree. The full picture of our reductions is shown in figure 1. Using the notation means that problem reduces to and the theorem number where this is shown is mentioned above the . Similarly is the same as having both directions.

### 4.1 Reduction to ODT

We first show that can be reduced to the more general Optimal decision tree. This reduction implies that any known algorithm for can be applied to and give the same guarantees. Since the best possible algorithm for is a -approximation, this implies a approximation for .

###### Theorem 4.1.

If there exists an -approximation algorithm for then there exists a -approximation for .

###### Proof.

We show how to convert an input given for to one for and use this solution to get one for in polynomial time with the claimed approximation factor.

##### ODT Instance:

Let be the initial instance of . To construct the instance of , , we subdivide the scenarios and their probabilities. To be precise, for any scenario we construct two subdivisions both having probability . For every box with cost we create a test costing with possible outcomes from . The result of this test, when scenario is realized is

 Ti(jk)={j∗k, if vij

In other words, if a box gave less than reward for a scenario , the test corresponding to the box will isolate the subdivisions of that scenario, so that can stop.

We also have a single test that is a multi-way test that distinguishes all scenarios instantly at a cost of . This simulates the outside option.

##### Constructing the policy:

We construct a policy given a policy for . We start at the root of , and open boxes in as suggested by our current location in . At every step, the outcome of the test suggests a step to take in the tree . To be precise, there are two cases to consider.

1. If we encounter a test , we open the corresponding box in .

1. If the value of this box is less than , terminates.

2. If the value of this box is at least , then will continue following the corresponding branch in .

2. If We encounter the test , then takes the outside option and terminates.

We argue that is in fact a feasible policy for the instance . This follows since the only ways that could isolate a scenario subdivision , is by either running a test satisfying or by running which isolates all scenario subdivisions by definition. In the latter case, will take the outside option in step and so will give a valid solution to scenario . In the former case, we know that only if . Thus, box is opened on ’s branch in step and a value less than is achieved. Hence, is a feasible policy for the instance.

##### Approximation ratio:

First, note that always open boxes (or takes the outside option) with same cost as the corresponding test run by by construction. Also, we have subdivisions will always go down the same branches and are always isolated at the same time. This follows since either a test corresponding to a box isolated a subdivision and so would isolate both or running isolated them. Hence, the branch for scenario has exactly the same cost in as its subdivisions in and their total probabilities are both the same. So, .

For the optimal solutions we show that . This follows almost identically to the argument above just by swapping boxes with tests. In particular, for any branch of , we either reach a box with a value less than or take the outside option. If we do the sequence of tests corresponding to this sequence of box openings and end whenever the outside option is taken, we end up with a solution to the instance of cost at most . Putting it altogether, we get that

 c(π′)≤c(π)≤αc(π∗I)≤αc(π∗I′)

Clearly, the reduction can be done in polynomial time and so this yields a -approximation for . ∎

### 4.2 A Stronger Result: Equivalence with UDT

The previous reduction to the optimal decision tree problem highlights the similarity of these two problems. However, optimal decision tree is a very general and powerful problem. In this section we show that the Pandora’s Box problem is actually strictly easier than the optimal decision tree. Specifically, we reduce to the uniform decision tree, which Li et al. [LLM20] proved that admits a strictly better approximation than the best possible for optimal decision tree.

#### 4.2.1 Reducing PB≤T to UDT

In this section we show a reduction from to , formally stated in Theorem 4.2. The currently known results for all assume uniform cost tests but this reduction, even though simple, introduces a test with cost . If the initial instance had non-uniform cost boxes, introducing costs is unavoidable. However, if the initial instance had uniform costs, except the outside option box, we show that it is possible to avoid introducing costs (Theorem 4.3), and therefore all the known results for apply to .

###### Theorem 4.2.

If there exists an approximation for where is the number of scenarios, then there exists an -approximation for .

###### Theorem 4.3.

If there exists an approximation for with uniform costs where is the number of scenarios, then there exists an -approximation for with uniform costs.

This theorem combined with the result of Li et al. [LLM20] that gives a -approximation for with uniform test costs and with our reduction from section 3.2 we get the following corollary.

###### Corollary 4.3.1.

For an instance of with uniform costs and scenarios, there exists a -approximation algorithm for . Additionally if the tests only have a constant number of different results, the approximation ratio is .444This holds since where is number of different test results.

The proof of Theorem 4.3 follows similarly to that for Theorem 4.2, and is deferred to section B.1 of the Appendix.

###### Proof of Theorem 4.2.

Given an instance of with outside cost box , we construct the instance of as follows.

##### Constructing the instance:

we first remove all scenarios with . Let be the number of scenarios remaining so that is the smallest probability remaining. All probabilities are scaled by such that . We then create copies of scenario , called sub-scenarios and denoted by where each having equal probability . For simplicity, we assume the number of copies is integral. For any box of , we create a test called a box test such that

 TB(i,j)={j, if vBi

In other words, if a box had a value less than for scenario , the test corresponding to the box isolates each sub-scenario by sending them to their own branch and send all other scenario copies to the same branch. We also add an outside option test that costs and isolates all copies of all scenarios by giving a different result for each.

##### Constructing the policy:

given a policy , we construct a policy . Starting from the root of , whenever chooses to run a box test on a branch with a copy of scenario , opens box on the branch for scenario . If chooses the outside option test, then also chooses the outside option box . If at some point the policy has spent more than , we stop and take the outside option box , incurring at most twice the cost of .

For the constructed policy to be feasible we show that for any scenario either (1) opens a box giving value less than or (2) takes . Feasibility is immediate since the isolating tests correspond one-to-one to the boxes with value less than , and the low probability scenarios take .

##### Approximation ratio:

the low probability scenarios at worst take the outside option incurring cost

 2cminTm⋅T⋅m≤2cmin≤2c(π∗I′)

since there is at most and their probability is at most . Let be any scenario with . Then for every test run, the corresponding box is opened, and whenever isolates, at the same box also finds a value less than . Therefore, scenario contributes the same cost, but with scaled probability , and since it holds that . Summing up, we get . Putting it all together we get

 c(πI)≤2c(πI′)≤2αc(π∗I′)≤4αc(π∗I),

where the second inequality follows since we are given an approximation and the last inequality since if we are given an optimal policy for , the exact same policy is also feasible for any instance, which has cost at least . ∎

#### 4.2.2 Reducing UDT to PB≤T

We proceed to show the reverse reduction, from to , showing thus that the problems are equivalent up to constant factors.

###### Theorem 4.4.

If there exists an -approximation for then there exists an -approximation for .

Given an instance of we construct an instance of by keeping the scenarios and probabilities the same and choosing , where is the cost of test . For every test , we construct a test box and we define so that each value for a standard box is an infinity chosen to match the same branching as . The costs of these boxes is the same cost as the corresponding test.

Next, we introduce isolating boxes for each scenario , we define isolating box satisfying and for all other scenarios . The cost of an isolating box is the minimum cost test needed to isolate from scenario where is the scenario that maximizes this quantity. Formally, if , then . Overall, the instance will have boxes and scenarios.

The policy for is constructed by following the policy given by , and ensuring that every time there are at most two scenarios that are not distinguished at every leaf of the policy tree. The full proof of the theorem is deferred to section B.2 of the Appendix.

## 5 Mixture of Product Distributions

In this section we switch gears and consider the case where we are given a mixture of product distributions. Observe that using the tool described in section 3.2, we can reduce this problem to . This now is equivalent to the noisy version of [GK17, JNN+19] where for a specific scenario, the result of each test is not deterministic and can get different values with different probabilities.

##### Comparison with previous work:

previous work on noisy decision tree, considers limited noise models or the runtime and approximation ratio depends on the type of noise. For example in the main result of [JNN+19], the noise outcomes are binary with equal probability. The authors mention that it is possible to extend the following ways:

• to probabilities within , incurring an extra factor in the approximation

• to non-binary noise outcomes, incurring an extra at most factor in the approximation

Additionally, their algorithm works by expanding the scenarios for every possible noise outcome (e.g. to for binary noise). In our work the number of noisy outcomes does not affect the number of scenarios whatsoever.

In our work, we obtain a constant approximation factor, that does not depend in any way on the type of the noise. Additionally, the outcomes of the noisy tests can be arbitrary, and do not affect either the approximation factor or the runtime. We only require a separability condition to hold ; the distributions either differ enough or are exactly the same. Formally, we require that for any two scenarios and for every box , the distributions and satisfy , where is the total variation distance of distributions and .

### 5.1 A DP Algorithm for noisy PB≤T

We move on to designing a dynamic programming algorithm to solve the problem, in the case of a mixtures of product distributions. The guarantees of our dynamic programming algorithm are given in the following theorem.

###### Theorem 5.1.

For any , let and be the policies produced by Algorithm described by Equation (2) and the optimal policy respectively and . Then it holds that

 c(π\scaletoDP5pt)≤(1+β)c(π∗).

and the runs in time , where is the number of boxes and is the minimum cost box.

Using the reduction described in section 3.2 and the previous theorem we can get a constant-approximation algorithm for the initial problem given a mixture of product distributions. Observe that in the reduction, for every instance of it runs, the chosen threshold satisfies that where is the optimal policy for the threshold . The inequality holds since the algorithm for the threshold is a approximation and it covers of the scenarios left (i.e. pays for the rest). This is formalized in the following corollary.

###### Corollary 5.1.1.

Given an instance of on scenarios, and the DP algorithm described in Equation (2), then using Algorithm 1 we obtain an -approximation algorithm for than runs in .

Observe that the naive DP, that keeps track of all the boxes and possible outcomes, has space exponential in the number of boxes, which can be very large. In our DP, we exploit the separability property of the distributions by distinguishing the boxes in two different types based on a given set of scenarios. Informally, the informative boxes help us distinguish between two scenarios, by giving us enough TV distance, while the non-informative always have zero TV distance. The formal definition follows.

###### Definition 5.2 (Informative and non-informative boxes).

Let be a set of scenarios. Then we call a box informative if there exist such that

 |Dksi−Dksj)|≥ε.

We denote the set of all informative boxes by . Similarly, the boxes for which the above does not hold are called non-informative and the set of these boxes is denoted by .

##### Recursive calls of the DP:

Our dynamic program chooses at every step one of the following options:

1. open an informative box: this step contributes towards eliminating improbable scenarios. From the definition of informative boxes, every time such a box is opened, it gives TV distance at least between at least two scenarios, making one of them more probable than the other. We show (Lemma 5.3) that it takes a finite amount of these boxes to decide, with high probability, which scenario is the one realized (i.e. eliminating all but one scenarios).

2. open a non-informative box: this is a greedy step; the best non-informative box to open next is the one that maximizes the probability of finding a value smaller than . Given a set of scenarios that are not yet eliminated, there is a unique next non-informative box which is best. We denote by the function that returns this next best non-informative box. Observe that the non-informative boxes do not affect the greedy ordering of which is the next best, since they do not affect which scenarios are eliminated.

##### State space of the DP:

the DP keeps track of the following three quantities:

1. a list which consists of sets of informative boxes opened and numbers of non-informative ones opened in between the sets of informative ones. Specifically, has the following form: 555If for are boxes, the list looks like this: where is a set of informative boxes, and is the number of non-informative boxes opened exactly after the boxes in set . We also denote by the informative boxes in the list .

In order to update at every recursive call, we either append a new informative box opened (denoted by ) or, when a non-informative box is opened, we add at the end, denoted by .

2. a list of tuples of integers , one for each pair of distinct scenarios with . The number keeps track of the number of informative boxes between and that the value discovered had higher probability for scenario , and the number is the total number of informative for scenarios and opened. Every time an informative box is opened, we increase the variables for the scenarios the box was informative and add to the if the value discovered had higher probability in . When a non-informative box is opened, the list remains the same.We denote this update by .

3. a list of the scenarios not yet eliminated. Every time an informative test is performed, and the list updated, if for some scenario there exists another scenario such that and then is removed from , otherwise is removed666This is the process of elimination in the proof of Lemma 5.3. This update is denoted by .

##### Base cases:

if a value below is found, the algorithm stops. The other base case is when , which means that the scenario realized is identified, we either take the outside option or search the boxes for a value below , whichever is cheapest. If the scenario is identified correctly, the DP finds the expected optimal for this scenario. We later show that we make a mistake only with with low probability, thus increasing the cost only by a constant factor. We denote by the “nature’s” move, where the value in the box we chose is realized, and is the minimum value obtained by opening boxes. The recursive formula is shown below.

 (2)

The final solution is , where is a list of tuples of the form , and in order to update we set .

###### Lemma 5.3.

Let be any two scenarios. Then after opening informative boxes, we can eliminate one scenario with probability at least .

We defer the proof of this lemma and Theorem 5.1 to section C of the Appendix.

## References

• [ASW16] M. Adamczyk, M. Sviridenko, and J. Ward (2016) Submodular stochastic probing on matroids. Math. Oper. Res. 41 (3), pp. 1022–1038. External Links: Cited by: §1.2.
• [AH12] M. Adler and B. Heeringa (2012) Approximating optimal binary decision trees. Algorithmica 62 (3-4), pp. 1112–1121. External Links: Cited by: §1.2.
• [BK19] H. Beyhaghi and R. Kleinberg (2019) Pandora’s problem with nonobligatory inspection. In Proceedings of the 2019 ACM Conference on Economics and Computation, EC 2019, Phoenix, AZ, USA, June 24-28, 2019, A. Karlin, N. Immorlica, and R. Johari (Eds.), pp. 131–132. External Links: Cited by: §1.2.
• [BFL+20] S. Boodaghians, F. Fusco, P. Lazos, and S. Leonardi (2020) Pandora’s box problem with order constraints. In EC ’20: The 21st ACM Conference on Economics and Computation, Virtual Event, Hungary, July 13-17, 2020, P. Biró, J. D. Hartline, M. Ostrovsky, and A. D. Procaccia (Eds.), pp. 439–458. External Links: Cited by: §1.2.
• [CPR+11] V. T. Chakaravarthy, V. Pandit, S. Roy, P. Awasthi, and M. K. Mohania (2011) Decision trees for entity identification: approximation algorithms and hardness results. ACM Trans. Algorithms 7 (2), pp. 15:1–15:22. External Links: Cited by: §1.2, §1.
• [CPR+09] V. T. Chakaravarthy, V. Pandit, S. Roy, and Y. Sabharwal (2009) Approximating decision trees with multiway branches. In Automata, Languages and Programming, 36th International Colloquium, ICALP 2009, Rhodes, Greece, July 5-12, 2009, Proceedings, Part I, S. Albers, A. Marchetti-Spaccamela, Y. Matias, S. E. Nikoletseas, and W. Thomas (Eds.), Lecture Notes in Computer Science, Vol. 5555, pp. 210–221. External Links: Cited by: §1.2.
• [CFG+00] M. Charikar, R. Fagin, V. Guruswami, J. M. Kleinberg, P. Raghavan, and A. Sahai (2000) Query strategies for priced information (extended abstract). In

Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, May 21-23, 2000, Portland, OR, USA

,
pp. 582–591. External Links: Cited by: §1.2.
• [CGT+20] S. Chawla, E. Gergatsouli, Y. Teng, C. Tzamos, and R. Zhang (2020) Pandora’s box with correlations: learning and approximation. In 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, Durham, NC, USA, November 16-19, 2020, pp. 1214–1225. External Links: Cited by: §1.2, §1.
• [CHK+15a] Y. Chen, S. H. Hassani, A. Karbasi, and A. Krause (2015) Sequential information maximization: when is greedy near-optimal?. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, pp. 338–363. External Links: Link Cited by: §1.2.
• [CJK+15b] Y. Chen, S. Javdani, A. Karbasi, J. A. Bagnell, S. S. Srinivasa, and A. Krause (2015) Submodular surrogates for value of information. In

Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA.

,
• [CJL+10] F. Cicalese, T. Jacobs, E. S. Laber, and M. Molinaro (2010) On greedy algorithms for decision trees. In Algorithms and Computation - 21st International Symposium, ISAAC 2010, Jeju Island, Korea, December 15-17, 2010, Proceedings, Part II, O. Cheong, K. Chwa, and K. Park (Eds.), Lecture Notes in Computer Science, Vol. 6507, pp. 206–217. External Links: Cited by: §1.2.
• [DAS04] S. Dasgupta (2004) Analysis of a greedy active learning strategy. In Advances in Neural Information Processing Systems 17 [Neural Information Processing Systems, NIPS 2004, December 13-18, 2004, Vancouver, British Columbia, Canada], pp. 337–344. External Links: Link Cited by: §1.2.
• [DOV18] L. Doval (2018) Whether or not to open pandora’s box. J. Econ. Theory 175, pp. 127–158. External Links: Cited by: §1.2.
• [EHL+19] H. Esfandiari, M. T. Hajiaghayi, B. Lucier, and M. Mitzenmacher (2019) Online pandora’s boxes and bandits. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 1885–1892. External Links: Cited by: §1.2.
• [GG74] M. R. Garey and R. L. Graham (1974) Performance bounds on the splitting algorithm for binary testing. Acta Informatica 3, pp. 347–355. External Links: Cited by: §1.2.
• [GJ74] J.C. Gittins and D.M. Jones (1974) A dynamic allocation index for the sequential design of experiments. Progress in Statistics, pp. 241–266. Cited by: footnote 1.
• [GGM06] A. Goel, S. Guha, and K. Munagala (2006) Asking the right questions: model-driven optimization using probes. In Proceedings of the Twenty-Fifth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 26-28, 2006, Chicago, Illinois, USA, pp. 203–212. External Links: Cited by: §1.2.
• [GKR10] D. Golovin, A. Krause, and D. Ray (2010) Near-optimal bayesian active learning with noisy observations. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.), pp. 766–774. External Links: Link Cited by: §1.2, §1.
• [GK11] D. Golovin and A. Krause (2011) Adaptive submodularity: theory and applications in active learning and stochastic optimization. J. Artif. Intell. Res. 42, pp. 427–486. External Links: Cited by: footnote 2.
• [GK17] D. Golovin and A. Krause (2017) Adaptive submodularity: A new approach to active learning and stochastic optimization. CoRR abs/1003.3967. External Links: Link, 1003.3967 Cited by: §5, footnote 2.
• [GB09] A. Guillory and J. A. Bilmes (2009) Average-case active learning with costs. In Algorithmic Learning Theory, 20th International Conference, ALT 2009, Porto, Portugal, October 3-5, 2009. Proceedings, R. Gavaldà, G. Lugosi, T. Zeugmann, and S. Zilles (Eds.), Lecture Notes in Computer Science, Vol. 5809, pp. 141–155. External Links: Cited by: §1.2, §1.
• [GJS+19] A. Gupta, H. Jiang, Z. Scully, and S. Singla (2019) The markovian price of information. In

Integer Programming and Combinatorial Optimization - 20th International Conference, IPCO 2019, Ann Arbor, MI, USA, May 22-24, 2019, Proceedings

,
pp. 233–246. External Links: Cited by: §1.2.
• [GK01] A. Gupta and A. Kumar (2001) Sorting and selection with structured costs. In 42nd Annual Symposium on Foundations of Computer Science, FOCS 2001, 14-17 October 2001, Las Vegas, Nevada, USA, pp. 416–425. External Links: Cited by: §1.2.
• [GNR17a] A. Gupta, V. Nagarajan, and R. Ravi (2017) Approximation algorithms for optimal decision trees and adaptive TSP problems. Math. Oper. Res. 42 (3), pp. 876–896. External Links: Cited by: §1.2.
• [GNS16] A. Gupta, V. Nagarajan, and S. Singla (2016) Algorithms and adaptivity gaps for stochastic probing. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016, pp. 1731–1747. External Links: Cited by: §1.2.
• [GNS17b] A. Gupta, V. Nagarajan, and S. Singla (2017) Adaptivity gaps for stochastic probing: submodular and XOS functions. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pp. 1688–1702. External Links: Cited by: §1.2.
• [GN13] A. Gupta and V. Nagarajan (2013) A stochastic probing problem with applications. In Integer Programming and Combinatorial Optimization - 16th International Conference, IPCO 2013, Valparaíso, Chile, March 18-20, 2013. Proceedings, pp. 205–216. External Links: Cited by: §1.2.
• [HR76] L. Hyafil and R. L. Rivest (1976) Constructing optimal binary decision trees is np-complete. Inf. Process. Lett. 5 (1), pp. 15–17. External Links: Cited by: §1.2.
• [INv16] S. Im, V. Nagarajan, and R. van der Zwaan (2016) Minimum latency submodular cover. ACM Trans. Algorithms 13 (1), pp. 13:1–13:28. External Links: Cited by: §1.2.
• [JNN+19] S. Jia, V. Nagarajan, F. Navidi, and R. Ravi (2019) Optimal decision tree with noisy outcomes. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 3298–3308. External Links: Link Cited by: §1.2, §5, §5.
• [KNN17] P. Kambadur, V. Nagarajan, and F. Navidi (2017) Adaptive submodular ranking. In Integer Programming and Combinatorial Optimization - 19th International Conference, IPCO 2017, Waterloo, ON, Canada, June 26-28, 2017, Proceedings, pp. 317–329. External Links: Link, Document Cited by: §1.2.
• [KPB99] S. R. Kosaraju, T. M. Przytycka, and R. S. Borgstrom (1999) On an optimal split tree problem. In Algorithms and Data Structures, 6th International Workshop, WADS ’99, Vancouver, British Columbia, Canada, August 11-14, 1999, Proceedings, F. K. H. A. Dehne, A. Gupta, J. Sack, and R. Tamassia (Eds.), Lecture Notes in Computer Science, Vol. 1663, pp. 157–168. External Links: Link, Document Cited by: §1.2.
• [LLM20] R. Li, P. Liang, and S. Mussmann (2020) A tight analysis of greedy yields subexponential time approximation for uniform decision tree. In Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, S. Chawla (Ed.), pp. 102–121. External Links: Link, Document Cited by: §1.1, §1.2, §1, §4.2.1, §4.2.
• [LPR+08] Z. Liu, S. Parthasarathy, A. Ranganathan, and H. Yang (2008) Near-optimal algorithms for shared filter evaluation in data stream systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, J. T. Wang (Ed.), pp. 133–146. External Links: Link, Document Cited by: §1.2.
• [LOV85] D. W. Loveland (1985) Performance bounds for binary testing with arbitrary weights. Acta Informatica 22 (1), pp. 101–114. External Links: Link, Document Cited by: §1.2.
• [NS17] F. Nan and V. Saligrama (2017) Comments on the proof of adaptive stochastic set cover based on adaptive submodularity and its implications for the group identification problem in ”group-based active query selection for rapid diagnosis in time-critical situations”. IEEE Trans. Inf. Theory 63 (11), pp. 7612–7614. External Links: Link, Document Cited by: footnote 2.
• [PD92] K. R. Pattipati and M. Dontamsetty (1992) On a generalized test sequencing problem. IEEE Trans. Syst. Man Cybern. 22 (2), pp. 392–396. External Links: Link, Document Cited by: §1.
• [PKS+02] V. Podgorelec, P. Kokol, B. Stiglic, and I. Rozman (2002-11) Decision trees: an overview and their use in medicine. Journal of medical systems 26, pp. 445–63. External Links: Document Cited by: §1.
• [SIN18] S. Singla (2018) The price of information in combinatorial optimization. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pp. 2523–2532. External Links: Link, Document Cited by: §1.2.
• [WEI79] M. L. Weitzman (1979-05) Optimal Search for the Best Alternative. Econometrica 47 (3), pp. 641–654. Cited by: §1.2, §1.

## Appendix A Proofs from section 3

### a.1 Proofs from subsection 3.1

In order to prove the result, we use the following two key lemmas. In Lemma A.1 we show that the optimal solution for the transformed instance of is not much higher than the optimal for initial problem. In Lemma A.2 we show how to obtain a policy for the initial problem with values, given a policy for the problem with a threshold.

###### Lemma A.1.

Given the instance of and the instance of it holds that

 c(π∗I′)≤2c(π∗I).
###### Proof of Lemma a.1.

We show that given an optimal policy for , we can construct a feasible policy for such that . We construct the policy by opening the same boxes as and finally opening the corresponding “values” box, in order to find the and stop.

Fix any scenario , and let the smallest values box opened for be box which gives value . Since is opened, in the instance we open box , and from the construction of we have that . Since on every branch we open a box with values 777 opens at least one box., we get that is a feasible policy for . For scenario , we have that the cost of is

 c(π(i))=mink∈π(i)Vi,k+∑k∈π(i)ck

while in the minimum values is and there is the additional cost of the “values” box. Formally, the cost of is

 c(π′(i))=0+∑k∈π(i)ck+c(j,Vi,j)=mink∈π(i)Vi,k+∑k∈π(i)ck+cj=c(π(i))+cj

Since appears in the cost of , we know that . Thus, , which implies that for our feasible policy . Observing that for any policy, gives the lemma. ∎

###### Lemma A.2.

Given a policy for the instance of , there exists a feasible policy for the instance of of no larger expected cost. Furthermore, any branch of can be constructed from in polynomial time.

###### Proof of Lemma a.2.

We construct a policy for using the policy . Fix some branch of , if opens a box , our policy opens the same box. When opens a “final” box , our policy opens box if it has not been opened already. There are two cases to consider depending on where the “final” box is opened.

1. “Final” box is at a leaf of : since has finite expected cost and this is the first “final” box we encountered, the result must be , therefore for policy the values will be , by definition of . Observe that in this case, since the (at most) extra paid by for the value term, has already been paid by the box cost in when box was opened.

2. “Final” box is at an intermediate node of : after opens box , we copy the subtree of that follows the branch into the branch of that follows the branch, and the subtree of that follows the branch into each branch that has a value different from (the non- branches). The cost of this new subtree is improved as the root of the original subtree had cost and now it has cost . The branch may accrue an additional cost of or smaller if was not the smallest values box on this branch of the tree, so in total the branch has cost at most that originally. However, the non- branches have a term removed going down the tree. Specifically, since the feedback of down the non- branch was , some other box with values had to be opened at some point, and this box is still available to be used as the final values for this branch later on (since if this branch already had a , it would have stopped). Thus, the cost of this subtree is at most that originally, and has one fewer “final” box opened.

Thus, we construct a policy for with cost at most that of . Also, computing a branch for is as simple as following the corresponding branch of , opening box instead of box and remembering the feedback to know which boxes of to open in the future. Hence, we can compute a branch of from in polynomial time. ∎

###### Proof of Theorem 3.1.

Suppose we have an -approximation for . Given an instance to , we construct the instance for as described before and then run the approximation algorithm on to get a policy . Next, we prune the tree as described in Lemma A.2 to get a policy, of no worse cost. Also, our policy will use time at most polynomially more than the policy for since each branch of can be computed in polynomial time from . Hence, the runtime is polynomial in the size of . We also note that we added at most total “final” boxes to construct our new instance , and so this algorithm will run in polynomial time in and . Then, by Lemma A.2 and Lemma A.1 we know the cost of the constructed policy is

 c(π)≤c(π′)≤αc(π∗I′)≤2αc(π∗I)

Hence, this algorithm is be a -approximation for .

### a.2 Proofs from subsection 3.2

###### Proof of Lemma 3.5.

Consider a policy , that for of the scenarios chooses the outside option and for the rest of the scenarios it runs the policy , then we get

 c(π∗Iq)≤c(πIq)=T10α+∑cs∈[tq,tq/(10α)]s∈Scspsq, (3)

On the other hand since is an -approximation to the optimal we have that

 c(πIq)≤αc(π∗Iq)≤T5

where for the last inequality we put together the assumption from our lemma, , (3) and that pays at least for the scenarios in . Since the expected cost of is at most using Markov’s inequality, we get that . Therefore, covers at least mass every time. ∎

###### Proof of Lemma 3.6.

In every interval of the form the optimal policy for covers at least of the probability mass that remains. Since the values belong in the interval in phase , it follows that the minimum possible value that the optimal policy might pay is , i.e. the lower end of the interval. Summing up for all intervals, we get the lemma. ∎

## Appendix B Proofs from section 4

### b.1 Proofs from section 4.2.1

In order to show Theorem 4.3 we first need the following lemma that bounds the cost of a scenario that takes at least one outside option test in the transformed instance of .

###### Lemma B.1.

Let be an instance of , and the instance of constructed by the reduction of Theorem 4.3. For a scenario with probability , if there is at least one outside option test run in , then .

###### Proof of Lemma b.1.

Denote by the box tests ran before there were total outside option tests ran on the branch of corresponding to scenario , and similarly denote by the total number of outside option tests on