# The Markovian Price of Information

Suppose there are n Markov chains and we need to pay a per-step price to advance them. The "destination" states of the Markov chains contain rewards; however, we can only get rewards for a subset of them that satisfy a combinatorial constraint, e.g., at most k of them, or they are acyclic in an underlying graph. What strategy should we choose to advance the Markov chains if our goal is to maximize the total reward minus the total price that we pay? In this paper we introduce a Markovian price of information model to capture settings such as the above, where the input parameters of a combinatorial optimization problem are given via Markov chains. We design optimal/approximation algorithms that jointly optimize the value of the combinatorial problem and the total paid price. We also study robustness of our algorithms to the distribution parameters and how to handle the commitment constraint. Our work brings together two classical lines of investigation: getting optimal strategies for Markovian multi-armed bandits, and getting exact and approximation algorithms for discrete optimization problems using combinatorial as well as linear-programming relaxation ideas.

• 45 publications
• 21 publications
• 12 publications
• 40 publications
11/01/2017

### The Price of Information in Combinatorial Optimization

Consider a network design application where we wish to lay down a minimu...
02/06/2019

### Testing Markov Chains without Hitting

We study the problem of identity testing of markov chains. In this setti...
11/07/2017

### Combinatorial Assortment Optimization

Assortment optimization refers to the problem of designing a slate of pr...
04/15/2021

### Stochastic Processes with Expected Stopping Time

Markov chains are the de facto finite-state model for stochastic dynamic...
05/20/2018

### Randomized Strategies for Robust Combinatorial Optimization

In this paper, we study the following robust optimization problem. Given...
07/13/2021

### Markov Game with Switching Costs

We study a general Markov game with metric switching costs: in each roun...
06/22/2021

### Strategic Liquidity Provision in Uniswap v3

Uniswap is the largest decentralized exchange for digital currencies. Th...

## 1 Introduction

Suppose we are running an oil company and are deciding where to set up new drilling operations. There are several candidate sites, but the value of drilling each site is a random variable. We must therefore

inspect sites before drilling. Each inspection gives more information about a site’s value, but the inspection process is costly. Based on laws, geography, or availability of equipment, there are constraints on which sets of drilling sites are feasible. We ask:

What adaptive inspection strategy should we adopt to find a feasible set of sites to drill which maximizes, in expectation, the value of the chosen (drilled) sites minus the total inspection cost of all sites?

Let us consider the optimization challenges in this problem:

1. [(i)]

2. Even if we could fully inspect each site for free, choosing the best feasible set of sites is a combinatorial optimization problem.

3. Each site may have multiple stages of inspection. The costs and possible outcomes of later stages may depend on the outcomes of earlier stages. We use a Markov chain for each site to model how our knowledge about the value of the site stochastically evolves with each inspection.

4. Since a site’s Markov chain model may not exactly match reality, we want a robust strategy that performs well even under small changes in the model parameters.

5. If there is competition among several companies, it may not be possible to do a few stages of inspection at a given site, abandon that site’s inspection to inspect other sites, and then later return to further inspect the first site. In this case the problem has additional “take it or leave it” or commitment constraints, which prevent interleaving inspection of multiple sites.

While each of the above aspects has been individually studied in the past, no prior work addresses all of them. In particular, aspects 1 and 2 have not been simultaneously studied before. In this work we advance the state of the art by solving the 1-2-3 and the 1-2-4 problems.

To study aspects 1 and 2 together, in §2 we propose the Markovian Price of Information (Markovian PoI) model. The Markovian PoI model unifies prior models which address 1 or 2 alone. These prior models include those of Kleinberg et al. [33] and Singla [37], who study the combinatorial optimization aspect 1 in the so-called price of information model, in which each site has just a single stage of inspection; and those of Dimitriu et al. [17] and Kleinberg et al. [33, Appendix G], who consider the multiple stage inspection aspect 2 for the problem of selecting just a single site.

Our main results show how to solve combinatorial optimization problems, including both maximization and minimization problems, in the Markovian PoI model. We give two methods of transforming classic algorithms, originally designed for the Free-Info (inspection is free) setting, into adaptive algorithms for the Markovian PoI setting. These adaptive algorithms respond dynamically to the random outcomes of inspection.

• [noitemsep,topsep=5pt]

• In §3.3 we transform “greedy” -approximation algorithms in the Free-Info setting into -approximation adaptive algorithms in the Markovian PoI setting (Theorem 3.1). For example, this yields optimal algorithms for matroid optimization (Corollary 1).

• In §4 we show how to slightly modify our -approximations for the Markovian PoI setting in Theorem 3.1 to make them robust to small changes in the model parameters (Theorem 4.1).

• In §5 we use online contention resolution schemes (OCRSs) [19] to transform LP based Free-Info maximization algorithms into adaptive Markovian PoI algorithms while respecting the commitment constraints. Specifically, a -selectable OCRS yields -approximation with commitment (Theorem 5.1).

The general idea behind our first result (Theorem 3.1) is the following. A Frugal combinatorial algorithm (Definition 8) is, roughly speaking, “greedy”: it repeatedly selects the feasible item of greatest marginal value. We show how to adapt any Frugal algorithm to the Markovian PoI setting:

• [noitemsep,topsep=5pt]

• Instead of using a fixed value for each item , we use a time-varying “proxy” value that depends on the state of ’s Markov chain.

• Instead of immediately selecting the item  of greatest marginal value, we advance ’s Markov chain one step.

The main difficulty lies in choosing each item’s proxy value, for which simple heuristics can be suboptimal. We use a quantity for each state of each item’s Markov chain called its

grade, and an item’s proxy value is its minimum grade so far. A state’s grade is closely related to the Gittins index from the multi-armed bandit literature, which we discuss along with other related work in §6.

## 2 The Markovian Price of Information Model

To capture the evolution of our knowledge about an item’s value, we use the notion of a Markov system from [17] (who did not consider values at the destinations).

###### Definition 1 (Markov System)

A Markov system for an element consists of a discrete Markov chain with state space , a transition matrix indexed by (here

is the probability of transitioning from

to ), a starting state , a set of absorbing destination states , a non-negative probing price for every state , and a value for each destination state . We assume that every state reaches some destination state.

We have a collection of ground elements, each associated with its own Markov system. An element is ready if its Markov system has reached one of its absorbing destination states. For a ready element, if is the (random) trajectory of its Markov chain then denotes its associated destination state. We now define the Markovian PoI game, which consists of an objective function on .

###### Definition 2 (Markovian PoI Game)

Given a set of ground elements , constraints , an objective function , and a Markov system for each element , the Markovian PoI game is the following. At each time step, we either advance a Markov system from its current state by incurring price , or we end the game by selecting a subset of ready elements that are feasible—i.e., .

A common choice for is the additive objective .

Let denote the trajectory profile for the Markovian PoI game: it consists of the random trajectories  taken by all the Markov chains  at the end of the game. To avoid confusion, we write the selected feasible solution as . A utility/disutility optimization problem is to give a strategy for a Markovian PoI game while optimizing both the objective and the total price.

Utility Maximization (Util-Max): A Markovian PoI game where the constraints are downward-closed (i.e., packing) and the values are non-negative for every (i.e., , , and can be understood as a reward obtained for selecting ). The goal is to find a strategy ALG maximizing utility:

 Umax(\sc{ALG})Δ=Eω[f(I(ω),{rd(ωi)i}i∈I(ω))value−∑i∑u∈ωiπuitotal price]. (1)

Since the empty set is always feasible, the optimum utility is non-negative.

We also define a minimization variant of the problem that is useful to capture covering combinatorial problems such as minimum spanning trees and set cover.

Disutility Minimization (Disutil-Min) : A Markovian PoI game where the constraints are upward-closed (i.e., covering) and the values are non-negative for every (i.e., , , and can be understood as a cost we pay for selecting ). The goal is to find a strategy ALG minimizing disutility:

We will assume that the function is non-negative when all are non-negative. Hence, the disutility of the optimal policy is non-negative.

In the special case where all the Markov chains for a Markovian PoI game are formed by a directed acyclic graph (Dag), we call the corresponding optimization problem Dag-Util-Max or Dag-Disutil-Min.

## 3 Adaptive Utility Maximization via Frugal Algorithms

Frugal algorithms, introduced in Singla [37], capture the intuitive notion of “greedy” algorithms. There are many known Frugal algorithms, e.g., optimal algorithms for matroids and -approx algorithms for matchings, vertex cover, and facility location. These Frugal algorithms were designed in the traditional free information (Free-Info) setting, where each ground element has a fixed value. Can we use them in the Markovian PoI world?

Our main contribution is a technique that adapts any Frugal algorithm to the Markovian PoI world, achieving the same approximation ratio as the original algorithm. The result applies to semiadditive objective functions , which are those of the form for some .

###### Theorem 3.1 ()

For a semiadditive objective function , if there exists an -approximation Frugal algorithm for a Util-Max problem over some packing constraints in the Free-Info world, then there exists an -approximation strategy for the corresponding Util-Max problem in the Markovian PoI world.

We prove an analogous result for Disutil-Min in §0.D. The following corollaries immediately follow from known Frugal algorithms [37].

###### Corollary 1

In the Markovian PoI world, we have:

• [noitemsep,topsep=5pt]

• An optimal algorithm for both Util-Max and Disutil-Min for matroids.

• A -approx for Util-Max for matchings and a -approx for a -system.

• A -approx for Disutil-Min for set-cover, where is the maximum number of sets in which a ground element is present.

• A -approx for Disutil-Min for facility location.

• A -approx for Disutil-Min for prize-collecting Steiner tree.

Before proving Theorem 3.1, we define a grade for every state in a Markov system in §3.1, much as in [17]. This grade is a variant of the popular Gittins index. In §3.2, we use the grade to define a prevailing cost and an epoch for a trajectory. In §3.3, we use these definitions to prove Theorem 3.1. We consider Util-Max throughout, but analogous definitions and arguments hold for Disutil-Min.

### 3.1 Grade of a State

To define the grade of a state in Markov system , we consider the following Markov game called -penalized , denoted . Roughly, is the same as but with a termination penalty, which is a constant .

Suppose denotes the current state of in the game . In each move, the player has two choices: (a) Halt that immediately ends the game, and (b) Play that changes the state, price, and value as follows:

• [noitemsep,topsep=5pt]

• If , the player pays price , the current state of changes according to the transition matrix , and the game continues.

• If , then the player receives penalized value , where is the aforementioned termination penalty, and the game ends.

The player wishes to maximize his utility, which is the expected value he obtains minus the expected price he pays. We write for the utility attained by optimal play starting from state .

The utility is clearly non-increasing in the penalty , and one can also show that it is continuous [17, Section 4]. In the case of large penalty , it is optimal to halt immediately, achieving . In the opposite extreme , it is optimal to play until completion, achieving . Thus, as we increase from to , the utility becomes  at some critical value . This critical value that depends on state  is the grade.

The grade of a state in Markov system is For a Util-Max problem, we write the grade of a state in Markov system corresponding to element  as .

The quantity grade of a state is well-defined from the above discussion. We emphasize that it is independent of all other Markov systems. Put another way, the grade of a state is the penalty that makes the player indifferent between halting and playing. It is known how to compute grade efficiently [17, Section 7].

### 3.2 Prevailing Cost and Epoch

We now define a prevailing cost [17] and an epoch. The prevailing cost of Markov system is its minimum grade at any point in time.

###### Definition 4 (Prevailing Cost)

The prevailing cost of Markov system in a trajectory is . For trajectory profile , denote the list of prevailing costs for each Markov system.

Put another way, the prevailing cost is the maximum termination penalty for the game such that for every state along the player does not want to halt.

Observe that the prevailing cost of a trajectory can only decrease as it extends further. In particular, it decreases whenever the Markov system reaches a state with grade smaller than each of the previously visited states. We can therefore view the prevailing cost as a non-increasing piecewise constant function of time. This motivates us to define an epoch.

###### Definition 5 (Epoch)

An epoch for a trajectory is any maximal continuous segment of where the prevailing cost does not change.

Since the grade can be computed efficiently, we can also compute the prevailing cost and epochs of a trajectory efficiently.

### 3.3 Adaptive Algorithms for Utility Maximization

In this section, we prove Theorem 3.1 that adapts a Frugal algorithm in Free-Info world to a probing strategy in the Markovian PoI world. This theorem concerns semiadditive functions, which are useful to capture non-additive objectives of problems like facility location and prize-collecting Steiner tree.

###### Definition 6 (Semiadditive Function [37])

A function is semiadditive if there exists a function s.t.

All additive functions are semiadditive with for all . To capture the facility location problem on a graph with metric , clients , and facility opening costs , we can define . Notice only depends on the identity of facilities and not their opening costs.

The proof of Theorem 3.1 takes two steps. We first give a randomized reduction to upper bound the utility of the optimal strategy in the Markovian PoI world with the optimum of a surrogate problem in the Free-Info world. Then, we transform a Frugal algorithm into a strategy with utility close to this bound.

#### 3.3.1 Upper Bounding the Optimal Strategy Using a Surrogate.

The main idea in this section is to show that for Util-Max, no strategy (in particular, optimal) can derive more utility from an element than its prevailing cost. Here, the prevailing cost of is for a random trajectory to a destination state in Markov system . Since the optimal strategy can only select a feasible set in , this idea naturally leads to the following Free-Info surrogate problem: imagine each element’s value is exactly its (random) prevailing cost, the goal is to select a set feasible in to maximize the total value. In Lemma 1, we show that the expected optimum value of this surrogate problem is an upper bound on the optimum utility for Util-Max. First, we formally define the surrogate problem.

###### Definition 7 (Surrogate Problem)

Given a Util-Max problem with semiadditive objective and packing constraints over universe , the corresponding surrogate problem over is the following. It consists of constraints and (random) objective function given by , where denotes the prevailing costs over a random trajectory profile consisting of independent random trajectories for each element to a destination state. The goal is to select to maximize .

Let denote the optimum value of the surrogate problem for trajectory profile . We now upper bound the optimum utility in the Markovian PoI world. Our proof borrows ideas from the “prevailing reward argument” in [17].

###### Lemma 1 ()

For a Util-Max problem with objective and packing constraints , let OPT denote the utility of the optimal strategy. Then,

 \textscOPT≤Eω[\sc SUR(ω)]=Eω[maxI∈F{val(I,Ymax(ω))}],

where the expectation is over a random trajectory profile that has every Markov system reaching a destination state.

We prove Lemma 1 in §0.A.

#### 3.3.2 Designing an Adaptive Strategy Using a Frugal Algorithm.

A Frugal algorithm selects elements one-by-one and irrevocably. Besides greedy algorithms, its definition also captures “non-greedy” algorithms such as primal-dual algorithms that do not have the reverse-deletion step [37].

###### Definition 8 (Frugal Packing Algorithm)

For a combinatorial optimization problem on universe in the Free-Info world with packing constraints and objective , we say Algorithm is Frugal if there exists a marginal-value function that is increasing in , and for which the pseudocode is given by Algorithm 1. Note that this algorithm always returns a feasible solution if .

The following lemma shows that a Frugal algorithm can be converted to a strategy with the same utility in the Markovian PoI world.

###### Lemma 2 ()

Given a Frugal packing Algorithm , there exists an adaptive strategy for the corresponding Util-Max problem in Markovian PoI world with utility at least where is the solution returned by for objective .

We prove Lemma 2 in §0.B. Finally, we can prove Theorem 3.1.

###### Proof (Proof of Theorem 3.1)

From Lemma 2, the utility of is at least Since Algorithm  is an -approx algorithm in the Free-Info world, it follows

 Eω[val(A(Ymax(ω)),Ymax(ω))]≥1α⋅Eω[maxI∈F{val(I,Ymax(ω))}].

Using the upper bound on optimal utility from Lemma 1, we have utility of is at least .

In §0.D, a similar approach is used for the Disutil-Min problem with semi-additive function. This shows that for both Util-Max or Disutil-Min problem with semi-additive function, a Frugal algorithm can be transformed from Free-Info to Markovian PoI world while retaining its performance.

## 4 Robustness in Model Parameters

In practical applications, the parameters of Markov systems (i.e., transition probabilities, values, and prices) are not known exactly but are estimated by statistical sampling. In this setting, the true parameters

, which govern how each Markov system evolves, differ from the estimated parameters that the algorithm uses to make decisions. This raises a natural question: how well does an adapted

Frugal algorithm do when the true and the estimated parameters differ? We would hope to design a robust algorithm, meaning small estimation errors cause only small error in the utility objective.

In the important special case where the Markov chain corresponding to each element is formed by a directed acyclic graph (Dag), an adaptation of our strategy in Theorem 3.1 is robust. This Dag assumption turns out to be necessary as similar results do not hold for general Markov chains (see Appendix 0.F.1). In particular, we prove the following generalization of Theorem 3.1 under the Dag assumption.

###### Theorem 4.1 (Informal statement)

If there exists an -approximation Frugal algorithm () for a packing problem with a semiadditive objective function, then it suffices to estimate the true model parameters of a Dag-Markovian PoI game within an additive error of , where poly is some polynomial in the size of the input, to design a strategy with utility at least , where OPT is the utility of the optimal policy that knows all the true model parameters.

Specifically, our strategy for Theorem 4.1 is obtained from the strategy in Theorem 3.1 by making use of the following idea: each time we advance an element’s Markov system, we slightly increase the estimated grade of every state in that Markov system. This ensures that whenever we advance a Markov system, we advance through an entire epoch and remain optimal in the “teasing game”.

Our analysis of works roughtly as follows. We first show that close estimates of the model parameters of a Markov system can be used to closely estimate the grade of each state. We can therefore assume that close estimates of all grades are given as input. Next we define the “shifted” prevailing cost corresponding to the “shifted” grades. This allows us to equate the utility of by the utility of running in the “modified” surrogate problem where the input to is the “shifted” prevailing costs instead of the true prevailing costs. Finally, we prove that the “shifted” prevailing costs are close to the real prevailing costs and thus the “modified” surrogate problem is close to the surrogate problem. This allows us to bound the utility of running in the “modified” surrogate problem by the optimal strategy to the surrogate problem. Combining with Lemma 1 finishes the proof of Theorem 4.1.

Similar arguments extend to prove the analogous result for Disutil-Min.

We formally state our main theorem and the parameters on which it depends in Section 4.1. Section 4.2 shows that close estimates of transition probabilities can be used to obtain close estimates of the grades. In Section 4.3, we use these estimated grades to transform a Frugal algorithm into a robust adaptive algorithm for Dag-Util-Max. Similar arguments can be used to obtain the corresponding results for Dag-Disutil-Min (we omit this proof).

### 4.1 Main Results and Assumptions

We first explicitly define the input size of Dag-Util-Max as follows.

1. [label=()]

2. is the number of Markov systems.

3. is the maximum number of elements in a feasible solution, i.e., .

4. is the maximum depth of any Dag Markov system.

Denote an upper bound on all input prices and values, i.e., , we have . We make the following assumption.

###### Assumption 4.2

The upper bound is polynomial in , and .

Such an assumption turns out to be necessary (see Appendix 0.F.2). We now state our main theorem of this section.

###### Theorem 4.3 ()

Consider a Dag-Util-Max problem with a semiadditive objective and satisfying Assumption 4.2. Suppose there exists an -approximation Frugal algorithm in the Free-Info world. If each input parameter is known to within an additive error of , where poly is some polynomial in , and , then there exists an adaptive algorithm with utility at least

 1α⋅\textscOPT−ϵ,

where OPT is the utility of the optimal policy that exactly knows the true input parameters.

To simplify the proof of Theorem 4.3, we also assume the following without loss of generality (see Appendix 0.F.3 for justifications).

1. [label=()]

2. All non-zero transition probabilities are lower bounded by , where is a polynomial in , and .

3. We know the prices and the rewards exactly, i.e., the only unknown input parameters are the transition probabilities.

### 4.2 Well-Estimated Input Parameters Imply Well-Estimated Grades

We call the set of Markov systems constructed using our estimated transition probabilities the estimated world. The th Markov system in this estimated world is denoted by , where contains the estimated transition probabilities. Note, and are exact due to Assumption 5. We estimate the grade of a state by simply computing the grade of that state in the estimated world. The following Lemma 3 bounds the error in estimated grades in terms of the error in transition probabilities.

###### Lemma 3

Consider the Dag-Util-Max problem satisfying the assumptions in Section 4.1. Suppose all transition probabilities are estimated to within an additive error of , then , the estimated grade is within an additive factor of from the real grade , where .

###### Proof

We show below that . A symmetrical argument shows , which finishes the proof of this lemma.

We consider the Markov game defined in Section 3.1 in the estimated world. By definition, there exists an optimal policy Pol that advances the chain at least one more step and achieves an expected utility of 0. Also consider the Markov game in the real world and apply Pol in . Notice Pol might be sub-optimal in and might therefore obtain a negative expected value. Let be the cost in such that Pol obtains an expected value of 0. It follows that . It therefore suffices to show that .

Denote the set of trajectories when applying Pol (in either world) by and those in which the item is picked by . Denote the probability of a trajectory in the real world and the probability of it in the estimated world. Let be the utility of (as defined for Util-Max by ignoring the cost ) in either world. It follows that

 τfair=1∑ω∈Swinpω⋅∑ω∈S(pω⋅rω)=∑ω∈S⎛⎝pω∑ω∈Swinpω⋅rω⎞⎠,

and that

 ˆτui=1∑ω∈Swinˆpω⋅∑ω∈S(ˆpω⋅rω)=∑ω∈S(ˆpω∑ω∈Swinˆpω⋅rω).

Since each transtion probability is lower bounded by , it is estimated to within a multiplicative error of . Since and can be written as the product of at most probabilities, each term is within a multiplicative error of from . It follows that is within a multiplicative factor of from . But notice that , which implies that .

### 4.3 Designing an Adaptive Strategy for DAG-Utility Maximization

From the previous section we know how to obtain close estimates of the grades. Now we use well-estimated grades to design a robust adaptive strategy for Dag-Util-Max and prove Theorem 4.3. Theorem 4.3 directly follows by combining Lemma 1 and the following Lemma 4.

###### Lemma 4

Assuming the conditions of Theorem 4.3 and that the grade of each state is estimated to within an additive factor of , where is the depth of , then there exists an adaptive algorithm with utility at least

 1α⋅Eω[maxI∈F{val(I,Ymax(ω))}]−ϵ.

To prove Lemma 4, we describe our algorithm (Algorithm 2). We define as follows.

###### Definition 9

Fix a trajectory profile where each Markov system reaches the destination state. For each and , let be the number of transitions for to reach from by taking the trajectory . Let . Define . Denote the list of ’s as and the list of values in the set .

The key idea in (the main difference from Algorithm 4) is the “upward shifting” technique in Step 2. As we advance a Markov system, we shift our estimates of its grades upward. This guarantees that our algorithm is optimal in the teasing game defined for Claim 8.

###### Proof (Proof of Lemma 4)

This lemma immediately follows from the following two claims (whose proofs are in Appendix 0.E).

###### Claim

The utility of running in the real world is exactly the same as

 Eω[val(Alg(ˆYmax(ω),A),Ymax(ω))].
###### Claim

For any trajectory profile and for any , . Thus

 val(Alg(ˆYmax(ω),A),Ymax(ω))≥1α⋅maxI∈F{val(I,Ymax(ω))}−ϵ.

## 5 Handling Commitment Constraints

Consider the Markovian PoI model defined in §2 with an additional restriction that whenever we abandon advancing a Markov system, we need to immediately and irrevocably decide if we are selecting this element into the final solution . Since we only select ready elements, any element that is not ready when we abandon its Markov system is automatically discarded. We call this constraint commitment. The benchmark for our algorithm is the optimal policy without the commitment constraint. For single-stage probing, such commitment constraints have been well studied, especially in the context of stochastic matchings [11, 6].

We study Util-Max in the Dag model with the commitment constraint. Our algorithms make use of the online contention resolution schemes (OCRSs) proposed in [19]. OCRSs address our problem in the Free-Info world111In fact, OCRSs consider a variant where the adversary chooses the order in which the elements are tried. This handles the present problem where we may choose the order. (i.e., we can see the realization of the r.v.s for free, but there is the commitment constraint). Constant factor “selectable” OCRSs are known for several constraint families:  for matroids, for matchings, and for intersection of matroids [19]. We show how to adapt them to Markovian PoI with commitment.

###### Theorem 5.1 ()

For an additive objective, if there exists a -selectable OCRS () for a packing constraint , then there exists an -approximation algorithm for the corresponding Dag-Util-Max problem with commitment.

The proof of this result uses a new LP relaxation (inspired from [22]) to bound the optimum utility of a Markovian PoI game without commitment (see §5.1). Although this relaxation is not exact even for Pandora’s box (and cannot be used to design optimal strategies in Corollary 1), it turns out to suffice for our approximation guarantees. In §5.2, we use an OCRS to round this LP with only a small loss in the utility, while respecting the commitment constraint.

###### Remark 1

We do not consider Disutil-Min problem under commitment because it captures prophet inequalities in a minimization setting where no polynomial approximation is possible even for i.i.d. r.v.s [18, Theorem ].

In §5.1, we give an LP relaxation to upper bound the optimum utility without the commitment constraint. In §5.2, we apply an OCRS to round the LP solution to obtain an adaptive policy, while satisfying the commitment constraint.

### 5.1 Upper Bounding the Optimum Utility

Define the following variables, where is an index for the Markov systems.

• [noitemsep,topsep=5pt]

• : probability we reach state  in Markov system  for .

• : probability we play  when it is in state  for .

• : probability  is selected into the final solution when in a destination state.

• is a convex relaxation containing all feasible solutions for packing .

We can now formulate the following LP, which is inspired from [22].

 maxz∑i (∑u∈Tiruizui−∑u∈Vi∖Tiπuizui) subject toysii =1 ∀i∈J yui =∑v∈Vi(Pi)uvzvi ∀i∈J,∀u∈Vi∖si xi =∑u∈Tizui ∀i∈J zui ≤yui ∀i∈J,∀u∈Vi x ∈PF xi,yui,zui ≥0 ∀i∈J,∀u∈Vi

The first four constraints characterize the dynamics in advancing the Markov systems. The fifth constraint encodes the packing constraint . We denote the optimal solution of this LP as . We can efficiently solve the above LP for packing constraints such as matroids, matchings, and intersection of matroids.

If we interpret the variables , and as the probabilities corresponding to the optimal strategy without commitment, it forms a feasible solution to the LP. This implies the following claim.

###### Lemma 5

The optimum utility without commitment is at most the LP value.

### 5.2 Rounding the LP Using an OCRS

Before describing our rounding algorithm, we define an OCRS. Intuitively, it is an online algorithm that given a random set ground elements, selects a feasible subset of them. Moreover, if it can guarantee that every is selected w.p. at least , it is called -selectable.

###### Definition 10 (Ocrs [19])

Given a point , let denote a random set containing each independently w.p. . The elements reveal one-by-one whether and we need to decide irrevocably whether to select an into the final solution before the next element is revealed. An OCRS is an online algorithm that selects a subset such that .

###### Definition 11 (1α-Selectability [19])

Let . An OCRS for is -selectable if for any and all , we have .

Our algorithm ALG uses OCRS as an oracle. It starts by fixing an arbitrary order of the Markov systems. (Our algorithm works even when an adversary decides the order of the Markov systems.) Then at each step, the algorithm considers the next element in and queries the OCRS whether to select element if it is ready. If OCRS decides to select , then ALG advances the Markov system such that it plays from each state with independent probability . This guarantees that the desination state is reached with probability . If OCRS is not going to select , then ALG moves on to the next element in . A formal description of the algorithm can be found in Algorithm 3.

We show below that ALG has a utility of at least times the LP value.

###### Lemma 6

The utility of ALG is at least times the LP optimum.

Since by Lemma 5 the LP optimum is an upper bound on the utility of any policy without commitment, this proves Theorem 5.1. We now prove Lemma 6.

###### Proof (Proof of Lemma 6)

Recollect that we call a Markov system ready if it reaches an absorbing destination state. We first notice that once ALG starts to advance a Markov system , then by Step 3 of Algorithm 3, element is ready with probability exactly . This agrees with what ALG tells the OCRS. Since the OCRS is -selectable, the probability that any Markov system begins advancing is . Here the probability is both over the random choice of the OCRS and the randomness due to the Markov systems. Conditioning on the event that begins advancing, the probability that it is selected into the final solution on reaching a destination state is exactly . Hence, the conditioned utility from Markov system is exactly

 ∑u∈Tiruizui−∑u∈Vi∖Tiπuizui.

By removing the conditioning and by linearity of expectation, the utility of ALG is at least which proves this lemma.

## 6 Related Work

Our work is related to work on multi-armed bandits in the scheduling literature. The Gittins index theorem [21] provides a simple optimal strategy for several scheduling problems where the objective is to maximize the long-term exponentially discounted reward. This theorem turned out to be fundamental and [38, 39, 41] gave alternate proofs. It can be also used to solve Weitzman’s Pandora’s box. The reader is referred to the book [20] for further discussions on this topic. Influenced by this literature, [17] studied scheduling of Markovian jobs, which is a minimization variant of the Gittins index theorem without any discounting. Their paper is part of the inspiration for our Markovian PoI model.

The Lagrangian variant of stochastic probing considered in [22] is similar to our Markovian PoI model. However, their approach using an LP relaxation to design a probing strategy is fundamentally different from our approach using a Frugal algorithm. E.g., unlike Corollary 1, their approach cannot give optimal probing strategies for matroid constraints due to an integrality gap. Also, their approach does not work for Disutil-Min. In §5, we extend their techniques using OCRSs to handle the commitment constraint for Util-Max.

There is also a large body of work in related models where information has a price [28, 10, 32, 25, 14, 1, 13, 12]. Finally, as discussed in the introduction, the works in [33] and [37] are directly relevant to this paper. The former’s primary focus is on single item settings and its applications to auction design, and the latter studies price of information in a single-stage probing model. Our contributions concern selecting multiple items in multi-stage probing model, in some sense unifying these two lines of work.

The field of combinatorial optimization has been extensively studied: we refer the readers to Schrijver’s popular book [36], and the references therein. In recent years, there has also been a lot of interest in studying these combinatorial problems for stochastic inputs. [15, 16, 24, 22, 9, 34, 35] considered stochastic knapsack, [11, 2, 6, 8, 3] studied stochastic matchings, [23, 27, 7] studied stochastic orienteering, [5, 29, 4, 31, 30] considered stochastic submodular maximization, and [22, 23, 26, 35] studied budgeted multi-armed bandits. These works (besides [22]) do not consider mixed-sign utility objective or multi-stage probing, which is our primary focus.

## References

• [1] Abbas, A.E., Howard, R.A.: Foundations of decision analysis. Pearson Higher Ed (2015)
• [2] Adamczyk, M.: Improved analysis of the greedy algorithm for stochastic matching. Inf. Process. Lett. 111(15), 731–737 (2011)
• [3] Adamczyk, M., Grandoni, F., Mukherjee, J.: Improved approximation algorithms for stochastic matching. In: Algorithms-ESA 2015, pp. 1–12. Springer (2015)
• [4] Adamczyk, M., Sviridenko, M., Ward, J.: Submodular stochastic probing on matroids. Mathematics of Operations Research 41(3), 1022–1038 (2016)
• [5] Asadpour, A., Nazerzadeh, H., Saberi, A.: Stochastic submodular maximization. In: International Workshop on Internet and Network Economics. pp. 477–489. Springer (2008)
• [6] Bansal, N., Gupta, A., Li, J., Mestre, J., Nagarajan, V., Rudra, A.: When LP Is the Cure for Your Matching Woes: Improved Bounds for Stochastic Matchings. Algorithmica 63(4), 733–762 (2012)
• [7] Bansal, N., Nagarajan, V.: On the adaptivity gap of stochastic orienteering. In: IPCO. pp. 114–125 (2014)
• [8] Baveja, A., Chavan, A., Nikiforov, A., Srinivasan, A., Xu, P.: Improved bounds in stochastic matching and optimization. In: APPROX. pp. 124–134 (2015)
• [9] Bhalgat, A., Goel, A., Khanna, S.: Improved approximation results for stochastic knapsack problems. In: SODA. pp. 1647–1665 (2011)
• [10] Charikar, M., Fagin, R., Guruswami, V., Kleinberg, J.M., Raghavan, P., Sahai, A.: Query strategies for priced information. J. Comput. Syst. Sci. 64(4), 785–819 (2002). https://doi.org/10.1006/jcss.2002.1828, http://dx.doi.org/10.1006/jcss.2002.1828
• [11] Chen, N., Immorlica, N., Karlin, A.R., Mahdian, M., Rudra, A.: Approximating Matches Made in Heaven. In: ICALP (1). pp. 266–278 (2009)
• [12] Chen, Y., Immorlica, N., Lucier, B., Syrgkanis, V., Ziani, J.: Optimal data acquisition for statistical estimation. arXiv preprint arXiv:1711.01295 (2017)
• [13] Chen, Y., Hassani, S.H., Karbasi, A., Krause, A.: Sequential information maximization: When is greedy near-optimal? In: Conference on Learning Theory. pp. 338–363 (2015)
• [14] Chen, Y., Javdani, S., Karbasi, A., Bagnell, J.A., Srinivasa, S.S., Krause, A.: Submodular surrogates for value of information. In: AAAI. pp. 3511–3518 (2015)
• [15] Dean, B.C., Goemans, M.X., Vondrák, J.: Approximating the stochastic knapsack problem: The benefit of adaptivity. In: Foundations of Computer Science, 2004. Proceedings. 45th Annual IEEE Symposium on. pp. 208–217. IEEE (2004)
• [16] Dean, B.C., Goemans, M.X., Vondrák, J.: Adaptivity and approximation for stochastic packing problems. In: SODA. pp. 395–404 (2005)
• [17] Dumitriu, I., Tetali, P., Winkler, P.: On playing golf with two balls. SIAM Journal on Discrete Mathematics 16(4), 604–615 (2003)
• [18] Esfandiari, H., Hajiaghayi, M., Liaghat, V., Monemizadeh, M.: Prophet secretary. SIAM Journal on Discrete Mathematics 31(3), 1685–1701 (2017)
• [19] Feldman, M., Svensson, O., Zenklusen, R.: Online contention resolution schemes. In: Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 1014–1033. Society for Industrial and Applied Mathematics (2016)
• [20] Gittins, J., Glazebrook, K., Weber, R.: Multi-armed bandit allocation indices. John Wiley & Sons (2011)
• [21] Gittins, J., Jones, D.: A dynamic allocation index for the sequential design of experiments. Progress in statistics pp. 241–266 (1974)
• [22] Guha, S., Munagala, K.: Approximation algorithms for budgeted learning problems. In: STOC, pp. 104–113 (2007), full version as: Approximation Algorithms for Bayesian Multi-Armed Bandit Problems, http://arxiv.org/abs/1306.3525
• [23] Guha, S., Munagala, K.: Multi-armed bandits with metric switching costs. In: ICALP. pp. 496–507 (2009)
• [24] Guha, S., Munagala, K.: Adaptive uncertainty resolution in bayesian combinatorial optimization problems. ACM Transactions on Algorithms (TALG) 8(1),  1 (2012)
• [25] Guha, S., Munagala, K., Sarkar, S.: Information acquisition and exploitation in multichannel wireless systems. In: IEEE Transactions on Information Theory. Citeseer (2007)
• [26] Gupta, A., Krishnaswamy, R., Molinaro, M., Ravi, R.: Approximation algorithms for correlated knapsacks and non-martingale bandits. In: FOCS. pp. 827–836 (2011)
• [27] Gupta, A., Krishnaswamy, R., Nagarajan, V., Ravi, R.: Approximation algorithms for stochastic orienteering. In: SODA (2012), http://dl.acm.org/citation.cfm?id=2095116.2095237
• [28] Gupta, A., Kumar, A.: Sorting and selection with structured costs. In: Foundations of Computer Science, 2001. Proceedings. 42nd IEEE Symposium on. pp. 416–425. IEEE (2001)
• [29] Gupta, A., Nagarajan, V.: A stochastic probing problem with applications. In: IPCO. pp. 205–216 (2013)
• [30] Gupta, A., Nagarajan, V., Singla, S.: Algorithms and adaptivity gaps for stochastic probing. In: Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 1731–1747. SIAM (2016)
• [31] Gupta, A., Nagarajan, V., Singla, S.: Adaptivity Gaps for Stochastic Probing: Submodular and XOS Functions. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 1688–1702. SIAM (2017)
• [32] Kannan, S., Khanna, S.: Selection with monotone comparison costs. In: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms. pp. 10–17. Society for Industrial and Applied Mathematics (2003)
• [33] Kleinberg, R., Waggoner, B., Weyl, G.: Descending Price Optimally Coordinates Search. arXiv preprint arXiv:1603.07682 (2016)
• [34]

Li, J., Yuan, W.: Stochastic combinatorial optimization via poisson approximation. In: Symposium on Theory of Computing Conference, STOC’13, Palo Alto, CA, USA, June 1-4, 2013. pp. 971–980 (2013).

https://doi.org/10.1145/2488608.2488731,
http://doi.acm.org/10.1145/2488608.2488731
• [35] Ma, W.: Improvements and generalizations of stochastic knapsack and multi-armed bandit approximation algorithms: Extended abstract. In: SODA. pp. 1154–1163 (2014)
• [36] Schrijver, A.: Combinatorial optimization: polyhedra and efficiency, vol. 24. Springer Science & Business Media (2003)
• [37] Singla, S.: The price of information in combinatorial optimization. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM (2018)
• [38] Tsitsiklis, J.N.: A short proof of the Gittins index theorem. The Annals of Applied Probability pp. 194–199 (1994)
• [39] Weber, R.: On the Gittins index for multiarmed bandits. The Annals of Applied Probability 2(4), 1024–1033 (1992)
• [40] Weitzman, M.L.: Optimal search for the best alternative. Econometrica: Journal of the Econometric Society pp. 641–654 (1979)
• [41] Whittle, P.: Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society. Series B (Methodological) pp. 143–149 (1980)

## Appendix 0.A Proof of Lemma 1

We restate Lemma 1 below.

See 1

###### Proof

We abuse the notation and use OPT to denote both the optimal policy and its utility. Suppose we fix a trajectory profile where each Markov system reaches a destination state. Let be the set of elements selected by OPT on , where notice that some of the unselected elements may not be ready: OPT might have selected only after playing prefixes of trajectories in . The following observation follows from the definition of .

###### Observation 0.A.1

For any trajectory profile ,

 val(I(ω),Ymax(ω))≤\sc SUR(ω).

Now, using the following Lemma 7 along with Observation 0.A.1 finishes the proof of Lemma 1.

###### Lemma 7

The utility of the optimal strategy

 \textscOPT≤Eω[val(I(ω),Ymax(ω))].
###### Proof (Proof of Lemma 7)

Since for every trajectory profile both OPT in the Markovian PoI world and in the Free-Info world pick the same set of elements , the expected value due to the set function is the same. Hence, WLOG assume for all .

Now consider the following teasing game defined using the prevailing cost from Definition 4. Consider a game where each Markov system starts at its initial state and a player is invited to advance the Markov systems. Besides advancing, the player is allowed to select any arbitrary elements (need not be feasible in ) or terminate the game at any time during the game. Whenever an element is selected, the player pays a corresponding cost, which is set to be the prevailing cost defined by the trajectory that lead to the current state in . The player’s goal is to maximize the expected value, which is the expected utility (as defined for Util-Max) from advancing the Markov systems minus the expected total cost he pays when some items are selected. Observe that in this game the costs are updated in a “teasing” manner according to the prevailing costs that motivates the player to continue playing. By an argument similar to [17], we have the following lemma.

###### Lemma 8

The teasing game is fair, which means that no strategy achieves a positive expected value by playing it and that there exists a strategy with zero expected value. Moreover, the following strategy plays fairly: irrespective of the order in which the Markov systems are played, whenever the player starts to advance a Markov system, he continues to advance it through the entire epoch.

Now consider running the optimal policy OPT in the teasing game. Let be a trajectory profile in which each chain reaches its destination state. Let

denote a trajectory profile until the moment when

OPT returns the solution on the trajectory profile . It should be noticed that each trajectory in is a prefix of the corresponding trajectory in . In particular, for an element , coincides with since the destination state of is reached. For an element , however, may only be a prefix of . It follows that applying OPT in along trajectory profile incurs a cost of , where is the prevailing cost for on trajectory according to Definition 4. Since is a fair game, the expected utility of OPT cannot be larger than the expected cost it pays, i.e.,

 \textscOPT≤Eω[∑i∈I(ω)Ymax(ωT)i].

Since the elements are ready, we have and

 ∑i∈I(ω)Ymax(ωT)i=∑i∈I(ω)Ymaxωi.

This implies

 \textscOPT≤Eω[∑i∈I(ω)Ymaxωi],

which finishes the proof of Lemma 7.

## Appendix 0.B Proof of Lemma 2

We restate Lemma 2 below.

See 2

###### Proof (Proof of Lemma 2)

We describe how to adapt the Frugal Algorithm  to an adaptive strategy in the Markovian PoI world. uses the grade as proxy for , since is known only when the Markov systems reach their destination states. More specifically, at each moment when the Frugal Algorithm  is trying to evaluate the marginal-value function for each element, instead of using the value for each element, which we may not yet know at the moment, the strategy uses the values to compute the marginal. For the element chosen by , the corresponding Markov system will be advanced one more step. A more specific description of our algorithm is given Algorithm 4. Here for a set is defined as the list of values that are in the set .

In the following Claim 0.B, we argue that for any trajectory profile , running in Markovian PoI returns the same set of elements as running for .

###### Claim (Claim 0.b)

For any trajectory profile , the solution returned by running Algorithm 4 in the Markovian PoI world is the same as the solution by Algorithm  on .

Before proving Claim 0.B, we use it to prove Lemma 2 by showing that the utility of Algorithm 4 in the Markovian PoI world is at least

 Eω[val(A(Ymax(ω)),Ymax(ω))].

By Claim 0.B, the value due to the set function is the same for both algorithms. So without loss of generality, assume is always 0. We consider the teasing game as defined in Claim 8. By definition, is an increasing function of the last parameter . Since grade is used as that parameter and the grade of each state visited during an epoch is at least the grade of the initial state of that epoch, it follows that once Algorithm 4 starts to play a Markov system , it will not switch before finishing an epoch. Therefore, by Claim 8, Algorithm 4 plays a fair game. So the expected cost that Algorithm 4 pays is the same as its expected utility from playing the Markov systems. However, Claim 0.B gives the expected cost payed by Algorithm 4 is the same as the utility of running Algorithm  in the Free-Info world, i.e., . Hence, the utility of running Algorithm 4 is at least .

It remains to prove the missing Claim 0.B in the proof of Lemma 2.

###### Proof (Proof of Claim 0.b)

Suppose we fix a trajectory profile where each Markov system reaches some destination state. We prove the claim by induction on the number of elements already selected into the set . Suppose the set of elements selected into is the same by running the two algorithms until now. We show that the next element selected by the algorithms into is the same.

Assume for the purpose of contradiction that the next element picked by is but the next element picked by Algorithm 4 is . By the definition of Algorithm ,

 j=arg maxi′∉M{g(YmaxM(ω),i′,Ymaxωi′)}. (2)

where denotes the trajectory of in . Now we look at the trajectory , it follows that the prevailing cost is non-increasing over this trajectory and is equal to when reaches the destination state. We look at the last moment when the prevailing cost of decreases. Consider the first moment after that our Algorithm 4 decides to play (but has not actually played yet). It follows that the prevailing cost of at moment is exactly the same as and also the grade of the current state . Denote the prevailing cost of and the state of at moment . Then we have because the prevailing cost of is also non-increasing. By the definition of , one has

 g(YmaxM(ω),i,Ymaxωi) =g(YmaxM(ω),i,τuii) >g(YmaxM(ω),j,τujj)≥g(YmaxM(ω),j,Ymaxω′j).

However, since is increasing in the last parameter, it follows that

 g(YmaxM(ω),j,Ymaxω′j)≥g(YmaxM(ω),j,Ymaxωj),

which implies

 g(YmaxM(ω),i,Ymaxωi)>g(YmaxM(ω),j,Ymaxωj).

This contradicts with the definition of in Eq (2).

## Appendix 0.C Comparing Grade and Weitzman’s Index for Pandora’s Box

Recall Weitzman’s Pandora’s box formulation of the oil-drilling problem mentioned in Section 1

. Given probability distributions of

independent random variables (amount of oil at site ) and their probing (inspection) prices , the goal is to design a strategy to adaptively probe a set to maximize expected utility

 E[maxi∈Probed{Xi}−∑i∈Probedπi].

The Weitzman’s index for site , denoted by , is defined using the following equation . It is known that the following strategy is optimal [40].

Selection Rule: The next site to be probed is the one with with the highest Weitzman’s index.

Stopping Rule: Terminate when the maximum realized value amongst the probed sites exceeds the Weitzman’s index of every unprobed site.

It turns out that Weitman’s index is simply the grade, defined in Section 3.1, in disguise. To see this, we start by noticing that each variable with probing price can be thought of as the following Markov system. There is one initial state with moving cost . has transitions, with probabilities according to the distribution of , to a set of destination states, each corresponding to a possible outcome of the variable . The value of each destination state is naturally set to be the corresponding outcome of . We show below that is simply the grade of the initial state .

According to our definition of grade in Section 3.1, in the -penalized Markov game , there is a fair strategy that probes site and achieves a zero utility. Such a strategy would pick site (i.e., play in the corresponding destination state) if and only if . The utility of that policy is thus . Comparing with the definition of Weitzman’s index, this shows . The optimality of Weitzman’s strategy is therefore also implied by Theorem 3.1.

## Appendix 0.D Adaptive Algorithms for Disutility Minimization

We give the corresponding definitions for the Disutil-Min problem.

###### Definition 12 (Prevailing Reward for Disutil-Min)

The prevailing reward of for the trajectory in Disutil-Min is defined as

 RminPiΔ=maxu∈Pi{−τui}.

For a trajectory profile , denote the list of prevailing rewards for each Markov system.

For a trajectory in the Disutil-Min problem, consider the change of the prevailing reward as the Markov system starts from and moves according to . It follows that the prevailing reward is non-decreasing in this process. Moreover, it increases whenever the Markov system reaches a state that has smaller grade than each previously visited state. Now we are ready to state the definition of an epoch.

###### Definition 13 (Epoch for Disutil-Min)

An epoch is defined to be the period from the time when the prevailng reward increases until the moment just before the next time it increases.

It follows that within an epoch, all states visited has grade no smaller than the prevailing reward at the start of this epoch and thus the prevailing reward stays constant in an epoch. We can therefore view the prevailing reward as a non-decreasing piece-wise constant function of time.

###### Definition 14 (Frugal Covering Algorithm)

For a Disutil-Min problem in the Deterministic world with covering constraints and cost function , we say Algorithm is Frugal if there exists a marginal-value function that is decreasing in , and for which the pseudocode is given by Algorithm 5. Moreover, the function should encode the constraints , such that whenever is infeasible, then with . This requirement will ensure that a feasible solution is returned.