Consider an online stochastic optimization problem with a finite number of rounds. There are a set of tasks (or items, boxes, jobs or actions). In each round, we can choose a task and each task can be chosen at most once. We have an initial “state” of the system (called the value of the system). At each time period, we can select a task. Finishing the task generates some (possibly stochastic) feedback, including changing the value of the system and providing some profit for the round. Our goal is to design a strategy to maximize our total (expected) profit.
The above problem can be modeled as a class of stochastic dynamic programs which was introduced by Bellman . There are many problems in stochastic combinatorial optimization which fit in this model, e.g., the stochastic knapsack problem , the Probemax problem . Formally, the problem is specified by a -tuple . Here, is the set of all possible values of the system. is a finite set of items or tasks which can be selected and each item can be chosen at most once. This model proceeds for at most rounds. At each round , we use to denote the current value of the system and the set of remaining available items. If we select an item , the value of the system changes to . Here may be stochastic and is assumed to be independent for each item
. Using the terminology from Markov decision processes, the state at timeis . 555This is why we do not call the state of the system. Hence, if we select an item , the evolution of the state is determined by the state transition function :
Meanwhile the system yields a random profit . The function is the terminal profit function at the end of the process.
We begin with the initial state . We choose an item . Then the system yields a profit , and moves to the next state where follows the distribution and . This process is iterated yielding a random sequence
The profits are accumulated over steps. 666If less than steps, we can use some special items to fill which satisfy that and for any value . The goal is to find a policy that maximizes the expectation of the total profits . Formally, we want to determine:
By Bellman’s equation , for every initial state , the optimal value is given by . Here is the function defined by together with the recursion:
When the value and the item spaces are finite, and the expectations can be computed, this recursion yields an algorithm to compute the optimal value. However, since the state space is exponentially large, this exact algorithm requires exponential time. Since this model can capture several stochastic optimization problems which are known (or believed) be #P-hard or even PSPACE-hard, we are interested in obtaining polynomial-time approximation algorithms with provable performance guarantees.
1.1 Our Results
In order to obtain a polynomial time approximation scheme (PTAS) for the stochastic dynamic program, we need the following assumptions.
In this paper, we make the following assumptions.
The value space is discrete and ordered, and its size is a constant. W.l.o.g., we assume .
The function satisfies that , which means the value is nondecreasing.
The function is a nonnegative function. The expected profit is nonnegative (although the function may be negative with nonzero probability).
Assumption (1) seems to be quite restrictive. However, for several concrete problems where the value space is not of constant size (e.g., Probemax in Section 1.2), we can discretize the value space and reduce its size to a constant, without losing much profit. Assumption (2) and (3) are quite natural for many problems. Now, we state our main result.
For any fixed , if Assumption 1 holds, we can find an adaptive policy in polynomial time with expected profit at least where and denotes the expected profit of the optimal adaptive policy.
Our Approach: For the stochastic dynamic program, an optimal adaptive policy
can be represented as a decision tree(see Section 2 for more details). The decision tree corresponding to the optimal policy may be exponentially large and arbitrarily complicated. Hence, it is unlikely that one can even represent an optimal decision for the stochastic dynamic program in polynomial space. In order to reduce the space, we focus a special class of policies, called block adaptive policy. The idea of block adaptive policy was first introduced by Bhalgatet al.  and further generalized in  to the context of the stochastic knapsack. To the best of our knowledge, the idea has not been extended to other applications. In this paper, we make use of the notion of block adaptive policy as well, but we target at the development of a general framework. For this sake we provide a general model of block policy (see Section 3). Since we need to work with the more abstract dynamic program, our construction of block adaptive policy is somewhat different from that in [6, 18].
Roughly speaking, in a block adaptive policy, we take a batch of items simultaneously instead of a single one each time. This can significantly reduce the size of the decision tree. Moreover, we show that there exists a block-adaptive policy that approximates the optimal adaptive policy and has only a constant number of blocks on the decision tree (the constant depends on ). Since the decision tree corresponding to a block adaptive policy has a constant number of nodes, the number of all topologies of the block decision tree is a constant. Fixing the topology of the decision tree corresponding to the block adaptive policy, we still need to decide the subset of items to place in each block. Again, there is exponential number of possible choices. For each block, we can define a signature for it, which allows us to represent a block using polynomially many possible signatures. The signatures are so defined such that two subsets with the same signature have approximately the same reward distribution. Finally, we show that we can enumerate the signatures of all blocks in polynomial time using dynamic programming and find a nearly optimal block-adaptive policy. The high level idea is somewhat similar to that in , but the details are again quite different.
Our framework can be used to obtain the first PTAS for the following problems.
1.2.1 The Probemax Problem
In the Probemax problem, we are given a set of items. Each item has a value which is an independent random variable following a known (discrete) distribution . We can probe a subset of items sequentially. Each time after probing an item , we observe its value realization, which is an independent sample from the distribution . We can adaptively probe at most items and each item can be probed at most once. The reward is the maximum among the realized values. Our goal is to design an adaptive probing policy such that the expected value of the reward is maximized.
Despite being a very basic stochastic optimization problem, we still do not have a complete understanding of the approximability of the Probemax problem. It is not even known whether it is intractable to obtain the optimal policy. For the non-adaptive Probemax problem (i.e., the probed set is just a priori fixed set), it is easy to obtain a approximation by noticing that is a submodular function (see e.g., Chen et al. ). Chen et al.  obtained the first PTAS. When considering the adaptive policies, Munagala  provided a -approximation ratio algorithm by LP relaxation. His policy is essentially a non-adaptive policy (it is related to the contention resolution schemes [24, 11]). They also showed that the adaptivity gap (the gap between the optimal adaptive policy and optimal non-adaptive policy) is at most 3. For the Probemax problem, the best-known approximation ratio is . Indeed, this can be obtained using the algorithm for stochastic monotone submodular maximization in Asadpour et al. . This is also a non-adaptive policy, which implies the adaptivity gap is at most . In this paper, we provide the first PTAS, among all adaptive policies. Note that our policy is indeed adaptive.
There exists a PTAS for the Probemax problem. In other words, for any fixed constant , there is a polynomial-time approximation algorithm for the Probemax problem that finds a policy with the expected profit at least , where denotes the expected profit of the optimal adaptive policy.
Let the value be the maximum among the realized values of the probed items at the time period . Using our framework, we have the following system dynamics for Probemax:
. Clearly, Assumption 1 (2) and (3) are satisfied. But Assumption 1 (1) is not satisfied because the value space is not of constant size. Hence, we need to discretize the value space and reduce its size to a constant. See Section 4 for more details. If the reward is the summation of top- values () among the realized values, we obtain the ProbeTop- problem. Our techniques also allow us to derive the following result.
For the ProbeTop- problem where is a constant, there is a polynomial time algorithm that finds an adaptive policy with the expected profit at least , where denotes the expected profit of the optimal adaptive policy.
1.2.2 Committed ProbeTop-k Problem
We are given a set of items. Each item has a value which is an independent random variable with a known (discrete) distribution . We can adaptively probe at most items and choose values in the committed model, where is a constant. In the committed model, once we probe an item and observe its value realization, we must make an irrevocable decision whether to choose it or not, i.e., we must either add it to the final chosen set immediately or discard it forever. 777In [11, 12], it is called the online decision model. If we add the item to the final chosen set , the realized profit is collected. Otherwise, no profit is collected and we are going the probe the next item. Our goal is to design an adaptive probing policy such that the expected value is maximized, where is the final chosen set.
There is a polynomial time algorithm that finds a committed policy with the expected profit at least for the committed ProbeTop- problem, where is the expected total profit obtained by the optimal policy.
Let represent the action that we probe item with the threshold (i.e., we choose item if realizes to a value such that ). Let be the the number of items that have been chosen at the period time . Using our framework, we have following transition dynamics for the ProbeTop- problem.
for , and . Since is a constant, Assumption 1 is immediately satisfied. There is one extra requirement for the problem: in any realization path, we can choose at most one action from the set . See Section 5 for more details.
1.2.3 Committed Pandora’s Box Problem
For Weitzman’s “Pandora’s box” problem , we are given boxes. For each box , the probing cost is deterministic and the value is an independent random variable with a known (discrete) distribution . Opening a box incurs a cost of . When we open the box , its value is realized, which is a sample from the distribution . The goal is to adaptively open a subset to maximize the expected profit: Weitzman provided an elegant optimal adaptive strategy, which can be computed in polynomial time. Recently, Singla  generalized this model to other combinatorial optimization problems such as matching, set cover and so on.
In this paper, we focus on the committed model, which is mentioned in Section 1.2.2. Again, we can adaptively open the boxes and choose at most values in the committed way, where is a constant. Our goal is to design an adaptive policy such that the expected value is maximized, where is the final chosen set and is the set of opened boxes. Although the problem looks like a slight variant of Weitzman’s original problem, it is quite unlikely that we can adapt Weitzman’s argument (or any argument at all) to obtain an optimal policy in polynomial time. When , we provide the first PTAS for this problem. Note that a PTAS is not known previously even for .
When , there is a polynomial time algorithm that finds a committed policy with the expected value at least for the committed Pandora’s Box problem.
Similar to the committed ProbeTop- problem, let represent the action that we open the box with threshold . Let be the number of boxes that have been chosen at the time period . Using our framework, we have following system dynamics for the committed Pandora’s Box problem:
1.2.4 Stochastic Target Problem
İlhan et al.  introduced the following stochastic target problem. 888 called the problem the adaptive stochastic knapsack instead. However, their problem is quite different from the stochastic knapsack problem studied in the theoretical computer science literature. So we use a different name. In this problem, we are given a predetermined target and a set of items. Each item has a value which is an independent random variable with a known (discrete) distribution . Once we decide to insert an item into a knapsack, we observe a reward realization which follows the distribution . We can insert at most items into the knapsack and our goal is to design an adaptive policy such that is maximized, where is the set of inserted items. For the stochastic target problem, İlhan et al. 
provided some heuristic based on dynamic programming for the special case where the random profit of each item follows a known normal distribution. In this paper, we provide an additive PTAS for the stochastic target problem when the target is relaxed to.
There exists an additive PTAS for stochastic target problem if we relax the target to . In other words, for any given constant , there is a polynomial-time approximation algorithm that finds a policy such that the probability of the total rewards exceeding is at least , where is the resulting probability of an optimal adaptive policy.
Let the value be the total profits of the items in the knapsack at time period . Using our framework, we have following system dynamics for the stochastic target problem:
for . Then Assumption 1 (2,3) is immediately satisfied. But Assumption 1 (1) is not satisfied for that the value space is not of constant size. Hence, we need to discretize the value space and reduce its size to a constant. See Section 7 for more details.
1.2.5 Stochastic Blackjack Knapsack
Levin et al.  introduced the stochastic blackjack knapsack. In this problem, we are given a capacity and a set of items, each item has a size which is an independent random variable with a known distribution and a profit . We can adaptively insert the items into a knapsack, as long as the capacity constraint is not violated. Our goal is to design an adaptive policy such that the expected total profits of all items inserted is maximized. The key feature here different from classic stochastic knapsack is that we gain zero if overflow, i.e., we will lose the profits of all items inserted already if the total size is larger than the capacity. This extra restriction might induce us to take more conservative policies. Levin et al.  presented a non-adaptive policy with expected value that is at least times the expected value of the optimal adaptive policy. Chen et al.  assumed each size
follows a known exponential distribution and gave an optimal policy forbased on dynamic programming. In this paper, we provide the first bi-criteria PTAS for the problem.
For any fixed constant , there is a polynomial-time approximation algorithm for stochastic blackjack knapsack that finds a policy with the expected profit at least , when the capacity is relaxed to , where is the expected profit of the optimal adaptive policy.
Denote and let be the total sizes and total profits of the items in the knapsack at the time period respectively. When we insert an item into the knapsack and observe its size realization, say , we define the system dynamics function to be
and for . Then Assumption 1 (2,3) is immediately satisfied. But Assumption 1 (1) is not satisfied for that the value space is not of constant size. Hence, we need to discretize the value space and reduce its size to a constant. See Section 8 for more details.
For the case without relaxing the capacity, we can improve the result of in .
For any , there is a polynomial time algorithm that finds a -approximate adaptive policy for .
1.3 Related Work
Stochastic dynamic program has been widely studied in computer science and operation research (see, for example, [4, 21]) and has many applications in different fields. It is a natural model for decision making under uncertainty. In 1950s, Richard Bellman 
introduced the “principle of optimality” which leads to dynamic programming algorithms for solving sequential stochastic optimization problems. However, Bellman’s principle does not immediate lead to efficient algorithms for many problems due to “curse of dimensionality” and the large state space.
There are some constructive frameworks that provide approximation schemes for certain classes of stochastic dynamic programs. Shmoys et al. 
dealt with stochastic linear programs. Halmanet al. [13, 14, 15] studies stochastic discrete DPs with scalar state and action spaces and designed an FPTAS for their framework. As one of the applications, they used it to solve the stochastic ordered adaptive knapsack problem. As a comparison, in our model, the state space is exponentially large and hence cannot be solved by previous framework.
Stochastic knapsack problem is one of the most well-studied stochastic combinatorial optimization problem. We are given a knapsack of capacity . Each item has a random value with a known distribution and a profit . We can adaptively insert the items to the knapsack, as long as the capacity constraint is not violated. The goal is to maximize the expected total profit of all items inserted. For , Dean et al.  first provide a constant factor approximation algorithm. Later, Bhalgat et al.  improved that ratio to and gave an algorithm with ratio of by using extra budget for any given constant . In that paper, the authors first introduced the notion of block adaptive policies, which is crucial for this paper. The best known single-criterion approximation factor is 2 [5, 18, 19].
The Probemax problem and ProbeTop- problem are special cases of the general stochastic probing framework formulated by Gupta et al. . They showed that the adaptivity gap of any stochastic probing problem where the outer constraint is prefix-closed and the inner constraint is an intersection of matroids is at most , where is the number of items. The Bernoulli version of stochastic probing was introduced in , where each item has a fixed value and is “active” with an independent probability . Gupta et al.  presented a framework which yields a -approximation algorithm for the case when and are respectively an intersection of and matroids. This ratio was improved to by Adamczyk et al.  using the iterative randomized rounding approach. Weitzman’s Pandora’s Box is a classical example in which the goal is to find out a single random variable to maximize the utility minus the probing cost. Singla  generalized this model to other combinatorial optimization problems such as matching, set cover, facility location, and obtained approximation algorithms.
2 Policies and Decision Trees
An instance of stochastic dynamic program is given by . For each item and values , we denote and . The process of applying a feasible adaptive policy can be represented as a decision tree . Each node on is labeled by a unique item . Before selecting the item , we denote the corresponding time index, the current value and the set of the remaining available items by and respectively. Each node has several children, each corresponding to a different value realization (one possible ). Let be the -th edge emanating from where is the realized value. We call the -child of . Thus has probability and weight .
We use to denote the expected profit that the policy can obtain. For each node on , we define . In order to clearly illustrate the tree structure, we add a dummy node at the end of each root-to-leaf path and set if is a dummy node. Then, we recursively define the expected profit of the subtree rooted at to be
if is an internal node and if is a leaf (i.e., the dummy node). The expected profit of the policy is simply . Then, according to Equation (1.2), we have
for each node . For a node , we say the path from the root to it on as the realization path of , and denote it by . We denote the probability of reaching as . Then, we have
We use to denote the expected profit of the optimal adaptive policy. For each node on the tree , by Assumption 1 (2) that , we define . For a set of nodes , we define .
Given an policy , there is a policy with profit at least which satisfies that for any realization path , , where .
Consider a random realization path generated by . Recall in Assumption 1 (1), the value space is . For each node on the tree, we define , which is larger than
We now define a sequence of random variables :
This sequence is a martinale: conditioning on current value , we have
The last equation is due to the definition of . By the martingale property, we have for any . Thus, we have
Let be the set of realization paths on the tree for which . Then, we have which implies that , where is the probability of passing the path . For each path , let be the first node on the path such that , where is the path from the root to the node . Let be the set of such nodes. For the policy , we have a truncation on the node when we reach the node , i.e., we do not select items (include ) any more in the new policy . The total profit loss is at most
where . ∎
W.l.o.g, we assume that all (optimal or near optimal) policies considered in this paper satisfy that for any realization ,
3 Block Adaptive Policies
The decision tree corresponding to the optimal policy may be exponentially large and arbitrarily complicated. Now we consider a restrict class of policies, called block-adaptive policy. The concept was first introduced by Bhalgat et al.  in the context of stochastic knapsack. Our construction is somewhat different from that in [6, 18]. Here, we need to define an order for each block and introduce the notion of approximate block policy.
Formally, a block-adaptive policy can be thought as a decision tree . Each node on the tree is labeled by a block which is a set of items. For a block , we choose an arbitrary order for the items in the block. According to the order , we take the items one by one, until we get a bigger value or all items in the block are taken but the value does not change (recall from Assumption 1 that the value is nondecreasing). Then we visit the child block which corresponds to the realized value. We use to denote the current value right before taking the items in the block . Then for each edge , it has probability
if and if .
Similar to Equation (2.1), for each block and an arbitrary order for , we recursively define the expected profit of the subtree rooted at to be
if is an internal block and if is a leaf (i.e., the dummy node). Here is the expected profit we can get from the block which is equal to
Since the profit and the probability are dependent on the order and thus difficult to deal with, we define the approximate block profit and the approximate probability which do not depend on the choice of the specific order :
if and if . Then we recursicely define the approximate profit
if is an internal block and if is a leaf. For each block , we define . Lemma 3.1 below can be used to bound the gap between the approximate profit and the original profit if the policy satisfies the following property. Then it suffices to consider the approximate profit for a block adaptive policy in this paper.
Each block with more than one item satisfies that .
For any block-adaptive policy satisfying Property (P1), we have
The right hand of this lemma can be proved by induction: for each block on the decision tree, we have
If is a leaf, we have which implies that Equation (3.4) holds. For an internal block , by Property (P1), we have
if has more than one item and if has only one item. For each edge , we have . Then, by induction, we have
To prove the left hand of the lemma, we use Equation (2.2):
where is the probability of reaching the block . For each edge , if or has only one item, we have . Otherwise, we have
Then, for each block and its realization path , we have
where the last inequality holds because the value is nondecreasing and . Thus we have
3.1 Constructing a Block Adaptive Policy
In this section, we show that there exists a block-adaptive policy that approximates the optimal adaptive policy. In order to prove this, from an optimal (or nearly optimal) adaptive policy , we construct a block adaptive policy which satisfies certain nice properties and can obtain almost as much profit as does. Thus it is sufficient to restrict our search to the block-adaptive policies. The construction is similar to that in .
An optimal policy can be transformed into a block adaptive policy with approximate expected profit at least . Moreover, the block-adaptive policy satisfies Property (P1) and (P2):
Each block with more than one item satisfies that .
There are at most blocks on any root-to-leaf path on the decision tree.
For a node on the decision tree and a value , we use to denote the -child of , which is the child of corresponding to the realized value . We say an edge is non-increasing if and define the leftmost path of to be the realization path which starts at , ends at a leaf, and consists of only the non-increasing edges.
We say a node is a starting node if is the root or corresponds to an increasing value of its parent (i.e., ). For each staring node , we greedily partition the leftmost path of into several segments such that for any two nodes in the same segment and for any value , we have
Since is at most for each root-to-leaf path by Lemma 2.1, the second inequality in (3.5) can yield at most blocks. Now focus on the first inequality in (3.5). Fix a particular leftmost path from a starting node on . For each value , we have
Otherwise, replacing the subtree with increases the profit of the policy for some if . Thus, for each particular size , we could cut the path at most times. Since , we have at most segments on the leftmost path . Now, fix a particular root-to-leaf path. Since the value is nondecreasing by Assumption 1 (2), there are at most starting nodes on the path. Thus the first inequality in (3.5) can yield at most segments on the root-to-leaf path. In total, there are at most segments on any root-to-leaf path on the decision tree.
Now, we are ready to describe the algorithm, which takes a policy as input and outputs a block adaptive policy . For each node , we denote its segment and use to denote the last node in . In Algorithm 1, we can see that the set of items which the policy attempts to take always corresponds to some realization path in the original policy . Property (P1) and (P2) hold immediately following from the partition argument. Now we show that the expected profit that the new policy can obtain is at least .
Our algorithm deviates the policy when the first time a node in the segment which makes a transition to an increasing value, say . In this case, would visit , the -child of and follows from then on. But our algorithm visits , the -child of (i.e., the last node of ), and follows . The expected profit gap in each such event can be bounded by
due to the first inequality in Equation (3.5). Suppose pays such a profit loss, and switches to visit . Then, and our algorithm always stay at the same node. Note that there are at most starting nodes on any root-to-leaf path. Thus pays at most times in any realization. Therefore, the total profit loss is at most . By Lemma 3.1, we have
3.2 Enumerating Signatures
To search for the (nearly) optimal block-adaptive policy, we want to enumerate all possible structures of the block decision tree. Fixing the topology of the decision tree, we need to decide the subset of items to place in each block. To do this, we define the signature such that two subsets with the same signature have approximately the same profit distribution. Then, we can enumerate the signatures of all blocks in polynomial time and find a nearly optimal block-adaptive policy. Formally, for an item and a value , we define the signature of on
to be the following vector
for any . 999If is unknown, for some several concrete problems (e.g., Probemax), we can get a constant approximation result for , which is sufficient for our purpose. In general, we can guess a constant approximation result for using binary search. For a block of items, we define the signature of on to be
Consider two decision trees corresponding to block-adaptive policies with the same topology (i.e., and are isomorphic) and the two block adaptive policies satisfiy Property (P1) and (P2). If for each block on , the block at the corresponding position on satisfies that where , then