# Planning and Learning with Stochastic Action Sets

In many practical uses of reinforcement learning (RL) the set of actions available at a given state is a random variable, with realizations governed by an exogenous stochastic process. Somewhat surprisingly, the foundations for such sequential decision processes have been unaddressed. In this work, we formalize and investigate MDPs with stochastic action sets (SAS-MDPs) to provide these foundations. We show that optimal policies and value functions in this model have a structure that admits a compact representation. From an RL perspective, we show that Q-learning with sampled action sets is sound. In model-based settings, we consider two important special cases: when individual actions are available with independent probabilities; and a sampling-based model for unknown distributions. We develop poly-time value and policy iteration methods for both cases; and in the first, we offer a poly-time linear programming solution.

## Authors

• 46 publications
• 13 publications
• 20 publications
• 18 publications
• 62 publications
• 8 publications
• 12 publications
• 51 publications
• ### Reinforcement Learning When All Actions are Not Always Available

The Markov decision process (MDP) formulation used to model many real-wo...
06/05/2019 ∙ by Yash Chandak, et al. ∙ 0

• ### Exponential Lower Bounds for Planning in MDPs With Linearly-Realizable Optimal Action-Value Functions

We consider the problem of local planning in fixed-horizon Markov Decisi...
10/03/2020 ∙ by Gellért Weisz, et al. ∙ 0

• ### Chi-square Tests Driven Method for Learning the Structure of Factored MDPs

SDYNA is a general framework designed to address large stochastic reinfo...
06/27/2012 ∙ by Thomas Degris, et al. ∙ 0

• ### Solution Methods for Constrained Markov Decision Process with Continuous Probability Modulation

We propose solution methods for previously-unsolved constrained MDPs in ...
09/26/2013 ∙ by Marek Petrik, et al. ∙ 0

• ### Metareasoning for Planning Under Uncertainty

The conventional model for online planning under uncertainty assumes tha...
05/03/2015 ∙ by Christopher H. Lin, et al. ∙ 0

• ### Revisiting State Augmentation methods for Reinforcement Learning with Stochastic Delays

Several real-world scenarios, such as remote control and sensing, are co...
08/17/2021 ∙ by Somjit Nath, et al. ∙ 9

• ### Fast Reinforcement Learning with Large Action Sets using Error-Correcting Output Codes for MDP Factorization

The use of Reinforcement Learning in real-world scenarios is strongly li...
02/29/2012 ∙ by Gabriel Dulac-Arnold, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Markov decision processes (MDPs) are the standard model for sequential decision making under uncertainty, and provide the foundations for reinforcement learning (RL). With the recent emergence of RL as a practical AI technology in combination with deep learning [12, 13], new use cases are arising that challenge basic MDP modeling assumptions. One such challenge is that many practical MDP and RL problems have stochastic sets of feasible actions; that is, the set of feasible actions at state varies stochastically with each visit to . For instance, in online advertising, the set of available ads differs at distinct occurrences of the same state (e.g., same query, user, contextual features), due to exogenous factors like campaign expiration or budget throttling. In recommender systems with large item spaces, often a set of candidate recommendations is first generated, from which top scoring items are chosen; exogenous factors often induce non-trivial changes in the candidate set. With the recent application of MDP and RL models in ad serving and recommendation [4, 9, 3, 2, 1, 18, 19, 11], understanding how to capture the stochastic nature of available action sets is critical.

Somewhat surprisingly, this problem seems to have been largely unaddressed in the literature. Standard MDP formulations [17] allow each state to have its own feasible action set , and it is not uncommon to allow the set to be non-stationary or time-dependent. However, they do not support the treatment of as a stochastic random variable. In this work, we: (a) introduce the stochastic action set MDP (SAS-MDP) and provide its theoretical foundations; (b) describe how to account for stochastic action sets in model-free RL (e.g., Q-learning); and (c) develop tractable algorithms for solving SAS-MDPs in important special cases.

An obvious way to treat this problem is to embed the set of available actions into the state itself. This provides a useful analytical tool, but it does not immediately provide tractable algorithms for learning and optimization, since each state is augmented with all possible subsets of actions, incurring an exponential blow up in state space size. To address this issue, we show that SAS-MDPs possess an important property: the Q-value of an available action is independent of the availability of other actions. This allows us to prove that optimal policies can be represented compactly using (state-specific) decision lists (or orderings) over the action set.

This special structure allows one to solve the SAS RL problem effectively using, for example, Q-learning. We also devise model-based algorithms that exploit this policy structure. We develop value and policy iteration schemes, showing they converge in a polynomial number of iterations (w.r.t. the size of the underlying “base” MDP). We also show that per-iteration complexity is polynomial time for two important special forms of action availability distribution: (a) when action availabilities are independent, both methods are exact; (b) when the distribution over sets is sampleable, we obtain approximation algorithms with polynomial sample complexity. In fact, policy iteration is strongly polynomial under additional assumptions (for a fixed discount factor). We show that a linear program for SAS-MDPs can be solved in polynomial time as well. Finally, we offer a simple empirical demonstration of the importance of accounting for stochastic action availability when computing an MDP policy.

Additional discussion and full proofs of all results can be found in a longer version of this paper [sasmdps_full:arxiv18].

## 2 MDPs with Stochastic Action Sets

We first introduce SAS-MDPs and provide a simple example illustrating how action availability impacts optimal decisions. See [17] for more background on MDPs.

### 2.1 The SAS-MDP Model

Our formulation of MDPs with Stochastic Action Sets (SAS-MDPs) derives from a standard, finite-state, finite-action MDP (the base MDP) , with states , base actions for , and transition and reward functions, and . We use and to denote the probability of transition to and the accrued reward, respectively, when action is taken at state . For notational ease, we assume that feasible action sets for each are identical, so (allowing distinct base sets at different states has no impact on what follows). Let and . We assume an infinite-horizon, discounted objective with fixed discount rate , .

In a SAS-MDP, the set of actions available at state at any stage is a random subset . We assume a family of action availability distributions defined over the powerset of . These can depend on but are otherwise history-independent, hence . Only actions in the realized available action set can be executed at stage . Apart from this, the dynamics of the MDP is unchanged: when an (available) action is taken, state transitions and rewards are prescribed as in the base MDP. In what follows, we assume that some action is always available, i.e., for all .111Models that trigger process termination when are well-defined, but we set aside this model variant here. Note that a SAS-MDP does not conform to the usual definition of an MDP.

### 2.2 Example

The following simple MDP shows the importance of accounting for stochastic action availability when making decisions. The MDP below has two states. Assume the agent starts at state , where two actions (indicated by directed edges for their transitions) are always available: one () stays at , and the other () transitions to state , both with reward . At , the action returns to , is always available and has reward 0. A second action also returns to , but is available with only probability and has reward 1.

A naive solution that ignores action availability is as follows: we first compute the optimal -function assuming all actions are available (this can be derived from the optimal value function, computed using standard techniques). Then at each stage, we use the best action available at the current state where actions are ranked by Q-value. Unfortunately, this leads to a suboptimal policy when the action has low availability, specifically if .

The best naive policy always chooses to move to from ; at , it picks the best action available. This yields a reward of at even stages, and an expected reward of

at odd stages. However, by anticipating the possibility that action

is unavailable at , the optimal (SAS) policy always stays at , obtaining reward at all stages. For , the latter policy dominates the former: the plot on the right shows the fraction of the optimal (SAS) value lost by the naive policy () as a function of the availability probability . This example also illustrates that as action availability probabilities approach , the optimal policy for the base MDP is also optimal for the SAS-MDP.

### 2.3 Related Work

While a general formulation of MDPs with stochastic action availability does not appear in the literature, there are two strands of closely related work. In the bandits literature, sleeping bandits are defined as bandit problems in which the arms available at each stage are determined randomly or adversarially (sleeping experts are similar, with complete feedback being provided rather than bandit feedback) [8, 7]. Best action orderings (analogous to our decision list policies for SAS-MDPs) are often used to define regret in these models. The goal is to develop exploration policies to minimize regret. Since these models have no state, if the action reward distributions are known, the optimal policy is trivial: always take the best available action. By contrast, a SAS-MDP, even a known model, induces a difficult optimization problem, since the quality of an action depends not just on its immediate reward, but also on the availability of actions at reachable (future) states. This is our focus.

The second closely related branch of research comes from the field of stochastic routing. The “Canadian Traveller Problem”—the problem of minimizing travel time in a graph with unavailable edges—was introduced by Papadimitriou and Yannakakis [15], who gave intractability results (under much weaker assumptions about edge availability, e.g. adversarial). Poliyhondrous and Tsitsiklis [16] consider a stochastic version of the problem, where edge availabilities are random but static (and any edge observed to be unavailable remains so throughout the scenario). Most similar to our setting is the work of Nikolova and Karger [14], who discuss the case of resampling edge costs at each node visit; however, the proposed solution is well-defined only when the edge costs are finite and does not easily extend to unavailable actions/infinite edge costs. Due to the specificity of their modeling assumptions, none of the solutions found in this line of research can be adapted in a straightforward way to SAS-MDPs.

## 3 Two Reformulations of SAS-MDPs

The randomness of feasible actions means that SAS-MDPs do not conform to the usual definition of an MDP. In this section, we develop two reformulations of SAS-MDPs that transform them into MDPs. We discuss the relative advantages of each, outline key properties and relationships between these models, and describe important special cases of the SAS-MDP model itself.

### 3.1 The Embedded MDP

We first consider a reformulation of the SAS-MDP in which we embed the (realized) available action set into the state space itself. This is a straightforward way to recover a standard MDP. The embedded MDP for a SAS-MDP has state space , with having feasible action set .222Embedded states whose embedded action subsets have zero probability are unreachable and can be ignored. The history independence of allows transitions to be defined as:

 pks∘A,s′∘A′=P(s′∘A′|s∘A,k)=pks,s′Ps′(A′),∀k∈A.

Rewards are defined similarly: for .

In our earlier example, the embedded MDP has three states: (other action subsets have probability hence their corresponding embedded states are unreachable). The feasible actions at each state are given by the embedded action set, and the only stochastic transition occurs when is taken at : it moves to with probability and with probability .

Clearly, the induced reward process and dynamics are Markovian, hence is in fact an MDP under the usual definition. Given the natural translation afforded by the embedded MDP, we view this as providing the basic “semantic” underpinnings of the SAS-MDP model. This translation affords the use of standard MDP analytical tools and methods.

A (stationary, determinstic, Markovian) policy for is restricted so that . The policy backup operator and Bellman operator for decompose naturally as follows:

 TπeVe(s∘As)=rπ(s∘As)s+ γ∑s′pπ(s∘As)s,s′∑As′⊆BPs′(As′)Ve(s′∘As′), (1) T∗eVe(s∘As)=maxk∈Asrks+ γ∑s′pks,s′∑As′⊆BPs′(As′)Ve(s′∘As′) (2)

Their fixed points, and respectively, can be expressed similarly.

Obtaining an MDP from an SAS-MDP via action-set embedding comes at the expense of a (generally) exponential blow-up in the size of the state space, which can increase by a factor of .

### 3.2 The Compressed MDP

The embedded MDP provides a natural semantics for SAS-MDPs, but is problematic from an algorithmic and learning perspective given the state space blow-up. Fortunately, the history independence of the availability distributions gives rise to an effective, compressed representation. The compressed MDP recasts the embedded MDP in terms of the original state space, using expectations to express value functions, policies, and backups over rather than over the (exponentially larger) . As we will see below, the compressed MDP induces a blow-up in action space rather than state space, but offers significant computational benefits.

Formally, the state space for is . To capture action availability, the feasible action set for is the set of state policies, or mappings satisfying . In other words, once we reach , dictates what action to take for any realized action set . A policy for is a family of such state policies. Transitions and rewards use expectations over :

 pμss,s′=∑As⊆BPs(As)pμs(As)s,s′  and  rμss=∑As⊆BPs(As)rμs(As)s .

In our earlier example, the compressed MDP has only two states, and . Focusing on , its “actions” in the compressed MDP are the set of state policies, or mappings from the realizable available sets into action choices (as above, we ignore unrealizable action subsets): in this case, there are two such state policies: the first selects for and (obviously) for ; the second selects for and for .

It is not hard to show that the dynamics and reward process defined above over this compressed state space and expanded action set (i.e., the set of state policies) are Markovian. Hence we can define policies, value functions, optimality conditions, and policy and Bellman backup operators in the usual fashion. For instance, the Bellman and policy backup operators, and , on compressed value functions are:

 T∗cVc(s)= EAs⊆Bmaxk∈Asrks+γ∑s′pks,s′Vc(s′), (3) TμcVc(s)= EAs⊆Brμs(As)s+γ∑s′pμs(As)s,s′Vc(s′). (4)

It is easy to see that any state policy

induces a Markov chain over base states, hence we can define a standard

transition matrix for such a policy in the compressed MDP, where . When additional independence assumptions hold, this expectation over subsets can be computed efficiently (see Section 3.4).

Critically, we can show that there is a direct “equivalence” between policies and their value functions (including optimal policies and values) in and . Define the action-expectation operator to be a mapping that compresses a value function for into a value function for :

 Vec(s)=EVe(s)=EAs⊆BVe(s∘As)=∑As⊆BPs(As)Ve(s∘As).

We emphasize that transforms an (arbitrary) value function in embedded space into a new value function defined in compressed space (hence, is not defined w.r.t. ).

###### Lemma 1

. Hence, has a unique fixed point .

###### Proof.
 ETeVe(s) =EA⊆BTeVe(s∘A) =EA⊆Bmaxk∈Arks+γ∑s′∘A′pks∘A,s′∘A′Ve(s′∘A′) =EA⊆Bmaxk∈Arks+γ∑s′pks,s′EA′⊆BVe(s′∘A′) =EA⊆Bmaxk∈Arks+γ∑s′pks,s′EVe(s′) =TcEVe(s′).

###### Lemma 2

Given the optimal value function for , the optimal policy for can be constructed directly. Specifically, for any , the optimal policy and optimal value at that embedded state can be computed in polynomial time.

Proof Sketch: Given , the expected value of each action in can be computed using a one-step backup of . Then is the action with maximum value, and is its backed-up expected value.

Therefore, it suffices to work directly with the compressed MDP, which allows one to use value functions (and -functions) over the original state space. The price is that one needs to use state policies, since the best action at depends on the available set . In other words, while the embedded MDP causes an exponential blow-up in state space, the compressed MDP causes an exponential blow-up in action space. We now turn to assumptions that allow us to effectively manage this action space blow-up.

### 3.3 Decision List Policies

The embedded and compressed MDPs do not, prima facie, offer much computational or representational advantage, since they rely on an exponential increase in the size of the state space (embedded MDP) or decision space (compressed MDP). Fortunately, SAS-MDPs have optimal policies with a useful, concise form. We first focus on the policy representation itself, then describe the considerable computational leverage it provides.

A decision list (DL) policy is a type of policy for that can be expressed compactly using space and executed efficiently. Let be the set of permutations over base action set . A DL policy associates a permutation with each state, and is executed at embedded state by executing . In other words, whenever base state is encountered and is the available set, the first action in the order dictated by DL is executed. Equivalently, we can view as a state policy for in . In our earlier example, one DL is , which requires taking (base) action if it is available, otherwise taking .

For any SAS-MDP, we have optimal DL policies:

###### Theorem 1

has an optimal policy that can be represented using a decision list. The same policy is optimal for the corresponding .

Proof Sketch: Let be the (unique) optimal value function for and its corresponding Q-function (see Sec. 5.1 for a definition). A simple inductive argument shows that no DL policy is optimal only if there is some state , action sets , and (base) actions , s.t. (i) ; (ii) for some optimal policy and ; and (iii) either or or . However, the fact that the optimal Q-value of any action at state is independent of the other actions in (i.e., it depends only on the base state) implies that these conditions are mutually contradictory.

### 3.4 The Product Distribution Assumption

The DL form ensures that optimal policies and value functions for SAS-MDPs can be expressed polynomially in the size of the base MDP . However, their computation still requires the computation of expectations over action subsets, e.g., in Bellman or policy backups (Eqs. 34). This will generally be infeasible without some assumptions on the form the action availability distributions .

One natural assumption is the product distribution assumption (PDA). PDA holds when is a product distribution where each action is available with probability , and subset has probability . This assumption is a reasonable approximation in the settings discussed above, where state-independent exogenous processes determine the availability of actions (e.g., the probability that one advertiser’s campaign has budget remaining is roughly independent of another advertiser’s). For ease of notation, we assume that is identical for all states (allowing different availability probabilities across states has no impact on what follows). To ensure the MDP is well-founded, we assume some default action (e.g., no-op) is always available.333We omit the default action from analysis for ease of exposition. Our earlier running example trivially satisifes PDA: at , ’s availability probability () is independent of the availability of (1).

When the PDA holds, the DL form of policies allows the expectations in policy and Bellman backups to be computed efficiently without enumeration of subsets . For example, given a fixed DL policy , we have

 TμcVc(s) =m∑i=1[i−1∏j=1(1−ρμ(s)(j)s)]ρμ(s)(i)s(rμ(s)(i)s +γ∑s′pμ(s)(i)s,s′Vc(s′)). (5)

The Bellman operator has a similar form. We exploit this below to develop tractable value iteration and policy iteration algorithms, as well as a practical LP formulation.

### 3.5 Arbitrary Distributions with Sampling (ADS)

We can also handle the case where, at each state, the availability distribution is unknown, but is sampleable. In the longer version of the paper [sasmdps_full:arxiv18], we show that samples can be used to approximate expectations w.r.t. available action subsets, and that the required sample size is polynomial in , and not in the size of the support of the distribution.

Of course, when we discuss algorithms for policy computation, this approach does not allow us to compute the optimal policy exactly. However, it has important implications for sample complexity of learning algorithms like Q-learning. We note that the ability to sample available action subsets is quite natural in many domains. For instance, in ad domains, it may not be possible to model the process by which eligible ads are generated (e.g., involving specific and evolving advertiser targeting criteria, budgets, frequency capping, etc.). But the eligible subset of ads considered for each impression opportunity is an action-subset sampled from this process.

Under ADS, we compute approximate backup operators as follows. Let be an i.i.d. sample of size of action subsets in state . For a subset of actions , an index and a decision list , define to be 1 if and for each we have , or 0 otherwise. Similar to Eq. 5, we define:

 TμcVc(s) =1TT∑t=1m∑i=1I[i,A(t)s,μ(s)](rμ(s)(i)s+γ∑s′pμ(s)(i)s,s′Vc(s′)).

In the sequel, we focus largely on PDA; in most cases equivalent results can be derived in the ADS model.

## 4 Q-Learning with the Compressed MDP

Suppose we are faced with learning the optimal value function or policy for an SAS-MDP from a collection of trajectories. The (implicit) learning of the transition dynamics and rewards can proceed as usual; the novel aspect of the SAS model is that the action availability distribution must also be considered. Remarkably, Q-learning can be readily augmented to incorporate stochastic action sets: we require only that our training trajectories are augmented with the set of actions that were available at each state,

 …s(t),A(t),k(t),r(t),s(t+1),A(t+1),k(t+1),r(t+1),…,

where: is the realized state at time (drawn from distribution ); is the realized available set at time , drawn from ; is the action taken; and is the realized reward. Such augmented trajectory data is typically available. In particular, the required sampling of available action sets is usually feasible (e.g., in ad serving as discussed above).

SAS-Q-learning can be applied directly to the compressed MDP , requiring only a minor modification of the standard Q-learning update for the base MDP. We simply require that each Q-update maximize over the realized available actions :

 Qnew(s(t),k(t)) ←(1−αt)Qold(s(t),k(t)) +αt[r(t)+γmaxk∈A(t+1)Q% old(s(t+1),k)] .

Here is the previous

-function estimate and

is the updated estimate, thus it encompasses both online and batch Q-learning, experience replay, etc.; and is our (adaptive) learning rate.

It is straightforward to show that, under the usual exploration conditions, SAS-Q-learning will converge to the optimal Q-function for the compressed MDP, since the expected maximum over sampled action sets at any particular state will converge to the expected maximum at that state.

###### Theorem 2

The SAS-Q-learning algorithm will converge w.p. 1 to the optimal Q-function for the (discounted, infinite-horizon) compressed MDP if the usual stochastic approximation requirements are satisfied. That is, if (a) rewards are bounded and (b) the subsequence of learning rates applied to satisfies and for all state-action pairs (see, e.g., [21]).

Moreover, function approximation techniques, such as DQN [13], can be directly applied with the same action set-sample maximization. Implementing an optimal policy is also straightforward: given a state and the realization of the available actions, one simply executes .

We note that extracting the optimal value function for the compressed MDP from the learned Q-function is not viable without some information about the action availability distribution. Fortunately, one need not know the expected value at a state to implement the optimal policy.444It is, of course, straightforward to learn an optimal value function if desired.

## 5 Value Iteration in the Compressed MDP

Computing a value function for , with its “small” state space , suffices to execute an optimal policy. We develop an efficient value iteration (VI) method to do this.

### 5.1 Value Iteration

Solving an SAS-MDP using VI is challenging in general due to the required expectations over action sets. However, under PDA, we can derive an efficient VI algorithm whose complexity depends only polynomially on the base set size .

Assume a current iterate , where . We compute as follows:

• For each , compute its -stage-to-go Q-value:

• Sort these Q-values in descending order. For convenience, we re-index each action by its Q-value rank (i.e., is the action with largest Q-value, and is its probability, the second-largest, etc.).

• For each , compute its -stage-to-go value:

 Vt+1(s) =EAs[maxk∈AsQt+1(s,k)] =m−1∑i=1(i−1∏j=1(1−ρ(j)))ρ(i)Qt+1(s,k(i)).

Under ADS, we use the approximate Bellman operator:

 ˆVt+1(s) =EAs[maxk∈AsˆQt+1(s,k)] =1TT∑t=1m∑i=1I[i,A(t)s,μ(s)]ˆQt+1(s,μ(s)(i)) ,

where is the DL resulting from sorting -values.

The Bellman operator under PDA is tractable:

###### Observation 1

The compressed Bellman operator can be computed in time.

Therefore the per-iteration time complexity of VI for compares favorably to the time of VI in the base MDP. The added complexity arises from the need to sort Q-values.555The products of the action availability probabilities can be computed in linear time via caching. Conveniently, this sorting process immediately provides the desired DL state policy for .

Using standard arguments, we obtain the following results, which immediately yield a polytime approximation method.

###### Lemma 3

is a contraction with modulus i.e., .

###### Corollary 1

For any precision , the compressed value iteration algorithm converges to an -approximation of the optimal value function in iterations, where is an upper bound on .

We provide an even stronger result next: VI, in fact, converges to an optimal solution in polynomial time.

### 5.2 The Complexity of Value Iteration

Given its polytime per-iteration complexity, to ensure VI is polytime, we must show that it converges to a value function that induces an optimal policy in polynomially many iterations. To do so, we exploit the compressed representation and adapt the technique of [20].

Assume, w.r.t. the base MDP , that the discount factor , rewards , and transition probabilities , are rational numbers represented with a precision of ( is an integer). Tseng shows that VI for a standard MDP is strongly polynomial, assuming constant and , by proving that: (a) if the ’th value function produced by VI satisfies

 ||Vt−V∗||<1/(2δ2n+2nn),

then the policy induced by is optimal; and (b) VI achieves this bound in polynomially many iterations.

We derive a similar bound on the number of VI iterations needed for convergence in an SAS-MDP, using the same input parameters as in the base MDP, and applying the same precision to the action availability probabilities. We apply Tseng’s result by exploiting the fact that: (a) the optimal policy for the embedded MDP can be represented as a DL; (b) the transition function for any DL policy can be expressed using an matrix (we simply take expectations, see above); and (c) the corresponding linear system can be expressed over the compressed rather than the embedded state space to determine (rather than ).

Tseng’s argument requires some adaptation to apply to the compressed VI algorithm. We extend his precision assumption to account for our action availability probabilities as well, ensuring is also represented up to precision of .

Since is an MDP, Tseng’s result applies; but notice that each entry of the transition matrix for any state’s DL , which serves as an action in , is a product of probabilities, each with precision . We have that has precision of . Thus the required precision parameter for our MDP is at most . Plugging this into Tseng’s bound, VI applied to must induce an optimal policy at the ’th iteration if

 ||Vt−v∗||<1/(2(δ(m+1))2nnn)=1/(2δ(m+1)2nnn) .

This in turn gives us a bound on the number of iterations of VI needed to reach an optimal policy:

###### Theorem 3

VI applied to converges to a value function whose greedy policy is optimal in iterations, where

 t∗≤log(2δ2n(m+1)nnM)/log(1/γ)

Combined with Obs. 1, we have:

###### Corollary 2

VI yields an optimal policy for the SAS-MDP corresponding to in polynomial time.

Under ADS, VI merely approximates the optimal policy. In fact, one cannot compute an exact optimal policy without observing the entire support of the availability distributions (requiring exponential sample size).

## 6 Policy Iteration in the Compressed MDP

We now outline a policy iteration (PI) algorithm.

### 6.1 Policy Iteration

The concise DL form of optimal policies can be exploited in PI as well. Indeed, the greedy policy with respect to any value function in the compressed space is representable as a DL. Thus the policy improvement step of PI can be executed using the same independent evaluation of action Q-values and sorting as used in VI above:

 QV(s,k)=r(s,k)+γ∑s′pks,s′V(s′), QV(s,As)=maxk∈AsQV(s,k), and πV(s,As)=argmaxk∈AsQV(s,k).

The DL policy form can also be exploited in the policy evaluation phase of PI. The tractability of policy evaluation requires a tractable representation of the action availability probabilities, which PDA provides, leading to the following PI method that exploits PDA:

1. Initialize an arbitrary policy in decision list form.

2. Evaluate by solving the following linear system over variables : (Note: We use to represent the relevant linear expression over .)

 Vπ(s) =n∑i=1[i−1∏j=1(1−ρ(j))]ρ(i)Qπ(s,k(i))
3. Let denote the greedy policy w.r.t. , which can be expressed in DL form for each by sorting Q-values as above (with standard tie-breaking rules). If , terminate; otherwise replace with and repeat (Steps 2 and 3).

Under ADS, PI can use the approximate Bellman operator, giving an approximately optimal policy.

### 6.2 The Complexity of Policy Iteration

The per-iteration complexity of PI in is polynomial: as in standard PI, policy evaluation solves an linear system (naively, ) plus the additional overhead (linear in ) to compute the compounded availability probabilities; and policy improvement requires computation of action Q-values, plus overhead for sorting Q-values (to produce improving DLs for all states).

An optimal policy is reached in a number of iterations no greater than that required by VI, since: (a) the sequence of value functions for the policies generated by PI contracts at least as quickly as the value functions generated by VI (see, e.g., [10, 6]); (b) our precision argument for VI ensures that the greedy policy extracted at that point will be optimal; and (c) once PI finds an optimal policy, it will terminate (with one extra iteration). Hence, PI is polytime (assuming a fixed discount ).

###### Theorem 4

PI yields an optimal policy for the SAS-MDP corresponding to in polynomial time.

In the longer version of the paper [sasmdps_full:arxiv18], we adapt more direct proof techniques [22, 6] to derive polynomial-time convergence of PI for SAS-MDPs under additional assumptions. Concretely, for a policy and actions , let be the probability, over action sets, that at state , the optimal selects and selects . Let be such that whenever . We show:

###### Theorem 5

The number of iterations it takes policy iteration to converge is no more than

 O(nm21−γlogm1−γlogeq) .

Under PDA, the theorem implies strongly-polynomial convergence of PI if each action is available with constant probability. In this case, for any , , , and , we have , which in turn implies that we can take in the bound above.

## 7 Linear Programming in the Compressed MDP

An alternative model-based approach is linear programming (LP). The primal formulation for the embedded MDP is straightforward (since it is a standard MDP), but requires exponentially many variables (one per embedded state) and constraints (one per embedded state, base action pair).

A (nonlinear) primal formulation for the compressed MDP reduces the number of variables to :

 minv∑s∈Sαsvs,s.t. vs≥EAsmaxk∈AsQ(s,k)∀s. (6)

Here is an arbitrary, positive state-weighting, over the embedded states corresponding to each base state and

 Q(s,k)=rks+∑s′∈Spks,s′vs′

abbreviates the linear expression of the action-value backup at the state and action in question w.r.t. the value variables . This program is valid given the definition of and the fact that a weighting over embedded states corresponds to a weighting over base states by taking expectations. Unfortunately, this formulation is non-linear, due to the max term in each constraint. And while it has only variables, it has factorially many constraints; moreover, the constraints themselves are not compact due to the presence of the expectation in each constraint.

PDA can be used to render this formulation tractable. Let denote an arbitrary (inverse) permutation of the action set (so means that action is ranked in position ). As above, the optimal policy at base state w.r.t. a Q-function is expressible as a DL ( with actions sorted by Q-values) and its expected value given by the expression derived below. Specifically, if reflects the relative ranking of the (optimal) Q-values of the actions at some fixed state , then with probability , i.e., the probability that occurs in . Similarly, with probability , and so on. We define the Q-value of a DL as follows:

 QVs(σ)=n∑i=1[i−1∏j=1(1−ρσ(j))]ρσ(i)QV(s,σ(i)). (7)

Thus, for any fixed action permutation , the constraint that at least matches the expectation of the maximum action’s Q-value is linear. Hence, the program can be recast as an LP by enumerating action permutations for each base state, replacing the constraints in Eq. (6) as follows:

 vs≥QVs(σ)∀s∈S,∀σ∈Σ. (8)

The constraints in this LP are now each compactly represented, but it still has factorially many constraints. Despite this, it can be solved in polynomial time. First, we observe that the LP is well-suited to constraint generation. Given a relaxed LP with a subset of constraints, a greedy algorithm that simply sorts actions by Q-value to form a permutation can be used to find the maximally violated constraint at any state. Thus we have a practical constraint generation algorithm for this LP since (maximally) violated constraints can be found in polynomial time.

More importantly from a theoretical standpoint, the constraint generation algorithm can be used as a separation oracle within an ellipsoid method for this LP. This directly yields an exact, (weakly) polynomial time algorithm for this LP [5].

## 8 Empirical Illustration

We now provide a somewhat more elaborate empirical demonstration of the effects of stochastic action availability. Consider an MDP that corresponds to a routing problem on a real-world road network (Fig. 1) in the San Francisco Bay Area. The shortest path between the source and destination locations is sought. The dashed edge in Fig. 1 represents a bridge, available only with probability , while all other edges correspond to action choices available with probability . At each node, a no-op action (waiting) is available at constant cost; otherwise the edge costs are the geodesic lengths of the corresponding roads on the map. The optimal policies for different choices and are depicted in Fig. 1, where line thickness and color indicate traversal probabilities under the corresponding optimal policies. It can be observed that lower values of lead to policies with more redundancy. Fig. 2 investigates the effect of solving the routing problem obliviously to the stochastic action availability (assuming actions are fully available). The SAS-optimal policy allows graceful scaling of the expected travel time from source to destination as bridge availability decreases. Finalluy, the effects of violating the PDA assumption are investigated in the long version of this work [TODO].

## 9 Concluding Remarks

We have developed a new MDP model, SAS-MDPs, that extends the usual finite-action MDP model by allowing the set of available actions to vary stochastically. This captures an important use case that arises in many practical applications (e.g., online advertising, recommender systems). We have shown that embedding action sets in the state gives a standard MDP, supporting tractable analysis at the cost of an exponential blow-up in state space size. Despite this, we demonstrated that (optimal and greedy) policies have a useful decision list structure. We showed how this DL format can be exploited to construct tractable Q-learning, value and policy iteration, and linear programming algorithms.

While our work offers firm foundations for stochastic action sets, most practical applications will not use the algorithms described here explicitly. For example, in RL, we generally use function approximators for generalization and scalability in large state/action problems. We have successfully applied Q-learning using DNN function approximators (i.e., DQN) using sampled/logged available actions in ads and recommendations domains as described in Sec. 4. This has allowed us to apply SAS-Q-learning to problems of significant, commercially viable scale. Model-based methods such as VI, PI, and LP also require suitable (e.g., factored) representations of MDPs and structured implementations of our algorithms that exploit these representations. For instance, extensions of approximate linear programming or structured dynamic programming to incorporate stochastic action sets would be extremely valuable.

Other important questions include developing a polynomial-sized direct LP formulation; and deriving sample-complexity results for RL algorithms like Q-learning is also of particular interest, especially as it pertains to the sampling of the action distribution. Finally, we are quite interested in relaxing the strong assumptions embodied in the PDA model—of particular interest is the extension of our algorithms to less extreme forms of action availability independence, for example, as represented using consise graphical models (e.g., Bayes nets).

Acknowledgments: Thanks to the reviewers for their helpful suggestions.

## References

• [1] Kareem Amin, Michael Kearns, Peter Key, and Anton Schwaighofer. Budget optimization for sponsored search: Censored learning in MDPs. In

Proceedings of the Twenty-eighth Conference on Uncertainty in Artificial Intelligence (UAI-12)

, pages 543–553, Catalina, CA, 2012.
• [2] Nikolay Archak, Vahab Mirrokni, and S. Muthukrishnan. Budget optimization for online campaigns with positive carryover effects. In Proceedings of the Eighth International Workshop on Internet and Network Economics (WINE-12), pages 86–99, Liverpool, 2012.
• [3] Nikolay Archak, Vahab S. Mirrokni, and S. Muthukrishnan. Mining advertiser-specific user behavior using adfactors. In Proceedings of the Nineteenth International World Wide Web Conference (WWW 2010), pages 31–40, Raleigh, NC, 2010.
• [4] Moses Charikar, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. On targeting Markov segments. In

Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC-99)

, pages 99–108, Atlanta, 1999.
• [5] Martin Grötschel, Lászlo Lovász, and Alexander Schrijver.

Geometric Algorithms and Combinatorial Optimization

, volume 2 of Algorithms and Combinatorics.
Springer, 1988.
• [6] Thomas Dueholm Hansen, Peter Bro Miltersen, and Uri Zwick. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. Journal of the ACM (JACM), 60(1), 2013. Article 1, 16pp.
• [7] Varun Kanade, H Brendan McMahan, and Brent Bryan. Sleeping experts and bandits with stochastic action availability and adversarial rewards. In Twelfth International Conference on Artificial Intelligence and Statistics (AIStats-09), pages 272–279, Clearwater Beach, FL, 2009.
• [8] Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. Regret bounds for sleeping experts and bandits. Machine learning, 80(2–3):245–272, 2010.
• [9] Ting Li, Ning Liu, Jun Yan, Gang Wang, Fengshan Bai, and Zheng Chen. A Markov chain model for integrating behavioral targeting into contextual advertising. In Proceedings of the Third International Workshop on Data Mining and Audience Intelligence for Advertising (ADKDD-09), pages 1–9, Paris, 2009.
• [10] U. Meister and U. Holzbaur. A polynomial time bound for Howard’s policy improvement algorithm. OR Spektrum, 8:37–40, 1986.
• [11] Martin Mladenov, Craig Boutilier, Dale Schuurmans, Ofer Meshi, Gal Elidan, and Tyler Lu. Logistic Markov decision processes. In Proceedings of the Twenty-sixth International Joint Conference on Artificial Intelligence (IJCAI-17), pages 2486–2493, Melbourne, 2017.
• [12] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
• [13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
• [14] Evdokia Nikolova and David R. Karger. Route planning under uncertainty: The Canadian traveller problem. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08, pages 969–974. AAAI Press, 2008.
• [15] Christos H. Papadimitriou and Mihalis Yannakakis. Shortest paths without a map. Theoretical Computer Science, 84(1):127 – 150, 1991.
• [16] George H. Polychronopoulos and John N. Tsitsiklis. Stochastic shortest path problems with recourse. Networks, 27(2):133–143, 1996.
• [17] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, 1994.
• [18] David Silver, Leonard Newnham, David Barker, Suzanne Weller, and Jason McFall. Concurrent reinforcement learning from customer interactions. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 924–932, Atlanta, 2013.
• [19] Georgios Theocharous, Philip S. Thomas, and Mohammad Ghavamzadeh. Personalized ad recommendation systems for life-time value optimization with guarantees. In Proceedings of the Twenty-fourth International Joint Conference on Artificial Intelligence (IJCAI-15), pages 1806–1812, Buenos Aires, 2015.
• [20] Paul Tseng. Solving h-horizon, stationary Markov decision problems in time proportional to log(h). Operations Research Letters, 9(5):287–297, 1990.
• [21] Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8:279–292, 1992.
• [22] Yinyu Ye. The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate. Mathematics of Operations Research, 36(4):593–603, 2011.