Reinforcement Learning When All Actions are Not Always Available

06/05/2019 ∙ by Yash Chandak, et al. ∙ University of Massachusetts Amherst adobe 0

The Markov decision process (MDP) formulation used to model many real-world sequential decision making problems does not capture the setting where the set of available decisions (actions) at each time step is stochastic. Recently, the stochastic action set Markov decision process (SAS-MDP) formulation has been proposed, which captures the concept of a stochastic action set. In this paper we argue that existing RL algorithms for SAS-MDPs suffer from divergence issues, and present new algorithms for SAS-MDPs that incorporate variance reduction techniques unique to this setting, and provide conditions for their convergence. We conclude with experiments that demonstrate the practicality of our approaches using several tasks inspired by real-life use cases wherein the action set is stochastic.



There are no comments yet.


page 8

page 19

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many real-world sequential decision making problems, the set of available decisions, which we call the action set, is stochastic. In vehicular routing on a road network (Gendreau et al., 1996) or packet routing on the internet (Ribeiro et al., 2008), the goal is to find the shortest path between a source and destination. However, due to construction, traffic, or other damage to the network, not all pathways are always available. In online advertising (Tan and Srikant, 2012; Mahdian et al., 2007), the set of available ads can vary due to fluctuations in advertising budgets and promotions. In robotics (Feng and Yan, 2000), actuators can fail. In recommender systems (Harper and Skiba, 2007), the set of possible recommendations can vary based on product availability. These examples capture the broad idea and motivate the question we aim to address: how can we develop efficient learning algorithms for sequential decision making problems wherein the action set can be stochastic?

Sequential decision making problems without stochastic action sets are typically modeled as Markov decision processes (MDPs). Although the MDP formulation is remarkably flexible, and can incorporate concepts like stochastic state transitions, partial observability, and even different (deterministic) action availability depending on the state, it does not allow for stochastic action sets. As a result, algorithms designed for MDPs are not suited to our setting of interest. Recently, Boutilier et al. (2018) proposed a new problem formulation, stochastic action set Markov decision processes (SAS-MDPs), that extends MDPs to include stochastic action sets. They also showed how the Q-learning and value iteration algorithms two classic algorithms for approximating optimal solutions to MDPs, can be extended to SAS-MDPs.

In this paper we show that the lack of convergence guarantees of the Q-learning algorithm, when using function approximators in the MDP setting can be exacerbated in the SAS-MDP setting. We therefore derive policy gradient and natural policy gradient algorithms for the SAS-MDP setting and provide conditions for their almost-sure convergence. Furthermore, since the introduction of stochastic action sets introduces further uncertainty in the decision making process, variance reduction techniques are of increased importance. We therefore derive new approaches to variance reduction for policy gradient algorithms that are unique to the SAS-MDP setting. We validate our new algorithms empirically on tasks inspired by real-world problems with stochastic action sets.

2 Related Work

While there is extensive literature on solving sequential decision problems modeled as MDPs (Puterman, 2014; Sutton and Barto, 2018), there are few methods designed to handle stochastic action sets. Recently, Boutilier et al. (2018) laid the foundation for studying MDPs with stochastic action sets by defining the new SAS-MDP problem formulation, which we review in the background section. After defining SAS-MDPs, Boutilier et al. (2018) presented and analyzed the model-based value iteration and policy iteration algorithms and the model-free Q-learning algorithm for SAS-MDPs.

In the bandit setting, wherein individual decisions are optimized rather than sequences of dependent decisions, sleeping bandits extend the standard bandit problem formulation to allow for stochastic action sets (Kanade et al., 2009; Kleinberg et al., 2010)

. We focus on the SAS-MDP formulation rather than the sleeping bandit formulation because we are interested in sequential problems. Such sequential problems are more challenging because making optimal decisions requires one to reason about the long-term impact of decisions, which includes reasoning about how a decision will influence the probability that different actions (decisions) will be available in the future.

Although we focus on the model-free setting, wherein the dynamics of the environment are not known a priori to the agent optimizing its decisions, in the alternative model-based setting researchers have considered related problems in the area of stochastic routing (Papadimitriou and Yannakakis, 1991; Polychronopoulos and Tsitsiklis, 1996; Nikolova et al., 2006; Nikolova and Karger, 2008). In stochastic routing problems, the goal is to find a shortest path on a graph with stochastic availability of edges. The SAS-MDP framework generalizes stochastic routing problems by allowing for sequential decision making problems that are not limited to shortest path problems.

3 Background

MDPs and SAS-MDPs (Boutilier et al., 2018) are mathematical formulations of sequential decision problems. Before defining SAS-MDPs, we define MDPs. We refer to the entity interacting with an MDP or SAS-MDP and trying to optimize its decisions as the agent.

Formally, an MDP is a tuple . is the set of all possible states that the agent can be in, called the state set. Although our math notation assumes that is countable, our primary results extend to MDPs with continuous states. is a finite set of all possible actions that the agent can take, called the base action set. and

are random variables that denote the state of the environment and action chosen by the agent at time

. is called the transition function and characterizes how states transition: . , a bounded random variable, is the scalar reward received by the agent at time , where is a finite constant. is called the reward function, and is defined as . The reward discount parameter, , characterizes how to utility of rewards to the agent decays based on how far in the future they occur. We call the start state distribution, which is defined as .

We now turn to defining a SAS-MDP. Let the set of actions available at time be a random variable, , which we assume is always not empty, i.e., . Let characterize the conditional distribution of : . We assume that is Markovian, in that its distribution is conditionally independent of all events prior to the agent entering state given . Formally, a SAS-MDP is , with the additional requirement that .

A policy is a conditional distribution over actions for each state: for all , and , where

. Sometimes a policy is parameterized by a weight vector

, such that changing changes the policy. We write to denote such a parameterized policy with weight vector . For any policy , we define the corresponding state-action value function to be , where conditioning on denotes that for all and for . Similarly, the state-value function associated with policy is . For a given SAS-MDP , the agent’s goal is to find an optimal policy, , (or equivalently optimal policy parameters ) which is any policy that maximizes the expected sum of discounted future rewards. More formally, an optimal policy is any , where and denotes the set of all possible policies. For notational convenience, we sometimes use in place of , e.g., to write , , or , since a weight vector induces a specific policy.

One way to model stochastic action sets using the MDP formulation rather than the SAS-MDP formulation is to define states such that one can infer given . Transforming an MDP into a new MDP with embedded in in this way can result in the size of the state set growing exponentially— by a factor of . This drastic increase in the size of the state set can make finding or approximating an optimal policy prohibitively difficult. Using the SAS-MDP formulation, the challenges associated with this exponential increase in the size of the state set can be avoided, and one can derive algorithms for finding or approximating optimal policies in terms of the state set of the original underlying MDP. This is accomplished using a variant of the Bellman operator, , which incorporates the concept of stochastic action sets:


for all . Similarly, one can extend the Bellman optimality operator (Sutton and Barto, 2018):


Showing an equivalence between the fixed point of this modified Bellman operator and the fixed point of the standard Bellman operator on the MDP with embedded actions, Boutilier et al. (2018)

proposed the following update for a tabular estimate,

, of :


Notice that the maximum is computed only over the available actions, , in state . We refer to the algorithm using this update rule as SAS-Q-learning.

4 Limitations of SAS-Q-Learning

Although SAS-Q-learning provides a powerful first model-free algorithm for approximating optimal policies for SAS-MDPs, it inherits several of the drawbacks of the Q-learning algorithm for MDPs. Just like Q-learning, in a state and with available actions , the SAS-Q-learning method chooses actions deterministically when not exploring: . This limits its practicality for problems where optimal policies are stochastic, which is often the case when the environment is partially observable or when the use of function approximation causes state aliasing (Baird, 1995). Additionally, if the SAS-Q-learning update converges to an estimate, , of such that for all , then the agent will act optimally; however, convergence to a fixed-point of is seldom achieved in practice, and reducing the difference between and (what SAS-Q-learning aims to do) does not ensure improvement of the policy (Sutton and Barto, 2018).

SAS-Q-learning does not perform gradient ascent or descent on any function, and it can cause divergence of the estimator when using function approximation, just like Q-learning for MDPs (Baird, 1995). Furthermore, we contend that the divergent behavior of SAS-Q-Learning can in some cases be more severe than that of the Q-Learning algorithm for MDPs. That is, in cases where Q-learning converges, SAS-Q-learning can diverge.

Figure 1: MDP

To see this, consider the SAS variant of the classical MDP (Tsitsiklis and Roy, 1983) illustrated in Figure 1. In this example there are two states, (left in Figure 1) and (right), and two actions, and . The agent in this example uses function approximation (Sutton and Barto, 2018), with weight vector , such that and . In either state, if the agent takes the left action, it goes to the left state, and if the agent takes the right action, it goes to the right state. In our SAS-MDP version of this problem, both actions are not always available. Let always, and . Consider the case where the weights of the -approximation are initialized to . Now suppose that a transition is observed from the left state to the right state, and after the transition the left action is not available to the agent. As per the SAS-Q-learning update rule provided in (3), Since and , this is equivalent to If this transition is used repeatedly on its own, then irrespective of the learning rate, , the weight would diverge to . In contrast, had there been no constraint of using max over given the available actions, the Q-learning update would have been, because action has higher q-value than due to . This would make converge to the correct value of . This provides an example of how the stochastic constraints on the set of available actions can be instrumental in causing the SAS-Q-learning method to diverge.

5 Policy Gradient Methods for SAS-MDPs

In this section we derive policy gradient algorithms (Sutton et al., 2000) for the SAS-MDP setting. While the Q-learning algorithm minimizes the error between and for all states (using a procedure that is not a gradient algorithm), policy gradient algorithms perform stochastic gradient ascent on the objective function . That is, they use the update , where

is an unbiased estimator of


Unlike the Q-learning algorithm, policy gradient algorithms for MDPs provide (local) convergence guarantees even when using function approximation, and can approximate optimal stochastic policies. However, ignoring the fact that actions are not always available and using off-the-shelf algorithms for MDPs fails to fully capture the problem setting (Boutilier et al., 2018). It is therefore important that we derive policy gradient algorithms that are appropriate for the SAS-MDP setting, as they provide the first convergent model-free algorithms for SAS-MDPs when using function approximation.

In the following lemma we extend the expression for the policy gradient for MDPs (Sutton et al., 2000) to handle stochastic action sets.

Lemma 1 (SAS Policy Gradient).

For a SAS-MDP, for all ,


See Appendix A. ∎

It follows from Lemma 1 that we can create unbiased estimates of , which can be used to update using the well-known stochastic gradient ascent algorithm. This algorithm is presented in Algorithm LABEL:apx:Alg:1 in Appendix E. Notably, this process does not require the agent to know . Notice that in the special case where all actions are always available, the expression in Lemma 1 degenerates to the policy gradient theorem for MDPs (Sutton and Barto, 2018). We now establish that SAS policy gradient algorithms are guaranteed to converge to locally optimal policies under standard assumptions on policy being differentiable (A1), gradient of being Lipschitz (A2), and step-sizes being decayed appropriately (A3). Formal assumption statements are deferred to Appendix B.

Lemma 2.

Under Assumptions (A1)-(A3), the SAS policy gradient algorithm causes as , with probability one.


See Appendix B. ∎

Natural policy gradient algorithms (Kakade, 2002) extend policy gradient algorithms to follow the natural gradient of (Amari, 1998). In essence, whereas policy gradient methods perform gradient ascent in the space of policy parameters by computing the gradient of as a function of the parameters

, natural policy gradient methods perform gradient ascent in the space of policies (which are probability distributions) by computing the gradient of

as a function of the policy, . Thus, whereas policy gradient implicitly measures distances between policies by the Euclidean distance between their policy parameters, natural policy gradient methods measure distances between policies using notions of distance between probability distributions. In the most common form of natural policy gradients, the distances between policies are measured using a Taylor approximation of Kullback–Leibler divergence (KLD). By performing gradient ascent in the space of policies rather than the space of policy parameters, the natural policy gradient becomes invariant to how the policy is parameterized (Thomas et al., 2018)

, which can help to mitigate the vanishing gradient problem in neural networks and improve learning speed

(Amari and Douglas, 1998).

The natural policy gradient (using a Taylor approximation of KLD to measure distances) is where is the Fisher information matrix (FIM) associated with the policy . Although the FIM is a well-known quantity, it is typically associated with a parameterized probability distribution. Here, is a collection of probability distributions—one per state. This raises the question of what should be when computing the natural policy gradient. Following the work of Bagnell and Schneider (2003) for MDPs, we show that the FIM, , for computing the natural policy gradient for a SAS-MDP can also be derived by viewing as a single distribution over possible trajectories (sequences of states, available action sets and executed actions).

Property 1 (Fisher Information Matrix).

For a policy, parameterized using weights , the Fisher information matrix is given by,


where, .


See Appendix C. ∎

Furthermore, Kakade (2002) showed that many terms in the definition of the natural policy gradient cancel, providing a simple expression for the natural gradient which can be estimated with time linear in the number of policy parameters per time step. We extend the result of Kakade (2002) to the SAS-MDP formulation in the following lemma:

Lemma 3 (SAS Natural Policy Gradient).

Let be a parameter such that,


then for all in ,


See Appendix C. ∎

From Lemma 3, we can derive a computationally efficient natural policy gradient algorithm by using the well-known temporal difference algorithm (Sutton and Barto, 2018), modified to work with SAS-MDPs, to estimate with the approximator , and then using the update . This algorithm, which is the SAS-MDP equivalent of NAC-TD (Bhatnagar et al., 2008; Degris et al., 2012; Morimura et al., 2005; Thomas and Barto, 2012), is provided in Algorithm LABEL:apx:Alg:2 in Appendix E.

6 Adaptive Variance Mitigation

In the previous section, we derived (natural) policy gradient algorithms for SAS-MDPs. While these algorithms avoid the divergence of SAS-Q-learning, they suffer from the high variance of policy gradient estimates (Kakade et al., 2003). As a consequence of the additional stochasticity that results from stochastic action sets, this problem can be even more severe in the SAS-MDP setting. In this section, we leverage insights from the Bellman equation for SAS-MDPs, provided in (1), to reduce the variance of policy gradient estimates.

One of the most popular methods to reduce variance is the use of a state-dependent baseline . Sutton et al. (2000) showed that, for any state-dependent baseline :


For any random variables and , we know that the variance of is given by , where cov stands for covariance. Therefore, the variance of is lesser than variance of if . As a result, any state dependent baseline whose value is sufficiently correlated to the expected return, , can be used to reduce the variance of the sample estimator of (7). A common choice for such a baseline is a state-value function estimator, .

A baseline dependent on both the state and action can have higher correlation with , and could therefore reduce variance further. However, such action dependent baselines cannot be used directly, as they can result in biased gradient estimates. Developing such baselines remains an active area of research for MDPs (Thomas and Brunskill, 2017; Grathwohl et al., 2017; Liu et al., 2017; Wu et al., 2018; Tucker et al., 2018) and is largely complementary to our purpose.

We now show that we can introduce a baseline for SAS-MDPs that lies between state-dependent and state-action-dependent baselines. Like state-dependent baselines, these new baselines do not introduce bias into gradient estimates. However, like action-dependent baselines these new baselines include some information about the chosen actions. Specifically, we propose baselines that depend on the state, , and available action set , but not the precise action, .

Recall from the SAS Bellman equation (1) that the state-value function for SAS-MDPs can be written as, . While we cannot directly use a baseline dependent on the action sampled from , we can use baseline dependent on the sampled action set. We consider a new baseline which leverages this information about the sampled action set . This baseline is where is a learned estimator of the state-action value function, and represents its expected value under the current policy, , conditioned on the sampled action set .

In principle, we expect to be more correlated with as it explicitly conditions on the action set and does not compute an average over all action sets possible, like . Practically, however, estimating values can be harder than estimating . This can be attributed to the fact that with the same number of training samples, the number of parameters to learn in is more than those in an estimate of . This poses a new dilemma of deciding when to use which baseline. To get the best of both, we consider using a weighted combination of and . In the following property we establish that using any weighted combination of these two baselines results in an unbiased estimate of the SAS policy gradient.

Property 2 (Unbiased estimator).

Let and , then for any values of and ,


See Appendix D. ∎

The question remains: what values should be used for and for combining and ? Similar problems of combining different estimators has been studied in statistics literature (Graybill and Deal, 1959; Meir et al., 1994) and more recently for combining control variates (Wang et al., 2013; Geffner and Domke, 2018). Building upon their ideas, rather than leaving and

as open hyperparameters, we propose a method for automatically adapting

for the specific SAS-MDP and current policy parameters, . The following lemma presents an analytic expression for the value of that minimizes a sample-based estimate of the variance of .

Lemma 4 (Adaptive variance mitigation).

If and , where , and , then the that minimizes the variance of is given by


See Appendix D. ∎

Lemma 4 provides the values for and that result in the minimal variance of . Note that the computational cost associated with evaluating the inverse of is negligible because its dimension is always , independent of the number of policy parameters. Also, Lemma 4 provides the optimal values of and , which still must be approximated using sample-based estimates of and . Furthermore, one might use double sampling for to get unbiased estimates of the variance minimizing value of (Baird, 1995). However, as Property 2 ensures that estimates of for any value of and are always unbiased, we opt to use all the available samples for estimating and . Detailed step-by-step pseudocode for optimizing , constructing the baselines, and using them within a SAS policy gradient algorithm is provided in Algorithm LABEL:apx:Alg:1 in Appendix E.

7 Empirical Analysis

Figure 2: (Top) Best performing learning curves on the domains considered. The probability of any action being available in the action set is . (Bottom) Autonomously adapted values of and associated with and

, respectively, for the SAS-PG results. Shaded regions correspond to one standard deviation obtained using


In this section we use empirical studies to answer the following three questions: (a) How do our proposed algorithms, SAS policy gradient (SAS-PG) and SAS natural policy gradient (SAS-NPG), compare to the prior method SAS-Q-learning? (b) How does our adaptive variance reduction technique weight the two baselines over the training duration? (c) What impact does the probability of action availability have on the performances of SAS-PG, SAS-NPG, and SAS-Q-learning? To evaluate these aspects, we first briefly introduce three domains inspired by real-world problems.

Routing in San Francisco.

This task models the problem of finding shortest paths in San Francisco, and was first presented with stochastic actions by Boutilier et al. (2018). Stochastic actions model the concept that certain paths in the road network may not be available at certain times. A positive reward is provided to the agent when it reaches the destination, while a small penalty is applied at every time step. We modify the domain presented by Boutilier et al. (2018) so that the starting state of the agent is not one particular node, but rather is uniformly randomly chosen among all possible locations. This makes the problem more challenging, since it requires the agent to learn the shortest path from every node. All the states (nodes) are discrete, and edges correspond to the action choices. Each edge is made available with some fixed probability. The overall map is shown in Figure 4.

Robot locomotion task in a maze.

In this domain, the agent has to navigate a maze using unreliable actuators. The agent starts at the bottom left corner and a goal reward is given when it reaches the goal position, marked by a star in Figure 4. The agent is penalized at each time step to encourage it to reach the goal as quickly as possible. The state space is continuous, and corresponds to real-valued Cartesian coordinates of the agent’s position. The agent has actuators pointing in different directions. Turning each actuator on moves the agent in the direction of the actuator. However, each actuator is unreliable, and is therefore only available with some fixed probability.

Product recommender system.

In online marketing and sales, product recommendation is a popular problem. Due to various factors such as stock outage, promotions, delivery issues etc., not all products can be recommended always. To model this, we consider a synthetic setup of providing recommendation to a user from a batch of products, each available with some fixed probability and associated with a stochastic reward corresponding to profit. Each user has a real-valued context, which forms the state space, and the recommender system interacts with a randomly chosen user for steps. The goal for the recommender system is to suggest products that maximize total profit. Often the problem of recommendation is formulated as a contextual bandit or collaborative filtering problem, but as shown by Theocharous et al. (2015) these approaches fail to capture the long term value of the prediction. Hence we resort to the full RL setup.

7.1 Results

Figure 3: Best performances of different algorithms across different values of probabilities for action availability. The error bars correspond to one standard deviation obtained using trials.

Here we only discuss the representative results for the three major questions of interest. Plots for detailed evaluations are available in Appendix F.2.

(a) For the routing problem in San Francisco, as both the states and actions are discrete, the q-function for each state-action pair has a unique parameter. When no parameters are shared, SAS-Q-learning will not diverge. Therefore, in this domain, we notice that SAS-Q-learning performs similarly to the proposed algorithms. However, in many large-scale problems, the use of function approximators is crucial for estimating the optimal policy. For the robot locomotion task in the maze domain and the recommender system, the state space is not discrete and hence function approximators are required to obtain the state features. As we saw in Section 4, the sharing of state features can create problems for SAS-Q-learning. The increased variance in the performance of SAS-Q-learning is visible in both the Maze and the Recommender system domains in Figure 2. While the SAS-Q method eventually performs the same on the Maze domain, its performance improvement saturates quickly in the recommender system domain thus resulting in a sub-optimal policy.

(b) To provide visual intuition for the behavior of adaptive variance mitigation, we report the values of and over the training duration in Figure 2. As several factors are combined through (9) to influence the values, it is hard to pinpoint any individual factor that is responsible for the observed trend. However, note that for both the routing problem in San Francisco and the robot navigation in maze, the goal reward is obtained on reaching the destination and intermediate actions do not impact the total return significantly. Intuitively, this makes the action set conditioned baseline similarly correlated to the observed return as the state only conditioned baseline, , but at the expense of estimating significantly more number of parameters. Thus the importance for is automatically adapted to be closer to zero. On the other hand, in recommender system, each product has a significant amount of associated reward. Therefore, the total return possible during each episode has a strong dependency on the available action set and thus the magnitude of weight for is much larger than that for .

(c) To understand the impact of the probability of an action being available, we report the best performances for all the algorithms for different probability values in Figure 3. We notice that in the San Francisco routing domain, SAS-Q-learning has a slight edge over the proposed methods. This can be attributed to the fact that off-policy samples can be re-used without causing any divergence problems as state features are not shared. For the maze and the recommender system tasks, where function approximators are necessary, the proposed methods significantly out-perform SAS-Q.

8 Conclusion

Building upon the SAS-MDP framework of Boutilier et al. (2018), we studied an under-addressed problem of dealing with MDPs with stochastic action sets. We highlighted some of the limitations of the existing method and addressed them by generalizing policy gradient methods for SAS-MDPs. Additionally, we introduced a novel baseline and an adaptive variance reduction technique unique to this setting. Our approach has several benefits. Not only does it generalize the theoretical properties of standard policy gradient methods, but it is also practically efficient and simple to implement.


  • Gendreau et al. (1996) Michel Gendreau, Gilbert Laporte, and René Séguin. Stochastic vehicle routing. European Journal of Operational Research, 1996.
  • Ribeiro et al. (2008) Alejandro Ribeiro, Nikolaos D Sidiropoulos, and Georgios B Giannakis. Optimal distributed stochastic routing algorithms for wireless multihop networks. IEEE Transactions on Wireless Communications, 2008.
  • Tan and Srikant (2012) Bo Tan and Rayadurgam Srikant. Online advertisement, optimization and stochastic networks. IEEE Transactions on Automatic Control, 2012.
  • Mahdian et al. (2007) Mohammad Mahdian, Hamid Nazerzadeh, and Amin Saberi. Allocating online advertisement space with unreliable estimates. In Proceedings of the 8th ACM conference on Electronic commerce. ACM, 2007.
  • Feng and Yan (2000) Youyi Feng and Houmin Yan. Optimal production control in a discrete manufacturing system with unreliable machines and random demands. IEEE Transactions on Automatic Control, 2000.
  • Harper and Skiba (2007) Gregory W Harper and Steven Skiba. User-personalized media sampling, recommendation and purchasing system using real-time inventory database, 2007. US Patent 7,174,312.
  • Boutilier et al. (2018) Craig Boutilier, Alon Cohen, Amit Daniely, Avinatan Hassidim, Yishay Mansour, Ofer Meshi, Martin Mladenov, and Dale Schuurmans. Planning and learning with stochastic action sets. In IJCAI, 2018.
  • Puterman (2014) Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Kanade et al. (2009) Varun Kanade, H. Brendan McMahan, and Brent Bryan. Sleeping experts and bandits with stochastic action availability and adversarial rewards. In

    Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS

    , 2009.
  • Kleinberg et al. (2010) Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. Regret bounds for sleeping experts and bandits. Machine learning, 80(2-3):245–272, 2010.
  • Papadimitriou and Yannakakis (1991) Christos H Papadimitriou and Mihalis Yannakakis. Shortest paths without a map. Theoretical Computer Science, 84(1):127–150, 1991.
  • Polychronopoulos and Tsitsiklis (1996) George H Polychronopoulos and John N Tsitsiklis. Stochastic shortest path problems with recourse. Networks: An International Journal, 27(2):133–143, 1996.
  • Nikolova et al. (2006) Evdokia Nikolova, Matthew Brand, and David R Karger. Optimal route planning under uncertainty. In ICAPS, volume 6, pages 131–141, 2006.
  • Nikolova and Karger (2008) Evdokia Nikolova and David R Karger. Route planning under uncertainty: The canadian traveller problem. In AAAI, pages 969–974, 2008.
  • Baird (1995) Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier, 1995.
  • Tsitsiklis and Roy (1983) JN Tsitsiklis and BV Roy. An analysis of temporal-difference with function approximation. IEEE Trans. Autom. Control, 42(5):834–836, 1983.
  • Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
  • Kakade (2002) Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538, 2002.
  • Amari (1998) Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
  • Thomas et al. (2018) Philip Thomas, Christoph Dann, and Emma Brunskill. Decoupling gradient-like learning rules from representations. In International Conference on Machine Learning, 2018.
  • Amari and Douglas (1998) Shun-Ichi Amari and Scott C Douglas. Why natural gradient? In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181). IEEE, 1998.
  • Bagnell and Schneider (2003) J. Andrew Bagnell and Jeff G. Schneider. Covariant policy search. In IJCAI-03, Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence., 2003.
  • Bhatnagar et al. (2008) Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, and Richard S Sutton. Incremental natural actor-critic algorithms. In Advances in neural information processing systems, pages 105–112, 2008.
  • Degris et al. (2012) T. Degris, P. M. Pilarski, and R. S. Sutton. Model-free reinforcement learning with continuous action in practice. In Proceedings of the 2012 American Control Conference, 2012.
  • Morimura et al. (2005) T. Morimura, E. Uchibe, and K. Doya. Utilizing the natural gradient in temporal difference reinforcement learning with eligibility traces. In International Symposium on Information Geometry and its Application, pages 256–263, 2005.
  • Thomas and Barto (2012) P. S. Thomas and A. G. Barto. Motor primitive discovery. In Procedings of the IEEE Conference on Development and Learning and Epigenetic Robotics, pages 1–8, 2012.
  • Kakade et al. (2003) Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD thesis, University of London London, England, 2003.
  • Thomas and Brunskill (2017) Philip S Thomas and Emma Brunskill. Policy gradient methods for reinforcement learning with function approximation and action-dependent baselines. arXiv preprint arXiv:1706.06643, 2017.
  • Grathwohl et al. (2017) Will Grathwohl, Dami Choi, Yuhuai Wu, Geoffrey Roeder, and David Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. arXiv preprint arXiv:1711.00123, 2017.
  • Liu et al. (2017) Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-depedent control variates for policy optimization via stein’s identity. arXiv preprint arXiv:1710.11198, 2017.
  • Wu et al. (2018) Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized baselines. arXiv preprint arXiv:1803.07246, 2018.
  • Tucker et al. (2018) George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard E Turner, Zoubin Ghahramani, and Sergey Levine. The mirage of action-dependent baselines in reinforcement learning. arXiv preprint arXiv:1802.10031, 2018.
  • Graybill and Deal (1959) Franklin A Graybill and RB Deal. Combining unbiased estimators. Biometrics, 15(4):543–550, 1959.
  • Meir et al. (1994) Ronny Meir et al. Bias, variance and the combination of estimators: The case of linear least squares. Citeseer, 1994.
  • Wang et al. (2013) Chong Wang, Xi Chen, Alexander J Smola, and Eric P Xing. Variance reduction for stochastic gradient optimization. In Advances in Neural Information Processing Systems, 2013.
  • Geffner and Domke (2018) Tomas Geffner and Justin Domke. Using large ensembles of control variates for variational inference. In Advances in Neural Information Processing Systems, 2018.
  • Theocharous et al. (2015) Georgios Theocharous, Philip S Thomas, and Mohammad Ghavamzadeh. Ad recommendation systems for life-time value optimization. In Proceedings of the 24th International Conference on World Wide Web, pages 1305–1310. ACM, 2015.
  • Bertsekas and Tsitsiklis (2000) Dimitri P Bertsekas and John N Tsitsiklis. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627–642, 2000.
  • Amari and Nagaoka (2007) Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2007.
  • Thomas (2014) Philip Thomas. Bias in natural actor-critic algorithms. In International Conference on Machine Learning, pages 441–448, 2014.
  • Konidaris et al. (2011) George Konidaris, Sarah Osentoski, and Philip Thomas. Value function approximation in reinforcement learning using the fourier basis. In Twenty-fifth AAAI conference on artificial intelligence, 2011.

Reinforcement Learning When All Actions are Not
Always Available (Supplementary Material)

Appendix A SAS Policy Gradient

Lemma 1 (SAS Policy Gradient).

For all ,


where (16) comes from unrolling the Bellman equation. We started with the partial derivative of the value of a state, expanded the definition of the value of a state, and obtained an expression in terms of the partial derivative of the value of another state. Now, we again expand using the definition of the state-value function and the Bellman equation.


Expanding allowed us to write it in terms of the partial derivative of yet another state, . We could continue this process, “unravelling” the recurrence further. Each time that we expand the partial derivative of the value of a state with respect to the parameters, we get another term. The first two terms that we have obtained are marked above. If we were to unravel the expression more times, by expanding and then differentiating, we would obtain the subsequent third, fourth, etc., terms.

Finally, to get the desired result, we expand the start-state objective and take the derivative with respect to it,


Combining results from (30) and (31), we index each term by , with the first term being , the second , etc., which results in the expression:


Notice that to get the gradient with respect to , we have included a sum over all the states weighted by, , the start state probability. When , the only state where is not zero will be when (at which point this probability is one). This allows us to succinctly represent all the terms. With this we conclude the proof. ∎

Appendix B Convergence

Assumption A1 (Differentiable).

For any state, action-set, and action triplet , policy is continuously differentiable in the parameter .

Assumption A2 (Lipschitz smooth gradient).

Let denote the set of all possible parameters for policy , then for some constant ,

Assumption A3 (Learning rate schedule).

Let be the learning rate for updating policy parameters , then,

Lemma 2.

Under Assumptions (A1)-(A3), SAS policy gradient algorithm causes as , with probability one.


Following the standard result on convergence of gradient ascent (descent) methods [Bertsekas and Tsitsiklis, 2000], we know that under Assumptions (A1)-(A3), either or as . However, maximum rewards possible is and , therefore is bounded above by . Hence cannot go to and we get the desired result. ∎

Appendix C SAS Natural Policy Gradient

Property 1 (Fisher Information Matrix).

For a policy, parameterized using weights , the Fisher information matrix is given by,


where, .


To prove this result, we first note the following relation by Amari and Nagaoka [2007] which connects the Hessian and the FIM of a random variable parameterized using ,


Now, let denote the random variable corresponding to the trajectories observed using policy . Let denote an outcome of , then the probability of observing this trajectory, , is given by,




We know that Fisher Information Matrix for a random variable, which in our case is , is given by,

(Using Equation (34)) (41)
(Using Equation (39)) (42)

where the summation over corresponds to all possible values of and for every step in the trajectory. Expanding the inner summation in (43),