1 Introduction
In many realworld sequential decision making problems, the set of available decisions, which we call the action set, is stochastic. In vehicular routing on a road network (Gendreau et al., 1996) or packet routing on the internet (Ribeiro et al., 2008), the goal is to find the shortest path between a source and destination. However, due to construction, traffic, or other damage to the network, not all pathways are always available. In online advertising (Tan and Srikant, 2012; Mahdian et al., 2007), the set of available ads can vary due to fluctuations in advertising budgets and promotions. In robotics (Feng and Yan, 2000), actuators can fail. In recommender systems (Harper and Skiba, 2007), the set of possible recommendations can vary based on product availability. These examples capture the broad idea and motivate the question we aim to address: how can we develop efficient learning algorithms for sequential decision making problems wherein the action set can be stochastic?
Sequential decision making problems without stochastic action sets are typically modeled as Markov decision processes (MDPs). Although the MDP formulation is remarkably flexible, and can incorporate concepts like stochastic state transitions, partial observability, and even different (deterministic) action availability depending on the state, it does not allow for stochastic action sets. As a result, algorithms designed for MDPs are not suited to our setting of interest. Recently, Boutilier et al. (2018) proposed a new problem formulation, stochastic action set Markov decision processes (SASMDPs), that extends MDPs to include stochastic action sets. They also showed how the Qlearning and value iteration algorithms two classic algorithms for approximating optimal solutions to MDPs, can be extended to SASMDPs.
In this paper we show that the lack of convergence guarantees of the Qlearning algorithm, when using function approximators in the MDP setting can be exacerbated in the SASMDP setting. We therefore derive policy gradient and natural policy gradient algorithms for the SASMDP setting and provide conditions for their almostsure convergence. Furthermore, since the introduction of stochastic action sets introduces further uncertainty in the decision making process, variance reduction techniques are of increased importance. We therefore derive new approaches to variance reduction for policy gradient algorithms that are unique to the SASMDP setting. We validate our new algorithms empirically on tasks inspired by realworld problems with stochastic action sets.
2 Related Work
While there is extensive literature on solving sequential decision problems modeled as MDPs (Puterman, 2014; Sutton and Barto, 2018), there are few methods designed to handle stochastic action sets. Recently, Boutilier et al. (2018) laid the foundation for studying MDPs with stochastic action sets by defining the new SASMDP problem formulation, which we review in the background section. After defining SASMDPs, Boutilier et al. (2018) presented and analyzed the modelbased value iteration and policy iteration algorithms and the modelfree Qlearning algorithm for SASMDPs.
In the bandit setting, wherein individual decisions are optimized rather than sequences of dependent decisions, sleeping bandits extend the standard bandit problem formulation to allow for stochastic action sets (Kanade et al., 2009; Kleinberg et al., 2010)
. We focus on the SASMDP formulation rather than the sleeping bandit formulation because we are interested in sequential problems. Such sequential problems are more challenging because making optimal decisions requires one to reason about the longterm impact of decisions, which includes reasoning about how a decision will influence the probability that different actions (decisions) will be available in the future.
Although we focus on the modelfree setting, wherein the dynamics of the environment are not known a priori to the agent optimizing its decisions, in the alternative modelbased setting researchers have considered related problems in the area of stochastic routing (Papadimitriou and Yannakakis, 1991; Polychronopoulos and Tsitsiklis, 1996; Nikolova et al., 2006; Nikolova and Karger, 2008). In stochastic routing problems, the goal is to find a shortest path on a graph with stochastic availability of edges. The SASMDP framework generalizes stochastic routing problems by allowing for sequential decision making problems that are not limited to shortest path problems.
3 Background
MDPs and SASMDPs (Boutilier et al., 2018) are mathematical formulations of sequential decision problems. Before defining SASMDPs, we define MDPs. We refer to the entity interacting with an MDP or SASMDP and trying to optimize its decisions as the agent.
Formally, an MDP is a tuple . is the set of all possible states that the agent can be in, called the state set. Although our math notation assumes that is countable, our primary results extend to MDPs with continuous states. is a finite set of all possible actions that the agent can take, called the base action set. and
are random variables that denote the state of the environment and action chosen by the agent at time
. is called the transition function and characterizes how states transition: . , a bounded random variable, is the scalar reward received by the agent at time , where is a finite constant. is called the reward function, and is defined as . The reward discount parameter, , characterizes how to utility of rewards to the agent decays based on how far in the future they occur. We call the start state distribution, which is defined as .We now turn to defining a SASMDP. Let the set of actions available at time be a random variable, , which we assume is always not empty, i.e., . Let characterize the conditional distribution of : . We assume that is Markovian, in that its distribution is conditionally independent of all events prior to the agent entering state given . Formally, a SASMDP is , with the additional requirement that .
A policy is a conditional distribution over actions for each state: for all , and , where
. Sometimes a policy is parameterized by a weight vector
, such that changing changes the policy. We write to denote such a parameterized policy with weight vector . For any policy , we define the corresponding stateaction value function to be , where conditioning on denotes that for all and for . Similarly, the statevalue function associated with policy is . For a given SASMDP , the agent’s goal is to find an optimal policy, , (or equivalently optimal policy parameters ) which is any policy that maximizes the expected sum of discounted future rewards. More formally, an optimal policy is any , where and denotes the set of all possible policies. For notational convenience, we sometimes use in place of , e.g., to write , , or , since a weight vector induces a specific policy.One way to model stochastic action sets using the MDP formulation rather than the SASMDP formulation is to define states such that one can infer given . Transforming an MDP into a new MDP with embedded in in this way can result in the size of the state set growing exponentially— by a factor of . This drastic increase in the size of the state set can make finding or approximating an optimal policy prohibitively difficult. Using the SASMDP formulation, the challenges associated with this exponential increase in the size of the state set can be avoided, and one can derive algorithms for finding or approximating optimal policies in terms of the state set of the original underlying MDP. This is accomplished using a variant of the Bellman operator, , which incorporates the concept of stochastic action sets:
(1) 
for all . Similarly, one can extend the Bellman optimality operator (Sutton and Barto, 2018):
(2) 
Showing an equivalence between the fixed point of this modified Bellman operator and the fixed point of the standard Bellman operator on the MDP with embedded actions, Boutilier et al. (2018)
proposed the following update for a tabular estimate,
, of :(3) 
Notice that the maximum is computed only over the available actions, , in state . We refer to the algorithm using this update rule as SASQlearning.
4 Limitations of SASQLearning
Although SASQlearning provides a powerful first modelfree algorithm for approximating optimal policies for SASMDPs, it inherits several of the drawbacks of the Qlearning algorithm for MDPs. Just like Qlearning, in a state and with available actions , the SASQlearning method chooses actions deterministically when not exploring: . This limits its practicality for problems where optimal policies are stochastic, which is often the case when the environment is partially observable or when the use of function approximation causes state aliasing (Baird, 1995). Additionally, if the SASQlearning update converges to an estimate, , of such that for all , then the agent will act optimally; however, convergence to a fixedpoint of is seldom achieved in practice, and reducing the difference between and (what SASQlearning aims to do) does not ensure improvement of the policy (Sutton and Barto, 2018).
SASQlearning does not perform gradient ascent or descent on any function, and it can cause divergence of the estimator when using function approximation, just like Qlearning for MDPs (Baird, 1995). Furthermore, we contend that the divergent behavior of SASQLearning can in some cases be more severe than that of the QLearning algorithm for MDPs. That is, in cases where Qlearning converges, SASQlearning can diverge.
To see this, consider the SAS variant of the classical MDP (Tsitsiklis and Roy, 1983) illustrated in Figure 1. In this example there are two states, (left in Figure 1) and (right), and two actions, and . The agent in this example uses function approximation (Sutton and Barto, 2018), with weight vector , such that and . In either state, if the agent takes the left action, it goes to the left state, and if the agent takes the right action, it goes to the right state. In our SASMDP version of this problem, both actions are not always available. Let always, and . Consider the case where the weights of the approximation are initialized to . Now suppose that a transition is observed from the left state to the right state, and after the transition the left action is not available to the agent. As per the SASQlearning update rule provided in (3), Since and , this is equivalent to If this transition is used repeatedly on its own, then irrespective of the learning rate, , the weight would diverge to . In contrast, had there been no constraint of using max over given the available actions, the Qlearning update would have been, because action has higher qvalue than due to . This would make converge to the correct value of . This provides an example of how the stochastic constraints on the set of available actions can be instrumental in causing the SASQlearning method to diverge.
5 Policy Gradient Methods for SASMDPs
In this section we derive policy gradient algorithms (Sutton et al., 2000) for the SASMDP setting. While the Qlearning algorithm minimizes the error between and for all states (using a procedure that is not a gradient algorithm), policy gradient algorithms perform stochastic gradient ascent on the objective function . That is, they use the update , where
is an unbiased estimator of
.Unlike the Qlearning algorithm, policy gradient algorithms for MDPs provide (local) convergence guarantees even when using function approximation, and can approximate optimal stochastic policies. However, ignoring the fact that actions are not always available and using offtheshelf algorithms for MDPs fails to fully capture the problem setting (Boutilier et al., 2018). It is therefore important that we derive policy gradient algorithms that are appropriate for the SASMDP setting, as they provide the first convergent modelfree algorithms for SASMDPs when using function approximation.
In the following lemma we extend the expression for the policy gradient for MDPs (Sutton et al., 2000) to handle stochastic action sets.
Lemma 1 (SAS Policy Gradient).
For a SASMDP, for all ,
(4) 
Proof.
See Appendix A. ∎
It follows from Lemma 1 that we can create unbiased estimates of , which can be used to update using the wellknown stochastic gradient ascent algorithm. This algorithm is presented in Algorithm LABEL:apx:Alg:1 in Appendix E. Notably, this process does not require the agent to know . Notice that in the special case where all actions are always available, the expression in Lemma 1 degenerates to the policy gradient theorem for MDPs (Sutton and Barto, 2018). We now establish that SAS policy gradient algorithms are guaranteed to converge to locally optimal policies under standard assumptions on policy being differentiable (A1), gradient of being Lipschitz (A2), and stepsizes being decayed appropriately (A3). Formal assumption statements are deferred to Appendix B.
Lemma 2.
Proof.
See Appendix B. ∎
Natural policy gradient algorithms (Kakade, 2002) extend policy gradient algorithms to follow the natural gradient of (Amari, 1998). In essence, whereas policy gradient methods perform gradient ascent in the space of policy parameters by computing the gradient of as a function of the parameters
, natural policy gradient methods perform gradient ascent in the space of policies (which are probability distributions) by computing the gradient of
as a function of the policy, . Thus, whereas policy gradient implicitly measures distances between policies by the Euclidean distance between their policy parameters, natural policy gradient methods measure distances between policies using notions of distance between probability distributions. In the most common form of natural policy gradients, the distances between policies are measured using a Taylor approximation of Kullback–Leibler divergence (KLD). By performing gradient ascent in the space of policies rather than the space of policy parameters, the natural policy gradient becomes invariant to how the policy is parameterized (Thomas et al., 2018), which can help to mitigate the vanishing gradient problem in neural networks and improve learning speed
(Amari and Douglas, 1998).The natural policy gradient (using a Taylor approximation of KLD to measure distances) is where is the Fisher information matrix (FIM) associated with the policy . Although the FIM is a wellknown quantity, it is typically associated with a parameterized probability distribution. Here, is a collection of probability distributions—one per state. This raises the question of what should be when computing the natural policy gradient. Following the work of Bagnell and Schneider (2003) for MDPs, we show that the FIM, , for computing the natural policy gradient for a SASMDP can also be derived by viewing as a single distribution over possible trajectories (sequences of states, available action sets and executed actions).
Property 1 (Fisher Information Matrix).
For a policy, parameterized using weights , the Fisher information matrix is given by,
(5) 
where, .
Proof.
See Appendix C. ∎
Furthermore, Kakade (2002) showed that many terms in the definition of the natural policy gradient cancel, providing a simple expression for the natural gradient which can be estimated with time linear in the number of policy parameters per time step. We extend the result of Kakade (2002) to the SASMDP formulation in the following lemma:
Lemma 3 (SAS Natural Policy Gradient).
Let be a parameter such that,
(6) 
then for all in ,
Proof.
See Appendix C. ∎
From Lemma 3, we can derive a computationally efficient natural policy gradient algorithm by using the wellknown temporal difference algorithm (Sutton and Barto, 2018), modified to work with SASMDPs, to estimate with the approximator , and then using the update . This algorithm, which is the SASMDP equivalent of NACTD (Bhatnagar et al., 2008; Degris et al., 2012; Morimura et al., 2005; Thomas and Barto, 2012), is provided in Algorithm LABEL:apx:Alg:2 in Appendix E.
6 Adaptive Variance Mitigation
In the previous section, we derived (natural) policy gradient algorithms for SASMDPs. While these algorithms avoid the divergence of SASQlearning, they suffer from the high variance of policy gradient estimates (Kakade et al., 2003). As a consequence of the additional stochasticity that results from stochastic action sets, this problem can be even more severe in the SASMDP setting. In this section, we leverage insights from the Bellman equation for SASMDPs, provided in (1), to reduce the variance of policy gradient estimates.
One of the most popular methods to reduce variance is the use of a statedependent baseline . Sutton et al. (2000) showed that, for any statedependent baseline :
(7) 
For any random variables and , we know that the variance of is given by , where cov stands for covariance. Therefore, the variance of is lesser than variance of if . As a result, any state dependent baseline whose value is sufficiently correlated to the expected return, , can be used to reduce the variance of the sample estimator of (7). A common choice for such a baseline is a statevalue function estimator, .
A baseline dependent on both the state and action can have higher correlation with , and could therefore reduce variance further. However, such action dependent baselines cannot be used directly, as they can result in biased gradient estimates. Developing such baselines remains an active area of research for MDPs (Thomas and Brunskill, 2017; Grathwohl et al., 2017; Liu et al., 2017; Wu et al., 2018; Tucker et al., 2018) and is largely complementary to our purpose.
We now show that we can introduce a baseline for SASMDPs that lies between statedependent and stateactiondependent baselines. Like statedependent baselines, these new baselines do not introduce bias into gradient estimates. However, like actiondependent baselines these new baselines include some information about the chosen actions. Specifically, we propose baselines that depend on the state, , and available action set , but not the precise action, .
Recall from the SAS Bellman equation (1) that the statevalue function for SASMDPs can be written as, . While we cannot directly use a baseline dependent on the action sampled from , we can use baseline dependent on the sampled action set. We consider a new baseline which leverages this information about the sampled action set . This baseline is where is a learned estimator of the stateaction value function, and represents its expected value under the current policy, , conditioned on the sampled action set .
In principle, we expect to be more correlated with as it explicitly conditions on the action set and does not compute an average over all action sets possible, like . Practically, however, estimating values can be harder than estimating . This can be attributed to the fact that with the same number of training samples, the number of parameters to learn in is more than those in an estimate of . This poses a new dilemma of deciding when to use which baseline. To get the best of both, we consider using a weighted combination of and . In the following property we establish that using any weighted combination of these two baselines results in an unbiased estimate of the SAS policy gradient.
Property 2 (Unbiased estimator).
Let and , then for any values of and ,
(8) 
Proof.
See Appendix D. ∎
The question remains: what values should be used for and for combining and ? Similar problems of combining different estimators has been studied in statistics literature (Graybill and Deal, 1959; Meir et al., 1994) and more recently for combining control variates (Wang et al., 2013; Geffner and Domke, 2018). Building upon their ideas, rather than leaving and
as open hyperparameters, we propose a method for automatically adapting
for the specific SASMDP and current policy parameters, . The following lemma presents an analytic expression for the value of that minimizes a samplebased estimate of the variance of .Lemma 4 (Adaptive variance mitigation).
If and , where , and , then the that minimizes the variance of is given by
(9) 
Proof.
See Appendix D. ∎
Lemma 4 provides the values for and that result in the minimal variance of . Note that the computational cost associated with evaluating the inverse of is negligible because its dimension is always , independent of the number of policy parameters. Also, Lemma 4 provides the optimal values of and , which still must be approximated using samplebased estimates of and . Furthermore, one might use double sampling for to get unbiased estimates of the variance minimizing value of (Baird, 1995). However, as Property 2 ensures that estimates of for any value of and are always unbiased, we opt to use all the available samples for estimating and . Detailed stepbystep pseudocode for optimizing , constructing the baselines, and using them within a SAS policy gradient algorithm is provided in Algorithm LABEL:apx:Alg:1 in Appendix E.
7 Empirical Analysis
, respectively, for the SASPG results. Shaded regions correspond to one standard deviation obtained using
trials.In this section we use empirical studies to answer the following three questions: (a) How do our proposed algorithms, SAS policy gradient (SASPG) and SAS natural policy gradient (SASNPG), compare to the prior method SASQlearning? (b) How does our adaptive variance reduction technique weight the two baselines over the training duration? (c) What impact does the probability of action availability have on the performances of SASPG, SASNPG, and SASQlearning? To evaluate these aspects, we first briefly introduce three domains inspired by realworld problems.
Routing in San Francisco.
This task models the problem of finding shortest paths in San Francisco, and was first presented with stochastic actions by Boutilier et al. (2018). Stochastic actions model the concept that certain paths in the road network may not be available at certain times. A positive reward is provided to the agent when it reaches the destination, while a small penalty is applied at every time step. We modify the domain presented by Boutilier et al. (2018) so that the starting state of the agent is not one particular node, but rather is uniformly randomly chosen among all possible locations. This makes the problem more challenging, since it requires the agent to learn the shortest path from every node. All the states (nodes) are discrete, and edges correspond to the action choices. Each edge is made available with some fixed probability. The overall map is shown in Figure 4.
Robot locomotion task in a maze.
In this domain, the agent has to navigate a maze using unreliable actuators. The agent starts at the bottom left corner and a goal reward is given when it reaches the goal position, marked by a star in Figure 4. The agent is penalized at each time step to encourage it to reach the goal as quickly as possible. The state space is continuous, and corresponds to realvalued Cartesian coordinates of the agent’s position. The agent has actuators pointing in different directions. Turning each actuator on moves the agent in the direction of the actuator. However, each actuator is unreliable, and is therefore only available with some fixed probability.
Product recommender system.
In online marketing and sales, product recommendation is a popular problem. Due to various factors such as stock outage, promotions, delivery issues etc., not all products can be recommended always. To model this, we consider a synthetic setup of providing recommendation to a user from a batch of products, each available with some fixed probability and associated with a stochastic reward corresponding to profit. Each user has a realvalued context, which forms the state space, and the recommender system interacts with a randomly chosen user for steps. The goal for the recommender system is to suggest products that maximize total profit. Often the problem of recommendation is formulated as a contextual bandit or collaborative filtering problem, but as shown by Theocharous et al. (2015) these approaches fail to capture the long term value of the prediction. Hence we resort to the full RL setup.
7.1 Results
Here we only discuss the representative results for the three major questions of interest. Plots for detailed evaluations are available in Appendix F.2.
(a) For the routing problem in San Francisco, as both the states and actions are discrete, the qfunction for each stateaction pair has a unique parameter. When no parameters are shared, SASQlearning will not diverge. Therefore, in this domain, we notice that SASQlearning performs similarly to the proposed algorithms. However, in many largescale problems, the use of function approximators is crucial for estimating the optimal policy. For the robot locomotion task in the maze domain and the recommender system, the state space is not discrete and hence function approximators are required to obtain the state features. As we saw in Section 4, the sharing of state features can create problems for SASQlearning. The increased variance in the performance of SASQlearning is visible in both the Maze and the Recommender system domains in Figure 2. While the SASQ method eventually performs the same on the Maze domain, its performance improvement saturates quickly in the recommender system domain thus resulting in a suboptimal policy.
(b) To provide visual intuition for the behavior of adaptive variance mitigation, we report the values of and over the training duration in Figure 2. As several factors are combined through (9) to influence the values, it is hard to pinpoint any individual factor that is responsible for the observed trend. However, note that for both the routing problem in San Francisco and the robot navigation in maze, the goal reward is obtained on reaching the destination and intermediate actions do not impact the total return significantly. Intuitively, this makes the action set conditioned baseline similarly correlated to the observed return as the state only conditioned baseline, , but at the expense of estimating significantly more number of parameters. Thus the importance for is automatically adapted to be closer to zero. On the other hand, in recommender system, each product has a significant amount of associated reward. Therefore, the total return possible during each episode has a strong dependency on the available action set and thus the magnitude of weight for is much larger than that for .
(c) To understand the impact of the probability of an action being available, we report the best performances for all the algorithms for different probability values in Figure 3. We notice that in the San Francisco routing domain, SASQlearning has a slight edge over the proposed methods. This can be attributed to the fact that offpolicy samples can be reused without causing any divergence problems as state features are not shared. For the maze and the recommender system tasks, where function approximators are necessary, the proposed methods significantly outperform SASQ.
8 Conclusion
Building upon the SASMDP framework of Boutilier et al. (2018), we studied an underaddressed problem of dealing with MDPs with stochastic action sets. We highlighted some of the limitations of the existing method and addressed them by generalizing policy gradient methods for SASMDPs. Additionally, we introduced a novel baseline and an adaptive variance reduction technique unique to this setting. Our approach has several benefits. Not only does it generalize the theoretical properties of standard policy gradient methods, but it is also practically efficient and simple to implement.
References
 Gendreau et al. (1996) Michel Gendreau, Gilbert Laporte, and René Séguin. Stochastic vehicle routing. European Journal of Operational Research, 1996.
 Ribeiro et al. (2008) Alejandro Ribeiro, Nikolaos D Sidiropoulos, and Georgios B Giannakis. Optimal distributed stochastic routing algorithms for wireless multihop networks. IEEE Transactions on Wireless Communications, 2008.
 Tan and Srikant (2012) Bo Tan and Rayadurgam Srikant. Online advertisement, optimization and stochastic networks. IEEE Transactions on Automatic Control, 2012.
 Mahdian et al. (2007) Mohammad Mahdian, Hamid Nazerzadeh, and Amin Saberi. Allocating online advertisement space with unreliable estimates. In Proceedings of the 8th ACM conference on Electronic commerce. ACM, 2007.
 Feng and Yan (2000) Youyi Feng and Houmin Yan. Optimal production control in a discrete manufacturing system with unreliable machines and random demands. IEEE Transactions on Automatic Control, 2000.
 Harper and Skiba (2007) Gregory W Harper and Steven Skiba. Userpersonalized media sampling, recommendation and purchasing system using realtime inventory database, 2007. US Patent 7,174,312.
 Boutilier et al. (2018) Craig Boutilier, Alon Cohen, Amit Daniely, Avinatan Hassidim, Yishay Mansour, Ofer Meshi, Martin Mladenov, and Dale Schuurmans. Planning and learning with stochastic action sets. In IJCAI, 2018.
 Puterman (2014) Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
 Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Kanade et al. (2009)
Varun Kanade, H. Brendan McMahan, and Brent Bryan.
Sleeping experts and bandits with stochastic action availability and
adversarial rewards.
In
Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS
, 2009.  Kleinberg et al. (2010) Robert Kleinberg, Alexandru NiculescuMizil, and Yogeshwer Sharma. Regret bounds for sleeping experts and bandits. Machine learning, 80(23):245–272, 2010.
 Papadimitriou and Yannakakis (1991) Christos H Papadimitriou and Mihalis Yannakakis. Shortest paths without a map. Theoretical Computer Science, 84(1):127–150, 1991.
 Polychronopoulos and Tsitsiklis (1996) George H Polychronopoulos and John N Tsitsiklis. Stochastic shortest path problems with recourse. Networks: An International Journal, 27(2):133–143, 1996.
 Nikolova et al. (2006) Evdokia Nikolova, Matthew Brand, and David R Karger. Optimal route planning under uncertainty. In ICAPS, volume 6, pages 131–141, 2006.
 Nikolova and Karger (2008) Evdokia Nikolova and David R Karger. Route planning under uncertainty: The canadian traveller problem. In AAAI, pages 969–974, 2008.
 Baird (1995) Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier, 1995.
 Tsitsiklis and Roy (1983) JN Tsitsiklis and BV Roy. An analysis of temporaldifference with function approximation. IEEE Trans. Autom. Control, 42(5):834–836, 1983.
 Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 Kakade (2002) Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538, 2002.
 Amari (1998) ShunIchi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 Thomas et al. (2018) Philip Thomas, Christoph Dann, and Emma Brunskill. Decoupling gradientlike learning rules from representations. In International Conference on Machine Learning, 2018.
 Amari and Douglas (1998) ShunIchi Amari and Scott C Douglas. Why natural gradient? In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181). IEEE, 1998.
 Bagnell and Schneider (2003) J. Andrew Bagnell and Jeff G. Schneider. Covariant policy search. In IJCAI03, Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence., 2003.
 Bhatnagar et al. (2008) Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, and Richard S Sutton. Incremental natural actorcritic algorithms. In Advances in neural information processing systems, pages 105–112, 2008.
 Degris et al. (2012) T. Degris, P. M. Pilarski, and R. S. Sutton. Modelfree reinforcement learning with continuous action in practice. In Proceedings of the 2012 American Control Conference, 2012.
 Morimura et al. (2005) T. Morimura, E. Uchibe, and K. Doya. Utilizing the natural gradient in temporal difference reinforcement learning with eligibility traces. In International Symposium on Information Geometry and its Application, pages 256–263, 2005.
 Thomas and Barto (2012) P. S. Thomas and A. G. Barto. Motor primitive discovery. In Procedings of the IEEE Conference on Development and Learning and Epigenetic Robotics, pages 1–8, 2012.
 Kakade et al. (2003) Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD thesis, University of London London, England, 2003.
 Thomas and Brunskill (2017) Philip S Thomas and Emma Brunskill. Policy gradient methods for reinforcement learning with function approximation and actiondependent baselines. arXiv preprint arXiv:1706.06643, 2017.
 Grathwohl et al. (2017) Will Grathwohl, Dami Choi, Yuhuai Wu, Geoffrey Roeder, and David Duvenaud. Backpropagation through the void: Optimizing control variates for blackbox gradient estimation. arXiv preprint arXiv:1711.00123, 2017.
 Liu et al. (2017) Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Actiondepedent control variates for policy optimization via stein’s identity. arXiv preprint arXiv:1710.11198, 2017.
 Wu et al. (2018) Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with actiondependent factorized baselines. arXiv preprint arXiv:1803.07246, 2018.
 Tucker et al. (2018) George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard E Turner, Zoubin Ghahramani, and Sergey Levine. The mirage of actiondependent baselines in reinforcement learning. arXiv preprint arXiv:1802.10031, 2018.
 Graybill and Deal (1959) Franklin A Graybill and RB Deal. Combining unbiased estimators. Biometrics, 15(4):543–550, 1959.
 Meir et al. (1994) Ronny Meir et al. Bias, variance and the combination of estimators: The case of linear least squares. Citeseer, 1994.
 Wang et al. (2013) Chong Wang, Xi Chen, Alexander J Smola, and Eric P Xing. Variance reduction for stochastic gradient optimization. In Advances in Neural Information Processing Systems, 2013.
 Geffner and Domke (2018) Tomas Geffner and Justin Domke. Using large ensembles of control variates for variational inference. In Advances in Neural Information Processing Systems, 2018.
 Theocharous et al. (2015) Georgios Theocharous, Philip S Thomas, and Mohammad Ghavamzadeh. Ad recommendation systems for lifetime value optimization. In Proceedings of the 24th International Conference on World Wide Web, pages 1305–1310. ACM, 2015.
 Bertsekas and Tsitsiklis (2000) Dimitri P Bertsekas and John N Tsitsiklis. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627–642, 2000.
 Amari and Nagaoka (2007) Shunichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2007.
 Thomas (2014) Philip Thomas. Bias in natural actorcritic algorithms. In International Conference on Machine Learning, pages 441–448, 2014.
 Konidaris et al. (2011) George Konidaris, Sarah Osentoski, and Philip Thomas. Value function approximation in reinforcement learning using the fourier basis. In Twentyfifth AAAI conference on artificial intelligence, 2011.
Reinforcement Learning When All Actions are Not
Always Available (Supplementary Material)
Appendix A SAS Policy Gradient
Lemma 1 (SAS Policy Gradient).
For all ,
(10) 
Proof.
(11)  
(12)  
(13)  
(14)  
(15)  
(16)  
(17) 
where (16) comes from unrolling the Bellman equation. We started with the partial derivative of the value of a state, expanded the definition of the value of a state, and obtained an expression in terms of the partial derivative of the value of another state. Now, we again expand using the definition of the statevalue function and the Bellman equation.
(18)  
(19)  
(20)  
(21) 
(22)  
(23)  
(24)  
(25)  
(26)  
(27)  
(28)  
(29)  
(30) 
Expanding allowed us to write it in terms of the partial derivative of yet another state, . We could continue this process, “unravelling” the recurrence further. Each time that we expand the partial derivative of the value of a state with respect to the parameters, we get another term. The first two terms that we have obtained are marked above. If we were to unravel the expression more times, by expanding and then differentiating, we would obtain the subsequent third, fourth, etc., terms.
Finally, to get the desired result, we expand the startstate objective and take the derivative with respect to it,
(31) 
Combining results from (30) and (31), we index each term by , with the first term being , the second , etc., which results in the expression:
(32) 
Notice that to get the gradient with respect to , we have included a sum over all the states weighted by, , the start state probability. When , the only state where is not zero will be when (at which point this probability is one). This allows us to succinctly represent all the terms. With this we conclude the proof. ∎
Appendix B Convergence
Assumption A1 (Differentiable).
For any state, actionset, and action triplet , policy is continuously differentiable in the parameter .
Assumption A2 (Lipschitz smooth gradient).
Let denote the set of all possible parameters for policy , then for some constant ,
Assumption A3 (Learning rate schedule).
Let be the learning rate for updating policy parameters , then,
Lemma 2.
Proof.
Appendix C SAS Natural Policy Gradient
Property 1 (Fisher Information Matrix).
For a policy, parameterized using weights , the Fisher information matrix is given by,
(33) 
where, .
Proof.
To prove this result, we first note the following relation by Amari and Nagaoka [2007] which connects the Hessian and the FIM of a random variable parameterized using ,
(34) 
Now, let denote the random variable corresponding to the trajectories observed using policy . Let denote an outcome of , then the probability of observing this trajectory, , is given by,
(35)  
(36) 
Therefore,
(37)  
(38)  
(39) 
We know that Fisher Information Matrix for a random variable, which in our case is , is given by,
where the summation over corresponds to all possible values of and for every step in the trajectory. Expanding the inner summation in (43),
Comments
There are no comments yet.