1 Introduction
Reinforcement learning agents typically have either a discrete or a continuous action space [Sutton and Barto1998]
. With a discrete action space, the agent decides which distinct action to perform from a finite action set. With a continuous action space, actions are expressed as a single realvalued vector. If we use a continuous action space, we lose the ability to consider differences in kind: all actions must be expressible as a single vector. If we use only discrete actions, we lose the ability to finely tune action selection based on the current state.
A parameterized action is a discrete action parameterized by a realvalued vector. Modeling actions this way introduces structure into the action space by treating different kinds of continuous actions as distinct. At each step an agent must choose both which action to use and what parameters to execute it with. For example, consider a soccer playing robot which can kick, pass, or run. We can associate a continuous parameter vector to each of these actions: we can kick the ball to a given target position with a given force, pass to a specific position, and run with a given velocity. Each of these actions is parameterized in its own way. Parameterized action Markov decision processes (PAMDPs) model situations where we have distinct actions that require parameters to adjust the action to different situations, or where there are multiple mutually incompatible continuous actions.
We focus on how to learn an actionselection policy given predefined parameterized actions. We introduce the QPAMDP algorithm, which alternates learning actionselection and parameterselection policies and compare it to a direct policy search method. We show that with appropriate update rules QPAMDP converges to a local optimum. These methods are compared empirically in the goal and Platform domains. We found that QPAMDP outperformed direct policy search and fixed parameter SARSA.
2 Background
A Markov decision process (MDP) is a tuple , where is a set of states, is a set of actions,
is the probability of transitioning to state
from state after taking action , is the probability of receiving reward for taking action in state , and is a discount factor [Sutton and Barto1998]. We wish to find a policy, , which selects an action for each state so as to maximize the expected sum of discounted rewards (the return).The value function is defined as the expected discounted return achieved by policy starting at state
Similarly, the actionvalue function is given by
as the expected return obtained by taking action in state , and then following policy thereafter. While using the value function in control requires a model, we would prefer to do so without needing such a model. We can approach this problem by learning , which allows us to directly select the action which maximizes . We can learn for an optimal policy using a method such as Qlearning [Watkins and Dayan1992]. In domains with a continuous state space, we can represent using parametric function approximation with a set of parameters and learn this with algorithms such as gradient descent SARSA() [Sutton and Barto1998].
For problems with a continuous action space (), selecting the optimal action with respect to is nontrivial, as it requires finding a global maximum for a function in a continuous space. We can avoid this problem using a policy search algorithm, where a class of policies parameterized by a set of parameters is given, which transforms the problem into one of direct optimization over for an objective function . Several policy search approaches exist, including policy gradient methods, entropybased approaches, path integral approaches, and samplebased approaches [Deisenroth, Neumann, and Peters2013].
Parameterized Tasks
A parameterized task is a problem defined by a task parameter vector given at the beginning of each episode. These parameters are fixed throughout an episode, and the goal is to learn a task dependent policy. Kober et al. kober2012reinforcement developed algorithms to adjust motor primitives to different task parameters. They apply this to learn tabletennis and darts with different starting positions and targets. Da Silva et al. silva2012learning introduced the idea of a parameterized skill as a task dependent parameterized policy. They sample a set of tasks, learn their associated parameters, and determine a mapping from task to policy parameters. Deisenroth et al. deisenroth2014multi applied a modelbased method to learn a task dependent parameterized policy. This is used to learn task dependent policies for ballhitting task, and for solving a block manipulation problem. Parameterized tasks can be used as parameterized actions. For example, if we learn a parameterized task for kicking a ball to position , this could be used as a parameterized action kickto().
3 Parameterized Action MDPs
We consider MDPs where the state space is continuous () and the actions are parameterized: there is a finite set of discrete actions , and each has a set of continuous parameters . An action is a tuple where is a discrete action and are the parameters for that action. The action space is then given by
which is the union of each discrete action with all possible parameters for that action. We refer to such MDPs as parameterized action MDPs (PAMDPs). Figure 1 depicts the different action spaces.
We apply a twotiered approach for action selection: first selecting the parameterized action, then selecting the parameters for that action. The discreteaction policy is denoted . To select the parameters for the action, we define the actionparameter policy for each action as . The policy is then given by
In other words, to select a complete action , we sample a discrete action from and then sample a parameter from . The action policy is defined by the parameters and is denoted by . The actionparameter policy for action is determined by a set of parameters , and is denoted . The set of these parameters is given by .
The first approach we consider is direct policy search. We use a direct policy search method to optimize the objective function.
with respect to , where is a state sampled according to the state distribution . is the expected return for a given policy starting at an initial state.
Our second approach is to alternate updating the parameterpolicy and learning an actionvalue function for the discrete actions. For any PAMDP with a fixed parameterpolicy , there exists a corresponding discrete action MDP, , where
We represent the actionvalue function for using function approximation with parameters . For , there exists an optimal set of representation weights which maximizes with respect to . Let
We can learn for a fixed using a Qlearning algorithm. Finally, we define for fixed ,
is the performance of the best discrete policy for a fixed .
Algorithm 1 describes a method for alternating updating and . The algorithm uses two input methods: PUPDATE and QLEARN and a positive integer parameter , which determines the number of updates to for each iteration. PUPDATE( should be a policy search method that optimizes with respect to objective function . QLEARN can be any algorithm for Qlearning with function approximation. We consider two main cases of the QPAMDP algorithm: QPAMDP() and QPAMDP().
QPAMDP(1) performs a single update of and then relearns to convergence. If at each step we only update once, and then update until convergence, we can optimize with respect to . In the next section we show that if we can find a local optimum with respect to , then we have found a local optimum with respect to .
4 Theoretical Results
We now show that QPAMDP(1) converges to a local or global optimum with mild assumptions. We assume that iterating PUPDATE converges to some with respect to a given objective function . As the PUPDATE step is a design choice, it can be selected with the appropriate convergence property. QPAMDP(1) is equivalent to the sequence
if QLEARN converges to for each given .
Theorem 4.1 (Convergence to a Local Optimum).
For any , if the sequence
(1) 
converges to a local optimum with respect to , then QPAMDP(1) converges to a local optimum with respect to .
Proof.
By definition of the sequence above , so it follows that
In other words, the objective function equals if . Therefore, we can replace with in our update for , to obtain the update rule
Therefore by equation 1 the sequence converges to a local optimum with respect to . Let . As is a local optimum with respect to , by definition there exists ,
Therefore for any ,
Therefore is a local optimum with respect to . ∎
In summary, if we can locally optimize , and at each step, then we will find a local optimum for . The conditions for the previous theorem can be met by assuming that PUPDATE is a local optimization method such as a gradient based policy search. A similar argument shows that if the sequence converges to a global optimum with respect to , then QPAMDP(1) converges to a global optimum .
One problem is that at each step we must relearn for the updated value of . We now show that if updates to are bounded and is a continuous function, then the required updates to will also be bounded. Intuitively, we are supposing that a small update to results in a small change in the weights specifying which discrete action to choose. The assumption that is continuous is strong, and may not be satisfied by all PAMDPs. It is not necessary for the operation of QPAMDP(1), but when it is satisfied we do not need to completely relearn after each update to . We show that by selecting an appropriate we can shrink the differences in as desired.
Theorem 4.2 (Bounded Updates to ).
If is continuous with respect to , and updates to are of the form
with the norm of each PUPDATE bounded by
for some , then for any difference in , there is an initial update rate such that
Proof.
Let and
As it follows that
So we have
As is continuous, this means that
∎
In other words, if our update to is bounded and is continuous, we can always adjust the learning rate so that the difference between and is bounded.
With QPAMDP(1) we want PUPDATE to optimize . One logical choice would be to use a gradient update. The next theorem shows that gradient of is equal to the gradient of if . This is useful as we can apply existing gradientbased policy search methods to compute the gradient of with respect to . The proof follows from the fact that we are at a global optimum of with respect to , and so the gradient is zero. This theorem requires that is differentiable (and therefore also continuous).
Theorem 4.3 (Gradient of ).
If is differentiable with respect to and and is differentiable with respect to , then the gradient of is given by , where .
Proof.
If and , then we can compute the gradient of
by the chain rule:
where . Note that as by definitions of ,
we have that the gradient of with respect to is zero as is a global maximum with respect to for fixed . Therefore, we have that
∎
To summarize, if is continuous and PUPDATE converges to a global or local optimum, then QPAMDP(1) will converge to a global or local optimum, respectively, and the QLEARN step will be bounded if the update rate of the PUPDATE step is bounded. As such, if PUPDATE is a policy gradient update step then QPAMDP by Theorem 4.1 will converge to a local optimum and by Theorem 4.4 the QLEARN step will require a fixed number of updates. This policy gradient step can use the gradient of with respect to .
With QPAMDP() each step performs a full optimization on and then a full optimization of . The step would optimize , not , as we do update while we update . QPAMDP() has the disadvantage of requiring global convergence properties for the PUPDATE method.
Theorem 4.4 (Local Convergence of QPAMDP()).
If at each step of QPAMDP() for some bounded set :
then QPAMDP() converges to a local optimum.
Proof.
By definition of , Therefore this algorithm takes the form of direct alternating optimization. As such, it converges to a local optimum [Bezdek and Hathaway2002]. ∎
QPAMDP() has weaker convergence properties than QPAMDP(1), as it requires a globally convergent PUPDATE. However, it has the potential to bypass nearby local optima [Bezdek and Hathaway2002].
5 Experiments
We first consider a simplified robot soccer problem [Kitano et al.1997] where a single striker attempts to score a goal against a keeper. Each episode starts with the player at a random position along the bottom bound of the field. The player starts with the ball in possession, and the keeper is positioned between the ball and the goal. The game takes place in a 2D environment where the player and the keeper have a position, velocity and orientation and the ball has a position and velocity, resulting in 14 continuous state variables.
An episode ends when the keeper possesses the ball, the player scores a goal, or the ball leaves the field. The reward for an action is 0 for nonterminal state, 50 for a terminal goal state, and for a terminal nongoal state, where is the distance of the ball to the goal. The player has two parameterized actions: kickto, which kicks to ball towards position ; and shootgoal(), which shoots the ball towards a position along the goal line. Noise is added to each action. If the player is not in possession of the ball, it moves towards it. The keeper has a fixed policy: it moves towards the ball, and if the player shoots at the goal, the keeper moves to intercept the ball.
To score a goal, the player must shoot around the keeper. This means that at some positions it must shoot left past the keeper, and at others to the right past the keeper. However at no point should it shoot at the keeper, so an optimal policy is discontinuous. We split the action into two parameterized actions: shootgoalleft, shootgoalright. This allows us to use a simple action selection policy instead of complex continuous action policy. This policy would be difficulty to represent in a purely continuous action space, but is simpler in a parameterized action setting.
We represent the actionvalue function for the discrete action using linear function approximation with Fourier basis features [Konidaris, Osentoski, and Thomas2011]. As we have 14 state variables, we must be selective in which basis functions to use. We only use basis functions with two nonzero elements and exclude all velocity state variables. We use the softmax discrete action policy [Sutton and Barto1998]. We represent the actionparameter policy
as a normal distribution around a weighted sum of features
, where is a matrix of weights, and gives the features for state , and is a fixed covariance matrix. We use specialized features for each action. For the shootgoal actions we are using a simple linear basis , where is the projection of the keeper onto the goal line. For kickto we use linear features , where is the position of the ball and is the position of the keeper.For the direct policy search approach, we use the episodic natural actor critic (eNAC) algorithm [Peters and Schaal2008], computing the gradient of with respect to . For the QPAMDP approach we use the gradientdescent Sarsa() algorithm for Qlearning, and the eNAC algorithm for policy search. At each step we perform one eNAC update based on 50 episodes and then refit using 50 gradient descent Sarsa() episodes.
Return is directly correlated with goal scoring probability, so their graphs are close to indentical. As it is easier to interpret, we plot goal scoring probability in figure 2. We can see that direct eNAC is outperformed by QPAMDP(1) and QPAMDP(). This is likely due to the difficulty of optimizing the action selection parameters directly, rather than with Qlearning.
For both methods, the goal probability is greatly increased: while the initial policy rarely scores a goal, both QPAMDP(1) and QPAMDP() increase the probability of a goal to roughly 35%. Direct eNAC converged to a local maxima of 15%. Finally, we include the performance of SARSA() where the action parameters are fixed at the initial . This achieves roughly 20% scoring probability. Both QPAMDP(1) and QPAMDP() strongly outperform fixed parameter SARSA, but eNAC does not. Figure 3 depicts a single episode using a converged QPAMDP(1) policy— the player draws the keeper out and strikes when the goal is open.
Next we consider the Platform domain, where the agent starts on a platform and must reach a goal while avoiding enemies. If the agent reaches the goal platform, touches an enemy, or falls into a gap between platforms, the episode ends. This domain is depicted in figure 4. The reward for a step is the change in value for that step, divided by the total length of all the platforms and gaps. The agent has two primitive actions: run or jump, which continue for a fixed period or until the agent lands again respectively. There are two different kinds of jumps: a high jump to get over enemies, and a long jump to get over gaps between platforms. The domain therefore has three parameterized actions: run(), hop(), and leap(). The agent only takes actions while on the ground, and enemies only move when the agent is on their platform. The state space consists of four variables , representing the agent position, agent speed, enemy position, and enemy speed respectively. For learning , as in the previous domain, we use linear function approximation with the Fourier basis. We apply a softmax discrete action policy based on , and a Gaussian parameter policy based on scaled parameter features .
Figure 5 shows the performance of eNAC, QPAMDP(1), QPAMDP(), and SARSA with fixed parameters. Both QPAMDP(1) and QPAMDP() outperformed the fixed parameter SARSA method, reaching on average 50% and 65% of the total distance respectively. We suggest that QPAMDP() outperforms QPAMDP(1) due to the nature of the Platform domain. QPAMDP(1) is best suited to domains with smooth changes in the actionvalue function with respect to changes in the parameterpolicy. With the Platform domain, our initial policy is unable to make the first jump without modification. When the policy can reach the second platform, we need to drastically change the actionvalue function to account for this platform. Therefore, QPAMDP(1) may be poorly suited to this domain as the small change in parameters that occurs between failing to making the jump and actually making it results in a large change in the actionvalue function. This is better than the fixed SARSA baseline of 40%, and much better than direct optimization using eNAC which reached 10%. Figure 6 shows a successfully completed episode of the Platform domain.
6 Related Work
Hauskrecht et al. hauskrecht2004factored introduced an algorithm for solving factored MDPs with a hybrid discretecontinuous action space. However, their formalism has an action space with a mixed set of discrete and continuous components, whereas our domain has distinct actions with a different number of continuous components for each action. Furthermore, they assume the domain has a compact factored representation, and only consider planning.
Rachelson rachelsontemporal encountered parameterized actions in the form of an action to wait for a given period of time in his research on time dependent, continuous time MDPs (TMDPs). He developed XMDPs, which are TMDPs with a parameterized action space [Rachelson2009]. He developed a Bellman operator for this domain, and in a later paper mentions that the TiMDP algorithm can work with parameterized actions, although this specifically refers to the parameterized wait action [Rachelson, Fabiani, and Garcia2009]. This research also takes a planning perspective, and only considers a time dependent domain. Additionally, the size of the parameter space for the parameterized actions is the same for all actions.
Hoey et al. hoey2013bayesian considered mixed discretecontinuous actions in their work on Bayesian affect control theory. To approach this problem they use a form of POMCP, a Monte Carlo sampling algorithm, using domain specific adjustments to compute the continuous action components [Silver and Veness2010]. They note that the discrete and continuous components of the action space reflect different control aspects: the discrete control provides the “what”, while the continuous control describes the “how” [Hoey, Schroder, and Alhothali2013].
In their research on symbolic dynamic programming (SDP) algorithms, Zamani et al. zamani2012symbolic considered domains with a set of discrete parameterized actions. Each of these actions has a different parameter space. Symbolic dynamic programming is a form of planning for relational or firstorder MDPs, where the MDP has a set of logical relationships defining its dynamics and reward function. Their algorithms represent the value function as an extended algebraic decision diagram (XADD), and is limited to MDPs with predefined logical relations.
A hierarchical MDP is an MDP where each action has subtasks. A subtask is itself an MDP with its own states and actions which may have their own subtasks. Hierarchical MDPs are wellsuited for representing parameterized actions as we could consider selecting the parameters for a discrete action as a subtask. MAXQ is a method for value function decomposition of hierarchical MDPs [Dietterich2000]. One possiblity is to use MAXQ for learning the actionvalues in a parameterized action problem.
7 Conclusion
The PAMDP formalism models reinforcement learning domains with parameterized actions. Parameterized actions give us the adaptibility of continuous domains and to use distinct kinds of actions. They also allow for simple representation of discontinuous policies without complex parameterizations. We have presented three approaches for modelfree learning in PAMDPs: direct optimization and two variants of the QPAMDP algorithm. We have shown that QPAMDP(1), with an appropriate PUPDATE method, converges to a local or global optimum. QPAMDP() with a global optimization step converges to a local optimum.
We have examined the performance of these approaches in the goal scoring domain and the Platformer domain. The robot soccer goal domain models the situation where a striker must outmaneuver a keeper to score a goal. Of these, QPAMDP(1) and QPAMDP() outperformed eNAC and fixed parameter SARSA. QPAMDP(1) and QPAMDP() performed similarly well in terms of goal scoring, learning policies that score goals roughly 35% of the time. In the Platform domain we found that both QPAMDP(1) and QPAMDP() outperformed eNAC and fixed SARSA.
References
 [Bezdek and Hathaway2002] Bezdek, J., and Hathaway, R. 2002. Some notes on alternating optimization. In Advances in Soft Computing. Springer. 288–300.

[da Silva, Konidaris, and
Barto2012]
da Silva, B.; Konidaris, G.; and Barto, A.
2012.
Learning parameterized skills.
In
Proceedings of the TwentyNinth International Conference on Machine Learning
, 1679–1686.  [Deisenroth et al.2014] Deisenroth, M.; Englert, P.; Peters, J.; and Fox, D. 2014. Multitask policy search for robotics. In Proceedings of the Fourth International Conference on Robotics and Automation, 3876–3881.
 [Deisenroth, Neumann, and Peters2013] Deisenroth, M.; Neumann, G.; and Peters, J. 2013. A Survey on Policy Search for Robotics. Number 1–2. Now Publishers.

[Dietterich2000]
Dietterich, T.
2000.
Hierarchical reinforcement learning with the MAXQ value function
decomposition.
Journal of Artificial Intelligence Research
13:227–303.  [Guestrin, Hauskrecht, and Kveton2004] Guestrin, C.; Hauskrecht, M.; and Kveton, B. 2004. Solving factored MDPs with continuous and discrete variables. In Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence, 235–242.
 [Hoey, Schroder, and Alhothali2013] Hoey, J.; Schroder, T.; and Alhothali, A. 2013. Bayesian affect control theory. In Proceedings of the Fifth International Conference on Affective Computing and Intelligent Interaction, 166–172. IEEE.
 [Kitano et al.1997] Kitano, H.; Asada, M.; Kuniyoshi, Y.; Noda, I.; Osawa, E.; and Matsubara, H. 1997. Robocup: A challenge problem for AI. AI Magazine 18(1):73.
 [Kober et al.2012] Kober, J.; Wilhelm, A.; Oztop, E.; and Peters, J. 2012. Reinforcement learning to adjust parametrized motor primitives to new situations. Autonomous Robots 33(4):361–379.
 [Konidaris, Osentoski, and Thomas2011] Konidaris, G.; Osentoski, S.; and Thomas, P. 2011. Value function approximation in reinforcement learning using the Fourier basis. In Proceedings of the TwentyFifth AAAI Conference on Artificial Intelligence, 380–385.
 [Peters and Schaal2008] Peters, J., and Schaal, S. 2008. Natural actorcritic. Neurocomputing 71(7):1180–1190.
 [Rachelson, Fabiani, and Garcia2009] Rachelson, E.; Fabiani, P.; and Garcia, F. 2009. TiMDPpoly: an improved method for solving timedependent MDPs. In Proceedings of the TwentyFirst International Conference on Tools with Artificial Intelligence, 796–799. IEEE.
 [Rachelson2009] Rachelson, E. 2009. Temporal Markov Decision Problems: Formalization and Resolution. Ph.D. Dissertation, University of Toulouse, France.
 [Silver and Veness2010] Silver, D., and Veness, J. 2010. MonteCarlo planning in large POMDPs. In Advances in Neural Information Processing Systems, volume 23, 2164–2172.
 [Sutton and Barto1998] Sutton, R., and Barto, A. 1998. Introduction to Reinforcement Learning. Cambridge, MA, USA: MIT Press.
 [Watkins and Dayan1992] Watkins, C., and Dayan, P. 1992. Qlearning. Machine learning 8(34):279–292.
 [Zamani, Sanner, and Fang2012] Zamani, Z.; Sanner, S.; and Fang, C. 2012. Symbolic dynamic programming for continuous state and action MDPs. In Proceedings of the TwentySixth AAAI Conference on Artificial Intelligence.
Comments
There are no comments yet.