Discretizing Continuous Action Space for On-Policy Optimization

01/29/2019 ∙ by Yunhao Tang, et al. ∙ 5

In this work, we show that discretizing action space for continuous control is a simple yet powerful technique for on-policy optimization. The explosion in the number of discrete actions can be efficiently addressed by a policy with factorized distribution across action dimensions. We show that the discrete policy achieves significant performance gains with state-of-the-art on-policy optimization algorithms (PPO, TRPO, ACKTR) especially on high-dimensional tasks with complex dynamics. Additionally, we show that an ordinal parameterization of the discrete distribution can introduce the inductive bias that encodes the natural ordering between discrete actions. This ordinal architecture further significantly improves the performance of PPO/TRPO.



There are no comments yet.


page 6

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background

In reinforcement learning (RL), the action space of conventional control tasks are usually dichotomized into either discrete or continuous

(Brockman et al., 2016). While discrete action space is conducive to theoretical analysis, in the context of deep reinforcement learning, their application is limited to video game playing or board game (Mnih et al., 2013; Silver et al., 2016). On the other hand, in simulated or real life robotics control (Levine et al., 2016; Schulman et al., 2015a), the action space is by design continuous. Continuous control typically requires more subtle treatments, since a continuous range of control contains an infinite number of feasible actions and one must resort to parametric functions for a compact representation of distributions over actions.

Can we retain the simplicity of discrete actions when solving continuous control tasks? A straightforward solution is to discretize the continuous action space, i.e. we discretize the continuous range of action into a finite set of atomic actions and reduce the original task into a new task with a discrete action space. A common argument against this approach is that for an action space with dimensions, discretizing atomic actions per dimension leads to combinations of joint atomic actions, which quickly becomes intractable when

increases. However, a simple fix is to represent the joint distribution over discrete actions as factorized across dimensions, so that the joint policy is still tractable. As prior works have applied such discretization method in practice

(OpenAI, 2018; Jaśkowski et al., 2018), we aim to carry out a systemic study of such straightforward discretization method in simulated environments, and show how they improve upon on-policy optimization baselines.

The paper proceeds as follows. In Section 2, we introduce backgrounds on on-policy optimization baselines (e.g. TRPO and PPO) and related work. In Section 3, we introduce the straightforward method of discretizing action space for continuous control, and analyze the properties of the resulting policies as the number atomic actions changes. In Section 4, we introduce stick-breaking parameterization (Khan et al., 2012), an architecture that parameterizes the discrete distributions while encoding the natural ordering between discrete actions. In Section 5, through extensive experiments we show how the discrete/ordinal policy improves upon current on-policy optimization baselines and related prior works, especially on high-dimensional tasks with complex dynamics.

2 Background

2.1 Markov Decision Process

In the standard formulation of Markov Decision Process (MDP), an agent starts with an initial state

at time . At time , the agent is in , takes an action , receives a reward and transitions to a next state . A policy is a mapping from state to distributions over actions . The expected cumulative reward under a policy is where is a discount factor. The objective is to search for the optimal policy that achieves maximum reward . For convenience, under policy we define action value function and value function . Also define the advantage function .

2.2 On-Policy Optimization

In policy optimization, one restricts the policy search within a class of parameterized policy where is the parameter and is the parameter space. A straightforward way to update policy is to do local search in the parameter space with policy gradient with the incremental update with some learning rate . Alternatively, we can motivate the above gradient update with a trust region formulation. In particular, consider the following constrained optimization problem


for some . If we do a linear approximation of the objective in (1), , we recover the gradient update by properly choosing given .

In such vanilla policy gradient updates, the training can suffer from occasionally large step sizes and never recover from a bad policy (Schulman et al., 2015a). The following variants are introduced to entail more stable updates.

2.2.1 Trust Region Policy Optimization (TRPO)

Trust Region Policy Optimization (TRPO) Schulman et al. (2015a) apply an information constraint on and to better capture the geometry on the parameter space induced by the underlying policy distributions, consider the following trust region formulation


The trust region enforced by the KL divergence entails that the update according to (2) optimizes a lower bound of , so as to avoid accidentally taking large steps that irreversibly degrade the policy performance during training as in vanilla policy gradient (1) (Schulman et al., 2015a). As a practical algorithm, TRPO further approximates the KL constraint in (2) by a quadratic constraint and approximately invert the constraint matrix by conjugate gradient iterations (Wright and Nocedal, 1999).

Actor Critic using Kronecker-Factored Trust Region (ACKTR)

As a more scalable and stable alternate to TRPO, ACKTR Wu et al. (2017) propose to invert the constraint matrix by K-Fac (Martens and Grosse, 2015) instead of conjugate gradient iteration in TRPO.

2.2.2 Proximal Policy Optimization (PPO)

For a practical algorithm, TRPO requires approximately inverting the Fisher matrices by conjugate gradient iterations. Proximal Policy Optimization (PPO) Schulman et al. (2017a) propose to approximate a trust-region method by clipping the likelihood ratios as where clips the argument between and . Consider the following objective


The intuition behind the objective (3) is when , the clipping removes the incentives for the ratio to go above , with similar effects for the case when . PPO achieves more stable empirical performance than TRPO and involves only relatively cheap first-order optimization.

2.3 Related Work

On-policy Optimization.

Vanilla policy gradient updates are typically unstable (Schulman et al., 2015a). Natural policy Kakade (2002) apply information metric instead of Euclidean metric of parameters for updates. To further stabilize the training for deep RL, TRPO Schulman et al. (2015a) place an explicit KL divergence constraint between consecutive policies (Kakade and Langford, 2002) and approximately enforces the constraints with conjugate gradients. PPO Schulman et al. (2017a) replace the KL divergence constraint by a clipping in the likelihood ratio of consecutive policies, which allows for fast first-order updates and achieves state-of-the-art performance for on-policy optimizations. On top of TRPO, ACKTR Wu et al. (2017) further improves the natural gradient computation with Kronecker-factored approximation (George et al., 2018). Orthogonal to the algorithmic advances, we demonstrate that policy optimizations with discrete/ordinal policies achieve consistent and significant improvements on all the above benchmark algorithms over baseline distributions.

Policy Classes.

Orthogonal to the algorithmic procedures for policy updates, one is free to choose any policy classes. In discrete action space, the only choice is a categorical distribution (or a discrete distribution). In continuous action space, the default baseline is factorized Gaussian (Schulman et al., 2015a, 2017a). Gaussian mixtures, implicit generative models or even Normalizing flows (Rezende and Mohamed, 2015) can be used for more expressive and flexible policy classes (Tang and Agrawal, 2018; Haarnoja et al., 2017, 2018b, 2018a), which achieve performance gains primarily for off-policy learning. One issue with aformentioned prior works is that they do not disentangle algorithms from distributions, it is therefore unclear whether the benefits result from a better algorithm or an expressive policy. To make the contributions clear, we make no changes to the on-policy algorithms and show the net effect of how the policy classes improve the performance. Motivated by the fact that unbounded distributions can generate infeasible actions, Chou et al. (2017)

propose to use Beta distribution and also show improvement on TRPO. Early prior work

Shariff and Dick

also propose truncated Gaussian distribution but such idea is not tested on deep RL tasks. Complement to prior works, we propose discrete/ordinal policy as simple yet powerful alternates to baseline policy classes.

Discrete and Continuous Action Space.

Prior works have exploited the connection between discrete and continuous action space. For example, to solve discrete control tasks, Van Hasselt and Wiering (2009); Dulac-Arnold et al. (2015) leverage the continuity in the underlying continuous action space for generalization across discrete actions. Prior works have also converted continuous control problems into discrete ones, e.g. Pazis and Lagoudakis (2009) convert low-dimensional control problems into discrete problems with binary actions. Surprisingly, few prior works have considered a discrete policy and apply off-the-shelf policy optimization algorithms directly. Recently, OpenAI (2018); Jaśkowski et al. (2018) apply discrete policies to challenging hand manipulation and humanoid walking respectively. As a more comprehensive study, we carry out a full evaluation of discrete/ordinal policy on continuous control benchmarks and validate their performance gains.

To overcome the explosion of action space, Metz et al. (2018) overcome the explosion by sequence prediction, but so far their strategy is only shown effective on relatively low-dimensional problems (e.g. HalfCheetah). Tavakoli et al. (2018) propose to avoid taking across all actions in Q-learning, by applying independently across dimensions. Their method is also only tested on a very limited number of tasks. As an alternative, we consider distributions that factorize across dimensions and we show that this simple technique yields consistent performance gains.

Ordinal Architecture.

When discrete variables have an internal ordering, it is beneficial to account for such ordering when modeling the categorical distributions. In statistics, such problems are tackled as ordinal regression or classification (Winship and Mare, 1984; Chu and Ghahramani, 2005; Chu and Keerthi, 2007)

. Few prior works aim to combine ideas of ordinal regression with neural networks. Though

Cheng et al. (2008)

propose to introduce ordering as part of the loss function, they do not introduce a proper probabilistic model and need additional techniques during inference. More recently,

Khan et al. (2012) motivate the stick-breaking parameterization, a proper probabilistic model which does not introduce additional parameters compared to the original categorical distribution. In our original derivation, we motivate the architecture of (Khan et al., 2012) by transforming the loss function of (Cheng et al., 2008). We also show that such additional inductive bias greatly boosts the performance for PPO/TRPO.

3 Discretizing Action Space for Continuous Control

Without loss of generality, we assume the action space . We discretize each dimension of the action space into equally spaced atomic actions. The set of atomic action for any dimension is

. To overcome the curse of dimensionality, we represent the distribution as factorized across dimensions. In particular, in state

, we specify a categorical distribution over actions for each dimension , with as parameters for this marginal distribution. Then we define the joint discrete policy where . The factorization allows us to maintain a tractable distribution over joint actions, making it easy to do both sampling and training.

3.1 Network Architecture

The discrete policy is parameterized as follows. As in prior works (Schulman et al., 2015b, 2017b), the policy is a neural network that takes state

as input, through multiple layers of transformation it will encode the state into a hidden vector

. For the th action in the

th dimension of the action space, we output a logit

with parameters . For any dimension , the logits

are combined by soft-max to compute the probability of choosing action

, . By construction, the network has a fixed-size low-level parameter , while the output layer has parameters whose size scales linearly with .

3.2 Understanding Discrete Policy for Continuous Control

Here we briefly analyze the empirical properties of the discrete policy.

Discrete Policy is more expressive than Gaussian.

Though discrete policy is limited on taking atomic actions, in practice it can represent much more flexible distributions than Gaussian when there are sufficient number of atomic actions (e.g. ). Intuitively, discrete policy can represent multi-modal action distribution while Gaussian is by design unimodal. We illustrate this practical difference by a bandit example in Figure 1. Consider a one-step bandit problem with . The reward function for action is illustrated as the Figure 1 (a) blue curve. We train a discrete policy with and a Gaussian policy on the environment for steps and show their training curves in (b), with five different random seeds per policy. We see that 4 out 5 five Gaussian policies are stuck at a suboptimal policy while all discrete policies achieve the optimal rewards. Figure 1 (a) illustrates the density of a trained discrete policy (red) and a suboptimal Gaussian policy (green). The trained discrete policy is bi-modal and automatically captures the bi-modality of the reward function (notice that we did not add entropy regularization to encourage high entropy). The only Gaussian policy that achieves optimal rewards in (b) captures only one mode of the reward function.

For general high-dimensional problems, the reward landscape becomes much more complex. However, this simple example illustrates that the discrete policy can potentially capture the multi-modality of the landscape and achieve better exploration (Haarnoja et al., 2017) to bypass suboptimal policies.

(a) Bandit: Density
(b) Bandit: curves
Figure 1: Analyzing discrete policy: (a) Bandit example: comparison of normalized reward (blue), trained discrete policy density (red) and trained Gaussian policy density (green). (b) Bandit example: learning curves of discrete policy vs. Gaussian policy, we show five random seeds. Most seeds for Gaussian policy get stuck at a suboptimal policy displayed in (a) and all discrete policies reach a bi-modal optimal policy as in (a).
Effects of the Number of Atomic Actions .

Choosing a proper number of atomic actions per dimension is critical for learning. The trade-off comes in many aspects: (a) Control capacity. When is small the discretization is too coarse and the policy does not have enough capacity to achieve good performance. (b) Training difficulty. When

increases, the variance of policy gradients also increases. We detail the analysis of policy gradient variance in Appendix B. We also present the combined effects of (a) and (b) in Appendix B, where we find that the best performance is obtained when

, and setting either too large or too small will degrade the performance. (c) Model parameters and computational Costs. Both the number of model parameters increases linearly and computational costs grow linearly in . We present detailed computational results in Appendix B.

4 Discrete Policy with Ordinal Architecture

4.1 Motivation

When the continuous action space is discretized, we treat continuous variables as discrete and discard important information about the underlying continuous space. It is therefore desirable to incorporate the notion of continuity when paramterizing distributions over discrete actions.

4.2 Ordinal Distribution Network Architecture

For simplicity, we discuss the discrete distribution over only one action dimension. Recall previously that a typical feed-forward architecture that produces discrete distribution over classes produces logits at the last layer and derives the probability via a softmax . In the ordinal architecture, we retain these logits

and first transform them via a sigmoid function

. Then we compute the final logits as


and derive the final output probability via a softmax . The actions are sampled according to this discrete distribution .

This architecture is very similar to the stick-breaking parameterization introduced in (Khan et al., 2012), where they argue that such parameterization is beneficial when the samples drawn from class can be easily separated from samples drawn from all the classes . In our original derivation, we motivate the ordinal architecture from the loss function of (Cheng et al., 2008) and we show that the intuition behind (4) is more clear from this perspective. We show the intuition below with a -way classification problem, where the classes are internally ordered as .

Intuition behind (4).

For clarity, let such that and define the stable cross entropy with a very small to avoid numerical singularity. For a sample from class , the way classification loss based on (4) is , where the predicted vector and a target vector with first entries to be s and others s. The intuition becomes clear when we interpret as a continuous encoding of the class (instead of the one-hot vector) and as a intermediate vector from which we draw the final prediction. We see that the continuity between classes is introduced through the loss function, for example , i.e. the discrepancy between class and is strictly smaller than that between and

. On the contrary, such information cannot be introduced by one-hot encoding: let

be the one-hot vector for class , we always have e.g. , i.e. we introduce no discrepancy relationship between classes. While Cheng et al. (2008) introduce such continuous encoding techniques, they do not propose proper probabilistic models and require additional techniques at inference time to make predictions. Here, the ordinal architecture (4) defines a proper probabilistic model that implicitly introduces internal ordering between classes through the parameterization, while maintaining all the probabilistic properties of discrete distributions.

In summary, the oridinal architecture (4) introduces additional dependencies between logits which implicitly inject the information about the class ordering. In practice, we find that this generally brings significant performance gains during policy optimization.

5 Experiments

Our experiments aim to address the following questions: (a) Does discrete policy improve the performance of baseline algorithms on benchmark continuous control tasks? (b) Does the ordinal architecture further improve upon discrete policy? (c) How sensitive is the performance to hyper-parameters, particularly to the number of bins per action dimension?

For clarity, we henceforth refer to discrete policy as with the discrete distribution, and ordinal policy as with the ordinal architecture. To address (a), we carry out comparisons in two parts: (1) We compare discrete policy (with varying ) with Gaussian policy over baseline algorithms (PPO, TRPO and ACKTR), evaluated on benchmark tasks in gym MuJoCo (Brockman et al., 2016; Todorov, 2008), rllab (Duan et al., 2016), roboschool (Schulman et al., 2015a) and Box2D. Here we pay special attention to Gaussian policy because it is the default policy class implemented in popular code bases (Dhariwal et al., 2017); (2) We compare with other architectural alternatives, either straihghtforward architectural variants or those suggested in prior works (Chou et al., 2017). We evaluate their performance on high-dimensional tasks with complex dynamics (e.g. Walker, Ant and Humanoid). All the above results are reported in Section 5.1. To address (b)

, we compare discrete policy with ordinal policy with PPO in Section 5.2 (results for TRPO are also in Section 5.1). To address (c), we randomly sample hyper-parameters for Gaussian policy and discrete policy and compare their quantiles plots in Section 5.3 and Appendix C.

Implementation Details.

As we aim to study the net effect of the discrete/ordinal policy with on-policy optimization algorithms, we make minimal modification to the original PPO/TRPO/and ACKTR algorithms originally implemented in OpenAI baselines (Dhariwal et al., 2017). We leave all hyper-parameter settings in Appendix A.

(a) HalfCheetah + PPO
(b) Ant + PPO
(c) Walker + PPO
(d) Humanoid (R) + PPO
(e) Sim. Human. (R) + PPO
(f) Humanoid + PPO
(g) HalfCheetah + TRPO
(h) Ant + TRPO
(i) Walker + TRPO
(j) Sim. Human. (R) + TRPO
(k) Humanoid (R) + TRPO
(l) Humanoid + TRPO
Figure 2: MuJoCo Benchmarks: learning curves of PPO on OpenAI gym MuJoCo locomotion tasks. Each curve is averaged over 5 random seeds and shows performance. Each curve corresponds to a different policy architecture (Gaussian or discrete actions with varying number of bins ). Vertical axis is the cumulative rewards and horizontal axis is the number of time steps. Discrete actions significantly outperform Gaussian on Humanoid tasks. Tasks with (R) are from rllab.

5.1 Benchmark performance

All benchmark comparison results are presented in plots (Figure 2,3) or tables (Table 1,2

). For plots, we show the learning curves of different policy classes trained for a fixed number of time steps. The x-axis shows the time steps while the y-axis shows the cumulative rewards. Each curve shows the average performance with standard deviation shown in shaded areas. Results in Figure

2,4 are averaged over 5 random seeds and Figure 3 over 2 random seeds. In Table 1,2 we train all policies for a fixed number of time steps and we show the average standard deviation of the cumulative rewards obtained in the last 10 training iterations.

PPO/TRPO - Comparison with Gaussian Baselines.

We evaluate PPO/TRPO with Gaussian against PPO/TRPO with discrete policy on the full suite of MuJoCo control tasks and display all results in Figure 2. For PPO, on tasks with relatively simple dynamics, discrete policy does not necessarily enjoy significant advantages over Gaussian policy. For example, the rate of learning of discrete policy is comparable to Gaussian for HalfCheetah (Figure 2(a)) and even slightly lower on Ant 2(b)). However, on high-dimensional tasks with very complex dynamics (e.g. Humanoid, Figure 2(d)-(f)), discrete policy significantly outperforms Gaussian policy. For TRPO, the performance gains by discrete policy are also very consistent and significant.

We also evaluate the algorithms on Roboschool Humanoid tasks as shown in Figure 3. We see that discrete policy achieves better results than Gaussian across all tasks and both algorithms. The performance gains are most significant with TRPO (Figure 3(b)(d)(f)), where we see Gaussian policy barely makes progress during training while discrete policy has very stable learning curves. For completeness, we also evaluate PPO/TRPO with discrete policy vs. Gaussian policy on Box2D tasks and see that the performance gains are significant. Due to space limit, We present Box2D results in Appendix C.

By construction, when discrete policy and Gaussian policy have the same encoding architecture shown in Section 3, discrete policy has many more parameters than Gaussian policy. A critical question is whether we can achieve performance gains by simply increasing the number of parameters? We show that when we train a Gaussian policy with many more parameters (e.g. hidden units per layer), the policy does not perform as well. This validates our speculation that the performance gains result from a more carefully designed distribution class rather than larger networks.

PPO - Comparison with Off-Policy Baselines.

To further illustrate the strength of PPO with discrete policy on high-dimensional tasks with very complex dynamics, we compare PPO with discrete policy against state-of-the-art off-policy algorithms on Humanoid tasks (Humanoid-v1 and Humanoid rllab) ***Humanoid-v1 has and Humanoid rllab has . Both tasks have very high-dimensional observation space and action space.. Such algorithms include DDPG (Lillicrap et al., 2015), SQL (Haarnoja et al., 2017), SAC (Haarnoja et al., 2018b) and TD3 (Fujimoto et al., 2018), among which SAC and TD3 are known to achieve significantly better performance on MuJoCo benchmark tasks over other algorithms. Off-policy algorithms reuse samples and can potentially achieve orders of magnitude better sample efficiency than on-policy algorithms. For example, it has been commonly observed in prior works (Haarnoja et al., 2018b, a; Fujimoto et al., 2018) that SAC/TD3 can achieve state-of-the-art performance on most benchmark control tasks for only steps of training, on condition that off-policy samples are heavily replayed. In general, on-policy algorithms cannot match such level of fast convergence because samples are quickly discarded. However, for highly complex tasks such as Humanoid even off-policy algorithms take many more samples to learn, potentially because off-policy learning becomes more unstable and off-policy samples are less informative. In Table 1, we record the final performance of off-policy algorithms directly from the figures in (Haarnoja et al., 2018b) following the practice of (Mania et al., 2018). The final performance of PPO algorithms are computed as the average std of the returns in the last 10 training iterations across 5 random seeds. All algorithms are trained for steps. We observe in Table 1 that PPO + discrete (ordinal) actions achieve comparable or even better results than off-policy baselines. This shows that for general complex applications, PPO discrete/ordinal is still as competitive as the state-of-the-art off-policy methods.

PPO/TRPO - Comparison with Alternative Architectures.

We also compare with straightforward architectural alternatives: Gaussian with tanh non-linearity as the output layer, and Beta distribution (Chou et al., 2017). The primary motivation for these architectures is that they naturally bound the sampled actions to the feasible range ( for gym tasks). By construction, our proposed discrete/ordinal policy also bound the sampled actions within the feasible range. In Table 2, we show results for PPO/TRPO where we select the best result from for discrete/ordinal policy. We make several observations from results in Table 2: (1) Bounding actions (or action means) to feasible range does not consistently bring performance gains, because we observe that Gaussian and Beta distribution do not consistently outperform Gaussian. This is potentially because the parameterizations that bound the actions (or action means) also introduce challenges for optimization. For example, Gaussian bounds the action means , this implies that in order for the parameter must reach extreme values, which is hard to achieve using SGD based methods. (2) Discrete/Ordinal policy achieve significantly better results consistently across most tasks. Combining (1) and (2), we argue that the performance gains of discrete/ordinal policies are due to reasons beyond a bounded action distribution.

(a) Humanoid + PPO
(b) Humanoid + TRPO
(c) Flagrun + PPO
(d) Flagrun + TRPO
(e) FlagrunHarder + PPO
(f) FlagrunHarder + TRPO
Figure 3: Roboschool Humanoid Benchmarks: learning curves of PPO/TRPO on Roboschool Humanoid locomotion taskse. Each curve corresponds to a different policy architecture (Gaussian or discrete actions with varying number of bins ). Discrete policies outperform Gaussian policy on all Humanoid tasks and the performance gains are more significant with TRPO.

Here we discuss the results for Beta distribution. In our implementation we find training with Beta distribution tends to generate numerical errors when the update is more aggressive (e.g. PPO learning rate or TRPO trust region size is ). More conservative updates (e.g. e.g. PPO learning rate or TRPO trust region size is ) reduce numerical errors but also greatly degrade the learning performance. We suspect that this is because the Beta distribution parameterization (Appendix A and (Chou et al., 2017)) is numerically unstable and we discuss the potential reason in Appendix A. In Table 2, the results for Beta distribution is recorded as the performance of the last 10 iterations before the training terminates (potentially prematurely due to numerical errors). The potential advantages of Beta distribution are largely offset by the unstable training. We show more results in Appendix C.

Tasks DDPG SQL SAC TD3 PPO + Gaussian PPO + discrete PPO + ordinal
Table 1: A comparison of PPO with discrete/ordinal policy with state-of-the-art baseline algorithms on Humanoid benchmark tasks from OpenAI gym and rllab. For each task, we show the average rewards achieved after training the agent for a fixed number of time steps. The results for PPO + discrete/ordinal/Gaussian policy are mean performance averaged over 5 random seeds (see Figure 2. The results for DDPG, SQL, SAC and TD3 are approximated based on the figures in (Haarnoja et al., 2018b). The results for PPO is consistent with results in (Haarnoja et al., 2018b). Even compared to off-policy algorithms, PPO + ordinal policy achieves state of the art performance across both tasks.

PPO Gaussian Gaussian+tanh Beta Discrete Ordinal
Humanoid (R)
Sim. Humanoid (R)
Humanoid Standup
Humanoid (R)
Sim. Humanoid (R)
Table 2: Comparison across a range of policy alternatives (Gaussian, Gaussian , and Beta distribution (Chou et al., 2017)). All policies are optimized with PPO/TRPO. All tasks are training for 10M steps. Results are the average std performance for the last 10 training iterations. Top two results (with highest average) are highlighted in bold font. Tasks with (R) are from rllab.
ACKTR - Comparison with Gaussian Baselines.

We show results for ACKTR in Appendix C. We observe that for tasks with complex dynamics, discrete policy still achieves performance gains over its Gaussian policy counterpart.

(a) Walker
(b) Ant
(c) Humanoid (R)
(d) Sim. Humanoid (R)
(e) Humanoid
(f) Humanoid Standup
Figure 4: MuJoCo Benchmarks: learning curves of PPO + discrete policy vs. PPO + ordinal policy on OpenAI gym MuJoCo locomotion tasks. All policies have . We see that for each task, ordinal policy outperforms discrete policy.

5.2 Discrete Policy vs. Ordinal Policy

In Figure 4, we evaluate PPO discrete policy and PPO ordinal policy on high-dimensional tasks. Across all presented tasks, ordinal policy achieves significantly better performance than discrete policy both in terms of asymptotic performance and speed of convergence. Similar results are also presented in table 2 where we show that PPO ordinal policy achieves comparable performance as efficient off-policy algorithms on Humanoid tasks. We also compare these two architectures when trained with TRPO. The comparison of the trained policies can be found in table 1. For most tasks, we find that ordinal policy still significantly improves upon discrete policy.

Summarizing the results for PPO/TRPO, we conclude that the ordinal architecture introduces useful inductive bias that improves policy optimization. We note that sticky-breaking parameterization (4) is not the only parameterization that leverages natural orderings between discrete classes. We leave as promising future work how to better exploit task specific ordering between classes.

5.3 Sensitivity to Hyper-parameters

Here we evaluate the policy classes’ sensitivity to more general hyper-parameters, such as learning rate , number of bins per dimension and random seeds. We present the results of PPO in Appendix C. For PPO with Gaussian, we uniformly sample and one of random seeds. For PPO with discrete actions, we further uniformly sample . For each benchmark task, we sample hyper-parameters and show the quantile plot of the final performance. As seen in Appendix C, PPO with discrete actions is generally more robust to such hyper-parameters than Gaussian.

6 Conclusion

We have carried out a systemic evaluation of action discretization for continuous control across baseline on-policy algorithms and baseline tasks. Though the idea is simple, we find that it greatly improves the performance of baseline algorithms, especially on high-dimensional tasks with complex dynamics. We also show that the ordinal architecture which encodes the natural ordering into the discrete distribution, can further boost the performance of baseline algorithms.

7 Acknowledgements

This work was supported by an Amazon Research Award (2017). The authors also acknowledge the computational resources provided by Amazon Web Services (AWS).


Appendix A Hyper-parameters

All implementations of algorithms (PPO, TRPO, ACKTR) are based on OpenAI baselines (Dhariwal et al., 2017). All environments are based on OpenAI gym (Brockman et al., 2016), rllab (Duan et al., 2016) and Roboschool (Schulman et al., 2017a).

We present the details of each policy class as follows.

Gaussian Policy.

Factorized Gaussian Policies are represented as with as a two-layer neural network with hidden units per layer for PPO and ACKTR and units per layer for TRPO. The covariance matrix is diagonal , with each a single variable shared across all states. Such hyper-parameter settings are default with baselines.

Discrete Policy.

Discrete policies are represented as with a categorical distribution over atomic actions. We specify atomic actions across each dimension, evenly spaced between and . In each action dimension, the categorical distribution is specified by a set of logits ( for action in state ) and is parameterized to be a neural network with the same architecture as the factorized Gaussian above.

Ordinal Policy.

Ordinal policies are augmented with an ordinal parameterization compared to the discrete policies. Ordinal policy has exactly the same number of parameters as the discrete policy.

Gaussian Policy.

The architecture is the same as above but the final layer is added a transformation to ensure that the mean .

Beta Policy.

A Beta policy has the form where is a Beta distribution with parameters . Here, and are shape/rate parameters parameterized by two-layer neural network with a softplus at the end, i.e. , following (Chou et al., 2017). Actions sampled from this distribution have a strictly finite support. We notice that this parameterization introduces potential instability during optimization: for example, when we want to converge on policies that sample actions at the boundary, we require or , which might be very unstable. We also observe such instability in practice: when the trust region size is large (e.g. ) the training can easily terminate prematurely due to numerical errors. However, reducing the trust region size (e.g. ) will stabilize the training but degrade the performance. The results for Beta policy in the main paper are obtained under trust region size for TRPO and learning rate for PPO. These hyper-parameters are chosen such that the policy achieves fairly fast rate of learning (compared to other policy classes) at the cost of more numerical errors (which lead to premature termination).

Others Hyper-parameters.

Value functions are two-layer neural networks with hidden units per layer for PPO and ACKTR and hidden units per layer for TRPO. For PPO, the learning rate is tuned from . For TRPO, the KL constraint parameter is tuned from . For ACKTR, the KL constraint parameter is tuned from . All other hyper-parameters are default parameters in the baselines implementations.

Appendix B Effects of the Number of Atomic Actions

Variance of Policy Gradients.

We analyze the variance of policy gradients when the continuous action space is discretized. For an easy analysis, we assume that the policy architecture is as follows: the policy is in general a neural network (or any differentiable functions) that takes state as input, through multiple layers of transformation it will encode the state into a hidden vector . For the th dimension of the action space, for the th action in , we output a logit by parameters . For any dimension , logits are combined by soft-max to compute the probability of choosing action , . As noted, the number of model parameters scale linearly with .

To compare the variance of policy gradients across models with varying , we analyze the gradients of parameters that encode into . Such parameters are shared by all models. For simplicity, we consider a one step bandit problem with action space . The instant reward for action is where is a fixed constant. Since we have only one action dimension, let be the probability of taking the th action . We also assume that upon initialization the policy has very high entropy . The policy gradient estimator is

Under this setting, the policy gradient and the variance is

where approximations come from replacing . Notice that does not depend on since each logit has an independent dependency on . Under conventional neural network initializations (all weight and bias matrices of and are independently initialized),

are i.i.d. random variables with their randomness stemming from the random initialization of neural network parameters. Denote

as the expectation w.r.t. neural network initializations, we analyze the expectation of (B)


where is the variance of the logit gradients.

(a) Gradient Variance
(b) Control capacity
Figure 5: Analyzing discrete policy: (a) Variance of policy gradients on multiple tasks upon initialization, and comparison with the curve suggested by the simplified theoretical analysis above. The horizontal axis is the number of bins while the vertical axis is normalized such that the value is when . The variance saturates quickly as . (b) Control capacity as a function of number of bins . When is small, the control capacity is small and the policy cannot achieve good performance. When is large, the control capacity increases but training becomes harder, which leads to a potential drop in performance.

Though the above derivation makes multiple restrictive assumptions, we find that it also largely matches the results for more complex scenarios. For multiple MuJoCo tasks, we compute the empirical variance of policy gradients upon random initializations of network parameters. In Figure 5(a) we compare the empirical variance against the predicted variance. We normalize the variance such that the variance at is . With the same hyper-parameters (including batch-size for each update), policy gradients have larger variance for models with more fined discretization (large ) and will be harder to optimize.

Combined Effects of Control Capacity and Variance.

When increases the policy has larger capacity for control. However, as analyzed above, the policy gradient variance also increases with , which makes policy optimization more difficult using SGD based methods.

The combined effects can be observed in Figure 5(b). We train policies with various for a fixed number of time steps and evaluate their performance at the end of training. We find that the best performance is obtained when . When is small (e.g. ) the performance degrades drastically, due to the lack of control capacity. When is large (e.g. ), the performance only slightly degrades: this might be because the variance of the policy gradient almost saturates when as shown in Figure 5.

Model Parameters and Training Costs.

Both the number of model parameters and training costs scale linearly with . In Table 3 we present the computational results for the training costs: we train discrete policies (on Reacher-v1) with various for a fixed number of time steps and record the wall time. The results are standardized such that Gaussian policy is .

Action size
Percentage 116% 120% 143% 240%
Table 3: Computational costs measured in wall time on Reacher task. PPO with Gaussian policy is normalized to be 100% and we report the normalized time for discrete policy. Each number is averaged over 3 random seeds. The increase in costs is roughly linear in but can be more severe when the action dimension increases.

Appendix C Additional Experiments

c.1 Ppo

We show results for PPO on simpler MuJoCo tasks in Figure 6. In such tasks, discrete policy does not necessarily outperform factorized Gaussian.

(a) Reacher
(b) Swimmer
(c) Inverted Pendulum
(d) Double Pendulum
Figure 6: MuJoCo Benchmarks: learning curves of PPO on OpenAI gym MuJoCo locomotion tasks. Each curve corresponds to a different policy architecture (Gaussian or discrete policy with varying number of bins ). Discrete policy significantly outperforms Gaussian on Humanoid tasks.
(a) Bipedal Walker
(b) Lunar Lander
Figure 7: Box2D Benchmarks: learning curves of PPO on OpenAI gym Box2D locomotion tasks. Each curve corresponds to a different policy architecture (Gaussian or discrete policy with varying number of bins ).

c.2 Trpo

We show results for TRPO on simpler MuJoCo tasks in Figure 8. Even with simple task, discrete policy can still significantly outperform Gaussian ((a) Reacher and (d) Double Pendulum).

(a) Reacher
(b) Swimmer
(c) Inverted Pendulum
(d) Double Pendulum
Figure 8: MuJoCo Benchmark : learning curves of TRPO on MuJoCo locomotion tasks. Each curve is averaged over 6 random seeds and shows performance. Each curve corresponds to a different policy representation (Red: Implicit, Green: GMM , Yellow: GMM , Blue: Gaussian). Vertical axis is the cumulative rewards and horizontal axis is the number of time steps.
(a) Bipedal Walker
(b) Lunar Lander
Figure 9: Box2D Benchmarks: learning curves of TRPO on OpenAI gym Box2D locomotion tasks. Each curve is averaged over 5 random seeds and shows performance. Each curve corresponds to a different policy architecture (Gaussian or discrete policy with varying number of bins ). Vertical axis is the cumulative rewards and horizontal axis is the number of time steps.

c.3 Acktr

We show results for ACKTR on a set of MuJoCo and rllab tasks in Figure 6. For tasks with relatively simple dynamics, the performance gain of discrete policy is not significant ((a)(b)(c)). However, in Humanoid tasks, discrete policy does achieve significant performance gain over Gaussian ((d)(e)).

(a) Reacher
(b) Swimmer
(c) HalfCheetah
(d) Humanoid (R)
(e) Sim. Humanoid (R)
(f) Humanoid
Figure 10: MuJoCo Benchmarks: learning curves of ACKTR on OpenAI gym MuJoCo locomotion tasks. Each curve is averaged over 5 random seeds and shows performance. Each curve corresponds to a different policy architecture (Gaussian or discrete policy with varying number of bins ). Vertical axis is the cumulative rewards and horizontal axis is the number of time steps. Discrete policy significantly outperforms Gaussian on Humanoid tasks. Tasks with (R) are from rllab.

c.4 Sensitivity to Hyper-parameters

We present the sensitivity results in Figure 11 below.

(a) Roboschool Reacher
(b) Hopper
(c) HalfCheetah
(d) Roboschool Ant
(e) Roboschool Walker
(f) Humanoid
(g) Sim. Humanoid (R)
(h) Humanoid (R)
Figure 11: PPO sensitivity: quantile plots of final performance on benchmark tasks in OpenAI MuJoCo, rllab and Roboschool. In each plot, 30 different hyper-parameters are drawn for each policy (Gaussian vs. discrete policy). Reacher is trained for steps, Hopper steps and all other tasks about steps. Tasks with (R) are from rllab.

c.5 Comparison with Gaussian Policy with Big Networks

In our implementation, discrete/ordinal policy have more parameters than Gaussian policy. A natural question is whether the gains in policy optimization is due to a bigger network. To test this, we train Gaussian policy with large networks: 2-layer neural network with hidden units per layer. In Table 4 and Table 5, we find that for Gaussian policy, bigger network does not perform as well as the smaller network ( hidden units per layer). Since Gaussian policy with bigger network has more parameters than discrete policy, this validates the claim that the performance gains of discrete policy policy are not (only) due to increased parameters. Below in Table 5 we show results for PPO and Table 4 for TRPO.

Task Gaussian (big) Gaussian (small)
Sim. Human. (L)
Humanoid (L)
Table 4: Comparison of TRPO Gaussian policy with networks of different sizes. Big network has hidden units per layer while small network has hidden units per layer. Both networks have layers. Small networks generally performs better than big networks. Below we show average std cumulative rewards after training for steps.

Task Gaussian (big) Gaussian (small)
Sim. Human. (L)
Humanoid (L)
Table 5: Comparison of PPO Gaussian policy with networks of different sizes. Big network has hidden units per layer while small network has hidden units per layer. Both networks have layers. Small networks generally performs better than big networks. Below we show averageg std cumulative rewards after training for steps.

c.6 Additional Comparison with Beta policy

Chou et al. (2017) show the performance gains of Beta distribution policy for a limited number of benchmark tasks, most of which are relatively simple (with low dimensional observation space and action space). However, they show performance gains on Humanoid-v1. We compare the results of our Figure 10 with Figure 5(j) in (Chou et al., 2017)

(assuming each training epoch takes

steps): within 10M training steps, discrete/ordinal policy achieves faster progress, reaching at the end of training while Beta policy achieves . According to (Chou et al., 2017), Beta distribution can have an asymptotically better performance with , while we find that discrete/ordinal policy achieves asymptotically .

Appendix D Illustration of Benchmark Tasks

We present an illustration of benchmark tasks in Figure 12. All the benchmark tasks are implemented with very efficient physics simulators. All tasks use sensory data as states and actuator controls as actions.

Figure 12: Benchmark Tasks: Illustration of locomotion benchmark tasks in OpenAI gym (Brockman et al., 2016), rllab (Duan et al., 2016) with MuJoCo (Todorov, 2008) as simulation engines (top row) and Roboschool with open source simulation engine (Schulman et al., 2017a) (bottom row). All tasks involve using sensory data as states and actuator controls as actions.