In reinforcement learning (RL), the action space of conventional control tasks are usually dichotomized into either discrete or continuous(Brockman et al., 2016). While discrete action space is conducive to theoretical analysis, in the context of deep reinforcement learning, their application is limited to video game playing or board game (Mnih et al., 2013; Silver et al., 2016). On the other hand, in simulated or real life robotics control (Levine et al., 2016; Schulman et al., 2015a), the action space is by design continuous. Continuous control typically requires more subtle treatments, since a continuous range of control contains an infinite number of feasible actions and one must resort to parametric functions for a compact representation of distributions over actions.
Can we retain the simplicity of discrete actions when solving continuous control tasks? A straightforward solution is to discretize the continuous action space, i.e. we discretize the continuous range of action into a finite set of atomic actions and reduce the original task into a new task with a discrete action space. A common argument against this approach is that for an action space with dimensions, discretizing atomic actions per dimension leads to combinations of joint atomic actions, which quickly becomes intractable when
increases. However, a simple fix is to represent the joint distribution over discrete actions as factorized across dimensions, so that the joint policy is still tractable. As prior works have applied such discretization method in practice(OpenAI, 2018; Jaśkowski et al., 2018), we aim to carry out a systemic study of such straightforward discretization method in simulated environments, and show how they improve upon on-policy optimization baselines.
The paper proceeds as follows. In Section 2, we introduce backgrounds on on-policy optimization baselines (e.g. TRPO and PPO) and related work. In Section 3, we introduce the straightforward method of discretizing action space for continuous control, and analyze the properties of the resulting policies as the number atomic actions changes. In Section 4, we introduce stick-breaking parameterization (Khan et al., 2012), an architecture that parameterizes the discrete distributions while encoding the natural ordering between discrete actions. In Section 5, through extensive experiments we show how the discrete/ordinal policy improves upon current on-policy optimization baselines and related prior works, especially on high-dimensional tasks with complex dynamics.
2.1 Markov Decision Process
In the standard formulation of Markov Decision Process (MDP), an agent starts with an initial stateat time . At time , the agent is in , takes an action , receives a reward and transitions to a next state . A policy is a mapping from state to distributions over actions . The expected cumulative reward under a policy is where is a discount factor. The objective is to search for the optimal policy that achieves maximum reward . For convenience, under policy we define action value function and value function . Also define the advantage function .
2.2 On-Policy Optimization
In policy optimization, one restricts the policy search within a class of parameterized policy where is the parameter and is the parameter space. A straightforward way to update policy is to do local search in the parameter space with policy gradient with the incremental update with some learning rate . Alternatively, we can motivate the above gradient update with a trust region formulation. In particular, consider the following constrained optimization problem
for some . If we do a linear approximation of the objective in (1), , we recover the gradient update by properly choosing given .
In such vanilla policy gradient updates, the training can suffer from occasionally large step sizes and never recover from a bad policy (Schulman et al., 2015a). The following variants are introduced to entail more stable updates.
2.2.1 Trust Region Policy Optimization (TRPO)
Trust Region Policy Optimization (TRPO) Schulman et al. (2015a) apply an information constraint on and to better capture the geometry on the parameter space induced by the underlying policy distributions, consider the following trust region formulation
The trust region enforced by the KL divergence entails that the update according to (2) optimizes a lower bound of , so as to avoid accidentally taking large steps that irreversibly degrade the policy performance during training as in vanilla policy gradient (1) (Schulman et al., 2015a). As a practical algorithm, TRPO further approximates the KL constraint in (2) by a quadratic constraint and approximately invert the constraint matrix by conjugate gradient iterations (Wright and Nocedal, 1999).
Actor Critic using Kronecker-Factored Trust Region (ACKTR)
2.2.2 Proximal Policy Optimization (PPO)
For a practical algorithm, TRPO requires approximately inverting the Fisher matrices by conjugate gradient iterations. Proximal Policy Optimization (PPO) Schulman et al. (2017a) propose to approximate a trust-region method by clipping the likelihood ratios as where clips the argument between and . Consider the following objective
The intuition behind the objective (3) is when , the clipping removes the incentives for the ratio to go above , with similar effects for the case when . PPO achieves more stable empirical performance than TRPO and involves only relatively cheap first-order optimization.
2.3 Related Work
Vanilla policy gradient updates are typically unstable (Schulman et al., 2015a). Natural policy Kakade (2002) apply information metric instead of Euclidean metric of parameters for updates. To further stabilize the training for deep RL, TRPO Schulman et al. (2015a) place an explicit KL divergence constraint between consecutive policies (Kakade and Langford, 2002) and approximately enforces the constraints with conjugate gradients. PPO Schulman et al. (2017a) replace the KL divergence constraint by a clipping in the likelihood ratio of consecutive policies, which allows for fast first-order updates and achieves state-of-the-art performance for on-policy optimizations. On top of TRPO, ACKTR Wu et al. (2017) further improves the natural gradient computation with Kronecker-factored approximation (George et al., 2018). Orthogonal to the algorithmic advances, we demonstrate that policy optimizations with discrete/ordinal policies achieve consistent and significant improvements on all the above benchmark algorithms over baseline distributions.
Orthogonal to the algorithmic procedures for policy updates, one is free to choose any policy classes. In discrete action space, the only choice is a categorical distribution (or a discrete distribution). In continuous action space, the default baseline is factorized Gaussian (Schulman et al., 2015a, 2017a). Gaussian mixtures, implicit generative models or even Normalizing flows (Rezende and Mohamed, 2015) can be used for more expressive and flexible policy classes (Tang and Agrawal, 2018; Haarnoja et al., 2017, 2018b, 2018a), which achieve performance gains primarily for off-policy learning. One issue with aformentioned prior works is that they do not disentangle algorithms from distributions, it is therefore unclear whether the benefits result from a better algorithm or an expressive policy. To make the contributions clear, we make no changes to the on-policy algorithms and show the net effect of how the policy classes improve the performance. Motivated by the fact that unbounded distributions can generate infeasible actions, Chou et al. (2017)
propose to use Beta distribution and also show improvement on TRPO. Early prior workShariff and Dick
also propose truncated Gaussian distribution but such idea is not tested on deep RL tasks. Complement to prior works, we propose discrete/ordinal policy as simple yet powerful alternates to baseline policy classes.
Discrete and Continuous Action Space.
Prior works have exploited the connection between discrete and continuous action space. For example, to solve discrete control tasks, Van Hasselt and Wiering (2009); Dulac-Arnold et al. (2015) leverage the continuity in the underlying continuous action space for generalization across discrete actions. Prior works have also converted continuous control problems into discrete ones, e.g. Pazis and Lagoudakis (2009) convert low-dimensional control problems into discrete problems with binary actions. Surprisingly, few prior works have considered a discrete policy and apply off-the-shelf policy optimization algorithms directly. Recently, OpenAI (2018); Jaśkowski et al. (2018) apply discrete policies to challenging hand manipulation and humanoid walking respectively. As a more comprehensive study, we carry out a full evaluation of discrete/ordinal policy on continuous control benchmarks and validate their performance gains.
To overcome the explosion of action space, Metz et al. (2018) overcome the explosion by sequence prediction, but so far their strategy is only shown effective on relatively low-dimensional problems (e.g. HalfCheetah). Tavakoli et al. (2018) propose to avoid taking across all actions in Q-learning, by applying independently across dimensions. Their method is also only tested on a very limited number of tasks. As an alternative, we consider distributions that factorize across dimensions and we show that this simple technique yields consistent performance gains.
When discrete variables have an internal ordering, it is beneficial to account for such ordering when modeling the categorical distributions. In statistics, such problems are tackled as ordinal regression or classification (Winship and Mare, 1984; Chu and Ghahramani, 2005; Chu and Keerthi, 2007)
. Few prior works aim to combine ideas of ordinal regression with neural networks. ThoughCheng et al. (2008)
propose to introduce ordering as part of the loss function, they do not introduce a proper probabilistic model and need additional techniques during inference. More recently,Khan et al. (2012) motivate the stick-breaking parameterization, a proper probabilistic model which does not introduce additional parameters compared to the original categorical distribution. In our original derivation, we motivate the architecture of (Khan et al., 2012) by transforming the loss function of (Cheng et al., 2008). We also show that such additional inductive bias greatly boosts the performance for PPO/TRPO.
3 Discretizing Action Space for Continuous Control
Without loss of generality, we assume the action space . We discretize each dimension of the action space into equally spaced atomic actions. The set of atomic action for any dimension is
. To overcome the curse of dimensionality, we represent the distribution as factorized across dimensions. In particular, in state, we specify a categorical distribution over actions for each dimension , with as parameters for this marginal distribution. Then we define the joint discrete policy where . The factorization allows us to maintain a tractable distribution over joint actions, making it easy to do both sampling and training.
3.1 Network Architecture
as input, through multiple layers of transformation it will encode the state into a hidden vector. For the th action in the
th dimension of the action space, we output a logitwith parameters . For any dimension , the logits
are combined by soft-max to compute the probability of choosing action, . By construction, the network has a fixed-size low-level parameter , while the output layer has parameters whose size scales linearly with .
3.2 Understanding Discrete Policy for Continuous Control
Here we briefly analyze the empirical properties of the discrete policy.
Discrete Policy is more expressive than Gaussian.
Though discrete policy is limited on taking atomic actions, in practice it can represent much more flexible distributions than Gaussian when there are sufficient number of atomic actions (e.g. ). Intuitively, discrete policy can represent multi-modal action distribution while Gaussian is by design unimodal. We illustrate this practical difference by a bandit example in Figure 1. Consider a one-step bandit problem with . The reward function for action is illustrated as the Figure 1 (a) blue curve. We train a discrete policy with and a Gaussian policy on the environment for steps and show their training curves in (b), with five different random seeds per policy. We see that 4 out 5 five Gaussian policies are stuck at a suboptimal policy while all discrete policies achieve the optimal rewards. Figure 1 (a) illustrates the density of a trained discrete policy (red) and a suboptimal Gaussian policy (green). The trained discrete policy is bi-modal and automatically captures the bi-modality of the reward function (notice that we did not add entropy regularization to encourage high entropy). The only Gaussian policy that achieves optimal rewards in (b) captures only one mode of the reward function.
For general high-dimensional problems, the reward landscape becomes much more complex. However, this simple example illustrates that the discrete policy can potentially capture the multi-modality of the landscape and achieve better exploration (Haarnoja et al., 2017) to bypass suboptimal policies.
Effects of the Number of Atomic Actions .
Choosing a proper number of atomic actions per dimension is critical for learning. The trade-off comes in many aspects: (a) Control capacity. When is small the discretization is too coarse and the policy does not have enough capacity to achieve good performance. (b) Training difficulty. When
increases, the variance of policy gradients also increases. We detail the analysis of policy gradient variance in Appendix B. We also present the combined effects of (a) and (b) in Appendix B, where we find that the best performance is obtained when, and setting either too large or too small will degrade the performance. (c) Model parameters and computational Costs. Both the number of model parameters increases linearly and computational costs grow linearly in . We present detailed computational results in Appendix B.
4 Discrete Policy with Ordinal Architecture
When the continuous action space is discretized, we treat continuous variables as discrete and discard important information about the underlying continuous space. It is therefore desirable to incorporate the notion of continuity when paramterizing distributions over discrete actions.
4.2 Ordinal Distribution Network Architecture
For simplicity, we discuss the discrete distribution over only one action dimension. Recall previously that a typical feed-forward architecture that produces discrete distribution over classes produces logits at the last layer and derives the probability via a softmax . In the ordinal architecture, we retain these logits
and first transform them via a sigmoid function. Then we compute the final logits as
and derive the final output probability via a softmax . The actions are sampled according to this discrete distribution .
This architecture is very similar to the stick-breaking parameterization introduced in (Khan et al., 2012), where they argue that such parameterization is beneficial when the samples drawn from class can be easily separated from samples drawn from all the classes . In our original derivation, we motivate the ordinal architecture from the loss function of (Cheng et al., 2008) and we show that the intuition behind (4) is more clear from this perspective. We show the intuition below with a -way classification problem, where the classes are internally ordered as .
Intuition behind (4).
For clarity, let such that and define the stable cross entropy with a very small to avoid numerical singularity. For a sample from class , the way classification loss based on (4) is , where the predicted vector and a target vector with first entries to be s and others s. The intuition becomes clear when we interpret as a continuous encoding of the class (instead of the one-hot vector) and as a intermediate vector from which we draw the final prediction. We see that the continuity between classes is introduced through the loss function, for example , i.e. the discrepancy between class and is strictly smaller than that between and
. On the contrary, such information cannot be introduced by one-hot encoding: letbe the one-hot vector for class , we always have e.g. , i.e. we introduce no discrepancy relationship between classes. While Cheng et al. (2008) introduce such continuous encoding techniques, they do not propose proper probabilistic models and require additional techniques at inference time to make predictions. Here, the ordinal architecture (4) defines a proper probabilistic model that implicitly introduces internal ordering between classes through the parameterization, while maintaining all the probabilistic properties of discrete distributions.
In summary, the oridinal architecture (4) introduces additional dependencies between logits which implicitly inject the information about the class ordering. In practice, we find that this generally brings significant performance gains during policy optimization.
Our experiments aim to address the following questions: (a) Does discrete policy improve the performance of baseline algorithms on benchmark continuous control tasks? (b) Does the ordinal architecture further improve upon discrete policy? (c) How sensitive is the performance to hyper-parameters, particularly to the number of bins per action dimension?
For clarity, we henceforth refer to discrete policy as with the discrete distribution, and ordinal policy as with the ordinal architecture. To address (a), we carry out comparisons in two parts: (1) We compare discrete policy (with varying ) with Gaussian policy over baseline algorithms (PPO, TRPO and ACKTR), evaluated on benchmark tasks in gym MuJoCo (Brockman et al., 2016; Todorov, 2008), rllab (Duan et al., 2016), roboschool (Schulman et al., 2015a) and Box2D. Here we pay special attention to Gaussian policy because it is the default policy class implemented in popular code bases (Dhariwal et al., 2017); (2) We compare with other architectural alternatives, either straihghtforward architectural variants or those suggested in prior works (Chou et al., 2017). We evaluate their performance on high-dimensional tasks with complex dynamics (e.g. Walker, Ant and Humanoid). All the above results are reported in Section 5.1. To address (b)
, we compare discrete policy with ordinal policy with PPO in Section 5.2 (results for TRPO are also in Section 5.1). To address (c), we randomly sample hyper-parameters for Gaussian policy and discrete policy and compare their quantiles plots in Section 5.3 and Appendix C.
As we aim to study the net effect of the discrete/ordinal policy with on-policy optimization algorithms, we make minimal modification to the original PPO/TRPO/and ACKTR algorithms originally implemented in OpenAI baselines (Dhariwal et al., 2017). We leave all hyper-parameter settings in Appendix A.
5.1 Benchmark performance
). For plots, we show the learning curves of different policy classes trained for a fixed number of time steps. The x-axis shows the time steps while the y-axis shows the cumulative rewards. Each curve shows the average performance with standard deviation shown in shaded areas. Results in Figure2,4 are averaged over 5 random seeds and Figure 3 over 2 random seeds. In Table 1,2 we train all policies for a fixed number of time steps and we show the average standard deviation of the cumulative rewards obtained in the last 10 training iterations.
PPO/TRPO - Comparison with Gaussian Baselines.
We evaluate PPO/TRPO with Gaussian against PPO/TRPO with discrete policy on the full suite of MuJoCo control tasks and display all results in Figure 2. For PPO, on tasks with relatively simple dynamics, discrete policy does not necessarily enjoy significant advantages over Gaussian policy. For example, the rate of learning of discrete policy is comparable to Gaussian for HalfCheetah (Figure 2(a)) and even slightly lower on Ant 2(b)). However, on high-dimensional tasks with very complex dynamics (e.g. Humanoid, Figure 2(d)-(f)), discrete policy significantly outperforms Gaussian policy. For TRPO, the performance gains by discrete policy are also very consistent and significant.
We also evaluate the algorithms on Roboschool Humanoid tasks as shown in Figure 3. We see that discrete policy achieves better results than Gaussian across all tasks and both algorithms. The performance gains are most significant with TRPO (Figure 3(b)(d)(f)), where we see Gaussian policy barely makes progress during training while discrete policy has very stable learning curves. For completeness, we also evaluate PPO/TRPO with discrete policy vs. Gaussian policy on Box2D tasks and see that the performance gains are significant. Due to space limit, We present Box2D results in Appendix C.
By construction, when discrete policy and Gaussian policy have the same encoding architecture shown in Section 3, discrete policy has many more parameters than Gaussian policy. A critical question is whether we can achieve performance gains by simply increasing the number of parameters? We show that when we train a Gaussian policy with many more parameters (e.g. hidden units per layer), the policy does not perform as well. This validates our speculation that the performance gains result from a more carefully designed distribution class rather than larger networks.
PPO - Comparison with Off-Policy Baselines.
To further illustrate the strength of PPO with discrete policy on high-dimensional tasks with very complex dynamics, we compare PPO with discrete policy against state-of-the-art off-policy algorithms on Humanoid tasks (Humanoid-v1 and Humanoid rllab) ***Humanoid-v1 has and Humanoid rllab has . Both tasks have very high-dimensional observation space and action space.. Such algorithms include DDPG (Lillicrap et al., 2015), SQL (Haarnoja et al., 2017), SAC (Haarnoja et al., 2018b) and TD3 (Fujimoto et al., 2018), among which SAC and TD3 are known to achieve significantly better performance on MuJoCo benchmark tasks over other algorithms. Off-policy algorithms reuse samples and can potentially achieve orders of magnitude better sample efficiency than on-policy algorithms. For example, it has been commonly observed in prior works (Haarnoja et al., 2018b, a; Fujimoto et al., 2018) that SAC/TD3 can achieve state-of-the-art performance on most benchmark control tasks for only steps of training, on condition that off-policy samples are heavily replayed. In general, on-policy algorithms cannot match such level of fast convergence because samples are quickly discarded. However, for highly complex tasks such as Humanoid even off-policy algorithms take many more samples to learn, potentially because off-policy learning becomes more unstable and off-policy samples are less informative. In Table 1, we record the final performance of off-policy algorithms directly from the figures in (Haarnoja et al., 2018b) following the practice of (Mania et al., 2018). The final performance of PPO algorithms are computed as the average std of the returns in the last 10 training iterations across 5 random seeds. All algorithms are trained for steps. We observe in Table 1 that PPO + discrete (ordinal) actions achieve comparable or even better results than off-policy baselines. This shows that for general complex applications, PPO discrete/ordinal is still as competitive as the state-of-the-art off-policy methods.
PPO/TRPO - Comparison with Alternative Architectures.
We also compare with straightforward architectural alternatives: Gaussian with tanh non-linearity as the output layer, and Beta distribution (Chou et al., 2017). The primary motivation for these architectures is that they naturally bound the sampled actions to the feasible range ( for gym tasks). By construction, our proposed discrete/ordinal policy also bound the sampled actions within the feasible range. In Table 2, we show results for PPO/TRPO where we select the best result from for discrete/ordinal policy. We make several observations from results in Table 2: (1) Bounding actions (or action means) to feasible range does not consistently bring performance gains, because we observe that Gaussian and Beta distribution do not consistently outperform Gaussian. This is potentially because the parameterizations that bound the actions (or action means) also introduce challenges for optimization. For example, Gaussian bounds the action means , this implies that in order for the parameter must reach extreme values, which is hard to achieve using SGD based methods. (2) Discrete/Ordinal policy achieve significantly better results consistently across most tasks. Combining (1) and (2), we argue that the performance gains of discrete/ordinal policies are due to reasons beyond a bounded action distribution.
Here we discuss the results for Beta distribution. In our implementation we find training with Beta distribution tends to generate numerical errors when the update is more aggressive (e.g. PPO learning rate or TRPO trust region size is ). More conservative updates (e.g. e.g. PPO learning rate or TRPO trust region size is ) reduce numerical errors but also greatly degrade the learning performance. We suspect that this is because the Beta distribution parameterization (Appendix A and (Chou et al., 2017)) is numerically unstable and we discuss the potential reason in Appendix A. In Table 2, the results for Beta distribution is recorded as the performance of the last 10 iterations before the training terminates (potentially prematurely due to numerical errors). The potential advantages of Beta distribution are largely offset by the unstable training. We show more results in Appendix C.
|Tasks||DDPG||SQL||SAC||TD3||PPO + Gaussian||PPO + discrete||PPO + ordinal|
|Sim. Humanoid (R)|
|Sim. Humanoid (R)|
ACKTR - Comparison with Gaussian Baselines.
We show results for ACKTR in Appendix C. We observe that for tasks with complex dynamics, discrete policy still achieves performance gains over its Gaussian policy counterpart.
5.2 Discrete Policy vs. Ordinal Policy
In Figure 4, we evaluate PPO discrete policy and PPO ordinal policy on high-dimensional tasks. Across all presented tasks, ordinal policy achieves significantly better performance than discrete policy both in terms of asymptotic performance and speed of convergence. Similar results are also presented in table 2 where we show that PPO ordinal policy achieves comparable performance as efficient off-policy algorithms on Humanoid tasks. We also compare these two architectures when trained with TRPO. The comparison of the trained policies can be found in table 1. For most tasks, we find that ordinal policy still significantly improves upon discrete policy.
Summarizing the results for PPO/TRPO, we conclude that the ordinal architecture introduces useful inductive bias that improves policy optimization. We note that sticky-breaking parameterization (4) is not the only parameterization that leverages natural orderings between discrete classes. We leave as promising future work how to better exploit task specific ordering between classes.
5.3 Sensitivity to Hyper-parameters
Here we evaluate the policy classes’ sensitivity to more general hyper-parameters, such as learning rate , number of bins per dimension and random seeds. We present the results of PPO in Appendix C. For PPO with Gaussian, we uniformly sample and one of random seeds. For PPO with discrete actions, we further uniformly sample . For each benchmark task, we sample hyper-parameters and show the quantile plot of the final performance. As seen in Appendix C, PPO with discrete actions is generally more robust to such hyper-parameters than Gaussian.
We have carried out a systemic evaluation of action discretization for continuous control across baseline on-policy algorithms and baseline tasks. Though the idea is simple, we find that it greatly improves the performance of baseline algorithms, especially on high-dimensional tasks with complex dynamics. We also show that the ordinal architecture which encodes the natural ordering into the discrete distribution, can further boost the performance of baseline algorithms.
This work was supported by an Amazon Research Award (2017). The authors also acknowledge the computational resources provided by Amazon Web Services (AWS).
- Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
- Cheng et al. (2008) Cheng, J., Wang, Z., and Pollastri, G. (2008). A neural network approach to ordinal regression. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on, pages 1279–1284. IEEE.
- Chou et al. (2017) Chou, P.-W., Maturana, D., and Scherer, S. (2017). Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In International Conference on Machine Learning, pages 834–843.
- Chu and Ghahramani (2005) Chu, W. and Ghahramani, Z. (2005). Gaussian processes for ordinal regression. Journal of machine learning research, 6(Jul):1019–1041.
- Chu and Keerthi (2007) Chu, W. and Keerthi, S. S. (2007). Support vector ordinal regression. Neural computation, 19(3):792–815.
- Dhariwal et al. (2017) Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., and Wu, Y. (2017). Openai baselines. https://github.com/openai/baselines.
- Duan et al. (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. (2016). Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338.
- Dulac-Arnold et al. (2015) Dulac-Arnold, G., Evans, R., van Hasselt, H., Sunehag, P., Lillicrap, T., Hunt, J., Mann, T., Weber, T., Degris, T., and Coppin, B. (2015). Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679.
- Fujimoto et al. (2018) Fujimoto, S., van Hoof, H., and Meger, D. (2018). Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477.
- George et al. (2018) George, T., Laurent, C., Bouthillier, X., Ballas, N., and Vicent, P. (2018). Fast approximate natural gradient descent in a kronecker-factored eigenbasis. arXiv preprint arXiv:1806.03884.
- Haarnoja et al. (2018a) Haarnoja, T., Hartikainen, K., Abbeel, P., and Levine, S. (2018a). Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808.
- Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165.
- Haarnoja et al. (2018b) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018b). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.
- Jaśkowski et al. (2018) Jaśkowski, W., Lykkebø, O. R., Toklu, N. E., Trifterer, F., Buk, Z., Koutník, J., and Gomez, F. (2018). Reinforcement learning to run… fast. In The NIPS’17 Competition: Building Intelligent Systems, pages 155–167. Springer.
- Kakade and Langford (2002) Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267–274.
- Kakade (2002) Kakade, S. M. (2002). A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538.
- Khan et al. (2012) Khan, M., Mohamed, S., Marlin, B., and Murphy, K. (2012). A stick-breaking likelihood for categorical data analysis with latent gaussian models. In Artificial Intelligence and Statistics, pages 610–618.
- Levine et al. (2016) Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016). End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373.
- Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
- Mania et al. (2018) Mania, H., Guy, A., and Recht, B. (2018). Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055.
- Martens and Grosse (2015) Martens, J. and Grosse, R. (2015). Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417.
- Metz et al. (2018) Metz, L., Ibarz, J., Jaitly, N., and Davidson, J. (2018). Discrete sequential prediction of continuous actions for deep rl. arXiv preprint arXiv:1705.05035.
- Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
- OpenAI (2018) OpenAI (2018). Leaning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177.
- Pazis and Lagoudakis (2009) Pazis, J. and Lagoudakis, M. G. (2009). Binary action search for learning continuous-action control policies. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 793–800. ACM.
- Rezende and Mohamed (2015) Rezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770.
- Schulman et al. (2015a) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015a). Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897.
- Schulman et al. (2015b) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015b). Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897.
- Schulman et al. (2017a) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017a). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Schulman et al. (2017b) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017b). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- (31) Shariff, R. and Dick, T. Lunar lander: A continous-action case study for policy-gradient actor-critic algorithms.
- Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489.
- Tang and Agrawal (2018) Tang, Y. and Agrawal, S. (2018). Implicit policy for reinforcement learning. arXiv preprint arXiv:1806.06798.
- Tavakoli et al. (2018) Tavakoli, A., Pardo, F., and Kormushev, P. (2018). Action branching architectures for deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
Todorov, E. (2008).
General duality between optimal control and estimation.In Decision and Control, 2008. CDC 2008. 47th IEEE Conference on, pages 4286–4292. IEEE.
- Van Hasselt and Wiering (2009) Van Hasselt, H. and Wiering, M. A. (2009). Using continuous action spaces to solve discrete problems. In Neural Networks, 2009. IJCNN 2009. International Joint Conference on, pages 1149–1156. IEEE.
Winship and Mare (1984)
Winship, C. and Mare, R. D. (1984).
Regression models with ordinal variables.American sociological review, pages 512–525.
- Wright and Nocedal (1999) Wright, S. and Nocedal, J. (1999). Numerical optimization. Springer Science, 35(67-68):7.
- Wu et al. (2017) Wu, Y., Mansimov, E., Liao, S., Gross, R., and Ba, J. (2017). Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. arXiv preprint arXiv:1708.05144.
Appendix A Hyper-parameters
All implementations of algorithms (PPO, TRPO, ACKTR) are based on OpenAI baselines (Dhariwal et al., 2017). All environments are based on OpenAI gym (Brockman et al., 2016), rllab (Duan et al., 2016) and Roboschool (Schulman et al., 2017a).
We present the details of each policy class as follows.
Factorized Gaussian Policies are represented as with as a two-layer neural network with hidden units per layer for PPO and ACKTR and units per layer for TRPO. The covariance matrix is diagonal , with each a single variable shared across all states. Such hyper-parameter settings are default with baselines.
Discrete policies are represented as with a categorical distribution over atomic actions. We specify atomic actions across each dimension, evenly spaced between and . In each action dimension, the categorical distribution is specified by a set of logits ( for action in state ) and is parameterized to be a neural network with the same architecture as the factorized Gaussian above.
Ordinal policies are augmented with an ordinal parameterization compared to the discrete policies. Ordinal policy has exactly the same number of parameters as the discrete policy.
The architecture is the same as above but the final layer is added a transformation to ensure that the mean .
A Beta policy has the form where is a Beta distribution with parameters . Here, and are shape/rate parameters parameterized by two-layer neural network with a softplus at the end, i.e. , following (Chou et al., 2017). Actions sampled from this distribution have a strictly finite support. We notice that this parameterization introduces potential instability during optimization: for example, when we want to converge on policies that sample actions at the boundary, we require or , which might be very unstable. We also observe such instability in practice: when the trust region size is large (e.g. ) the training can easily terminate prematurely due to numerical errors. However, reducing the trust region size (e.g. ) will stabilize the training but degrade the performance. The results for Beta policy in the main paper are obtained under trust region size for TRPO and learning rate for PPO. These hyper-parameters are chosen such that the policy achieves fairly fast rate of learning (compared to other policy classes) at the cost of more numerical errors (which lead to premature termination).
Value functions are two-layer neural networks with hidden units per layer for PPO and ACKTR and hidden units per layer for TRPO. For PPO, the learning rate is tuned from . For TRPO, the KL constraint parameter is tuned from . For ACKTR, the KL constraint parameter is tuned from . All other hyper-parameters are default parameters in the baselines implementations.
Appendix B Effects of the Number of Atomic Actions
Variance of Policy Gradients.
We analyze the variance of policy gradients when the continuous action space is discretized. For an easy analysis, we assume that the policy architecture is as follows: the policy is in general a neural network (or any differentiable functions) that takes state as input, through multiple layers of transformation it will encode the state into a hidden vector . For the th dimension of the action space, for the th action in , we output a logit by parameters . For any dimension , logits are combined by soft-max to compute the probability of choosing action , . As noted, the number of model parameters scale linearly with .
To compare the variance of policy gradients across models with varying , we analyze the gradients of parameters that encode into . Such parameters are shared by all models. For simplicity, we consider a one step bandit problem with action space . The instant reward for action is where is a fixed constant. Since we have only one action dimension, let be the probability of taking the th action . We also assume that upon initialization the policy has very high entropy . The policy gradient estimator is
Under this setting, the policy gradient and the variance is
where approximations come from replacing . Notice that does not depend on since each logit has an independent dependency on . Under conventional neural network initializations (all weight and bias matrices of and are independently initialized),
are i.i.d. random variables with their randomness stemming from the random initialization of neural network parameters. Denoteas the expectation w.r.t. neural network initializations, we analyze the expectation of (B)
where is the variance of the logit gradients.
Though the above derivation makes multiple restrictive assumptions, we find that it also largely matches the results for more complex scenarios. For multiple MuJoCo tasks, we compute the empirical variance of policy gradients upon random initializations of network parameters. In Figure 5(a) we compare the empirical variance against the predicted variance. We normalize the variance such that the variance at is . With the same hyper-parameters (including batch-size for each update), policy gradients have larger variance for models with more fined discretization (large ) and will be harder to optimize.
Combined Effects of Control Capacity and Variance.
When increases the policy has larger capacity for control. However, as analyzed above, the policy gradient variance also increases with , which makes policy optimization more difficult using SGD based methods.
The combined effects can be observed in Figure 5(b). We train policies with various for a fixed number of time steps and evaluate their performance at the end of training. We find that the best performance is obtained when . When is small (e.g. ) the performance degrades drastically, due to the lack of control capacity. When is large (e.g. ), the performance only slightly degrades: this might be because the variance of the policy gradient almost saturates when as shown in Figure 5.
Model Parameters and Training Costs.
Both the number of model parameters and training costs scale linearly with . In Table 3 we present the computational results for the training costs: we train discrete policies (on Reacher-v1) with various for a fixed number of time steps and record the wall time. The results are standardized such that Gaussian policy is .
Appendix C Additional Experiments
We show results for PPO on simpler MuJoCo tasks in Figure 6. In such tasks, discrete policy does not necessarily outperform factorized Gaussian.
We show results for TRPO on simpler MuJoCo tasks in Figure 8. Even with simple task, discrete policy can still significantly outperform Gaussian ((a) Reacher and (d) Double Pendulum).
We show results for ACKTR on a set of MuJoCo and rllab tasks in Figure 6. For tasks with relatively simple dynamics, the performance gain of discrete policy is not significant ((a)(b)(c)). However, in Humanoid tasks, discrete policy does achieve significant performance gain over Gaussian ((d)(e)).
c.4 Sensitivity to Hyper-parameters
We present the sensitivity results in Figure 11 below.
c.5 Comparison with Gaussian Policy with Big Networks
In our implementation, discrete/ordinal policy have more parameters than Gaussian policy. A natural question is whether the gains in policy optimization is due to a bigger network. To test this, we train Gaussian policy with large networks: 2-layer neural network with hidden units per layer. In Table 4 and Table 5, we find that for Gaussian policy, bigger network does not perform as well as the smaller network ( hidden units per layer). Since Gaussian policy with bigger network has more parameters than discrete policy, this validates the claim that the performance gains of discrete policy policy are not (only) due to increased parameters. Below in Table 5 we show results for PPO and Table 4 for TRPO.
|Task||Gaussian (big)||Gaussian (small)|
|Sim. Human. (L)|
|Task||Gaussian (big)||Gaussian (small)|
|Sim. Human. (L)|
c.6 Additional Comparison with Beta policy
Chou et al. (2017) show the performance gains of Beta distribution policy for a limited number of benchmark tasks, most of which are relatively simple (with low dimensional observation space and action space). However, they show performance gains on Humanoid-v1. We compare the results of our Figure 10 with Figure 5(j) in (Chou et al., 2017)
(assuming each training epoch takessteps): within 10M training steps, discrete/ordinal policy achieves faster progress, reaching at the end of training while Beta policy achieves . According to (Chou et al., 2017), Beta distribution can have an asymptotically better performance with , while we find that discrete/ordinal policy achieves asymptotically .
Appendix D Illustration of Benchmark Tasks
We present an illustration of benchmark tasks in Figure 12. All the benchmark tasks are implemented with very efficient physics simulators. All tasks use sensory data as states and actuator controls as actions.