Introduction
Modelfree control methods are currently divided into two main branches: valuebased methods and policy gradient methods. Valuebased methods, such as Qlearning, have been quite successful in discreteaction domains [Mnih et al.2013, van Hasselt, Guez, and Silver2015], whereas policy gradient methods have been more commonly used in continuous action spaces. One of the reasons for this choice is because finding the optimal action for Qlearning can be difficult in continuousaction spaces, necessitating an optimization problem to be solved.
A common strategy when using action value methods in continuous actions has been to restrict the form of action values, to make optimization over actions easy to solve. Wirefitting [Baird and Klopf1993, del R Millán, Posenato, and Dedieu2002]interpolates between a set of action points, adjusting those points over time to force one interpolation action point to become the maximizing action. Normalized Advantage Functions (NAF) [Gu et al.2016b] learn an advantage function [Baird1993, Harmon and Baird1996a, Harmon and Baird1996b]
by constraining the advantage function to be quadratic in terms of the actions, keeping track of the vertex of the parabola. Partial Input Convex Neural Networks (PICNN) are learned such that actionvalues are guaranteed to be convex in terms of action
[Amos, Xu, and Kolter2016]. To enable convex functions to be learned, however, PICNNs are restricted to nonnegative weights and ReLU activations, and the maximizing action is found with an approximate gradient descent from random action points.
Another direction has been to parameterize the policy using the actionvalues, and use instead a soft Qlearning update [Haarnoja et al.2017]
. For action selection, the policy is parameterized as an energybased model using the actionvalues. This approach avoids the difficult optimization over actions, but unfortunately instead it can be expensive to sample an action from the policy. The actionvalues can be an arbitrary (energy) function, and sampling from the corresponding energybased model requires an approximate sampling routine, like MCMC. Moreover, it optimizes over the entropyregularized objective, which differs from the traditional objective in most other actionvalues learning algorithms, like Qlearning.
Policy gradient methods, on the other hand, learn a simple parametric distribution or a deterministic function over actions that can be easily used in continuous action spaces. In recent years, policy gradient methods have been particularly successful in continuous action benchmark domains [Duan et al.2016], facilitated by the ActorCritic framework. ActorCritic methods, first introduced in [Sutton1984], use a Critic (value function) that evaluates the current policy, to help compute the gradient for the Actor (policy). This separation into Actor and Critic enabled the two components to be optimized in a variety of ways, facilitating algorithm development. The Actor can incorporate different update mechanisms to achieve better sample efficiency [Mnih et al.2016, Kakade2001, Peters and Schaal2008, Wu et al.2017] or stable learning [Schulman et al.2015, Schulman et al.2017]
. The Critic can be used as a baseline or control variate to reduce variance
[Greensmith, Bartlett, and Baxter2004, Gu et al.2016a, Schulman et al.2016], and improve sample efficiency by incorporating offpolicy samples [Degris, White, and Sutton2012, Silver et al.2014, Lillicrap et al.2015, Wang et al.2016].In this work, we propose a framework called ActorExpert, that parallels ActorCritic, but for valuebased methods, facilitating use of Qlearning for continuous action spaces. ActorExpert decouples optimal action selection (Actor) from actionvalue representation (Expert), enabling a variety of optimization methods to be used for the Actor. The Expert learns the actionvalues using Qlearning. The Actor learns the greedy action by iteratively updating towards an estimate of the maximum action for the actionvalues given by the Expert. This decoupling also enables any Actor to be used, including any exploration mechanism, without interfering with the Expert’s goal to learn the optimal actionvalues. ActorExpert is different from ActorCritic because the Expert uses Qlearning—the Bellman optimality operator—whereas the Critic performs policy evaluation to get values of the current (suboptimal) policy. In ActorExpert, the Actor tracks the Expert, to track the greedy action, whereas in ActorCritic, the Critic tracks the Actor, to track the policy values.
Taking advantage of this formalism, we introduce a Conditional Cross Entropy Method for the Actor, that puts minimal restrictions on the form of the actionvalues. The basic idea is to iteratively increase the likelihood of nearmaximal actions for the expert over time, extending the global optimization algorithm, the Cross Entropy Method [Rubinstein1999], to be conditioned on state. We show in a toy domain with bimodal actionvalues—which are not quadratic nor convex—that previous actionvalue methods with restrictive actionvalues (NAF and PICNN) perform poorly, whereas ActorExpert learns the optimal policy well. We then show results on several continuousaction benchmark domains that our algorithm outperforms previous valuebased methods and an instance of an ActorCritic method, Deep Deterministic Policy Gradient (DDPG).
Background and Problem Formulation
The interaction between the agent and environment is formalized as a Markov decision process
, where is the state space, is the action space, is the onestep state transition dynamics, is the reward function and is the discount rate. At each discrete time step , the agent selects an action according to policy , the agent transitions to state according to , and observes a scalar reward .For valuedbased methods, the objective is to find the fixedpoint for the Bellman optimality operator:
(1) 
The corresponding optimal policy selects a greedy action from the set . These optimal Qvalues are typically learned using Qlearning [Watkins and Dayan1992]: for actionvalues parameterized by , the iterative updates are for
Qlearning is an offpolicy algorithm, that can learn the actionvalues for the optimal policy while following a different (exploratory) behaviour policy.
Policy gradient methods directly optimize a parameterized policy , with parameters . The objective is typically an average reward objective,
(2) 
where is the stationary distribution over states, representing state visitation. Policy gradient methods estimate gradients of this objective [Sutton et al.2000]
For example, in the policygradient approach called ActorCritic [Sutton1984], the Critic estimates and the Actor uses the Critic to obtain an estimate of the above gradient to adjust the policy parameters .
Actionvalue methods for continuous actions can be difficult to use, due to the fact that an optimization over actions needs to be solved, both for decisionmaking and for the Qlearning update. For a reasonably small number of discrete actions, is straightforward to solve, by iterating across all actions. For continuous actions, cannot be queried for all actions, and the optimization can be difficult to solve, such as if is nonconvex in .
ActorExpert Formalism
We propose a new framework for valuebased methods, with an explicit Actor. The goal is to provide a similar framework to ActorCritic—which has been so successful for algorithm development of policy gradient methods—to simplify algorithm development for valuebased methods. The Expert learns using Qlearning, but with an explicit actor that provides the greedy actions. The Actor has two roles: to select which action to take (behavior policy) and to provide the greedy action for the Expert’s Qlearning target. In this section, we develop a Conditional Cross Entropy Method for the Actor, to estimate the greedy action, and provide theoretical guarantees that the approach tracks a changing Expert.
Conditional Cross Entropy Method for the Actor
The primary role of the Actor is to identify—or learn— for the Expert. Different strategies can be used to obtain this greedy action on each step. The simplest strategy is to solve this optimization with gradient ascent, to convergence, on every time step. This is problematic for two reasons: it is expensive and is likely to get stuck in suboptimal stationary points.
Consider now a slightly more effective strategy, that learns an Actor that can provide an approximate greedy action that can serve as a good initial point for gradient ascent. Such a strategy reduces the number of gradient ascent steps required, and so makes it more feasible to solve the gradient ascent problem on each step. After obtaining at the end of the gradient ascent iterations, the Actor can be trained towards
, using a supervised learning update on
. The Actor will slowly learn to select better initial actions, conditioned on state, that are near stationary points for —which hopefully correspond to highvalue actions. This Actor learns to maximize , reducing computational complexity, but still suffers from reaching suboptimal stationary points.To overcome this issue, we propose an approach inspired by the Cross Entropy Method from global optimization. Global optimization strategies are designed to find the global optimum of a function for some parameters . For example, for parameters of a neural network,
may be the loss function on a sample of data. The advantage of these methods is that they do not rely on gradientbased strategies, which are prone to getting stuck in saddlepoints and local optima. Instead, they use randomized search strategies, that have been shown to be effective in practice
[Salimans et al.2017, Peters and Schaal2007, Szita and Lörincz2006, Hansen, Müller, and Koumoutsakos2003].One such algorithm is the Cross Entropy Method (CEM) [Rubinstein1999]. This method maintains a distribution over parameters
, starting with a wide distribution, such as a Gaussian distribution with mean zero
and a diagonal covariance of large magnitude. The highlevel idea is elegantly simple. On each iteration, the goal is to minimize the KLdivergence to the uniform distribution over parameters where the objective function is greater than some threshold:
. This distribution can be approximated with an empirical distribution, such as by sampling several parameter vectors
and keeping those with and discarding the rest. Each minimization of the KLdivergence to this empirical distribution , for , corresponds to maximizing the likelihood of the parameters in the set under the distribution . Iteratively, the distribution over parameters narrows around higher valued . Sampling the from narrows the search over and makes it more likely for them to produce a useful approximation to .CEM, however, finds the singlebest set of optimal parameters for a single optimization problem. Most of the work using CEM in reinforcement learning aim to learn a singlebest set of parameters that optimize towards higher rollout returns
[Szita and Lörincz2006, Mannor, Rubinstein, and Gat2003]. However, our goal is not to do a single global optimization over returns, but rather a repeated optimization to select maximal actions, conditioned on each state. The global optimization strategy could be run on each step to find the exact best action for each current state, but this is expensive and throws away prior information about the function surface when previous optimization was executed.We extend the Cross Entropy Method to be (a) conditioned on state and (b) learned iteratively over time. CEM is wellsuited to extend to a conditional approach, for use in the Actor, because it provides a stochastic Actor that can explore naturally and is effective for smooth, nonconvex functions [Kroese, Porotsky, and Rubinstein2006]. The idea is to iteratively update , where previous updates conditioned on state generalize to similar states. The Actor learns a stochastic policy that slowly narrows around maximal actions, conditioned on states, as the agent does CEM updates iteratively for the functions .
The Conditional CEM (CCEM) algorithm replaces the learned with , where can be any parametrized, multimodal distribution. For a mixture model, for example, the parameters are conditional means , conditional diagonal covariances and coefficients , for the th component of the mixture. On each step, the conditional mixture model, , is sampled to provide a set of actions from which we construct the empirical distribution where for state with current values . The parameters are updated using a gradient ascent step on the loglikelihood of the actions under .
The highlevel framework is given in Algorithm 1. The Expert is updated towards learning the optimal Qvalues, with (a variant of) Qlearning. The Actor provides exploration and, over time, learns how to find the maximal action for the Expert in the given state, using the described Conditional CEM algorithm. The strategy for the empirical distribution is assumed to be given. We discuss two strategies we explore in the experiments, in the next subsection.
We depict an ActorExpert architecture where the Actor uses a mixture model in Figure 1. In our implementation, we use mixture density networks [Bishop1994] to learn a Gaussian mixture distribution. As in Figure 1, the Actor and Expert share the same neural network to obtain the representation for the state, and learn separate functions conditioned on that state. To obtain the maximal action under mixture models with a small number of components, we simply used the mean with the highest coefficient . To prevent the diagonal covariance from exploding or vanishing, we bound it between using a tanh layer. We also follow standard practice of using experience replay and target networks to stabilize learning in neural networks. A more detailed algorithm for ActorExpert with neural networks is described in Supplement 2.1.
Selecting the empirical distribution
A standard strategy for selecting the empirical distributions in CEM is to use the top quantile of sampled variables—actions in this case (Algorithm
2). For sampled from , we select where are all with the top quantile values. The resulting empirical distribution is , for . This strategy is generic, and as we find empirically, effective.For particular regularities in the actionvalues, however, we may be able to further improve this empirical distribution. For actionvalues differentiable in the action, we can perform a small number of gradient ascent steps from to reach actions with slightly higher actionvalues (Algorithm 3). The empirical distribution, then, should contain a larger number of useful actions—those with higher actionvalues—on which to perform maximum likelihood, potentially also requiring less samples. In our experiments we perform 10 gradient ascent steps.
Theoretical guarantees for the Actor
In this section, we derive guarantees that the Conditional CEM Actor tracks a CEM update, for an evolving Expert. We follow a twotimescale stochastic approximation approach, where the actionvalues (Expert) change more slowly than the policy (Actor), allowing the Actor to track the maximal actions.^{1}^{1}1This is actually opposite to ActorCritic, for which the Actor changes slowly, and the value estimates are on the faster timescale. The Actor itself has two timescales, to account for its own parameters changing at different timescales. Actions for the maximum likelihood step are selected according to older—slower—parameters, so that it is as if the primary—faster—parameters are updated using samples from a fixed distribution.
We provide an informal theorem statement here, with a proofsketch. We include the full theorem statement, with assumptions and proof, in Supplement 1.
Theorem 1 (Informal Convergence Result).
Let be the actionvalue parameters with stepsize , and be the policy parameters with stepsize , with a more slowly changing set of policy parameters set to for stepsize . Assume

States are sampled from a fixed marginal distribution.

is locally Lipschitz w.r.t. , .

Parameters and remain bounded almost surely.

Stepsizes are chosen for three different timescales to make evolves faster than and evolves faster than ,

All the three stepsizes decays to , while the sample length strictly increases to infinity.
Then the Conditional CEM Actor tracks the CEM Optimizer for actions, conditioned on state: the stochastic recursion for the Actor asymptotically behaves like an expected CEM Optimizer, with expectation taken across states.
Proof Sketch: The proof follows a multitimescale stochastic approximation analysis. The primary concern is that the stochastic update to the Actor is not a direct gradientdescent update. Rather, each update to the Actor is a CEM update, which requires a different analysis to ensure that the stochastic noise remains bounded and is asymptotically negligible. Further, the classical results of the CEM also do not immediately apply, because such updates assume distribution parameters can be directly computed. Here, distribution parameters are conditioned on state, as outputs from a parametrized function. We identify conditions on the parametrized policy to ensure wellbehaved CEM updates.
The multitimescale analysis allows us to focus on the updates of the Actor , assuming the actionvalue parameter and actionsampling parameter are quasistatic. These parameters are allowed to change with time—as they will in practice—but are moving at a sufficiently slower timescale relative to and hence the analysis can be undertaken as if they are static. These updates need to produce that keep the actionvalues bounded for each state and action, but we do not specify the exact algorithm for the actionvalues. We assume that the actionvalue algorithm is given, and focus the analysis on the novel component: the Conditional CEM updates for the Actor.
The first step in the proof is to formulate the update to the weights as a projected stochastic recursion—simply meaning a stochastic update where after each update the weights are projected to a compact, convex set to keep them bounded. The stochastic recursion is reformulated into a summation involving the mean vector field (which depends on the actionvalue parameters ), martingale noise and a loss term that is due to having approximate quantiles. The key steps are then to show almost surely that the mean vector field is locally Lipschitz, the martingale noise is quadratically bounded and that the loss term decays to zero asymptotically. For the first and second, we identify conditions on the policy parameterization that guarantee these. For the final case, we adapt the proof for sampled quantiles approaching true quantiles for CEM, with modifications to account for expectations over the conditioning variable, the state.
Experiments
In this section, we investigate the utility of AE, particularly highlighting the utility of generalizing the functional form for the actionvalues and demonstrating performance across several benchmark domains. We first design a domain where the true actionvalues are neither quadratic nor concave, to investigate the utility of generalizing the functional form for the actionvalues. Then, we test AE and several other algorithms listed below in more complex continuousaction domains from OpenAI Gym [Brockman et al.2016] and MuJoCo [Todorov, Erez, and Tassa2012].
Algorithms
We use two versions of ActorExpert: AE which uses the Quantile Empirical Distribution (Alg. 2) and AE+ which uses the Optimized Quantile Empirical Distribution (Alg. 3). We use a bimodal Gaussian mixture for both Actors, with and for AE and and for AE+. The second choice for AE+ reflects that a smaller number of samples is needed for the optimized set of actions. For benchmark environments, it was even effective—and more efficient—for AE+ by sampling only 1 action (), with . For NAF, PICNN, Wirefitting, and DDPG, we attempt to match the settings used in their works.
Normalized Advantage Function (NAF) [Gu et al.2016b] uses , restricting the advantage function to the form . correspond to the state value for the maximum action , and only decreases this value for . NAF takes actions by sampling from a Gaussian with learned mean and learned covariance , with initial exploration scale swept in {0.1, 0.3, 1.0}.
Partially Input Convex Neural Networks (PICNN) [Amos, Xu, and Kolter2016] is a neural network that is convex with respect to a part of its input—the action in this case. PICNN learns so that it is convex with respect to
, by restricting the weights of intermediate layers to be nonnegative, and activation function to be convex and nondecreasing (e.g. ReLU). For exploration in PICNN, we use OU noise—temporally correlated stochastic noise generated by an OrnsteinUhlenbeck process
[Uhlenbeck and Ornstein1930]—with , where the noise is added to the greedy action. To obtain the greedy action, as suggested in their paper, we used 5 iterations of the bundle entropy method from a randomly initialized action.Wirefitting [Baird and Klopf1993] outputs a set of action control points and corresponding action values for a state. By construction, the optimal action is one of the action control points with the highest action value. Like PICNN, we use OU exploration. This method uses interpolation between the action control points to find the action values, and thus its performance is largely dependent on the number of control points. We used 100 action control points for the Bimodal Domain. For the benchmark problems, we found that Wirefitting did not scale well, and so was omitted.
Deep Deterministic Policy Gradient (DDPG) [Lillicrap et al.2015] learns a deterministic policy, parameterized as a neural network, using the deterministic policy gradient theorem [Silver et al.2014]. We include it as a policy gradient baseline, as it is a competitive ActorCritic method using offpolicy policy gradient. Like PICNN and Wirefitting, DDPG uses OU noise for exploration.
Experimental Settings
Agent performance is evaluated every
steps of training, by executing the current policy without exploration for 10 episodes. The performance was averaged over 20 runs of different random seeds for the Bimodal Domain, and 10 runs for the benchmark domains. For all agents we use a neural network of 2layers with 200 hidden units each, with ReLU activations between each layer and tanh activation for action outputs. For AE and AE+, the Actor and Expert share the first layer, and branch out into two separate layers which all have 200 hidden units. We keep a running average and standard deviation to normalize unbounded state inputs. We use an experience replay buffer and target networks, as is common with neural networks. We use a batch size of 32, with buffer size =
, target networks(), and discount factorfor all agents. We sweep over learning rates – policy: {1e3, 1e4, 1e5}, actionvalues: {1e2, 1e3, 1e4}, and then use of layer normalization between network layers. For PICNN however, layer normalization could not be used in order to preserve convexity. Best hyperparameter settings found for all agents are reported in Supplement 2.4.
Experiments in a Bimodal Toy Domain
To illustrate the limitation that could be posed by restricting the functional form of , we design a toy domain with a single state and , where the true —shown in in Figure 3
—is a function of two radial basis functions centered at
and respectively, with unequal values of and respectively. We assume a deterministic setting, and so the rewards .We plot the average performance of the best setting for each agent over 20 runs, in Figure 2. We also monitored the training process, logging actionvalue function, exploratory action, and greedy action at each time step. We include videos in the Supplement^{2}^{2}2https://sites.google.com/ualberta.ca/actorexpert/, and descriptions can be found in Supplement 2.2.
All the methods that restrict the functional form for actions failed in many runs. PICNN and NAF start to increase value for one action center, and by necessity of convexity, must overly decrease the values around the other action center. Consequently, when they randomly explore and observe the higher reward for that action than they predict, a large update skews the actionvalue estimates. DDPG would similarly suffer, because the Actor only learns to output one action. Even though its actionvalue function is not restrictive, DDPG may periodically see high value for the other action center, and so its choice of greedy action can be pulled backandforth between these highvalued actions. AE methods, on the other hand, almost always found the optimal action. Wirefitting performed better than DDPG, PICNN and NAF, as it should be capable of correctly modeling actionvalues, but still converged to the suboptimal policy quite often.
The exploration mechanisms also played an important role. For certain exploration settings, the agents restricting the functional form on the actionvalues or policy can learn to settle on one action, rather than oscillating. For NAF with small exploration scale and and DDPG using OU noise, such oscillations were not observed in the above figure, because the agent only explores locally around one action, avoiding oscillation but also often converging to the suboptimal action. This was still a better choice for overall performance as oscillation produces lower accumulated reward. AE and AE+, on the other hand, explore by sampling from their learned multimodal Gaussian mixture distribution, with no external exploration parameter to tune.
Experiments in Benchmark Domains
We evaluated the algorithms on a set of benchmark continuous action tasks, with results shown in Figure 4. As mentioned above, we do not include Wirefitting, as it scaled poorly on these domains. Detailed description of the benchmark environments and their dimensions is included in Supplement 2.3, with state dimensions ranging from 3 to 17 and action dimensions ranging from 1 to 6.
In all benchmark environments AE and AE+ perform as well or better than other methods. In particular, they seem to learn more quickly. We hypothesize that this is because AE better estimates greedy actions and explores around actions with high actionvalues more effectively.
NAF and PICNN seemed to have less stable behavior, potentially due to their restrictive actionvalue function. PICNN likely suffers less, because its functional form is more general, but its greedy action selection mechanism is not as robust and some instability was observed in Lunar Lander. Such instability is not observed in Pendulum or HalfCheetah possibly because the actionvalue surface is simple in Pendulum and for locomotion environment like HalfCheetah, precision is not necessary; approximately good actions may still enable the agent to move and achieve reasonable performance.
Though the goal here is to evaluate the utility of AE compared to valuebased methods, we do include one policy gradient method as a baseline and a preliminary result into valuebased versus policy gradient approaches for continuous control. It is interesting to see that AE methods often perform better than their ActorCritic (policy gradient) counterpart, DDPG. In particular, AE seems to learn much more quickly, which is a hypothesized benefit of valuebased methods. Policy gradient methods, on the other hand, typically have to use outofdate value estimates to update the policy, which could slow learning.
Discussion and Future Work
In our work, we introduced a new framework called ActorExpert, that decouples actionselection from actionvalue representation by introducing an Actor that learns to identify maximal actions for the Expert actionvalues. Previous valuebased approaches for continuous control have typically limited the actionvalue functional form to easily optimize over actions. We have shown that this can be problematic in domains with true actionvalues that do not follow this parameterization. We proposed an instance of ActorExpert, by developing a Conditional Cross Entropy Method to iteratively find greedy actions conditioned on states. We use a multitimescale analysis to prove that this Actor tracks the Cross Entropy updates which seek the optimal actions across states, as the Expert evolves gradually. This proof differs from other multitimescale proofs in reinforcement learning, as we analyze a stochastic recursion that is based on the Cross Entropy Method, rather than a more typical stochastic (semi)gradient descent update. We conclude by showing that AE methods are able to find the optimal policy even when the true actionvalue function is bimodal, and performs as well as or better than previous methods in more complex domains. Like the ActorCritic framework, we hope for the ActorExpert framework to facilitate further development and use of valuebased methods for continuous action problems.
One such direction is to more extensively compare valuebased methods and policy gradient methods for continuous control. In this work, we investigated how to use valuebased methods under continuous actions, but did not state that valuebased methods were preferable over policy gradient methods. However, there are several potential benefits of valuebased methods that merit further exploration. One advantage is that Qlearning easily incorporates offpolicy samples, potentially improving sample complexity, whereas with policy gradient methods, it comes at the cost of introducing bias. Although some offpolicy policy gradient methods like DDPG have achieved high performance in benchmark domains, they are also known to suffer from brittleness and hyperparameter sensitivity [Duan et al.2016, Henderson et al.2017]. Another more speculative advantage is in terms of the optimization surface. The Qlearning update converges to optimal values in tabular settings and linear function approximation [Melo and Ribeiro2007]. Policy gradient methods, on the other hand, can have local minima, even in the tabular setting. One goal with ActorExpert is to improve valuebased methods for continuous actions, and so facilitate investigation into these hypotheses, without being limited by difficulties in action selection.
References
 [Amos, Xu, and Kolter2016] Amos, B.; Xu, L.; and Kolter, J. Z. 2016. Input convex neural networks. CoRR abs/1609.07152.
 [Baird and Klopf1993] Baird, L. C., and Klopf, A. H. 1993. Reinforcement learning with highdimensional, continuous actions. Wright Laboratory.
 [Baird1993] Baird, L. C. 1993. Advantage updating. Technical report, Technical report, DTIC Document.
 [Bishop1994] Bishop, C. M. 1994. Mixture density networks. Technical report.
 [Borkar1997] Borkar, V. S. 1997. Stochastic approximation with two time scales. Systems & Control Letters 29(5):291–294.
 [Borkar2008] Borkar, V. S. 2008. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press.
 [Brockman et al.2016] Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym.
 [Degris, White, and Sutton2012] Degris, T.; White, M.; and Sutton, R. S. 2012. Offpolicy actorcritic. CoRR abs/1205.4839.
 [del R Millán, Posenato, and Dedieu2002] del R Millán, J.; Posenato, D.; and Dedieu, E. 2002. ContinuousAction QLearning. Machine Learning.
 [Duan et al.2016] Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; and Abbeel, P. 2016. Benchmarking deep reinforcement learning for continuous control. CoRR abs/1604.06778.
 [Durrett1991] Durrett, R. 1991. Probability. theory and examples. the wadsworth & brooks/cole statistics/probability series. Wadsworth & Brooks/Cole Advanced Books & Software, Pacific Grove, CA.
 [Greensmith, Bartlett, and Baxter2004] Greensmith, E.; Bartlett, P. L.; and Baxter, J. 2004. Variance reduction techniques for gradient estimates in reinforcement learning. J. Mach. Learn. Res. 5:1471–1530.
 [Gu et al.2016a] Gu, S.; Lillicrap, T. P.; Ghahramani, Z.; Turner, R. E.; and Levine, S. 2016a. Qprop: Sampleefficient policy gradient with an offpolicy critic. CoRR abs/1611.02247.
 [Gu et al.2016b] Gu, S.; Lillicrap, T. P.; Sutskever, I.; and Levine, S. 2016b. Continuous deep qlearning with modelbased acceleration. CoRR abs/1603.00748.
 [Haarnoja et al.2017] Haarnoja, T.; Tang, H.; Abbeel, P.; and Levine, S. 2017. Reinforcement learning with deep energybased policies. CoRR abs/1702.08165.
 [Hansen, Müller, and Koumoutsakos2003] Hansen, N.; Müller, S. D.; and Koumoutsakos, P. 2003. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cmaes). Evol. Comput. 11(1):1–18.
 [Harmon and Baird1996a] Harmon, M. E., and Baird, L. C. 1996a. Advantage learning applied to a game with nonlinear dynamics and a nonlinear function approximator. Proceedings of the International Conference on Neural Networks (ICNN).
 [Harmon and Baird1996b] Harmon, M. E., and Baird, L. C. 1996b. Multiplayer residual advantage learning with general function approximation. Technical report, Wright Laboratory.
 [Henderson et al.2017] Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, D.; and Meger, D. 2017. Deep reinforcement learning that matters. CoRR abs/1709.06560.

[Hoeffding1963]
Hoeffding, W.
1963.
Probability inequalities for sums of bounded random variables.
Journal of the American statistical association 58(301):13–30.  [Homemde Mello2007] Homemde Mello, T. 2007. A study on the crossentropy method for rareevent probability estimation. INFORMS Journal on Computing 19(3):381–394.
 [Hu, Fu, and Marcus2007] Hu, J.; Fu, M. C.; and Marcus, S. I. 2007. A model reference adaptive search method for global optimization. Operations Research 55(3):549–568.
 [Kakade2001] Kakade, S. 2001. A natural policy gradient. In Advances in Neural Information Processing Systems 14 (NIPS 2001). MIT Press.
 [Kroese, Porotsky, and Rubinstein2006] Kroese, D. P.; Porotsky, S.; and Rubinstein, R. Y. 2006. The crossentropy method for continuous multiextremal optimization. Methodology and Computing in Applied Probability 8(3):383–407.
 [Kushner and Clark2012] Kushner, H. J., and Clark, D. S. 2012. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer Science & Business Media.
 [Lillicrap et al.2015] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. CoRR abs/1509.02971.
 [Mannor, Rubinstein, and Gat2003] Mannor, S.; Rubinstein, R.; and Gat, Y. 2003. The cross entropy method for fast policy search. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, 512–519. AAAI Press.
 [Melo and Ribeiro2007] Melo, F. S., and Ribeiro, M. I. 2007. Convergence of qlearning with linear function approximation. In European Control Conference.
 [Mnih et al.2013] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. A. 2013. Playing atari with deep reinforcement learning. CoRR abs/1312.5602.
 [Mnih et al.2016] Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T. P.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. CoRR abs/1602.01783.
 [Morris1982] Morris, C. N. 1982. Natural exponential families with quadratic variance functions. The Annals of Statistics 65–80.
 [Peters and Schaal2007] Peters, J., and Schaal, S. 2007. Reinforcement learning by rewardweighted regression for operational space control. In Proceedings of the 24th International Conference on Machine Learning, 745–750. ACM.
 [Peters and Schaal2008] Peters, J., and Schaal, S. 2008. Natural actorcritic. Neurocomputing 71(7):1180 – 1190.
 [Robbins and Monro1985] Robbins, H., and Monro, S. 1985. A stochastic approximation method. In Herbert Robbins Selected Papers. Springer. 102–109.
 [Rubinstein and Shapiro1993] Rubinstein, R. Y., and Shapiro, A. 1993. Discrete event systems: Sensitivity analysis and stochastic optimization by the score function method, volume 1. Wiley New York.
 [Rubinstein1999] Rubinstein, R. 1999. The crossentropy method for combinatorial and continuous optimization. Methodology And Computing In Applied Probability 1(2):127–190.
 [Salimans et al.2017] Salimans, T.; Ho, J.; Chen, X.; and Sutskever, I. 2017. Evolution strategies as a scalable alternative to reinforcement learning. CoRR abs/1703.03864.
 [Schulman et al.2015] Schulman, J.; Levine, S.; Moritz, P.; Jordan, M. I.; and Abbeel, P. 2015. Trust region policy optimization. CoRR abs/1502.05477.
 [Schulman et al.2016] Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; and Abbeel, P. 2016. Highdimensional continuous control using generalized advantage estimation. International Conference on Learning Representations.
 [Schulman et al.2017] Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. CoRR abs/1707.06347.
 [Sen and Singer2017] Sen, P. K., and Singer, J. M. 2017. Large Sample Methods in Statistics (1994): An Introduction with Applications. CRC Press.
 [Silver et al.2014] Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning  Volume 32, I–387–I–395. JMLR.org.
 [Sutton et al.2000] Sutton, R. S.; McAllester, D.; Singh, S.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems.
 [Sutton1984] Sutton, R. S. 1984. Temporal Credit Assignment in Reinforcement Learning. Ph.D. Dissertation.
 [Szita and Lörincz2006] Szita, I., and Lörincz, A. 2006. Learning tetris using the noisy crossentropy method. Neural Computation 18(12):2936–2941.
 [Todorov, Erez, and Tassa2012] Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for modelbased control. In IROS, 5026–5033. IEEE.
 [Uhlenbeck and Ornstein1930] Uhlenbeck, G. E., and Ornstein, L. S. 1930. On the theory of the brownian motion. Physical review, 36(5):823.
 [van Hasselt, Guez, and Silver2015] van Hasselt, H.; Guez, A.; and Silver, D. 2015. Deep reinforcement learning with double qlearning. CoRR abs/1509.06461.
 [Wang et al.2016] Wang, Z.; Bapst, V.; Heess, N.; Mnih, V.; Munos, R.; Kavukcuoglu, K.; and de Freitas, N. 2016. Sample efficient actorcritic with experience replay. CoRR abs/1611.01224.
 [Watkins and Dayan1992] Watkins, C. J. C. H., and Dayan, P. 1992. Qlearning. In Machine Learning, 279–292.
 [Wu et al.2017] Wu, Y.; Mansimov, E.; Liao, S.; Grosse, R. B.; and Ba, J. 2017. Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. CoRR abs/1708.05144.
1 Convergence Analysis
In this section, we prove that the stochastic Conditional CrossEntropy Method update for the Actor tracks an underlying deterministic ODE for the expected CrossEntropy update over states. We being by providing some definitions, particularly for the quantile function which is central to the analysis. We then lay out the assumptions, and discuss some policy parameterizations to satisfy those assumptions. We finally state the theorem, with proof, and provide one lemma needed to prove the theorem in the final subsection.
1.1 Notation and Definitions
Notation: For a set , let represent the interior of , while is the boundary of . The abbreviation stands for almost surely and stands for infinitely often. Let represent the set . For a set , we let
to be the indicator function/characteristic function of
and is defined as if and 0 otherwise. Let , and denote the expectation, variance and probability measure w.r.t. . For a field , let represent the conditional expectation w.r.t. . A function is called Lipschitz continuous if , . A function is called locally Lipschitz continuous if for every , there exists a neighbourhood of such that is Lipschitz continuous. Let represent the space of continuous functions from to . Also, let represent an open ball of radius with centered at . For a positive integer , let .Definition 1.
A function is Frechet differentiable at if there exists a bounded linear operator such that the limit
(3) 
exists and is equal to . We say is Frechet differentiable if Frechet derivative of exists at every point in its domain.
Definition 2.
Given a bounded realvalued continuous function with and a scalar , we define the quantile of w.r.t. the PDF (denoted as ) as follows:
(4) 
where is the probability measure induced by the PDF , i.e., for a Borel set , .
This quantile operator will be used to succinctly write the quantile for , with actions selected according to , i.e.,
(5) 
1.2 Assumptions
Assumption 1.
Given a realization of the transition dynamics of the MDP in the form of a sequence of transition tuples , where the state is drawn using a latent sampling distribution , while is the action chosen at state , the transitioned state and the reward . We further assume that the reward is uniformly bounded, i.e., .
Here, we analyze the long run behaviour of the conditional crossentropy recursion (actor) which is defined as follows:
(6)  
(7) 
Here, is the projection operator onto the compact (closed and bounded) and convex set with a smooth boundary . Therefore, maps vectors in to the nearest vectors in w.r.t. the Euclidean distance (or equivalent metric). Convexity and compactness ensure that the projection is unique and belongs to .
Assumption 2.
The predetermined, deterministic, stepsize sequences , and are positive scalars which satisfy the following:
The first conditions in Assumption 2 are the classical RobbinsMonro conditions [Robbins and Monro1985] required for stochastic approximation algorithms. The last two conditions enable the different stochastic recursions to have separate timescales. Indeed, it ensures that the recursion is relatively faster compared to the recursions of and . This timescale divide is needed to obtain the pursued coherent asymptotic behaviour, as we describe in the next section.
Assumption 3.
The predetermined, deterministic, sample length schedule is positive and strictly monotonically increases to and .
Assumption 3 states that the number of samples increases to infinity and is primarily required to ensure that the estimation error arising due to the estimation of sample quantiles eventually decays to . Practically, one can indeed consider a fixed, finite, positive integer for which is large enough to accommodate the acceptable error.
Assumption 4.
The sequence satisfies , where is a convex, compact set. Also, for , let , .
Assumption 4 assumes stability of the Expert, and minimally only requires that the values remain in a bounded range. We make no additional assumptions on the convergence properties of the Expert, as we simply need stability to prove that the Actor tracts the desired update.
Assumption 5.
For and , let , and .
Assumption 5 implies that there always exists a strictly positive probability mass beyond every threshold . This assumption is easily satisfied when is continuous in and
is a continuous probability density function.
Assumption 6.
Assumption 7.
For , is locally Lipschitz continuous w.r.t. .
Assumptions 6 and 7 are technical requirements and can be justified and more appropriately characterized when we consider to belong to the most popular natural exponential family (NEF) of distributions.
Definition 3.
Natural exponential family of distributions (NEF)[Morris1982]:
These probability distributions over
are represented by(8) 
where is the natural parameter, , while (called the sufficient statistic) and (called the cumulant function of the family). The space is defined as . Also, the above representation is assumed minimal.^{3}^{3}3For a distribution in NEF, there may exist multiple representations of the form (8). However, for the distribution, there definitely exists a representation where the components of the sufficient statistic are linearly independent and such a representation is referred to as minimal.
A few popular distributions which belong to the NEF family include Binomial, Poisson, Bernoulli, Gaussian, Geometric and Exponential distributions.
We parametrize the policy using a neural network, which implies that when we consider NEF for the stochastic policy, the natural parameter of the NEF is being parametrized by . To be more specific, we have to be the function space induced by the neural network of the actor, i.e., for a given state , represents the natural parameter of the NEF policy . Further,
(9) 
Therefore Assumption 7 can be directly satisfied by assuming that is twice continuously differentiable w.r.t. .
The next assumption is a standard assumption that sample average converges with an exponential rate in the number of samples. The assumption reflects that this should be true for arbitrary .
Assumption 8.
For and , we have
where .
Assumption 9.
For every , and , (from Eq. (5)) exists and is unique.
The above assumption ensures that the true quantile is unique and the assumption is usually satisfied for most distributions and a wellbehaved .
1.3 Main Theorem
To analyze the algorithm, we employ here the ODEbased analysis as proposed in [Borkar2008, Kushner and Clark2012]. The actor recursions (Eqs. (67)) represent a classical two timescale stochastic approximation recursion, where there exists a bilateral coupling between the individual stochastic recursions (6) and (7). Since the stepsize schedules and satisfy , we have relatively faster than . This disparity induces a pseudoheterogeneous rate of convergence (or timescales) between the individual stochastic recursions which further amounts to the asymptotic emergence of a stable coherent behaviour which is quasiasynchronous. This pseudobehaviour can be interpreted using multiple viewpoints, i.e., when viewed from the faster timescale recursion (recursion controlled by ), the slower timescale recursion (recursion controlled by ) appears quasistatic (‘almost a constant’); likewise, when observed from the slower timescale, the faster timescale recursion seems equilibrated. The existence of this stable long run behaviour under certain standard assumptions of stochastic approximation algorithms is rigorously established in [Borkar1997] and also in Chapter 6 of [Borkar2008]. For our stochastic approximation setting (Eqs. (67)), we can directly apply this appealing characterization of the long run behaviour of the two timescale stochastic approximation algorithms—after ensuring the compliance of our setting to the prerequisites demanded by the characterization—by considering the slow timescale stochastic recursion (7) to be quasistationary (i.e., , , ), while analyzing the limiting behaviour of the faster timescale recursion (6). Similarly, we let to be quasistationary too (i.e., , , ). The asymptotic behaviour of the slower timescale recursion is further analyzed by considering the faster timescale temporal variable with the limit point so obtained during quasistationary analysis.
Define the filtration , a family of increasing natural fields, where .
Theorem 2.
Proof.
Firstly, we rewrite the stochastic recursion (6) under the hypothesis that and are quasistationary, i.e., and as follows:
(11) 
where and , i.e., the gradient w.r.t. at . Also,
(12) 
(13) 