Actor-Expert: A Framework for using Action-Value Methods in Continuous Action Spaces

by   Sungsu Lim, et al.
University of Alberta
Indiana University

Value-based approaches can be difficult to use in continuous action spaces, because an optimization has to be solved to find the greedy action for the action-values. A common strategy has been to restrict the functional form of the action-values to be convex or quadratic in the actions, to simplify this optimization. Such restrictions, however, can prevent learning accurate action-values. In this work, we propose the Actor-Expert framework for value-based methods, that decouples action-selection (Actor) from the action-value representation (Expert). The Expert uses Q-learning to update the action-values towards the optimal action-values, whereas the Actor (learns to) output the greedy action for the current action-values. We develop a Conditional Cross Entropy Method for the Actor, to learn the greedy action for a generically parameterized Expert, and provide a two-timescale analysis to validate asymptotic behavior. We demonstrate in a toy domain with bimodal action-values that previous restrictive action-value methods fail whereas the decoupled Actor-Expert with a more general action-value parameterization succeeds. Finally, we demonstrate that Actor-Expert performs as well as or better than these other methods on several benchmark continuous-action domains.


page 1

page 2

page 3

page 4


Inspiration Learning through Preferences

Current imitation learning techniques are too restrictive because they r...

CAQL: Continuous Action Q-Learning

Value-based reinforcement learning (RL) methods like Q-learning have sho...

Quinoa: a Q-function You Infer Normalized Over Actions

We present an algorithm for learning an approximate action-value soft Q-...

Learning Probabilistic Multi-Modal Actor Models for Vision-Based Robotic Grasping

Many previous works approach vision-based robotic grasping by training a...

Actor and Action Modular Network for Text-based Video Segmentation

The actor and action semantic segmentation is a challenging problem that...

Actor-Action Semantic Segmentation with Region Masks

In this paper, we study the actor-action semantic segmentation problem, ...

Beyond Homophily: Incorporating Actor Variables in Actor-oriented Network Models

We consider the specification of effects of numerical actor attributes i...


Model-free control methods are currently divided into two main branches: value-based methods and policy gradient methods. Value-based methods, such as Q-learning, have been quite successful in discrete-action domains [Mnih et al.2013, van Hasselt, Guez, and Silver2015], whereas policy gradient methods have been more commonly used in continuous action spaces. One of the reasons for this choice is because finding the optimal action for Q-learning can be difficult in continuous-action spaces, necessitating an optimization problem to be solved.

A common strategy when using action value methods in continuous actions has been to restrict the form of action values, to make optimization over actions easy to solve. Wire-fitting [Baird and Klopf1993, del R Millán, Posenato, and Dedieu2002]interpolates between a set of action points, adjusting those points over time to force one interpolation action point to become the maximizing action. Normalized Advantage Functions (NAF) [Gu et al.2016b] learn an advantage function [Baird1993, Harmon and Baird1996a, Harmon and Baird1996b]

by constraining the advantage function to be quadratic in terms of the actions, keeping track of the vertex of the parabola. Partial Input Convex Neural Networks (PICNN) are learned such that action-values are guaranteed to be convex in terms of action

[Amos, Xu, and Kolter2016]

. To enable convex functions to be learned, however, PICNNs are restricted to non-negative weights and ReLU activations, and the maximizing action is found with an approximate gradient descent from random action points.

Another direction has been to parameterize the policy using the action-values, and use instead a soft Q-learning update [Haarnoja et al.2017]

. For action selection, the policy is parameterized as an energy-based model using the action-values. This approach avoids the difficult optimization over actions, but unfortunately instead it can be expensive to sample an action from the policy. The action-values can be an arbitrary (energy) function, and sampling from the corresponding energy-based model requires an approximate sampling routine, like MCMC. Moreover, it optimizes over the entropy-regularized objective, which differs from the traditional objective in most other action-values learning algorithms, like Q-learning.

Policy gradient methods, on the other hand, learn a simple parametric distribution or a deterministic function over actions that can be easily used in continuous action spaces. In recent years, policy gradient methods have been particularly successful in continuous action benchmark domains [Duan et al.2016], facilitated by the Actor-Critic framework. Actor-Critic methods, first introduced in [Sutton1984], use a Critic (value function) that evaluates the current policy, to help compute the gradient for the Actor (policy). This separation into Actor and Critic enabled the two components to be optimized in a variety of ways, facilitating algorithm development. The Actor can incorporate different update mechanisms to achieve better sample efficiency [Mnih et al.2016, Kakade2001, Peters and Schaal2008, Wu et al.2017] or stable learning [Schulman et al.2015, Schulman et al.2017]

. The Critic can be used as a baseline or control variate to reduce variance

[Greensmith, Bartlett, and Baxter2004, Gu et al.2016a, Schulman et al.2016], and improve sample efficiency by incorporating off-policy samples [Degris, White, and Sutton2012, Silver et al.2014, Lillicrap et al.2015, Wang et al.2016].

In this work, we propose a framework called Actor-Expert, that parallels Actor-Critic, but for value-based methods, facilitating use of Q-learning for continuous action spaces. Actor-Expert decouples optimal action selection (Actor) from action-value representation (Expert), enabling a variety of optimization methods to be used for the Actor. The Expert learns the action-values using Q-learning. The Actor learns the greedy action by iteratively updating towards an estimate of the maximum action for the action-values given by the Expert. This decoupling also enables any Actor to be used, including any exploration mechanism, without interfering with the Expert’s goal to learn the optimal action-values. Actor-Expert is different from Actor-Critic because the Expert uses Q-learning—the Bellman optimality operator—whereas the Critic performs policy evaluation to get values of the current (sub-optimal) policy. In Actor-Expert, the Actor tracks the Expert, to track the greedy action, whereas in Actor-Critic, the Critic tracks the Actor, to track the policy values.

Taking advantage of this formalism, we introduce a Conditional Cross Entropy Method for the Actor, that puts minimal restrictions on the form of the action-values. The basic idea is to iteratively increase the likelihood of near-maximal actions for the expert over time, extending the global optimization algorithm, the Cross Entropy Method [Rubinstein1999], to be conditioned on state. We show in a toy domain with bimodal action-values—which are not quadratic nor convex—that previous action-value methods with restrictive action-values (NAF and PICNN) perform poorly, whereas Actor-Expert learns the optimal policy well. We then show results on several continuous-action benchmark domains that our algorithm outperforms previous value-based methods and an instance of an Actor-Critic method, Deep Deterministic Policy Gradient (DDPG).

Background and Problem Formulation

The interaction between the agent and environment is formalized as a Markov decision process

, where is the state space, is the action space, is the one-step state transition dynamics, is the reward function and is the discount rate. At each discrete time step , the agent selects an action according to policy , the agent transitions to state according to , and observes a scalar reward .

For valued-based methods, the objective is to find the fixed-point for the Bellman optimality operator:


The corresponding optimal policy selects a greedy action from the set . These optimal Q-values are typically learned using Q-learning [Watkins and Dayan1992]: for action-values parameterized by , the iterative updates are for

Q-learning is an off-policy algorithm, that can learn the action-values for the optimal policy while following a different (exploratory) behaviour policy.

Policy gradient methods directly optimize a parameterized policy , with parameters . The objective is typically an average reward objective,


where is the stationary distribution over states, representing state visitation. Policy gradient methods estimate gradients of this objective [Sutton et al.2000]

For example, in the policy-gradient approach called Actor-Critic [Sutton1984], the Critic estimates and the Actor uses the Critic to obtain an estimate of the above gradient to adjust the policy parameters .

Action-value methods for continuous actions can be difficult to use, due to the fact that an optimization over actions needs to be solved, both for decision-making and for the Q-learning update. For a reasonably small number of discrete actions, is straightforward to solve, by iterating across all actions. For continuous actions, cannot be queried for all actions, and the optimization can be difficult to solve, such as if is non-convex in .

Actor-Expert Formalism

We propose a new framework for value-based methods, with an explicit Actor. The goal is to provide a similar framework to Actor-Critic—which has been so successful for algorithm development of policy gradient methods—to simplify algorithm development for value-based methods. The Expert learns using Q-learning, but with an explicit actor that provides the greedy actions. The Actor has two roles: to select which action to take (behavior policy) and to provide the greedy action for the Expert’s Q-learning target. In this section, we develop a Conditional Cross Entropy Method for the Actor, to estimate the greedy action, and provide theoretical guarantees that the approach tracks a changing Expert.

Conditional Cross Entropy Method for the Actor

The primary role of the Actor is to identify—or learn— for the Expert. Different strategies can be used to obtain this greedy action on each step. The simplest strategy is to solve this optimization with gradient ascent, to convergence, on every time step. This is problematic for two reasons: it is expensive and is likely to get stuck in suboptimal stationary points.

Consider now a slightly more effective strategy, that learns an Actor that can provide an approximate greedy action that can serve as a good initial point for gradient ascent. Such a strategy reduces the number of gradient ascent steps required, and so makes it more feasible to solve the gradient ascent problem on each step. After obtaining at the end of the gradient ascent iterations, the Actor can be trained towards

, using a supervised learning update on

. The Actor will slowly learn to select better initial actions, conditioned on state, that are near stationary points for —which hopefully correspond to high-value actions. This Actor learns to maximize , reducing computational complexity, but still suffers from reaching suboptimal stationary points.

To overcome this issue, we propose an approach inspired by the Cross Entropy Method from global optimization. Global optimization strategies are designed to find the global optimum of a function for some parameters . For example, for parameters of a neural network,

may be the loss function on a sample of data. The advantage of these methods is that they do not rely on gradient-based strategies, which are prone to getting stuck in saddlepoints and local optima. Instead, they use randomized search strategies, that have been shown to be effective in practice

[Salimans et al.2017, Peters and Schaal2007, Szita and Lörincz2006, Hansen, Müller, and Koumoutsakos2003].

One such algorithm is the Cross Entropy Method (CEM) [Rubinstein1999]. This method maintains a distribution over parameters

, starting with a wide distribution, such as a Gaussian distribution with mean zero

and a diagonal covariance of large magnitude. The high-level idea is elegantly simple. On each iteration

, the goal is to minimize the KL-divergence to the uniform distribution over parameters where the objective function is greater than some threshold:

. This distribution can be approximated with an empirical distribution, such as by sampling several parameter vectors

and keeping those with and discarding the rest. Each minimization of the KL-divergence to this empirical distribution , for , corresponds to maximizing the likelihood of the parameters in the set under the distribution . Iteratively, the distribution over parameters narrows around higher valued . Sampling the from narrows the search over and makes it more likely for them to produce a useful approximation to .

CEM, however, finds the single-best set of optimal parameters for a single optimization problem. Most of the work using CEM in reinforcement learning aim to learn a single-best set of parameters that optimize towards higher roll-out returns

[Szita and Lörincz2006, Mannor, Rubinstein, and Gat2003]. However, our goal is not to do a single global optimization over returns, but rather a repeated optimization to select maximal actions, conditioned on each state. The global optimization strategy could be run on each step to find the exact best action for each current state, but this is expensive and throws away prior information about the function surface when previous optimization was executed.

Figure 1: Actor-Expert with an Actor using a Conditional Cross Entropy Method (CCEM), with a bimodal distribution. The Actor and Expert share the same network to learn the state representation, but then learn separate functions—the policy distribution and the -function for the Expert, where the actions come in through late fusion. The policy is a conditional mixture model, with coefficients , means and diagonal covariances . Such a multimodal stochastic Actor naturally provides an exploration mechanism to gather data for the Expert (the Q-learner) and enables more than one optimal action in a state. For example, could be bimodal due to symmetries in action selection, with equal value at two actions and . The distribution could learn a bimodal distribution, with means and and equal coefficients . Alternatively, it can still learn to select one action, where the extra expressivity of two modes is collapsed into one.

We extend the Cross Entropy Method to be (a) conditioned on state and (b) learned iteratively over time. CEM is well-suited to extend to a conditional approach, for use in the Actor, because it provides a stochastic Actor that can explore naturally and is effective for smooth, non-convex functions [Kroese, Porotsky, and Rubinstein2006]. The idea is to iteratively update , where previous updates conditioned on state generalize to similar states. The Actor learns a stochastic policy that slowly narrows around maximal actions, conditioned on states, as the agent does CEM updates iteratively for the functions .

The Conditional CEM (CCEM) algorithm replaces the learned with , where can be any parametrized, multi-modal distribution. For a mixture model, for example, the parameters are conditional means , conditional diagonal covariances and coefficients , for the th component of the mixture. On each step, the conditional mixture model, , is sampled to provide a set of actions from which we construct the empirical distribution where for state with current values . The parameters are updated using a gradient ascent step on the log-likelihood of the actions under .

The high-level framework is given in Algorithm 1. The Expert is updated towards learning the optimal Q-values, with (a variant of) Q-learning. The Actor provides exploration and, over time, learns how to find the maximal action for the Expert in the given state, using the described Conditional CEM algorithm. The strategy for the empirical distribution is assumed to be given. We discuss two strategies we explore in the experiments, in the next subsection.

Initialize Actor parameters and Expert parameters .
for t=1, 2, … do
Observe , sample action , and observe ,
Obtain maximum action from Actor
Update expert , using Q-learning with
Sample actions
Obtain empirical distribution based on
Increase likelihood for high-value actions
Algorithm 1 Actor-Expert (with Conditional CEM)

We depict an Actor-Expert architecture where the Actor uses a mixture model in Figure 1. In our implementation, we use mixture density networks [Bishop1994] to learn a Gaussian mixture distribution. As in Figure 1, the Actor and Expert share the same neural network to obtain the representation for the state, and learn separate functions conditioned on that state. To obtain the maximal action under mixture models with a small number of components, we simply used the mean with the highest coefficient . To prevent the diagonal covariance from exploding or vanishing, we bound it between using a tanh layer. We also follow standard practice of using experience replay and target networks to stabilize learning in neural networks. A more detailed algorithm for Actor-Expert with neural networks is described in Supplement 2.1.

Selecting the empirical distribution

A standard strategy for selecting the empirical distributions in CEM is to use the top quantile of sampled variables—actions in this case (Algorithm

2). For sampled from , we select where are all with the top quantile values. The resulting empirical distribution is , for . This strategy is generic, and as we find empirically, effective.

Evaluate and sort in descending order:
get top quantile, e.g.
return (where )
Algorithm 2 Quantile Empirical Distribution

For particular regularities in the action-values, however, we may be able to further improve this empirical distribution. For action-values differentiable in the action, we can perform a small number of gradient ascent steps from to reach actions with slightly higher action-values (Algorithm 3). The empirical distribution, then, should contain a larger number of useful actions—those with higher action-values—on which to perform maximum likelihood, potentially also requiring less samples. In our experiments we perform 10 gradient ascent steps.

For each , perform steps of gradient ascent starting from
return Quantile Empirical Distribution()
Algorithm 3 Optimized Quantile Empirical Distr.

Theoretical guarantees for the Actor

In this section, we derive guarantees that the Conditional CEM Actor tracks a CEM update, for an evolving Expert. We follow a two-timescale stochastic approximation approach, where the action-values (Expert) change more slowly than the policy (Actor), allowing the Actor to track the maximal actions.111This is actually opposite to Actor-Critic, for which the Actor changes slowly, and the value estimates are on the faster timescale. The Actor itself has two timescales, to account for its own parameters changing at different timescales. Actions for the maximum likelihood step are selected according to older—slower—-parameters, so that it is as if the primary—faster—parameters are updated using samples from a fixed distribution.

We provide an informal theorem statement here, with a proof-sketch. We include the full theorem statement, with assumptions and proof, in Supplement 1.

Theorem 1 (Informal Convergence Result).

Let be the action-value parameters with stepsize , and be the policy parameters with stepsize , with a more slowly changing set of policy parameters set to for stepsize . Assume

  1. States are sampled from a fixed marginal distribution.

  2. is locally Lipschitz w.r.t. , .

  3. Parameters and remain bounded almost surely.

  4. Stepsizes are chosen for three different timescales to make evolves faster than and evolves faster than ,

  5. All the three stepsizes decays to , while the sample length strictly increases to infinity.

  6. Both

    norm and the centered second moment of

    w.r.t. are bounded uniformly.

Then the Conditional CEM Actor tracks the CEM Optimizer for actions, conditioned on state: the stochastic recursion for the Actor asymptotically behaves like an expected CEM Optimizer, with expectation taken across states.

Proof Sketch:  The proof follows a multi-timescale stochastic approximation analysis. The primary concern is that the stochastic update to the Actor is not a direct gradient-descent update. Rather, each update to the Actor is a CEM update, which requires a different analysis to ensure that the stochastic noise remains bounded and is asymptotically negligible. Further, the classical results of the CEM also do not immediately apply, because such updates assume distribution parameters can be directly computed. Here, distribution parameters are conditioned on state, as outputs from a parametrized function. We identify conditions on the parametrized policy to ensure well-behaved CEM updates.

The multi-timescale analysis allows us to focus on the updates of the Actor , assuming the action-value parameter and action-sampling parameter are quasi-static. These parameters are allowed to change with time—as they will in practice—but are moving at a sufficiently slower timescale relative to and hence the analysis can be undertaken as if they are static. These updates need to produce that keep the action-values bounded for each state and action, but we do not specify the exact algorithm for the action-values. We assume that the action-value algorithm is given, and focus the analysis on the novel component: the Conditional CEM updates for the Actor.

The first step in the proof is to formulate the update to the weights as a projected stochastic recursion—simply meaning a stochastic update where after each update the weights are projected to a compact, convex set to keep them bounded. The stochastic recursion is reformulated into a summation involving the mean vector field (which depends on the action-value parameters ), martingale noise and a loss term that is due to having approximate quantiles. The key steps are then to show almost surely that the mean vector field is locally Lipschitz, the martingale noise is quadratically bounded and that the loss term decays to zero asymptotically. For the first and second, we identify conditions on the policy parameterization that guarantee these. For the final case, we adapt the proof for sampled quantiles approaching true quantiles for CEM, with modifications to account for expectations over the conditioning variable, the state.


In this section, we investigate the utility of AE, particularly highlighting the utility of generalizing the functional form for the action-values and demonstrating performance across several benchmark domains. We first design a domain where the true action-values are neither quadratic nor concave, to investigate the utility of generalizing the functional form for the action-values. Then, we test AE and several other algorithms listed below in more complex continuous-action domains from OpenAI Gym [Brockman et al.2016] and MuJoCo [Todorov, Erez, and Tassa2012].


We use two versions of Actor-Expert: AE which uses the Quantile Empirical Distribution (Alg. 2) and AE+ which uses the Optimized Quantile Empirical Distribution (Alg. 3). We use a bimodal Gaussian mixture for both Actors, with and for AE and and for AE+. The second choice for AE+ reflects that a smaller number of samples is needed for the optimized set of actions. For benchmark environments, it was even effective—and more efficient—for AE+ by sampling only 1 action (), with . For NAF, PICNN, Wire-fitting, and DDPG, we attempt to match the settings used in their works.

Normalized Advantage Function (NAF) [Gu et al.2016b] uses , restricting the advantage function to the form . correspond to the state value for the maximum action , and only decreases this value for . NAF takes actions by sampling from a Gaussian with learned mean and learned covariance , with initial exploration scale swept in {0.1, 0.3, 1.0}.

Partially Input Convex Neural Networks (PICNN) [Amos, Xu, and Kolter2016] is a neural network that is convex with respect to a part of its input—the action in this case. PICNN learns so that it is convex with respect to

, by restricting the weights of intermediate layers to be non-negative, and activation function to be convex and non-decreasing (e.g. ReLU). For exploration in PICNN, we use OU noise—temporally correlated stochastic noise generated by an Ornstein-Uhlenbeck process

[Uhlenbeck and Ornstein1930]—with , where the noise is added to the greedy action. To obtain the greedy action, as suggested in their paper, we used 5 iterations of the bundle entropy method from a randomly initialized action.

Wire-fitting [Baird and Klopf1993] outputs a set of action control points and corresponding action values for a state. By construction, the optimal action is one of the action control points with the highest action value. Like PICNN, we use OU exploration. This method uses interpolation between the action control points to find the action values, and thus its performance is largely dependent on the number of control points. We used 100 action control points for the Bimodal Domain. For the benchmark problems, we found that Wire-fitting did not scale well, and so was omitted.

Deep Deterministic Policy Gradient (DDPG) [Lillicrap et al.2015] learns a deterministic policy, parameterized as a neural network, using the deterministic policy gradient theorem [Silver et al.2014]. We include it as a policy gradient baseline, as it is a competitive Actor-Critic method using off-policy policy gradient. Like PICNN and Wire-fitting, DDPG uses OU noise for exploration.

Figure 2: Agents evaluated on Bimodal Toy Environment. Each faded line represents one run while the dark line represents the average. AE methods almost always converged to the optimum policy, while other value-based methods such as NAF, PICNN, and Wire-fitting struggled. DDPG also often converged to a suboptimal policy. When using a small scale , NAF was able to fit its action-value function to either local optimum, often learning the optimal policy. However, with a larger scale or , NAF often observed both optima during exploration, and attempted to fit its action-value function to encompass both optima. This results in a policy that is worse than fitting to either local optimum. With scale

, the final policy ends up selecting actions around a=0.5 (slightly skewed towards the higher rewarding action a=1.0), and with scale

, the agent consistently picks actions around 0, which results in zero return. PICNN has a more general functional form than NAF, but appears to suffer more. We attribute this behavior to the increased difficulties in optimizing the specialized network and from the need to solve the optimization problem on each step to get a greedy action. We found 5 iterations suggested by the original PICNN paper to be insufficient for this toy domain, and for this toy domain where it is computationally feasible, we performed 20 iterations of bundle entropy method. Even with more iterations however, the performance was noisy as seen in the figure. In our observations, we found that action-values of PICNN often correctly finds the best action, but lack of robust action-selection mechanism caused it to select suboptimal actions around the optimal action. For DDPG, the curious peak occurring around 60 steps is because the initial action-value function of DDPG is inaccurate and almost linear. As the Actor is slowly updated to explore local actions around its current greedy action, it passes the actual optimal action. It then takes DDPG a while to learn a correct action-value and policy, resulting in a lapse in performance until converging to an action.

Experimental Settings

Agent performance is evaluated every

steps of training, by executing the current policy without exploration for 10 episodes. The performance was averaged over 20 runs of different random seeds for the Bimodal Domain, and 10 runs for the benchmark domains. For all agents we use a neural network of 2-layers with 200 hidden units each, with ReLU activations between each layer and tanh activation for action outputs. For AE and AE+, the Actor and Expert share the first layer, and branch out into two separate layers which all have 200 hidden units. We keep a running average and standard deviation to normalize unbounded state inputs. We use an experience replay buffer and target networks, as is common with neural networks. We use a batch size of 32, with buffer size =

, target networks(), and discount factor

for all agents. We sweep over learning rates – policy: {1e-3, 1e-4, 1e-5}, action-values: {1e-2, 1e-3, 1e-4}, and then use of layer normalization between network layers. For PICNN however, layer normalization could not be used in order to preserve convexity. Best hyperparameter settings found for all agents are reported in Supplement 2.4.

Experiments in a Bimodal Toy Domain

To illustrate the limitation that could be posed by restricting the functional form of , we design a toy domain with a single state and , where the true —shown in in Figure 3

—is a function of two radial basis functions centered at

and respectively, with unequal values of and respectively. We assume a deterministic setting, and so the rewards .

Figure 3: Optimal Action-values for the Bimodal Domain.

We plot the average performance of the best setting for each agent over 20 runs, in Figure 2. We also monitored the training process, logging action-value function, exploratory action, and greedy action at each time step. We include videos in the Supplement222, and descriptions can be found in Supplement 2.2.

All the methods that restrict the functional form for actions failed in many runs. PICNN and NAF start to increase value for one action center, and by necessity of convexity, must overly decrease the values around the other action center. Consequently, when they randomly explore and observe the higher reward for that action than they predict, a large update skews the action-value estimates. DDPG would similarly suffer, because the Actor only learns to output one action. Even though its action-value function is not restrictive, DDPG may periodically see high value for the other action center, and so its choice of greedy action can be pulled back-and-forth between these high-valued actions. AE methods, on the other hand, almost always found the optimal action. Wire-fitting performed better than DDPG, PICNN and NAF, as it should be capable of correctly modeling action-values, but still converged to the suboptimal policy quite often.

The exploration mechanisms also played an important role. For certain exploration settings, the agents restricting the functional form on the action-values or policy can learn to settle on one action, rather than oscillating. For NAF with small exploration scale and and DDPG using OU noise, such oscillations were not observed in the above figure, because the agent only explores locally around one action, avoiding oscillation but also often converging to the suboptimal action. This was still a better choice for overall performance as oscillation produces lower accumulated reward. AE and AE+, on the other hand, explore by sampling from their learned multi-modal Gaussian mixture distribution, with no external exploration parameter to tune.

Experiments in Benchmark Domains


Figure 4: Agents evaluated on benchmark domains. Results are over 10 runs, smoothed over a moving window average of size 10. The exact versions of the environment we used are detailed in Supplement 2.3. In these complex domains, AE methods perform similarly or better than other baseline methods. Additionally, the optimized quantile distribution in AE+ generally performs better than the standard quantile distribution in AE, motivating further investigation into improving the empirical quantile distribution used in AE.

We evaluated the algorithms on a set of benchmark continuous action tasks, with results shown in Figure 4. As mentioned above, we do not include Wire-fitting, as it scaled poorly on these domains. Detailed description of the benchmark environments and their dimensions is included in Supplement 2.3, with state dimensions ranging from 3 to 17 and action dimensions ranging from 1 to 6.

In all benchmark environments AE and AE+ perform as well or better than other methods. In particular, they seem to learn more quickly. We hypothesize that this is because AE better estimates greedy actions and explores around actions with high action-values more effectively.

NAF and PICNN seemed to have less stable behavior, potentially due to their restrictive action-value function. PICNN likely suffers less, because its functional form is more general, but its greedy action selection mechanism is not as robust and some instability was observed in Lunar Lander. Such instability is not observed in Pendulum or HalfCheetah possibly because the action-value surface is simple in Pendulum and for locomotion environment like HalfCheetah, precision is not necessary; approximately good actions may still enable the agent to move and achieve reasonable performance.

Though the goal here is to evaluate the utility of AE compared to value-based methods, we do include one policy gradient method as a baseline and a preliminary result into value-based versus policy gradient approaches for continuous control. It is interesting to see that AE methods often perform better than their Actor-Critic (policy gradient) counterpart, DDPG. In particular, AE seems to learn much more quickly, which is a hypothesized benefit of value-based methods. Policy gradient methods, on the other hand, typically have to use out-of-date value estimates to update the policy, which could slow learning.

Discussion and Future Work

In our work, we introduced a new framework called Actor-Expert, that decouples action-selection from action-value representation by introducing an Actor that learns to identify maximal actions for the Expert action-values. Previous value-based approaches for continuous control have typically limited the action-value functional form to easily optimize over actions. We have shown that this can be problematic in domains with true action-values that do not follow this parameterization. We proposed an instance of Actor-Expert, by developing a Conditional Cross Entropy Method to iteratively find greedy actions conditioned on states. We use a multi-timescale analysis to prove that this Actor tracks the Cross Entropy updates which seek the optimal actions across states, as the Expert evolves gradually. This proof differs from other multi-timescale proofs in reinforcement learning, as we analyze a stochastic recursion that is based on the Cross Entropy Method, rather than a more typical stochastic (semi-)gradient descent update. We conclude by showing that AE methods are able to find the optimal policy even when the true action-value function is bimodal, and performs as well as or better than previous methods in more complex domains. Like the Actor-Critic framework, we hope for the Actor-Expert framework to facilitate further development and use of value-based methods for continuous action problems.

One such direction is to more extensively compare value-based methods and policy gradient methods for continuous control. In this work, we investigated how to use value-based methods under continuous actions, but did not state that value-based methods were preferable over policy gradient methods. However, there are several potential benefits of value-based methods that merit further exploration. One advantage is that Q-learning easily incorporates off-policy samples, potentially improving sample complexity, whereas with policy gradient methods, it comes at the cost of introducing bias. Although some off-policy policy gradient methods like DDPG have achieved high performance in benchmark domains, they are also known to suffer from brittleness and hyperparameter sensitivity [Duan et al.2016, Henderson et al.2017]. Another more speculative advantage is in terms of the optimization surface. The Q-learning update converges to optimal values in tabular settings and linear function approximation [Melo and Ribeiro2007]. Policy gradient methods, on the other hand, can have local minima, even in the tabular setting. One goal with Actor-Expert is to improve value-based methods for continuous actions, and so facilitate investigation into these hypotheses, without being limited by difficulties in action selection.


1 Convergence Analysis

In this section, we prove that the stochastic Conditional Cross-Entropy Method update for the Actor tracks an underlying deterministic ODE for the expected Cross-Entropy update over states. We being by providing some definitions, particularly for the quantile function which is central to the analysis. We then lay out the assumptions, and discuss some policy parameterizations to satisfy those assumptions. We finally state the theorem, with proof, and provide one lemma needed to prove the theorem in the final subsection.

1.1 Notation and Definitions

Notation: For a set , let represent the interior of , while is the boundary of . The abbreviation stands for almost surely and stands for infinitely often. Let represent the set . For a set , we let

to be the indicator function/characteristic function of

and is defined as if and 0 otherwise. Let , and denote the expectation, variance and probability measure w.r.t. . For a -field , let represent the conditional expectation w.r.t. . A function is called Lipschitz continuous if , . A function is called locally Lipschitz continuous if for every , there exists a neighbourhood of such that is Lipschitz continuous. Let represent the space of continuous functions from to . Also, let represent an open ball of radius with centered at . For a positive integer , let .

Definition 1.

A function is Frechet differentiable at if there exists a bounded linear operator such that the limit


exists and is equal to . We say is Frechet differentiable if Frechet derivative of exists at every point in its domain.

Definition 2.

Given a bounded real-valued continuous function with and a scalar , we define the -quantile of w.r.t. the PDF (denoted as ) as follows:


where is the probability measure induced by the PDF , i.e., for a Borel set , .

This quantile operator will be used to succinctly write the quantile for , with actions selected according to , i.e.,


1.2 Assumptions

Assumption 1.

Given a realization of the transition dynamics of the MDP in the form of a sequence of transition tuples , where the state is drawn using a latent sampling distribution , while is the action chosen at state , the transitioned state and the reward . We further assume that the reward is uniformly bounded, i.e., .

Here, we analyze the long run behaviour of the conditional cross-entropy recursion (actor) which is defined as follows:


Here, is the projection operator onto the compact (closed and bounded) and convex set with a smooth boundary . Therefore, maps vectors in to the nearest vectors in w.r.t. the Euclidean distance (or equivalent metric). Convexity and compactness ensure that the projection is unique and belongs to .

Assumption 2.

The pre-determined, deterministic, step-size sequences , and are positive scalars which satisfy the following:

The first conditions in Assumption 2 are the classical Robbins-Monro conditions [Robbins and Monro1985] required for stochastic approximation algorithms. The last two conditions enable the different stochastic recursions to have separate timescales. Indeed, it ensures that the recursion is relatively faster compared to the recursions of and . This timescale divide is needed to obtain the pursued coherent asymptotic behaviour, as we describe in the next section.

Assumption 3.

The pre-determined, deterministic, sample length schedule is positive and strictly monotonically increases to and .

Assumption 3 states that the number of samples increases to infinity and is primarily required to ensure that the estimation error arising due to the estimation of sample quantiles eventually decays to . Practically, one can indeed consider a fixed, finite, positive integer for which is large enough to accommodate the acceptable error.

Assumption 4.

The sequence satisfies , where is a convex, compact set. Also, for , let , .

Assumption 4 assumes stability of the Expert, and minimally only requires that the values remain in a bounded range. We make no additional assumptions on the convergence properties of the Expert, as we simply need stability to prove that the Actor tracts the desired update.

Assumption 5.

For and , let , and .

Assumption 5 implies that there always exists a strictly positive probability mass beyond every threshold . This assumption is easily satisfied when is continuous in and

is a continuous probability density function.

Assumption 6.
Assumption 7.

For , is locally Lipschitz continuous w.r.t. .

Assumptions 6 and 7 are technical requirements and can be justified and more appropriately characterized when we consider to belong to the most popular natural exponential family (NEF) of distributions.

Definition 3.

Natural exponential family of distributions (NEF)[Morris1982]:

These probability distributions over

are represented by


where is the natural parameter, , while (called the sufficient statistic) and (called the cumulant function of the family). The space is defined as . Also, the above representation is assumed minimal.333For a distribution in NEF, there may exist multiple representations of the form (8). However, for the distribution, there definitely exists a representation where the components of the sufficient statistic are linearly independent and such a representation is referred to as minimal.

A few popular distributions which belong to the NEF family include Binomial, Poisson, Bernoulli, Gaussian, Geometric and Exponential distributions.

We parametrize the policy using a neural network, which implies that when we consider NEF for the stochastic policy, the natural parameter of the NEF is being parametrized by . To be more specific, we have to be the function space induced by the neural network of the actor, i.e., for a given state , represents the natural parameter of the NEF policy . Further,


Therefore Assumption 7 can be directly satisfied by assuming that is twice continuously differentiable w.r.t. .

The next assumption is a standard assumption that sample average converges with an exponential rate in the number of samples. The assumption reflects that this should be true for arbitrary .

Assumption 8.

For and , we have

where .

Assumption 9.

For every , and , (from Eq. (5)) exists and is unique.

The above assumption ensures that the true -quantile is unique and the assumption is usually satisfied for most distributions and a well-behaved .

1.3 Main Theorem

To analyze the algorithm, we employ here the ODE-based analysis as proposed in [Borkar2008, Kushner and Clark2012]. The actor recursions (Eqs. (6-7)) represent a classical two timescale stochastic approximation recursion, where there exists a bilateral coupling between the individual stochastic recursions (6) and (7). Since the step-size schedules and satisfy , we have relatively faster than . This disparity induces a pseudo-heterogeneous rate of convergence (or timescales) between the individual stochastic recursions which further amounts to the asymptotic emergence of a stable coherent behaviour which is quasi-asynchronous. This pseudo-behaviour can be interpreted using multiple viewpoints, i.e., when viewed from the faster timescale recursion (recursion controlled by ), the slower timescale recursion (recursion controlled by ) appears quasi-static (‘almost a constant’); likewise, when observed from the slower timescale, the faster timescale recursion seems equilibrated. The existence of this stable long run behaviour under certain standard assumptions of stochastic approximation algorithms is rigorously established in [Borkar1997] and also in Chapter 6 of [Borkar2008]. For our stochastic approximation setting (Eqs. (6-7)), we can directly apply this appealing characterization of the long run behaviour of the two timescale stochastic approximation algorithms—after ensuring the compliance of our setting to the pre-requisites demanded by the characterization—by considering the slow timescale stochastic recursion (7) to be quasi-stationary (i.e., , , ), while analyzing the limiting behaviour of the faster timescale recursion (6). Similarly, we let to be quasi-stationary too (i.e., , , ). The asymptotic behaviour of the slower timescale recursion is further analyzed by considering the faster timescale temporal variable with the limit point so obtained during quasi-stationary analysis.

Define the filtration , a family of increasing natural -fields, where .

Theorem 2.

Let Let Assumptions 1-9 hold. Then the stochastic sequence generated by the stochastic recursion (6) asymptotically tracks the following ODE:


In other words, , where is set of stable equilibria of the ODE (10) contained inside .


Firstly, we rewrite the stochastic recursion (6) under the hypothesis that and are quasi-stationary, i.e., and as follows:


where and , i.e., the gradient w.r.t. at . Also,