Policy Search in Continuous Action Domains: an Overview

03/13/2018 ∙ by Olivier Sigaud, et al. ∙ DLR UPMC 0

Continuous action policy search, the search for efficient policies in continuous control tasks, is currently the focus of intensive research driven both by the recent success of deep reinforcement learning algorithms and by the emergence of competitors based on evolutionary algorithms. In this paper, we present a broad survey of policy search methods, incorporating into a common big picture these very different approaches as well as alternatives such as Bayesian Optimization and directed exploration methods. The main message of this overview is in the relationship between the families of methods, but we also outline some factors underlying sample efficiency properties of the various approaches. Besides, to keep this survey as short and didactic as possible, we do not go into the details of mathematical derivations of the elementary algorithms.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Autonomous systems are systems which know what to do in their domain without external intervention. Generally, their behavior is specified through a policy. The policy of a robot, for instance, is defined through a controller which determines actions to take or signals to send to the actuators for any state of the robot in its environment.

Robot policies are often designed by hand, but this manual design is only viable for systems acting in well-structured environments and for well-specified tasks. When those conditions are not met, a more appealing alternative is to let the system find its own policy by exploring various behaviors and exploiting those that perform well with respect to some predefined utility function. This approach is called policy search, a particular case of reinforcement learning (RL) (Sutton and Barto, 1998)

where actions are vectors from a continuous space. More precisely, the goal of policy search is to optimize a policy where the function relating behaviors to their utility is

black-box, i.e. no analytical model or gradient of the utility function is available. In practice, a policy search algorithm runs the system with some policies to generate rollouts made of several state and action steps and gets the utility as a return (see Figure 1). These utilities are then used to improve the policy, and the process is repeated until some satisfactory set of behaviors is found. In general, policies are represented with a parametric function, and policy search explores the space of policy parameters. For doing so, rollout and utility data are processed by policy improvement algorithms as a set of samples.

Figure 1: Visualization of one episode, the information contained in a rollout, and the definition of the episode utility (which is also known as the episode return when the utility is a reward).

In the context of robotics, sample efficiency is a key concern. There are three aspects to sample efficiency: (1) data efficiency, i.e. extracting more information from available data (definition taken from (Deisenroth and Rasmussen, 2011)), (2) sample choice, i.e. obtaining data in which more information is available and (3) sample reuse, i.e. improving a policy several times by using the same samples more than once through experience replay.

In this paper, we provide a broad overview of policy search algorithms under the perspective of these three aspects.

1.1 Scope and Contributions

Three surveys about policy search for robotics have been published in recent years (Deisenroth et al., 2013; Stulp and Sigaud, 2013; Kober et al., 2013). With respect to these previous surveys, we cover a broader range of policy search algorithms, including optimization without a utility model, Bayesian optimization (BO), directed exploration methods, and deep RL. The counterpart of this breadth is that we do not give a detailed account of the corresponding algorithms nor their mathematical derivation. To compensate for this lack of details, we refer the reader to (Deisenroth et al., 2013) for the mathematical derivation and description of most algorithms before 2013, and we provide carefully chosen references as needed when describing more recent algorithms.

Furthermore, we focus on the case where the system is learning to solve a single task. That is, we do not cover the broader domain of lifelong, continual or open-ended learning, where a robot must learn how to perform various tasks over a potentially infinite horizon (Thrun and Mitchell, 1995). Additionally, though a subset of policy search methods are based on RL, we do not cover recent work on RL with discrete actions such as dqn and its successors (Mnih et al., 2015; Hessel et al., 2017). Finally, we restrict ourselves to the case where samples are the unique source of information for improving the policy. That is, we do not consider the interactive context where a human user can provide external guidance (Najar et al., 2016), either through feedback, shaping or demonstration (Argall et al., 2009).

1.2 Perspective and structure of the survey

The main message of this paper is as follows. In optimization, when the utility function to be optimized is known and convex, efficient convex methods can be applied (Gill et al., 1981). If the function is known but not convex, a local optimum can be found using gradient descent, iteratively moving from the current point towards a local optimum by following the direction provided by the derivative of the function at this point. If the function is black-box, neither the function nor its analytic gradient are known. In policy gradient methods, policy parameters are only related to their utility indirectly through an intermediate set of observed behaviors. Given that policy search corresponds to this more difficult context, we consider five solutions:

  1. searching for high utility policy parameters without building a utility model (Section 2),

  2. learning a model of the utility function in the space of policy parameters and performing stochastic gradient descent (SGD) using this model (Section 


  3. defining an arbitrary outcome space and using directed exploration of this outcome space for finding high utility policy parameters (Section 4),

  4. doing the same as in Solution 2 in the state-action space (Section 5),

  5. learning a model of the transition function of the system-environment interaction that predicts the next state given the current state and action, to generate samples without using the system, and then applying one of the above solutions based on the generated samples.

An important distinction in the policy search domain is whether the optimization method is episode-based or step-based (Deisenroth et al., 2013)111This distinction exactly matches the phylogenetic RL versus ontogenetic RL distinction in (Togelius et al., 2009).. The first three solutions above are episode-based, the fourth is step-based and the fifth can be applied to all others.

Figure 2: Simplified classification of the algorithms covered in the paper. is the space of policy parameters (see Section 2), is an outcome space (see Section 4) and is the state and action space (see Section 5). Algorithms not covered in (Deisenroth et al., 2013) have a lighter (green) background. References to the main paper for each of these algorithms is given in a table in the end of each section. From the left to the right, algorithms are grossly ranked in order of increasing sample reuse, but methods using a utility model in and show better sample choice, resulting in competitive sample efficiency.

Resulting from the above perspective, this survey is structured following the organization of methods depicted in Figure 2. The rest of the paper describes the different nodes in the trees and highlights some of their sample efficiency factors. A table giving a quick reference to the main paper for each algorithm is given in the end of each section.

In Sections 2 to 5, we present Solutions 1 to 4 above in more detail, showing how the corresponding methods are implemented in various policy search algorithms. We do not cover Solution 5 and refer readers to (Chatzilygeroudis et al., 2017) for a recent presentation of these model-based policy search methods. Then, in Section 6, we discuss the different elementary design choices that matter in terms of sample efficiency. Finally, Section 7 summarizes the paper and provides some perspectives about current trends in the domain.

2 Policy search without a utility model

When the function to be optimized is available but has no favorable property, the standard optimization method known as Gradient Descent consists in iteratively following the gradient of this function towards a local optimum. When the same function is only known through a model built by regression from a batch of samples, one can also do the same, but computing the gradient requires evaluations over the whole batch, which can be computationally expensive. An alternative known as Stochastic Gradient Descent (SGD) circumvents this difficulty by taking a small subset of the batch at each iteration (Bottou, 2012). Before starting to present these methods in Section 3, we first investigate a family of methods which perform policy search without learning a model of the utility function at all. They do so by sampling the policy parameter space and moving towards policy parameters of higher utility .

2.1 Truly random search

At one extreme, the simplest black-box optimization (BBO) method randomly searches until it stumbles on a good enough utility. We call this method “Truly random search” as the name “random search” is used in the optimization community to refer to gradient-free methods (Rastrigin, 1963). Its distinguishing feature is in its sample choice strategy: the utility of the previous has no impact on the choice of the next .

Quite obviously, this sample choice strategy is not efficient, but it requires no assumption at all on the function to be optimized. Therefore, it is an option when this function does not show any regularity that can be exploited. All other methods rely on the implicit assumption that presents some smoothness around optima , which is a first step towards using a gradient.

So globally, this method provides a proof of concept that an agent can obtain a better utility without estimating any gradient at all. Recently, other forms of gradient-free methods called

random search though they are not truly random have been shown to be competitive with deep RL (Mania et al., 2018).

The next three families of methods, population-based optimization, evolutionary strategies and estimation of distribution algorithms, are all variants of evolutionary methods. An overview222A blog with dynamical visualizations and more technical details can be found at http://blog.otoro.net/2017/10/29/visual-evolution-strategies/. of these methods is depicted in Figure 3.

Figure 3: One iteration of evolutionary methods. (a) Population-based methods (b) Evolutionary Strategies (c) EDAs. Blue: current generation and sampling domain. Full blue dots: samples with a good evaluation. Dots with a red cross: samples with a poor evaluation. Green: new generation and sampling domain, empty dots are not evaluated yet. Red dots: optimum guess. In population-based methods, the next generation are offspring from several elite individuals of the previous generation. In ES, it is obtained from an optimum guess and sampling from fixed Gaussian noise. In EDAs, Gaussian noise is tuned using Covariance Matrix Adaptation.

2.2 Population-based optimization

Population-based BBO methods manage a limited population of individuals, and generate new individuals randomly in the vicinity of the previous elite

individuals. There are several families of population-based optimization methods, the most famous being Genetic Algorithms (GAs)

(Goldberg, 1989)

, Genetic Programming (GP)

(Koza, 1992), and the more advanced NEAT framework (Stanley and Miikkulainen, 2002). In these methods, the parameter corresponding to an individual is often called its genotype and the corresponding utility is called its fitness, see (Back, 1996)

for further reading. These methods have already been combined with neural networks giving rise to

neuroevolution (Floreano et al., 2008) but, up to recently, these methods were mostly applied to small to moderate size policy representations. However, the availability of modern computational resources have made it possible to apply them to large and deep neural network representations, defining the emerging domain of deep neuroevolution (Petroski Such et al., 2017). Among other things, it was shown that, given large enough computational resources, methods as simple as GAs can offer a competitive alternative to deep RL methods presented in Section 5, mostly due to their excellent parallelization capabilities (Petroski Such et al., 2017; Conti et al., 2017).

2.3 Evolutionary strategies

Evolutionary strategies (ES) can be seen as specific population-based optimization methods where only one individual is retained from one generation to the next. More specifically, an optimum guess is computed from the previous samples, then the next samples are obtained by adding Gaussian noise to the current optimum guess.

Moving from one optimum guess to the next implements a form of policy improvement similar to SGD, but where the gradient is approximated by averaging over samples instead of being analytically computed. Hence this method is more flexible but, since gradient approximation uses a random exploration component, it is less data efficient. However, data efficiency can be improved by reusing samples between one generation and the next when their sampling domain overlaps, a method called importance mixing (Sun et al., 2009). An improved version of importance mixing was recently proposed in (Pourchot et al., 2018), showing a large impact on sample efficiency, but not large enough to compete with deep RL methods on this aspect. Further results about importance mixing can be found in (Pourchot and Sigaud, 2018), showing that more investigations are necessary to better understand in which context this mechanism can be most useful.

The correlation between the direction of the gradient given by SGD and the same direction for ES depends on the evolutionary algorithm. Interestingly, good ES performance can be obtained even when the correlation is not high, though this result still needs to be confirmed in the case of policy search (Zhang et al., 2017).

A specific ES implementation of deep neuroevolution where constant Gaussian noise is used at each generation was shown to compete with deep RL methods on standard benchmarks (Salimans et al., 2017). This simple implementation generated an insightful comparison with methods based on SGD depending on various gradient landscapes, showing under which conditions ES can find better optima than SGD (Lehman et al., 2017).

Finally, instead of approximating the vanilla gradient of utility, nes (Wierstra et al., 2008) and xnes (Glasmachers et al., 2010) approximate its natural gradient (Akimoto et al., 2010), but for doing so they have to compute the inverse of the Fisher Information Matrix, which is prohibitively expensive in large dimensions (Grondman et al., 2012). We refer the reader to (Pierrot et al., 2018) for a detailed presentation of natural gradient and other advanced gradient descent concepts.

2.4 Estimation of Distribution Algorithms

The standard perspective about EDAs is that they are a specific family of ES using a covariance matrix (Larrañaga and Lozano, 2001). This covariance matrix defines a multivariate Gaussian function over , hence its size is

. Samples at the next iteration are drawn with a probability proportional to this Gaussian function. Along iterations, the ellipsoid defined by

is progressively adjusted to the top part of the hill corresponding to the local optimum .

The role of is to control exploration. The exploration policy can be characterized as uncorrelated when it only updates the diagonal of and correlated when it updates the full (Deisenroth et al., 2013). The latter is more efficient in small parameter spaces but computationally more demanding and potentially inaccurate in larger spaces as more samples are required. In particular, it cannot be applied in the deep neuroevolution context where the order of magnitude of the size of is between thousands and millions.

Various instances of EDAs, such as cem, cma-es, pi-cma, are covered in (Stulp and Sigaud, 2012a, b, 2013). Among them, the cma-es algorithm is also shown to approximate the natural gradient (Arnold et al., 2011). By contrast, the pi algorithm, also described in (Stulp and Sigaud, 2013), is a simplification of pi-cma where covariance matrix adaptation has been removed. Thus it should be considered an instance of the former ES category.

2.5 Finite difference methods

In finite difference methods, the gradient of utility with respect to

is estimated as the first order approximation of the Taylor expansion of the utility function. This estimation is performed by applying local perturbations to the current input. Thus these methods are derivative-free and we classify them as using no model, even if they are based on a local linear approximation of the gradient.

In finite difference methods, gradient estimation can be cast as a standard regression problem, but perturbations along each dimension of can be treated separately, which results in a very simple algorithm (Riedmiller et al., 2008)

. The counterpart of this simplicity is that it suffers from a lot of variance, so in practice the methods are limited to deterministic policies.

2.6 Reference to the main algorithms

Algorithm Main paper
cma-es (Hansen and Ostermeier, 2001)
cem (Rubinstein and Kroese, 2004)
finite diff. (Riedmiller et al., 2008)
nes (Wierstra et al., 2008)
xnes (Glasmachers et al., 2010)
pi (Stulp and Sigaud, 2012b)
pi-cma (Stulp and Sigaud, 2012a)
OpenAI-ES (Salimans et al., 2017)
Random Search (Mania et al., 2018)
Table 1: Main gradient-free algorithms. Above the line, they were studied in (Deisenroth et al., 2013), below they were not.

2.7 Sample efficiency analysis

In all gradient-free methods, sampling a vector of policy parameters provides an exact information about its utility . However, the function can be stochastic, in which case one value of only contains partial information about the value of that . Anyways, sample reuse can be implemented by storing an archive of the already sampled pairs . Each time an algorithm needs the utility of a sample , if this utility is already available in the archive, it can use it instead of sampling again. In the deterministic case, using the stored value is enough. In the stochastic case, the archive may provide a distribution over values , and the algorithm may either draw a value from this distribution or sample again, depending on accuracy requirements.

[colback=red!10!white]Message 1: Policy search without a utility model is generally less data efficient than Stochastic Gradient Descent (SGD). Though sample reuse is technically possible without a utility model, in practice it is seldom used. Despite their lower sample efficiency in comparison to SGD, some of these methods are highly parallelizable and offer a viable alternative to deep RL provided enough computational resources.

3 Policy search with a model of utility in the space of policy parameters

As outlined in the introduction, the utility of a vector of policy parameters is only available by observing the corresponding behavior. Although no model that relates policy parameters to utilities is given, one may approximate the utility function in from these observations, by collecting samples consisting of (policy parameters, utility) pairs and using regression to infer a model of the corresponding function (see e.g. (Stulp and Sigaud, 2015)). Such a model could either be deterministic, giving one utility per policy parameters vector, or stochastic, giving a distribution over utility values.

Once such a model is learned, one could perform gradient descent on this model. These steps could be performed sequentially (first model learning and then gradient descent) or incrementally (improving the model and performing gradient descent after each new utility observation). In the latter case, the model is necessarily persistent: it evolves from iteration to iteration given new information, in contrast with the sequential case where it could be transient, that is recomputed from scratch at each iteration.

3.1 Bayesian Optimization

Though the above approach seems appealing, we are not aware of any algorithm performing what is described above in the deterministic case. A good reason for this is that utility functions are generally stochastic in . Thus, algorithms which learn a model have to learn a distribution over such models. This is exactly what Bayesian optimization (BO) does. The distribution

over models is updated through Bayesian inference. It is initialized with a

prior, and each new sample, considered as some new evidence

, helps adjusting the model distribution towards a peak at the true value, whilst keeping track of the variance over models. By estimating the uncertainty over the distribution of models, BO methods are endowed with active learning capabilities, dramatically improving their sample efficiency at the cost of a worse scalability.

A BO algorithm comes with a covariance function that determines how the information provided by a new sample influences the model distribution around this sample. It also comes with an acquisition function used to choose the next sample given the current model distribution. A good acquisition function should take into account the value and the uncertainty of the model over the sampled space.

By quickly reducing uncertainty, BO implements a form of active learning. As a consequence, it is very sample efficient when the parameter space is small enough, and it searches for the global optimum, rather than a local one. However, given the necessity to optimize globally over the acquisition function, it scales poorly in the size of the parameter space. For more details, see (Brochu et al., 2010).

The rock algorithm is an instance of BO that searches for a local optimum instead of a global one (Hwangbo et al., 2014). It uses cma-es to find the optimum over the model function. By doing so, it performs natural rather than vanilla gradient optimization, but it does not use the available model of the utility function, though this could improve sample efficiency.

Bayesian optimization algorithms generally use Gaussian kernels to efficiently represent the distribution over models. However, some authors have started to note that, in the specific context of policy search, BO was not using all the information available in elementary steps of the agent. This led to the investigation of more appropriate data-driven kernels based on the Kullbak-Leibler divergence between rollout density generated by two policies (Wilson et al., 2014).

Using BO in the context of policy search is an emerging domain (Lizotte et al., 2007; Calandra et al., 2014; Metzen et al., 2015; Martinez-Cantin et al., 2017). Furthermore, recent attempts to combine BO with reinforcement learning approaches, giving rise to the Bayesian Optimization Reinforcement Learning (BORL) framework, are described in Section 5.

3.2 Reference to the main algorithms

Algorithm Main paper
Bayes. Opt. (Pelikan et al., 1999)
rock (Hwangbo et al., 2014)
Table 2: Main Bayesian Optimization algorithms

3.3 Sample efficiency analysis

Learning a model of the utility function in should be more sample efficient than trying to optimize without a model, as the gradient with respect to the model can be used to accelerate parameter improvement. However, learning a deterministic model is not enough for most cases, as the true utility function is generally stochastic in , and learning a stochastic model comes with an additional computational cost which impacts the scalability of the approach.

[colback=red!10!white]Message 2: Bayesian Optimization is BBO managing a distribution over models in the policy parameter space. Its sample efficiency benefits from active choice of samples. But as it performs global search, it does not scale well to large policy parameter spaces. Thus, it is difficult to apply to deep neural network representations.

4 Directed exploration methods

Directed exploration methods are particularly useful in tasks with sparse rewards, i.e. where large parts of the search space have the same utility signal. These methods have two main features. First, instead of searching directly in the policy parameter space , they search in a smaller outcome space (also called descriptor space or behavioral space) and learn an invertible mapping between and . Second, they all optimize a task-independent criterion called novelty or diversity which is used to efficiently cover the outcome space.

The outcome itself corresponds to properties of the observed behavior. The general intuition is that if the outcome space is properly covered by known policy parametrizations, and if utility can be easily related to outcomes, then it should be easy to find policy parameters with a high utility, even when the utility function is null for most policy parameters. Figure 4 visualizes why it is generally more efficient to perform the search for novel solutions in a dedicated outcome space and learning a mapping from to than performing this search directly in (Baranes et al., 2014).

So, for the method to work, the outcome space has to be defined in such a way that determining the utility corresponding to an outcome is straightforward. Generally, the outcome space is defined by an external user to meet this requirement. Nevertheless, using representation learning methods to let the agent autonomously define its own outcome space is an emerging topic of interest (Pere et al., 2018; Laversanne-Finot et al., 2018).

Figure 4: A standard mapping between a policy parameter space and an outcome space . Most often, many policy parameters result in the same outcome (for instance, in the case of a robotic arm which must move a ball around, if the policy defines arm movements and the outcome space is defined as ball positions, most policy parameters will results into a static ball). In that case, sampling directly in works poorly: you have to sample in such a way to efficiently cover .

Directed exploration methods can be split into novelty search (NS) (Lehman and Stanley, 2011), quality-diversity (QD) (Pugh et al., 2015) and goal exploration processes (geps) (Baranes and Oudeyer, 2010; Forestier and Oudeyer, 2016; Forestier et al., 2017). The first two derive from evolutionary methods, whereas geps come from the developmental learning and intrinsic motivation literature.

An important distinction between them is that NS and geps are designed to optimize diversity only333Hence the dotted line in Figure 2., thus they do not use the utility function at all, whereas QD methods rely on multi-objective optimization methods to optimize diversity and utility at the same time.

The NS approach arose from the realization that optimizing utility as a single objective is not the only option (Doncieux and Mouret, 2014). In particular, in the case of sparse or deceptive reward problems, it was shown that seeking novelty or diversity is an efficient strategy to obtain high utility solutions, even without explicitly optimizing this utility (Lehman and Stanley, 2011). The gep approach was more inspired by thoughts on intrinsic motivations, where the goal was to have an agent achieve its own goal without an external utility signal (Forestier et al., 2017). However, researchers in evolutionary methods also realized that diversity and utility can be optimized jointly (Cuccu and Gomez, 2011), giving rise to more advanced NS and QD algorithms (Pugh et al., 2015; Cully and Demiris, 2017).

All these methods share a lot of similarities. They all start with a random search stage and, when they evaluate a policy parameter vector resulting in a point in the outcome space , they store the corresponding pair in an archive. Because they use this archive for policy improvement, they all implement a form of lazy learning, endowing them with interesting sample efficiency properties (Aha, 1997). The archive itself can be seen as a stochastic model of the function relating to , as made particularly obvious in the MAP-Elites algorithm (Cully et al., 2015).

In more details, the main differences between these methods lie in the way they cover the outcome space . NS and QD methods perform undirected variations to the elite vectors present in the archive. More precisely, in NS, the resulting solution is just added to the archive, whereas in QD the new solution replaces a previous one if it outperforms it both in terms of diversity and utility. By contrast, geps choose a desired outcome and modify a copy of the leading to the closest outcome in the archive. The choice of a desired outcome can be performed randomly or using curriculum learning or learning progress concepts (Baranes and Oudeyer, 2013; Forestier et al., 2017). Similarly, the modification of can be performed using undirected Gaussian noise or in more advanced ways. For instance, some gep methods build a local linear model of the mapping from to to efficiently invert it, so as to find the corresponding to the desired outcome (Baranes and Oudeyer, 2013).

Thus directed exploration methods all learn a stochastic and invertible mapping between and . When they also learn a model of , this model is stochastic with respect to , which makes them similar to BO methods. In that case, the outcome space is an intermediate space between and utilities: policy parameters are first projected into the outcome space, and then a model of the utility function in this outcome space can be learned.

Learning a model of the utility in shares some similarities with learning a critic in the state action space , as presented in Section 5. From this perspective, these methods can be seen as providing an intermediary family between evolutionary, BO and reinforcement learning methods. However, we shall see soon that learning a critic in the state action space benefits from additional properties related to temporal difference learning, which limits the use of the above unifying perspective.

4.1 Reference to the main algorithms

Algorithm Main paper
Novelty Search (Lehman and Stanley, 2011)
Quality-Diversity (Pugh et al., 2015)
Goal Exploration (Forestier et al., 2017)
Table 3: Main directed exploration algorithms

4.2 Sample efficiency analysis

The defining characteristic of all directed exploration methods is their capability to widely cover the outcome space. This provides efficient exploration, which in turn critically improves sample efficiency when combined with more standard evolutionary methods mentioned in Section 2 (Conti et al., 2017) or deep RL method mentioned in Section 5 (Colas et al., 2018).

Even though our article focuses on single-task learning, it is worth mentioning that direct exploration methods may very much improve sample efficiency in multi-task learning scenarios. This is because such methods aim at covering the (interesting) outcome space, and can thus more easily adapt when facing multiple tasks, and thus potentially multiple outcomes.

[colback=red!10!white]Message 3: Looking for diversity only in a user defined outcome space is an efficient way to perform exploration, and can help solve sparse or deceptive reward problems, where more standard exploration would fail. Directed exploration methods are thus useful complements to other methods covered in this survey.

5 Policy search with a critic

The previous two sections have presented methods which learn mappings from policy parameter space to utilities or outcomes. We now cover methods which learn a model of utility in the state-action space .

An important component in the RL formalization, the utility corresponds to the return the agent may expect from performing action when it is in state and then following either its current policy or the optimal policy . This quantity may also depend on a discount factor and a noise parameter .

The true utility can be approximated with a model with parameters . Such a model is called a critic. A key feature is that the critic can be learned from samples corresponding to single steps in the rollouts of the agent, either with temporal differencing or Monte Carlo methods. Methods that approximate by , and determine the policy parameters by descending the gradient of with respect to are called actor-critic methods, the policy being the actor (Peters and Schaal, 2008b; Deisenroth et al., 2013).

This actor-critic approach can be applied to stochastic and deterministic policies (Silver et al., 2014). The space of deterministic policies being smaller than the space of stochastic policies, the latter can be advantageous because searching the former is faster than searching the latter. However, a stochastic policy might be more appropriate when Markov property does not hold (Williams and Singh, 1998; Sigaud and Buffet, 2010) or in adversarial contexts (Wang et al., 2016b).

5.1 Exploration in parameter or state-action space

As mentioned in Section 3, learning a model of the utility in the space is a regression problem that is performed by sampling and exploring directly in . In contrast, cannot be sampled directly, as one does not know in advance which policy parameters will result in visiting which states and performing which action. Exploration is therefore performed in either by adding noise to the (policy parameter perturbation), or adding noise to the actions the policy outputs (action perturbation). In the latter case, exploration is generally undirected and adds Gaussian noise or correlated Ornstein-Ulhenbeck noise to the actions taken by the policy. Policy parameter perturbation is done in pepg, power and pi, and more recently to ddpg (Fortunato et al., 2017; Plappert et al., 2017), whereas action perturbation in the other algorithms presented in this paper.

All actor-critic algorithms iterate over the following three steps:

  • Collect new step samples from the current policy with policy parameter perturbation or action perturbation for exploration,

  • Compute a new critic based on these samples, by determining through a temporal difference method,

  • Update the policy parameters through gradient descent with respect to the critic.

A distinction here should be made on whether 1) the critic is discarded after step C, and must thus be learned from scratch in step B in the next iteration, or 2) the critic is persistent throughout the learning, and incrementally updated in step B. We discuss the differences between these two variants – which we denote transient critic and persistent critic respectively – in more detail in the next two sections.

5.2 Transient Critic Algorithms

In methods with a transient critic, Monte Carlo sampling – running a large set of episodes and averaging over the stochastic return – is used to evaluate the current policy and generate a new set of step samples. Then, determining the optimal critic parameters given these samples can be cast as a batch regression problem.

Among these methods, one must distinguish between three families: likelihood ratio methods such as reinforce (Williams, 1992) and pepg (Sehnke et al., 2010), natural gradient methods such as nac and enac 444The critic is generally persistent in actor-critic methods, but this is not the case in nac and enac. (Peters and Schaal, 2008a) and EM-based methods such as power 555Interestingly, pi can also be seen as a transient critic method, though it could in principle use a persistent oneand fall into Section 5.3. This is just because using batch updates make it more stable (Deisenroth et al., 2013). (Kober and Peters, 2009) and the variants of reps (Peters et al., 2010). All the corresponding algorithms are well described in (Deisenroth et al., 2013).

Although they derive from a different mathematical framework, likelihood ratio methods and EM-based methods are similar: they both use unbiased estimation of the gradient through Monte Carlo sampling and they are both mathematically designed so that the most rewarding rollouts get the highest probability.

The trpo (Schulman et al., 2015)

algorithm also follows an iterative approach and can use a deep neural network representation, thus it can be classified as a deep RL method. Among other things, it uses a bound on the Kullback-Leibler divergence between policies at successive iterations to ensure safe and efficient exploration. Finally, the Guided Policy Search (

gps) algorithm (Levine and Koltun, 2013; Montgomery and Levine, 2016) is another transient critic deep RL method inspired from reinforce, but adding guiding rollouts obtained from simpler policies.

5.3 Persistent Critic Algorithms

In contrast with transient critic algorithms, persistent critic algorithms incrementally update the critic during training. Most such algorithms use an actor-critic architecture, with the notable exception of naf (Gu et al., 2016b), which does not have an explicit representation of the actor. To our knowledge, before the emergence of deep RL algorithms described below, the four inac algorithms were the only representative of this family (Bhatnagar et al., 2007).

The way to compute the critic incrementally can be named a temporal difference (TD) method, also named a bootstrap method (Sutton, 1988). They compute at each step a temporal difference error or reward prediction error (RPE) between the immediate reward predicted by the current values of the critic and the actual reward received by the agent. This RPE can then be used as a loss that the critic should minimize over iterations (Sutton and Barto, 1998).

5.4 Key properties of Persistent Critic Algorithms

Most mechanisms that made deep actor-critic algorithms possible where first introduced in dqn (Mnih et al., 2015). Though dqn is a discrete action algorithm which is outside the scope of this survey, we briefly review its important concepts and mechanisms before listing the main algorithms in the family of continuous action deep RL methods.

5.4.1 Accuracy and scalability: deep neural networks

By using deep neural networks as approximation functions and making profit of large computational capabilities of modern clusters of computers, all deep RL algorithms are capable of addressing much larger problems than before, and to approximate gradients with unprecedented accuracy, which makes them more stable than the previous linear architectures of nac and power, hence amenable to incremental updates of a persistent critic rather than recomputing a transient one.

5.4.2 Stability: the target critic

Deep RL methods introduced a target critic as a way to improve stability. Standard regression is the process of fitting samples to a model so as to approximate an unknown stationary function (Stulp and Sigaud, 2015). Estimating a critic through temporal difference methods is similar to regression, but the target function is not stationary: it is itself a function of the estimated critic, thus it is modified each time the critic is modified. This can result in divergence of the critic when the target function and the estimated critic are racing after each other (Baird, 1994). To mitigate this instability, one should keep the target function stationary during several updates and reset it periodically to a new function corresponding to the current critic estimate, switching from a regression problem to another. This idea was first introduced in dqn (Mnih et al., 2015) and then modified from periodic updates to smooth variations in ddpg (Lillicrap et al., 2015).

5.4.3 Sample reuse: the replay buffer

Since they are based on value propagation, TD methods can give rise to more sample reuse than standard regression methods, provided that these samples are saved into a replay buffer. Using a replay buffer is at the heart of the emergence of modern actor-critic approaches in deep RL. Actually, learning from the samples in the order in which they are collected is detrimental to learning performance and stability because these samples are not independent and identically distributed (i.i.d.). Stability is improved by drawing the samples randomly from the replay buffer and sample efficiency can be further improved by better choosing the samples, using prioritized experience replay (Schaul et al., 2015).

5.4.4 Adaptive step sizes and return lengths

Modern SGD methods provided by most machine learning libraries now incorporate adaptive step sizes, removing a difficulty with previous actor-critic algorithms such as e

nac. Another important ingredient for the success of some recent methods in the use of n-step return, which consists in performing temporal difference updates over several time steps, resulting in the possibility to control the bias-variance trade-off (see Section 6.3.1).

5.5 Overview of deep RL algorithms

All these favorable properties are common traits of several incremental deep RL algorithms: ddpg (Lillicrap et al., 2015), naf (Gu et al., 2016b), ppo (Schulman et al., 2017), acktr (Wu et al., 2017), sac (Haarnoja et al., 2018), td3 (Fujimoto et al., 2018) and d4pg (Barth-maron et al., 2018). As depicted in Figure 2, the last one, d4pg, is an instance of Bayesian Optimization Reinforcement Learning (BORL) algorithms which derive from BO but belong to the step-based category of methods described in Section 5. These algorithms result from an effort to incorporate Bayesian computations into the deep RL framework, and correspond to a very active trend in the field. Most of these works address discrete actions (Azizzadenesheli et al., 2018; Tang and Kucukelbir, 2017), but d4pg is an exception that derives from adopting a distributional perspective on policy gradient computation, resulting in more accurate estimates on the gradient and better sample efficiency (Bellemare et al., 2017).

Finally, a few algorithms such as acer (Wang et al., 2016b), Q-prop (Gu et al., 2016a) and pgql (O’Donoghue et al., 2016)

combine properties of transient and persistent critic methods, and are captured into the more general framework of Interpolated Policy Gradient (

ipg) (Gu et al., 2017). For a more detailed description of all these algorithms, we refer the reader to the corresponding papers and to a recent survey (Arulkumaran et al., 2017).

5.6 Reference to the main algorithms

Algorithm Main paper
reinforce (Williams, 1992)
g(po)mdp (Baxter and Bartlett, 2001)
nac (Peters and Schaal, 2008a)
enac (Peters and Schaal, 2008a)
power (Kober and Peters, 2009)
pi (Theodorou et al., 2010)
reps (Peters et al., 2010)
pepg (Sehnke et al., 2010)
vips (Neumann, 2011)

(Bhatnagar et al., 2007)
gps (Levine and Koltun, 2013)
trpo (Schulman et al., 2015)
ddpg (Lillicrap et al., 2015)
a3c (Mnih et al., 2016)
naf (Gu et al., 2016b)
acer (Wang et al., 2016b)
Q-prop (Gu et al., 2016a)
pgql (O’Donoghue et al., 2016)
ppo (Schulman et al., 2017)
acktr (Wu et al., 2017)
sac (Haarnoja et al., 2018)
td3 (Fujimoto et al., 2018)
d4pg (Barth-maron et al., 2018)
Table 4: Main reinforcement learning algorithms. Algorithms below the line have not yet been covered in (Deisenroth et al., 2013).

5.7 Sample efficiency analysis

[colback=red!10!white]Message 4: Being step-based, deep RL methods are able to use more information from rollouts than episode-based methods. Furthermore, using a replay buffer leads to further sample reuse.

6 Discussion

In the previous sections we have presented methods which: (1) do not build a utility model; (2) learn a utility model: (2a) in the policy parameter space , (2b) in an arbitrary outcome space , (2c) in the state-action space . In this section, we come back to the sample efficiency properties of these different methods. We do so by descending the tree of design choices depicted in Figure 2.

6.1 Building a model or not

We have outlined that policy search methods which build a model of the utility function are generally more sample efficient than methods which do not. However, the reliance of the latter to SGD can make them less robust to local optima (Lehman et al., 2017) and it has been shown recently that methods which do not build a model of utility are still competitive in terms of final performance, due to their higher parallelization capability and distinguishing properties with respect to various gradient landscapes (Salimans et al., 2017; Petroski Such et al., 2017; Zhang et al., 2017).

6.2 Building a utility function model in the policy parameter space versus the state-action space.

Several elements speak in favor of the higher sample efficiency of learning a critic in the state-action space . First, it can give rise to more sample reuse than learning a model of the utility function in . Second, learning from each step separately makes a better use of the information available from a rollout than learning from global episodes.

Furthermore, may naturally exhibit a hierarchical structure – especially the state – which is not so obvious for . As a consequence, methods that model utility in may benefit from learning intermediate representations at different levels in the hierarchy, thus reducing the dimensionality of the policy search problem. Learning such intermediate and more compact representations is the focus of hierarchical reinforcement learning, a domain which has also been impacted by the emergence of deep RL (Kulkarni et al., 2016; Bacon et al., 2017). Hierarchical reinforcement learning can also be performed off-line, which corresponds to the perspective of the DREAM project666http://www.robotsthatdream.eu/, illustrated for instance in (Zimmer and Doncieux, 2017).

Finally, an important factor of sample efficiency is the size and structure of with respect to . In both respects, the emergence of deep RL methods using large neural networks as policy representation has changed the perspective. First, in deep RL, the size of can become larger than that of , which speaks in favor of learning a critic. Second, deep neural networks seem to generally induce a smooth structure between and , which facilitates learning. Finally, a utility function modeled in a larger space may suffer from fewer local minima, as more directions remain for improving the gradient (Kawaguchi, 2016).

The above conclusions might be mitigated by considering exploration. Indeed, in several surveys about policy search for robotics, policy parameter perturbation methods are considered superior to action perturbation methods (Stulp and Sigaud, 2013; Deisenroth et al., 2013). This analysis is backed-up with several mathematical arguments, but it might be true mostly when the space is smaller than the space . Until recently, all deep RL methods were using action perturbation. But deep RL algorithms using policy parameter perturbation have recently been published, showing again that one can model the utility function in while performing exploration in (Fortunato et al., 2017; Plappert et al., 2017). Exploration is currently one of the hottest topics in deep RL and directed exploration methods presented in Section 4 may play a key role in this story, despite the lower data efficiency of their policy improvement mechanisms (Conti et al., 2017; Colas et al., 2018).

[colback=red!10!white]Message 5: There are more arguments for learning a utility model in than in , but this ultimately depends on the size of these spaces and the structure of their relationship.

6.3 Transient versus persistent critic

At first glance, having a persistent critic may seem superior to having a transient one, for three reasons. First, by avoiding to compute the critic again at each iteration, it is computationally more efficient. Second, immediate updates favor data efficiency because the policy is improved as soon as possible, which in turn helps generating better samples. Third, being based on bootstrap methods, they give rise to more sample reuse. However, these statements must be differentiated, as two factors (described below) must be taken into account.

6.3.1 Trading bias against variance

Estimating the utility of a policy in is subject to a bias-variance compromise (Kearns and Singh, 2000). On the one hand, estimating the utility of a given policy through Monte Carlo sampling – as is generally done in transient critic approaches – is subject to a variance which grows with the length of the episodes. On the other hand, incrementally updating a persistent critic reduces variance, but may suffer from bias, resulting in potential sub-optimality, or even divergence. Instead of performing bootstrap updates of a critic over one step, one can do so over steps. The larger , the closer to Monte Carlo estimation, thus tuning is a way of controlling the bias-variance compromise. For instance, while the transient critic trpo algorithm is less sample efficient than actor-critic methods but more stable, often resulting in superior performance (Duan et al., 2016), its immediate successor, ppo, uses steps return, resulting in a good compromise between both families (Schulman et al., 2017).

6.3.2 Off-policy versus on-policy updates

In on-policy methods such as Sarsa, the samples used to learn the critic must come from the current policy, whereas in off-policy methods such as q-learning, they can come from any policy. In most transient critic methods, the samples are discarded from one iteration to the next and these methods are generally on-policy. By contrast, persistent critic methods using a replay buffer are generally off-policy 777The a3c algorithm is an incremental actor-critic method which does not use a replay buffer, and it is classified as on-policy..

This on-policy versus off-policy distinction is related to the bias-variance compromise. Indeed, when learning a persistent critic incrementally, using off-policy updates is more flexible because the samples can come from any policy, but these off-policy updates introduce bias in the estimation of the critic. As a result, off-policy methods such as ddpg and naf are more sample efficient because they use a replay buffer, but they are also more prone to sub-optimality and divergence. In that respect, a key contribution of acer and Q-prop is that they provide an off-policy, sample efficient update method which strongly controls the bias, resulting in more stability (Gu et al., 2016a; Wang et al., 2016b; Wu et al., 2017; Gu et al., 2017). These aspects are currently the subject of intensive research but the resulting algorithms suffer from being more complex, with additional meta-parameters.

[colback=red!10!white]Message 6: Persistent critic methods are superior to transient critic methods in many respects, but the latter are more stable because they decorrelate the problem of estimating the utility function from the problem of descending its gradient, and they suffer from less bias.

7 Conclusion

In this paper, we have contrasted various approaches to policy search, from evolutionary methods which do not learn a model of the utility function to deep RL methods which do so in the state-action space.

In (Stulp and Sigaud, 2013), the authors have shown that policy search applied to robotics was shifting from actor-critic methods to evolutionary methods. Part of this shift was due to the use of open-loop dmps (Ijspeert et al., 2013) as a policy representation, which favors episode-based approaches, but another part resulted from the higher stability and efficiency of evolutionary methods by that time.

The emergence of deep RL methods has changed this perspective. It should be clear from this survey that, in the context of large problems where deep neural network representations are now the standard option, deep RL is generally more sample efficient than deep neuroevolution methods, as empirically confirmed in (de Froissard de Broissia and Sigaud, 2016) and (Pourchot et al., 2018). The higher sample efficiency of deep RL methods, and particularly actor-critic architectures with a persistent critic, results from several mechanisms. They benefit from better approximation capability of non-linear critics and the incorporation of an adapted step size in SGD, they model the utility function in the state-action space, and they benefit from massive sample reuse by using a replay buffer. Using a target network has also mitigated the intrinsic instability of incrementally approximating a critic. However, it is important to acknowledge that incremental deep RL methods still suffer from significant instability888As outlined at https://www.alexirpan.com/2018/02/14/rl-hard.html..

7.1 Future directions

The field of policy search is currently the object of an intense race for increased performance, stability and sample efficiency. We now outline what we currently consider as promising research directions.

7.1.1 More analyses than competitions

Up to now, the main trend in the literature focuses on performance comparisons (Duan et al., 2016; Islam et al., 2017; Henderson et al., 2017; Petroski Such et al., 2017) showing that, despite their lower sample efficiency, methods which do not build a model of utility are still a competitive alternative in terms of final performance (Salimans et al., 2017; Chrabaszcz et al., 2018). But stability and sample efficiency comparisons are missing and works analyzing the reasons why an algorithm performs better than another are only just emerging (Lehman et al., 2017; Zhang et al., 2017; Gangwani and Peng, 2017). By drawing an overview of the whole field and revealing some important factors behind sample efficiency, this paper was intended as a starting point towards broader and deeper analyses of the efficacy of various policy search methods.

7.1.2 More combinations than competitions

An important trend corresponds to the emergence of methods which combine algorithms from various families described above. As already noted in Section 4, directed exploration methods are often combined with evolutionary or deep RL methods (Conti et al., 2017; Colas et al., 2018). There is also an emerging trend combining evolutionary or population-based methods and deep RL methods (Jaderberg et al., 2017; Khadka and Tumer, 2018; Pourchot and Sigaud, 2018) which seem to be able to take the best of both worlds. We believe we are just at the beginning of such combinations and that this area has a lot of potential.

7.1.3 Beyond single policy improvement

Though we decided to keep lifelong, continual and open-ended learning outside the scope of this survey, we must mention that fast progress in policy improvement has favored an important tendency to address several tasks at the same time (Yang and Hospedales, 2014)

. This subfield is extremely active at the moment, with many works in multitask learning

(Vezhnevets et al., 2017; Veeriah et al., 2018; Gangwani and Peng, 2018), Hierarchical Reinforcement Learning (Levy et al., 2018; Nachum et al., 2018) and meta reinforcement learning (Wang et al., 2016a), to cite only a few.

Finally, because we focused on these elementary aspects, we have left aside the emerging topic of state representation learning (Jonschkowski and Brock, 2014; Raffin et al., 2016; Lesort et al., 2018) or using auxiliary tasks for improving deep RL (Shelhamer et al., 2016; Jaderberg et al., 2016; Riedmiller et al., 2018). The impact of these methods should be made clearer in the future.

7.2 Final word

As we have highlighted in the article, research in policy search and deep RL moves at a very high pace. Therefore, forecasting future trends, as we have done above, is risky, and even attempts to analyze the factors underlying current trends may be quickly outdated, but this also what makes this such an exciting research field.


Olivier Sigaud was supported by the European Commission, within the DREAM project, and has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement 640891. Freek Stulp was supported by the HGF project “Reduced Complexity Models”. We thank David Filliat, Nicolas Perrin and Pierre-Yves Oudeyer for their feedback on this article.