Compatible Value Gradients for Reinforcement Learning of Continuous Deep Policies

09/10/2015 ∙ by David Balduzzi, et al. ∙ Victoria University of Wellington 0

This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In reinforcement learning, an agent learns to maximize its discounted future rewards (Sutton and Barto, 1998). The structure of the environment is initially unknown, so the agent must both learn the rewards associated with various action-sequence pairs and optimize its policy. A natural approach is to tackle the subproblems separately via a critic and an actor (Barto et al., 1983; Konda and Tsitsiklis, 2000), where the critic estimates the value of different actions and the actor maximizes rewards by following the policy gradient (Sutton et al., 1999; Peters and Schaal, 2006; Silver et al., 2014). Policy gradient methods have proven useful in settings with high-dimensional continuous action spaces, especially when task-relevant policy representations are at hand (Deisenroth et al., 2011; Levine et al., 2015; Wahlström et al., 2015).

We tackle the problem of learning actor (policy) and critic representations. In the supervised setting, representation or deep learning algorithms have recently demonstrated remarkable performance on a range of benchmark problems. However, the problem of learning features for reinforcement learning remains comparatively underdeveloped. The most dramatic recent success uses -learning over finite action spaces, and essentially build a neural network critic (Mnih et al., 2015). Here, we consider continuous action spaces, and develop an algorithm that simultaneously learns the value function and its gradient, which it then uses to find the optimal policy.

1.1 Outline

This paper presents Value-Gradient Backpropagation (

GProp), a deep actor-critic algorithm for continuous action spaces with compatible function approximation. Our starting point is the deterministic policy gradient and associated compatibility conditions derived in (Silver et al., 2014). Roughly speaking, the compatibility conditions are that

  1. the critic approximate the gradient of the value-function and

  2. the approximation is closely related to the gradient of the policy.

See Theorem 2 for details. We identify and solve two problems with prior work on policy gradients – relating to the two compatibility conditions:

  1. Temporal difference methods do not directly estimate the gradient of the value function.
    Instead, temporal difference methods are applied to learn an approximation of the form , where estimates the value of a state, given the current policy, and estimates the advantage from deviating from the current policy (Sutton et al., 1999; Peters and Schaal, 2006; Deisenroth et al., 2011; Silver et al., 2014). Although the advantage is related to the gradient of the value function, it is not the same thing.

  2. The representations used for compatible approximation scale badly on neural networks.
    The second problem is that prior work has restricted to advantage functions constructed from a particular state-action representation, , that depends on the gradient of the policy. The representation is easy to handle for linear policies. However, if the policy is a neural network, then the standard state-action representation ties the critic too closely to the actor and depends on the internal structure of the actor, Example 2. As a result, weight updates cannot be performed by backpropagation, see section 5.5.

The paper makes three novel contributions. The first two contributions relate directly to problems P1 and P2. The third is a new task designed to test the accuracy of gradient estimates.

Method to directly learn the gradient of the value function.

The first contribution is to modify temporal difference learning so that it directly estimates the gradient of the value-function. The gradient perturbation trick, Lemma 3, provides a way to simultaneously estimate both the value of a function at a point and its gradient, by perturbing the function’s input with uncorrelated Gaussian noise.

Plugging in a neural network instead of a linear estimator extends the trick to the problem of learning a function and its gradient over the entire state-action space. Moreover, the trick combines naturally with temporal difference methods, Theorem 5, and is therefore well-suited to applications in reinforcement learning.

Deviator-Actor-Critic (DAC) model with compatible function approximation.

The second contribution is to propose the Deviator-Actor-Critic (DAC) model, Definition 2, consisting in three coupled neural networks and Value-Gradient Backpropagation (GProp), Algorithm 1, which backpropagates three different signals to train the three networks. The main result, Theorem 6, is that GProp has compatible function approximation when implemented on the DAC model when the neural network consists in linear and rectilinear units.111The proof also holds for maxpooling, weight-tying and other features of convnets. A description of how closely related results extend to convnets is provided in (Balduzzi, 2015).

The proof relies on decomposing the Actor-network into individual units that are considered as actors in their own right, based on ideas in (Srivastava et al., 2014; Balduzzi, 2015). It also suggests interesting connections to work on structural credit assignment in multiagent reinforcement learning (Agogino and Tumer, 2004, 2008; HolmesParker et al., 2014).

Contextual bandit task to probe the accuracy of gradient estimates.

A third contribution, that may be of independent interest, is a new contextual bandit setting designed to probe the ability of reinforcement learning algorithms to estimate gradients. A supervised-to-contextual bandit transform was proposed in (Dudík et al., 2014) as a method for turning classification datasets into -armed contextual bandit datasets.

We are interested in the continuous setting in this paper. We therefore adapt their transform with a twist. The SARCOS and Barrett datasets from robotics have features corresponding to the positions, velocities and accelerations of seven joints and labels corresponding to their torques. There are 7 joints in both cases, so the feature and label spaces are 21 and 7 dimensional respectively. The datasets are traditionally used as regression benchmarks labeled SARCOS1 through SARCOS7 where the task is to predict the torque of a single joint – and similarly for Barrett.

We convert the two datasets into two continuous contextual bandit tasks where the reward signal is the negative distance to the correct label 7-dimensional. The algorithm is thus “told” that the label lies on a sphere in a 7-dimensional space. The missing information required to pin down the label’s position is precisely the gradient. For an algorithm to make predictions that are competitive with fully supervised methods, it is necessary to find extremely accurate gradient estimates.


Section 6 evaluates the performance of GProp on the contextual bandit problems described above and on the challenging octopus arm task (Engel et al., 2005). We show that GProp is able to simultaneously solve seven nonparametric regression problems without observing any labels – instead using the distance between its actions and the correct labels. It turns out that GProp is competitive with recent fully supervised learning algorithms on the task. Finally, we evaluate GProp on the octopus arm benchmark, where it achieves the best performance reported to date.

1.2 Related work

An early reinforcement learning algorithm for neural networks is REINFORCE (Williams, 1992). A disadvantage of REINFORCE is that the entire network is trained with a single scalar signal.

Our proposal builds on ideas introduced with deep -learning (Mnih et al., 2015), such as replay. However, deep -learning is restricted to finite action spaces, whereas we are concerned with continuous action spaces.

Policy gradients were introduced in (Sutton et al., 1999) and have been used extensively (Kakade, 2001; Peters and Schaal, 2006; Deisenroth et al., 2011). The deterministic policy gradient was introduced in (Silver et al., 2014), which also proposed the algorithm . The relationship between GProp and is discussed in detail in section 5.5.

An alternate approach, based on the idea of backpropagating the gradient of the value function, is developed in (Jordan and Jacobs, 1990; Prokhorov and Wunsch, 1997; Wang and Si, 2001; Hafner and Riedmiller, 2011; Fairbank and Alonso, 2012; Fairbank et al., 2013). Unfortunately, these algorithms do not have compatible function approximation in general, so there are no guarantees on actor-critic interactions. See section 5.5 for further discussion.

The analysis used to prove compatible function approximation relies on decomposing the Actor neural network into a collection of agents corresponding to the units in the network. The relation between GProp and the difference-based objective proposed for multiagent learning (Agogino and Tumer, 2008; HolmesParker et al., 2014) is discussed in section 5.4.

1.3 Notation

We use boldface to denote vectors, subscripts for time, and superscripts for individual units in a network. Sets of parameters are capitalized (

, , ) when they refer to matrices or to the parameters of neural networks.

2 Deterministic Policy Gradients

This section recalls previous work on policy gradients. The basic idea is to simultaneously train an actor and a critic. The critic learns an estimate of the value of different policies; the actor then follows the gradient of the value-function to find an optimal (or locally optimal) policy in terms of expected rewards.

2.1 The Policy Gradient Theorem

The environment is modeled as a Markov Decision Process consisting of state space

, action space , initial distribution on states, stationary transition distribution and reward function . A policy is a function from states to actions. We will often add noise to policies, causing them to be stochastic. In this case, the policy is a function , where

is the set of probability distributions on actions.

Let denote the distribution on states at time given policy and initial state at and let . Let be the discounted future reward. Define the

value of a state-action pair: (1)
value of a policy: (2)

The aim is to find the policy with maximal value. A natural approach is to follow the gradient (Sutton et al., 1999), which in the deterministic case can be computed explicitly as

Theorem 1 (policy gradient)

Under reasonable assumptions on the regularity of the Markov Decision Process the policy gradient can be computed as


See (Silver et al., 2014).

2.2 Linear Compatible Function Approximation

Since the agent does not have direct access to the value function , it must instead learn an estimate . A sufficient condition for when plugging an estimate into the policy gradient

yields an unbiased estimator was first proposed in

(Sutton et al., 1999).

A sufficient condition in the deterministic setting is:

Theorem 2 (compatible value function approximation)

The value-estimate satisfies is compatible with the policy gradient, that is


if the following conditions hold:

  1. approximates the value gradient:
    The weights learned by the approximate value function must satisfy , where


    is the mean-square difference between the gradient of the true value function and the approximation .

  2. is policy-compatible:
    The gradients of the value-function and the policy must satisfy


See (Silver et al., 2014).

Having stated the compatibility condition, it is worth revisiting the problems that we propose to tackle in the paper. The first problem is to directly estimate the gradient of the value function, as required by Eq. (5) in condition C1. The standard approach used in the literature is to estimate the value function, or the closely related advantage function, using temporal difference learning, and then compute the derivative of the estimate. The next section shows how the gradient can be estimated directly.

The second problem relates to the compatibility condition on policy and value gradients required by Eq. (6) in condition C2. The only function approximation satisfying C2 that has been proposed is

Example 1 (standard value function approximation)

Let be an -dimensional feature representation on states and set . Then the value function approximation


satisfies condition C2 of Theorem 2.

The approximation in Example 1 encounters serious problems when applied to deep policies, see discussion in section 5.5.

3 Learning Value Gradients

In this section, we tackle the first problem by modifying temporal-difference (TD) learning so that it directly estimates the gradient of the value function. First, we developed a new approach to estimating the gradient of a black-box function at a point, based on perturbing the function with gaussian noise. It turns out that the approach extends easily to learning the gradient of a black-box function across its entire domain. Moreover, it is easy to combine with neural networks and temporal difference learning.

3.1 Estimating the gradient of an unknown function at a point

Gradient estimates have been intensively studied in bandit problems, where rewards (or losses) are observed but labels are not. Thus, in contrast to supervised learning where it is possible to compute the gradient of the loss, in bandit problems the gradient must be estimated. More formally, consider the following setup.

Definition 1 (zeroth-order black-box)

A function is a zeroth-order black-box if it can only be queried for zeroth-order information. That is, User can request the value of at any point , but cannot request the gradient of the function.

We use the shorthand black-box in what follows.

The black-box model for optimization was introduced in (Nemirovski and Yudin, 1983), see (Raginsky and Rakhlin, 2011) for a recent exposition. In those papers, a black-box consists in a first-order oracle that can provide both zeroth-order information (the value of the function) and first-order information (the gradient or subgradient of the function).

Remark 1 (reward function is a black-box; value function is not)

The reward function is a black box since Nature does not provide gradient information. The value function is not even a black-box: it cannot be queried directly since it is defined as the expected discounted future reward. It is for this reason the gradient perturbation trick must be combined with temporal difference learning, see section 3.4.

An important insight is that the gradient of an unknown function at a specific point can be estimated by perturbing its input (Flaxman et al., 2005). For example, for small the gradient of is approximately where the expectation is over vectors sampled uniformly from the unit sphere.

The following lemma provides a simple method for estimating the gradient of a function at a point based on Gaussian perturbations:

Lemma 3 (gradient perturbation trick)

The gradient of differentiable at is


By taking sufficiently small variance, we can assume that

is locally linear. Setting yields a line through the origin. It therefore suffices to consider the special case .



we are required to show that . The problem is convex, so setting the gradient to zero requires to solve , which reduces to solving the set of linear equations


The first equality holds since . It follows immediately that .

3.2 Learning gradients across a range

The solution to the optimization problem in Eq. (8) is the gradient of at a particular . The next step is to learn a function that approximates the gradient across a range of values.

More precisely, given a sample of points, we aim to find


The next lemma considers the case where and are linear estimates, of the form and for fixed representations and .

Lemma 4 (gradient learning)

Let be a differentiable function. Suppose that and are representations such that there exists an -vector and a -matrix satisfying and for all in the sample.

If we define loss function




Follows from Lemma 3.

In short, the lemma reduces gradient estimation to a simple optimization problem given a good enough representation. Jumping ahead slightly to section 4, we ensure that our model has good enough representations by constructing two neural networks to learn them. The first neural network, , learns an approximation to that plays the role of the baseline . The second neural network, learns an approximation to the gradient.

3.3 Temporal difference learning

Recall that is the expected value of a state-action pair given policy . It is never observed directly, since it is computed by discounting over future rewards. TD-learning is a popular approach to estimating through dynamic programming (Sutton and Barto, 1998).

We quickly review TD-learning. Let be a fixed representation. The goal is to find a value-estimate


where is an -dimensional vector, that is as close as possible to the true value function. If the value-function were known, we could simply minimize the mean-square error with respect to :


Unfortunately, it is impossible to minimize the mean-square error directly, since the value-function is the expected discounted future reward, rather than the reward. That is, the value function is not provided explicitly by the environment – not even as a black-box. The Bellman error is therefore used a substitute for the mean-square error:


where is the state subsequent to .

Let be the TD-error. TD-learning updates according to


where is a sequence of learning rates. The convergence properties of TD-learning and related algorithms have been studied extensively, see (Tsitsiklis and Roy, 1997; Dann et al., 2014).

3.4 Temporal difference gradient (TDG) learning

Finally, we apply temporal difference methods to estimate the gradient222Residual gradient (RG) and gradient temporal difference (GTD) methods were introduced in (Baird, 1995; Sutton et al., 2009a, b). The similar names may be confusing. RG and GTD methods are TD methods derived from gradient descent. In contrast, we develop a TD-based approach to learning gradients. The two approaches are thus complementary and straightforward to combine. However, in this paper we restrict to extending vanilla TD to learning gradients. of the value function, as required by condition C1 of Theorem 2. We are interested in gradient approximations of the form


where and is a -dimensional matrix. The goal is to find such that for all sampled state-action pairs.

It is convenient to introduce notation and shorthand . Then, analogously to the mean-square, define the perturbed gradient error:


Given a good enough representation, Lemma 4 guarantees that minimizing the perturbed gradient error yields the gradient of the value function. Unfortunately, as discussed above, the value function cannot be queried directly. We therefore introduce the Bellman gradient error as a proxy


Set the TDG-error as


and, analogously to Eq. (17), define the TDG-updates


where is the matrix given by the outer product. We refer to as the perturbed TDG-error.

The following extension theorem allows us to import guarantees from temporal-difference learning to temporal-difference gradient learning.

Theorem 5 (zeroth to first-order extension)

Guarantees on TD-learning extend to TDG-learning.

The idea is to reformulate TDG-learning as TD-learning, with a slightly different reward function and function approximation. Since the function approximation is still linear, any guarantees on convergence for TD-learning transfered automatically to TDG-learning.

First, we incorporate into the state-action pair. Define and


Second, we define a dot product on matrices of equal size by flattening them down to vectors. More precisely, given two matrices and of the same dimension , define the dot-product . It is easy to see that


The TDG-error can then be rewritten as


where is a linear function approximation.

If we are in a setting where TD-learning is guaranteed to converge to the value-function, it follows that TDG-learning is also guaranteed to converge – since it is simply a different linear approximation. Thus, and the result follows by Lemma 4.

4 Algorithm: Value-Gradient Backpropagation

This section presents our model, which consists of three coupled neural networks that learn to estimate the value function, its gradient, and the optimal policy respectively.

Definition 2 (deviator-actor-critic)

The deviator-actor-critic (DAC) model consists in three neural networks:

  • actor-network with policy ;

  • critic-network, , that estimates the value function; and

  • deviator-network, , that estimates the gradient of the value function.

Gaussian noise is added to the policy during training resulting in actions where . The outputs of the critic and deviator are combined as


The Gaussian noise plays two roles. Firstly, it controls the explore/exploit tradeoff by controlling the extent to which Actor deviates from its current optimal policy. Secondly, it controls the “resolution” at which Deviator estimates the gradient.

The three networks are trained by backpropagating three different signals. Critic, Deviator and Actor backpropagate the TDG-error, the perturbed TDG-error, and Deviator’s gradient estimate respectively; see Algorithm 1. An explicit description of the weight updates of individual units is provided in Appendix A.

Deviator estimates the gradient of the value-function with respect to deviations from the current policy. Backpropagating the gradient through Actor allows to estimate the influence of Actor-parameters on the value function as a function of their effect on the policy.

for rounds  do
       Network gets state , responds , gets reward Let .    // compute TDG-error       // backpropagate           // backpropagate        // backpropagate
Algorithm 1 Value-Gradient Backpropagation (GProp).

Critic and Deviator learn representations suited to estimating the value function and its gradient respectively. Note that even though the gradient is a linear function at a point, it can be a highly nonlinear function in general. Similarly, Actor learns a policy representation.

We set the learning rates of Critic and Deviator to be equal in the experiments in section 6. However, the perturbation has the effect of slowing down and stabilizing Deviator updates:

Remark 2 (stability)

The magnitude of Deviator’s weight updates depend on since they are computed by backpropagating the perturbed TDG-error . Thus as , Deviator’s learning rate essentially tends to zero. In general, Deviator learns more slowly than Critic. This has a stabilizing effect on the policy since Actor is insulated from Critic – its weight updates only depend (directly) on the output of Deviator.

5 Analysis: Deep Compatible Function Approximation

Our main result is that the deviator’s value gradient is compatible with the policy gradient of each unit in the actor-network – considered as an actor in its own right:

Theorem 6 (deep compatible function approximation)

Suppose that all units are rectilinear or linear. Then for each Actor-unit in the Actor-network there exists a reparametrization of the value-gradient approximator, , that satisfies the compatibility conditions in Theorem 2.

The actor-network is thus a collection of interdependent agents that individually follow the correct policy gradients. The experiments below show that they also collectively converge on useful behaviors.

Overview of the proof.

The next few subsections prove Theorem 6. We provide a brief overview before diving into the details.

Guarantees for temporal difference learning and policy gradients are typically based on the assumption that the value-function approximation is a linear function of the learned parameters. However, we are interested in the case where Actor, Critic and Deviator are all neural networks, and are therefore highly nonlinear functions of their parameters. The goal is thus to relate the representations learned by neural networks to the prior work on linear function approximations.

To do so, we build on the following observation, implicit in (Srivastava et al., 2014):

Remark 3 (active submodels)

A neural network of linear and rectilinear units can be considered as a set of submodels, corresponding to different subsets of units. The active submodel at time consists in the active units (that is, the linear units and the rectifiers that do not output 0).

The active submodel has two important properties:

  • it is a linear function from inputs to outputs, since rectifiers are linear when active, and

  • at each time step, learning only occurs over the active submodels, since only active units update their weights.

The feedforward sweep of a rectifier network can thus be disentangled into two steps (Balduzzi, 2015). The first step, which is highly nonlinear, applies a gating operation that selects the active submodel – by rendering various units inactive. The second step computes the output of the neural network via matrix multiplication. It is important to emphasize that although the active submodel is a linear function from inputs to outputs, it is not a linear function of the weights.

The strategy of the proof is to decompose the Actor-network in an interacting collection of agents, referred to as Actor-units. That is, we model each unit in the Actor-network as an Actor in its own right that. On each time step that an Actor-unit is active, it interacts with the Deviator-submodel corresponding to the current active submodel of the Deviator-network. The proof shows that each Actor-unit has compatible function approximation.

5.1 Error backpropagation on rectilinear neural networks

First, we recall some basic facts about backpropagation in the case of rectilinear

units. Recent work has shown that replacing sigmoid functions with rectifiers

improves the performance of neural networks (Nair and Hinton, 2010; Glorot et al., 2011; Zeiler et al., 2013; Dahl et al., 2013).

Let us establish some notation. The output of a rectifier with weight vector is


The rectifier is active if . We use rectifiers because they perform well in practice and have the nice property that units are linear when they are active. The rectifier subgradient is the indicator function

Consider a neural network of units, each equipped with a weight vector . Hidden units are rectifiers; output units are linear. There are units in total. It is convenient to combine all the weight vectors into a single object; let where . The network is a function .

The network has error function with gradient . Let denote the output of unit and denote its input, so that . Note that depends on (specifically, the weights of lower units) but this is supressed from the notation.

Definition 3 (influence)

The influence of unit on unit at time is (Balduzzi et al., 2015). The influence of unit on the output layer is the vector .

The following lemma summarizes an analysis of the feedforward and feedback sweep of neural nets.

Lemma 7 (structure of neural network gradients)

The following properties hold

  1. Influence.
    A path is active at time if all units on the path are firing. The influence of on is the sum of products of weights over all active paths from to :


    where refer to units along the path from to .

  2. Output decomposition.
    The output of a neural network decomposes, relative to the output of unit , as


    where is the -matrix whose entry is the sum over all active paths from input unit to output unit that do not intersect unit .

  3. Output gradient.
    Fix an input and consider the network as a function from parameters to outputs whose gradient is an -matrix. The -entry of the gradient is the input to the unit times its influence:

  4. Backpropagated error.
    Fix and consider the function . Let .

    The gradient of the error function is


    where the backpropagated error signal received by unit decomposes as .

Direct computation.

The lemma holds generically for networks of rectifier and linear units. We apply it to actor, critic and deviator networks below.

5.2 A minimal DAC model

This subsection proves condition C1 of compatible function approximation for a minimal, linear Deviator-Actor-Critic model. The next subsection shows how the minimal model arises at the level of Actor-units.

Definition 4 (minimal model)

The minimal model of a Deviator-Actor-Critic consists in an Actor with linear policy , where is an -vector and is a noisy scalar. The Critic and Deviator together output:


where is an -vector, is a scalar, and is simply scalar multiplication.

The Critic in the minimal model is standard. However, the Deviator has been reduced to almost nothing: it learns a single scalar parameter, , that is used to train the actor. The minimal model is thus too simple to be much use as a standalone algorithm.

Lemma 8 (compatible function approximation for the minimal model)

There exists a reparametrization of the gradient estimate of the minimal model such that compatibility condition C1 in Theorem 2 is satisifed:


Let and construct . Clearly,


Observe that and that, similarly,


as required.

5.3 Proof of Theorem 6

The proof proceeds by showing that the compatibility conditions in Theorem 2 hold for each Actor-unit. The key step is to relate the Actor-units to the minimal model introduced above.

Lemma 9 (reduction to minimal model)

Actor-units in a DAC neural network are equivalent to minimal model Actors.

Let denote the influence of unit on the output layer of the Actor-network at time . When unit is active, Lemma 7ab implies we can write , where is the sum over all active paths from the input to the output of the Actor-network that do not intersect unit .

Following Remark 3, the active subnetwork of the Deviator-network at time

is a linear transform which, by abuse of notation, we denote by


Combine the last two points to obtain


Observe that is a -vector. We have therefore reduced Actor-unit ’s interaction with the Deviator-network to copies of the minimal model.

Theorem 6 follows from combining the above Lemmas.

Compatibility condition C1 follows from Lemmas 8 and 9. Compatibility condition C2 holds since the Critic and Deviator minimize the Bellman gradient error with respect to and which also, implicitly, minimizes the Bellman gradient error with respect to the corresponding reparametrized ’s for each Actor-unit.

Theorem 6 shows that each Actor-unit satisfies the conditions for compatible function approximation and so follows the correct gradient when performing weight updates.

5.4 Structural credit assignment for multiagent learning

It is interesting to relate our approach to the literature on multiagent reinforcement learning (Guestrin et al., 2002; Agogino and Tumer, 2004, 2008). In particular, (HolmesParker et al., 2014) consider the structural credit assignment problem within populations of interacting agents: How to reward individual agents in a population for rewards based on their collective behavior? They propose to train agents within populations with a difference-based objective of the form


where is the objective function to be maximized; and are the system variables that are and are not under the control of agent respective, and is a fixed counterfactual action.

In our setting, the gradient used by Actor-unit to update its weights can be described explicitly:

Lemma 10 (local policy gradients)

Actor-unit follows policy gradient


where is Deviator’s estimate of the directional derivative of the value function in the direction of Actor-unit ’s influence.

Follows from Lemma 7b.

Notice that in Eq. (40). It follows that training the Actor-network via GProp causes the Actor-units to optimize the difference-based objective – without requiring to compute the difference explicitly. Although the topic is beyond the scope of the current paper, it is worth exploring how suitably adapted variants of backpropagation can be applied to the reinforcement learning problems in the multiagent setting.

5.5 Comparison with related work

Comparison with .

Extending the standard value function approximation in Example 1 to the setting where Actor is a neural network yields the following representation, which is used in (Silver et al., 2014) when applying to the octopus arm task:

Example 2 (extension of standard value approximation to neural networks)

Let and be an Actor and Critic neural network respectively. Suppose the Actor-network has parameters (i.e. the total number of entries in ). It follows that the Jacobian is an -matrix.

The value function approximation is then


where is an -vector.

Weight updates under , with the function approximation above, are therefore as described in Algorithm 2.

for rounds  do
       Network gets state , responds where , gets reward
Algorithm 2 Compatible Deterministic Actor-Critic ().

Let us compare GProp with , considering the three updates in turn:

  • Actor updates.
    Under GProp, the Actor backpropagates the value-gradient estimate. In contrast under the Actor performs a complicated update that combines the policy gradient with the advantage function’s weights – and differs substantively from backprop.

  • Deviator / advantage-function updates.
    Under GProp, the Deviator backpropagates the perturbed TDG-error. In contrast, uses the gradient of the Actor to update the weight vector of the advantage function.

    By Lemma 7d, backprop takes the form where is a -vector. In contrast, the advantage function requires computing , where is an -vector. Although the two formulae appear similarly superficially, they carry very different computational costs.

    The first consequence is that the parameters of must exactly line up with those of the policy. The second consequence is that, by Lemma 7c, the advantage function requires access to


    where is the input from unit to unit . Thus, the advantage function requires access to the input and the influence of every unit in the Actor-network.

  • Critic updates.
    The critic updates for the two algorithms are essentially identical, with the TD-error replaced with the TDG-error.

In short, the approximation in Example 2 that is used by is thus not well-adapted to deep learning. The main reason is that learning the advantage function requires coupling the vector with the parameters of the actor.

Comparison with computing the gradient of the value-function approximation.

Perhaps the most natural approach to estimating the gradient is to simply estimate the value function, and then use its gradient as an estimate of the derivative (Jordan and Jacobs, 1990; Prokhorov and Wunsch, 1997; Wang and Si, 2001; Hafner and Riedmiller, 2011; Fairbank and Alonso, 2012; Fairbank et al., 2013). The main problem with this approach is that, to date, it has not been show that the resulting updates of the Critic and the Actor are compatible.

There are also no guarantees that the gradient of the Critic will be a good approximation to the gradient of the value function – although it is intuitively plausible. The problem becomes particularly severe when the value-function is estimated via a neural network that uses activation functions that are

not smooth such as rectifers. Rectifiers are becoming increasingly popular due to their superior empirical performance (Nair and Hinton, 2010; Glorot et al., 2011; Zeiler et al., 2013; Dahl et al., 2013).

6 Experiments

We evaluate GProp on three tasks: two highly nonlinear contextual bandit tasks constructed from benchmark datasets for nonparametric regression, and the octopus arm.

We do not evaluate GProp on other standard reinforcement learning benchmarks such as Mountain Car, Pendulum or Puddle World, since these can already be handled by linear actor-critic algorithms. The contribution of GProp is the ability to learn representations suited to nonlinear problems.

Cloning and replay.

Temporal difference learning can be unstable when run over a neural network. A recent innovation introduced in (Mnih et al., 2015) that stabilizes TD-learning is to clone a separate network to compute the targets . The parameters of the cloned network are updated periodically.

We implement a similar modification of the TDG-error in Algorithm 1. We also use experience replay (Mnih et al., 2015). GProp is well-suited to replay, since the critic and deviator can learn values and gradients over the full range of previously observed state-action pairs offline.

Cloning and replay were also applied to

. Both algorithms were implemented in Theano

(Bergstra et al., 2010; Bastien et al., 2012).

6.1 Contextual Bandit Tasks

Figure 1: Performance on contextual bandit tasks. The reward (negative normalized test MSE) for 10 runs are shown and averaged (thick lines). Performance variation for GProp

is barely visible. Epochs refer to multiples of dataset; algorithms are ultimately trained on the same number of random samples for both datasets.

The goal of the contextual bandit tasks is to probe the ability of reinforcement learning algorithms to accurately estimate gradients. The experimental setting may thus be of independent interest.


We converted two robotics datasets, SARCOS333Taken from and Barrett WAM444Taken from, into contextual bandit problems via the supervised-to-contextual-bandit transform in (Dudík et al., 2014). The datasets have 44,484 and 12,000 training points respectively, both with 21 features corresponding to the positions, velocities and accelerations of seven joints. Labels are 7-dimensional vectors corresponding to the torques of the 7 joints.

In the contextual bandit task, the agent samples 21-dimensional state vectors i.i.d. from either the SARCOS or Barrett training data and executes 7-dimensional actions. The reward is the negative mean-square distance from the action to the label. Note that the reward is a scalar, whereas the correct label is a 7-dimensional vector. The gradient of the reward


is the direction from the action to the correct label. In the supervised setting, the gradient can be computed. In the bandit setting, the reward is a zeroth-order black box.

The agent thus receives far less information in the bandit setting than in the fully supervised setting. Intuitively, the negative distance “tells” the algorithm that the correct label lies on the surface of a sphere in the 7-dimensional action space that is centred on the most recent action. By contrast, in the supervised setting, the algorithm is given the position of the label in the action space. In the bandit setting, the algorithm must estimate the position of the label on the surface of the sphere. Equivalently, the algorithm must estimate the label’s direction relative to the center of the sphere – which is given by the gradient of the value function.

The goal of the contextual bandit task is thus to simultaneously solve seven nonparametric regression problems when observing distances-to-labels instead of directly observing labels. The value function is relatively easy to learn in contextual bandit setting since the task is not sequential. However, both the value function and its gradient are highly nonlinear, and it is precisely the gradient that specifies where labels lie on the spheres.

Network architectures.

GProp and

were implemented on an actor and deviator network of two layers (300 and 100 rectifiers) each and a critic with a hidden layers of 100 and 10 rectifiers. Updates were computed via RMSProp with momentum. The variance of the Gaussian noise

was set to decrease linearly from until reaching at which point it remained fixed.


Figure 1 compares the test-set performance of policies learned by GProp against . The final policies trained by GProp achieved average mean-square test error of 0.013 and 0.014 on the seven SARCOS and Barrett benchmarks respectively.

Remarkably, GProp is competitive with fully-supervised nonparametric regression algorithms on the SARCOS and Barrett datasets, see Figure 2bc in (Nguyen-Tuong et al., 2008) and the results in (Kpotufe and Boularias, 2013; Trivedi et al., 2014). It is important to note that the results reported in those papers are for algorithms that are given the labels and that solve one regression problem at a time. To the best of our knowledge, there are no prior examples of a bandit or reinforcement learning algorithm that is competitive with fully supervised methods on regression datasets.

For comparison, we implemented Backprop on the Actor-network under full-supervision. Backprop converged to .006 and .005 on SARCOS and BARRETT, compared to 0.013 and 0.014 for GProp. Note that BackProp is trained on 7-dim labels whereas GProp receives 1-dim rewards.

Figure 2: Gradient estimates for contextual bandit tasks. The normalized MSE of the gradient estimates compared against the true gradients, i.e. , are shown for 10 runs of and GProp, along with their averages (thick lines).

Accuracy of gradient-estimates.

The true value-gradients can be computed and compared with the algorithm’s estimates on the contextual bandit task. Fig. 2 shows the performance of the two algorithms. GProp’s gradient-error converges to on both tasks. ’s gradient estimate, implicit in the advantage function, converges to 0.03 (SARCOS) and 0.07 (BARRETT). This confirms that GProp yields significantly better gradient estimates.

’s estimates are significantly worse for Barrett compared to SARCOS, in line with the worse performance of on Barrett in Fig. 1. It is unclear why ’s gradient estimate gets worse on Barrett for some period of time. On the other hand, since there are no guarantees on ’s estimates, it follows that its erratic behavior is perhaps not surprising.

Comparison with bandit task in (Silver et al., 2014).

Note that although the contextual bandit problems investigated here are lower-dimensional (with 21-dimensional state spaces and 7-dimensional action spaces) than the bandit problem in (Silver et al., 2014) (with no state space and 10, 25 and 50-dimensional action spaces), they are nevertheless much harder. The optimal action in the bandit problem, in all cases, is the constant vector consisting of only 4s. In contrast, SARCOS and BARRETT are nontrivial benchmarks even when fully supervised.

6.2 Octopus Arm

The octopus arm task is a challenging environment that is high-dimensional, sequential and highly nonlinear.


The objective is to learn to hit a target with a simulated octopus arm (Engel et al., 2005).555Simulator taken from
Settings are taken from (Silver et al., 2014). Importantly, the action-space is not simplified using “macro-actions”. The arm has compartments attached to a rotating base. There are state variables (, position/velocity of nodes along the upper/lower side of the arm; angular position/velocity of the base) and action variables controlling the clockwise and counter-clockwise rotation of the base and three muscles per compartment.

After each step, the agent receives a reward of , where is the change in distance between the arm and the target. The final reward is if the agent hits the target. An episode ends when the target is hit or after 300 steps.

The arm initializes at eight positions relative to the target: . See Appendix B for more details.

Figure 3: Performance on octopus arm task. Ten runs of GProp and on a 6-segment octopus arm with 20 action and 50 state dimensions. Thick lines depict average values. Left panel: number of steps/episode for the arm to reach the target. Right panel: corresponding average rewards/step.

Network architectures.

We applied GProp to an actor-network with hidden rectifiers and linear output units clipped to lie in ; and critic and deviator networks both with two hidden layers of and rectifiers, and linear output units. Updates were computed via RMSProp with step rate of

, moving average decay, with Nesterov momentum 

(Hinton et al., 2012) penalty of