Introduction
Reinforcement and imitation learning using deep neural networks have achieved impressive results on a wide range of tasks spanning manipulation
[levine2016end, levine2014learning], locomotion [silver2014deterministic], games [mnih2015human, silver2016mastering], and autonomous driving [ho2016generative, li2017inferring]. Modelfree methods search for optimal policies without explicitly modeling the system’s dynamics [williams1992simple, schulman2015trust]. Most modelfree algorithms build an estimate of the policy gradient by sampling trajectories from the environment and perform gradient ascent. However, these methods suffer from either very high sample complexity due to the generally large variance of the estimator, or are restricted to policies with few parameters.
On the other hand, modelbased reinforcement learning methods learn an explicit model of the dynamics of the system. A policy can then be optimized under this model by backpropagating the reward signal through the learned dynamics. While these approaches can greatly improve sample efficiency, the dynamics model needs to be carefully handcrafted for each task. Recently, hybrid approaches, at the interface of modelfree and modelbased, have attempted to balance sampleefficiency with generalizability, with notable success in robotics
[levine2016end].Existing methods, however, are limited to continuous action spaces [levine2014learning, heess2015learning]. In this work, we introduce a hybrid reinforcement learning algorithm for deterministic policy optimization that is applicable to continuous and discrete action spaces. Starting from a class of deterministic policies, we relax the corresponding policy optimization problem to one over a carefully chosen set of stochastic policies under approximate dynamics. This relaxation allows the derivation of a novel policy gradient estimator, which combines pathwise derivative and score function estimator. This enables incorporating modelbased assumptions while remaining applicable to discrete action spaces. We then perform gradientbased optimization on this larger class of policies while slowly annealing stochasticity, converging to an optimal deterministic solution. We additionally justify and bound the dynamics approximation under certain assumptions. Finally, we complement this estimator by a scalable method to estimate the dynamics, first introduced in [levine2014learning], and with a general extension to nondifferentiable rewards, rendering it applicable to a large class of tasks. Our contributions are as follows:

We introduce a novel deterministic policy optimization method, that leverages a model of the dynamics and can be applied to any action space, whereas existing methods are limited to continuous action spaces. We also provide theoretical guarantees for our approximation.

We show how our estimator can be applied to a broad class of problems by extending it to discrete rewards, and utilizing a sampleefficient dynamics estimation method [levine2014learning].

We show that this method successfully optimizes complex neural network deterministic policies without additional variance reduction techniques. We present experiments on tasks with discrete action spaces where modelbased or hybrid methods are not applicable. On these tasks, we show significant gains in terms of samplecomplexity, ranging between and .
Related Work
Sample efficiency is a key metric for RL methods, especially when used on realworld physical systems. Modelbased methods improve sample efficiency at the cost of defining and learning taskspecific dynamics models, while modelfree methods typically require significantly more samples but are more broadly applicable.
Modelbased methods approximate dynamics using various classes of models, spanning from gaussian processes [deisenroth2011pilco] to neural networks [fu2016one]. Assuming the reward function is known and differentiable, the policy gradient can be computed exactly by differentiating through the dynamics model, thus enabling gradientbased optimization. heess2015learning heess2015learning extends this idea to stochastic dynamics and policies by expressing randomness as an exogenous variable and making use of the “reparameterization trick”. This requires differentiability of the dynamics function w.r.t. both state and action, limiting its applicability to continuous action spaces. On the other hand, modelfree algorithms are broadly applicable but require significantly more samples as well as variance reduction techniques through control variates [mnih2016asynchronous, ho2016model] or trust regions [schulman2015trust] to converge in reasonable time.
Recently, hybrid approaches have attempted to balance sample efficiency with applicability. levine2014learning levine2014learning locally approximate dynamics to fit locally optimal policies, which then serve as supervision for global policies. A good dynamics model also enables generating artificial trajectories to limit strain on the simulator [gu2016continuous]. However, these works are once again limited to continuous action spaces. Our work can also be considered a hybrid algorithm that can handle and improve sampleefficiency for discrete action spaces as well.
Background
In this section, we first present the canonical RL formalism. We then review the score function and pathwise derivative estimators, and their respective advantages. We show applications of these with first, the REINFORCE estimator [williams1992simple], followed by a standard method for modelbased policy optimization consisting of backpropagating through the dynamics equation.
Notations and definitions
Let and denote the state and action spaces, respectively. refers to a deterministic dynamics function i.e. . A deterministic policy is a function , and a stochastic policy is a conditional distribution over actions given state, denoted . For clarity, when considering parameterized policies, stochastic ones will be functions of , and deterministic ones functions of . Throughout this work, dynamics will be considered deterministic.
We consider the standard RL setting where an agent interacts with an environment and obtains rewards for its actions. The agent’s goal is to maximize its expected cumulative reward. Formally, there exists an initial distribution over and a collection of reward functions with . is sampled from , and at each step, the agent is presented with and chooses an action according to a policy . is then computed as . In the finite horizon setting, the episode ends when . The agent is then provided with the cumulative reward, . The goal of the agent is to find a policy that maximizes the expected cumulative reward .
Score Function Estimator and Pathwise Derivative
We now review two approaches for estimating the gradient of an expectation. Let
be a probability distribution over a measurable set
and . We are interested in obtaining an estimator of the following quantity: .Score Function Estimator
The score function estimator relies on the ‘logtrick’. It relies on the following identity (given appropriate regularity assumptions on and ):
(1) 
This last quantity can then be estimated using MonteCarlo sampling. This estimator is very general and can be applied even if
is a discrete random variable, as long as
is differentiable w.r.t. for all . However, it suffers from highvariance [glasserman2013monte].Pathwise Derivative
The pathwise derivative estimator depends on being reparameterizable, i.e., there exists a function and a distribution (independent of ) such that sampling is equivalent to sampling and computing . Given that observation, . This quantity can once again be estimated using MonteCarlo sampling, but is conversely lower variance [glasserman2013monte]. This requires to be a differentiable function of for all .
We override (resp. ) as (resp. ), and aim at maximizing this objective function using gradient ascent.
Reinforce
Using the score function estimator, we can derive the REINFORCE rule [williams1992simple], which is applicable without assumptions on the dynamics or action space. With a stochastic policy, we want to maximize where . Then the REINFORCE rule is . We can estimate this quantity using MonteCarlo sampling. This only requires differentiability of w.r.t. and does not assume knowledge of and . This estimator is however not applicable to deterministic policies.
A Method for Modelbased Policy Optimization
Let be a deterministic policy, differentiable w.r.t. both and . Assuming we have knowledge of and , we can directly differentiate : . The first term, , is easily computed given knowledge of the reward functions. The second term, given knowledge of the dynamics, can be computed by recursively differentiating , i.e. and:
(2) 
This method can be extended to stochastic dynamics and policies by using a pathwise derivative estimator i.e. by reparameterizing the noise [heess2015learning].
This method is applicable for a deterministic and differentiable policy w.r.t. both and , differentiable dynamics w.r.t. both variables and differentiable reward function. This implies that must be continuous. This modelbased method aims at utilizing the dynamics and differentiating the dynamics equation.
Relaxing the Policy Optimization Problem
Deterministic policies are optimal in a nongame theoretic setting. Our objective, in this work, is to perform hybrid policy optimization for deterministic policies for both continuous and discrete action spaces. In order to accomplish this, we present a relaxation of the RL optimization problem. This relaxation allows us to derive, in the subsequent section, a policy gradient estimator that differentiates approximate dynamics and leverages modelbased assumptions for any action spaces. Contrary to traditional hybrid methods, this does not assume differentiability of the dynamics function w.r.t. the action variable, thus elegantly handling discrete action spaces. As in the previous section, we place ourselves in the finitehorizon setting. We assume terminal rewards and convex state space .
Relaxing the dynamics constraint
In this section, we describe our relaxation. Starting from a class of deterministic policies parameterized by , we can construct a class of stochastic policies parameterized by , that can be chosen as close as desired to by adjusting . On this extended class, we can derive a lowvariance gradient estimator. We first explain the relaxation, then detail how to construct from , and finally, provide guarantees on the approximation.
Formally, the RL problem with deterministic policy and dynamics can be written as:
(3)  
subject to 
Given that is deterministic, the constraint can be equivalently rewritten as:
(4) 
Having made this observation, we proceed to relaxing the optimization from over to , with the constraint now being over approximate and in particular differentiable dynamics. The relaxed optimization program is therefore:
(5)  
subject to 
This relaxation casts the optimization to be over stochastic policies, which allows us to derive a policy gradient in Theorem 3, but under different, approximated dynamics. We later describe how to project the solution in back to an element on .
Construction of from
Here we show how to construct stochastic policies from deterministic ones while providing a parameterized ‘knob’ to control randomness and closeness to a deterministic policy.
Discrete action spaces
The natural parameterization is as a softmax model. However, this requires careful parameterization, in order to ensure that all policies of are included. We use the deterministic policy as a prior of which we can control the strength. Formally, we choose a class of parameterized functions . For and , we can define the following stochastic policy s.t. . We have therefore defined where and . We easily verify that , as, for any , we can choose an arbitrary and define , we then have .
Continuous action spaces
In the continuous setting, a very simple parameterization is by adding Gaussian noise, of which we can control variance. Formally, given , let and . The surjection can be derived by setting to . More complicated stochastic parameterizations can be easily derived as long as the density remains tractable.
Rounding
We assume that there is a surjection from to , s.t. any stochastic policy can be made deterministic by setting to a certain value. For the above examples, the mapping consists of setting to .
Similar in spirit to simulated annealing for optimization [kirkpatrick1983optimization], we optimize over , while slowly annealing to converge to an optimal solution in .
Theoretical Guarantees
Having presented the relaxation, we now provide theoretical justifications, to show, first, that given conditions on the stochastic policy, a trajectory computed with approximate dynamics under a stochastic policy is close to the trajectory computed with the true dynamics under a deterministic policy. We additionally present connections in the case where our dynamics are discretization of a continuoustime system.
Bounding the deviation from real dynamics
In this paragraph, we assume that the action space is continuous. Given the terminal reward setting, the amount of approximation can be defined as the divergence between , the trajectory from following a deterministic policy , and , the trajectory corresponding to the approximate dynamics with a policy . This will allow us to relate the optimal value of the relaxed program with the optimal value of the initial optimization problem.
Theorem 1 (Approximation Bound).
Let and be a deterministic and stochastic policy, respectively, s.t. . Let us suppose that is lipschitz and is lipschitz. We further assume that and that . We have the following guarantee:
(6) 
Furthermore, if are distributions of fixed variance , the approximation converges towards when .
Proof.
See Appendix. ∎
We know that solving the relaxed optimization problem will provide an upperbound on the expected terminal reward from a policy in . Given a lipschitz reward function, this bound shows that the optimal value of the true optimization program is within of the optimal value of the relaxed one.
Equivalence in continuoustime systems
The relaxation has strong theoretical guarantees when the dynamics are a discretization of a continuoustime system. With analogous notations as before, let us consider a continuoustime dynamical system: .
A discretization of this system can be written as . We can thus write the dynamics of the relaxed problem: .
Intuitively, when the discretization step tends to , the policy converges to a deterministic one. In the limit, our approximation is the true dynamics for continuous time systems. Proposition 2 formalizes this idea.
Proposition 2.
Let be a trajectory obtained from the discretized relaxed system, following a stochastic policy. Let be the continuous time trajectory. Then, with probability :
(7) 
Proof.
See [munos2006policy]. ∎
Relaxed Policy Gradient
The relaxation presented in the previous section allows us to differentiate through the dynamics for any action space. In this section, we derive a novel deterministic policy gradient estimator, Relaxed Policy Gradient (RPG), that combines pathwise derivative and score function estimators, and is applicable to all action spaces. We then show how to apply our estimator, using a sampleefficient and scalable method to estimate the dynamics, first presented in [levine2014learning]. We conclude by extending the estimator to discrete reward functions, effectively making our algorithm applicable to a wide variety of problems. For simplicity, we omit (the stochasticity) from our derivations and consider it fixed.
Estimator
We place ourselves in the relaxation setting. Letting be a parameterized stochastic policy, we define . Our objective is to find . To that aim, we wish to perform gradient ascent on .
Theorem 3 (Relaxed Policy Gradient).
Given a trajectory , sampled by following a stochastic policy , we define the following quantity where can be computed with the recursion defined by and:
(8)  
Proof.
See Appendix. ∎
The presented RPG estimator is effectively a combination of pathwise derivative and score function estimators. Intuitively, the pathwise derivative cannot be used for discrete action spaces as it requires differentiability w.r.t. the action. To this end, we differentiate pathwise through and handle the discrete portion with a score function estimator.
Benefits of pathwise derivative
Gradient estimates through pathwise derivatives are considered lower variance [glasserman2013monte]. In the context of RL, intuitively, REINFORCE suffers from high variance as it requires many samples to properly establish credit assignment, increasing or decreasing probabilities of actions seen during a trajectory based on the reward. Conversely, as RPG estimates the dependency of the state on , the gradient can directly adjust to steer the trajectory to high reward regions. This is possible because the estimator utilizes a model of the dynamics. While REINFORCE assigns credit indifferently to actions, RPG adjusts the entire trajectory.
However, when examining our expression for RPG, the computation requires the gradient of the dynamics w.r.t. the state, unlike REINFORCE. In the next section, we present a scalable and sampleefficient dynamics fitting method, further detailed in [levine2016end], to estimate this term.
Scalable estimation of the dynamics
The Relaxed Policy Gradient estimator can be computed exclusively from sampled trajectories if provided with the stategradient of the dynamics . However, this is not an information available to the agent and thus needs to be estimated from samples. We review a method, first presented in [levine2014learning], that provides estimation of the stategradient of the dynamics, incorporating information from multiple sampled trajectories. Although the dynamics are deterministic, we utilize a stochastic estimation to account for modeling errors.
Formally, we have sampled trajectories of the form from the dynamical system. Our goal is not only to accurately model the dynamics of the system, but also to estimate correctly the stategradient. This prevents us from simply fitting a parametric function to the samples. Indeed, this approach would provide no guarantee that is a good approximation of . This is especially important as the variance of the estimator is dependent on the quality of our approximation of and not the approximation of .
In order to fit a dynamics model, we choose to locally model the dynamics of the discretized stochastic process as , parameterized by . We choose this approach as it does not model global dynamics but instead a good timevarying local model around the latest sampled trajectories, which corresponds to the term we want to estimate. Under that model, the term we are interested in estimating is then .
While this approach is simple, it requires a large number of samples to be accurate. To greatly increase sampleefficiency, we assume a prior over sampled trajectories. We use the GMM approach of [levine2014learning] which corresponds to softly piecewise linear dynamics. At each iteration, we update our prior using the current trajectories and fit our model. This allows us to utilize both past trajectories as well as nearby time steps that are included in the prior. Another key advantage of this stochastic estimation is that it elegantly handles discontinuous dynamics models. Indeed, by averaging the dynamics locally around the discontinuities, it effectively smooths out the discontinuities of the underlying deterministic dynamics. We refer to [levine2016end] for more detailed derivations.
Extension to discrete rewards
Prior to this work, estimators incorporating models of the dynamics such as [heess2015learning, levine2014learning] were constrained to continuously differentiable rewards. We present an extension of this type of estimators to a class of noncontinuous rewards. To do so, we make assumptions on the form of the reward and approximate it by a smooth function.
We assume that and that the reward can be written as a sum of indicator functions, i.e.:
(9) 
where are compact subsets of . This assumption covers a large collection of tasks where the goal is to end up in a compact subspace of . For each , we are going to approximate by an arbitrarily close smooth function.
Proposition 4 (Smooth approximation of an indicator function).
Let be a compact of . For any neighborhood of , there exists, , smooth, s.t. .
Proof.
See Appendix. ∎
We can now approximate each by a smooth function . Given this surrogate reward , we can apply our estimator. In practice, however, discrete reward functions are often defined as , where is a function from to . We approximate the reward function by (
being the sigmoid function). If
is , then is too and pointwise converges to . We present in the Appendix an example of the approximate reward functions for the Mountain Car task.Given these arbitrarily close surrogate rewards, we can now prove a continuity result, i.e. the sequence of optimal policies under the surrogate rewards converges to the optimal policy.
Proposition 5 (Optimal policies under surrogate reward functions).
Let be a reward function defined as and the optimal policy under this reward. Let us define to be the value of state under the policy . Then, there exists a sequence of smooth reward functions s.t. if is optimal under the reward , .
Proof.
See Appendix. ∎
Practical Algorithm
In Algorithm 1, we present a practical algorithm to perform deterministic policy optimization based on the previously defined RPG. In summary, given , we construct our extended class of stochastic policies . We then perform gradientbased optimization over , i.e. stochastic policies, while slowly annealing , converging to a policy in . We easily extend the estimator to the case of nonterminal rewards as:
(10) 
Experiments
We empirically evaluate our algorithm to investigate the following questions: 1) How much does RPG improve sample complexity? 2) Can RPG learn an optimal policy for the true reward using a smooth surrogate reward? 3) How effective is our approximation of the dynamics?
In an attempt to evaluate the benefits of the relaxation compared to other estimators as fairly as possible, we do not use additional techniques, such as trust regions [schulman2015trust]. Both our method and the compared ones can incorporate these improvements, but we leave study of these more sophisticated estimators for future work.
Classical Control Tasks
We apply our Relaxed Policy Gradient (RPG) to a set of classical control tasks with discrete action spaces: Cart Pole [barto1983neuronlike], Acrobot [sutton1996generalization], Mountain Car [sutton1996generalization] and Hand Mass [munos2006policy]. For the first three tasks, we used the OpenAI Gym [brockman2016openai] implementation and followed [munos2006policy] for the implementation of Hand Mass. A diagram of our tasks are presented in Figure 1.
Baselines
We compare our methods against two different baselines: a blackbox derivative free optimization algorithm called CrossEntropy Method (CEM) [szita2006learning] and the ActorCritic algorithm (A3C) [mnih2016asynchronous]. CEM is an evolutionarytype algorithm that samples a different set of parameters at each step, simulates using these parameters and keeps only the top , using those to refit the distribution over parameters. CEM is known to work well [duan2016benchmarking], but lacks sample efficiency as the algorithm does not exploit any structure. A3C is a variant of REINFORCE [williams1992simple] that learns a value function to reduce the variance of the estimator. For each task and algorithm, we evaluate for distinct random seeds.
Results and Analysis
We present the learning curves for all tasks in Figure 2. Even when training is done with surrogate rewards, in both instances we report the actual reward. We do not show CEM learning curves as these are not comparable due to the nature of the algorithm; each CEM episode is equivalent to RPG or A3C episodes, we instead report final performance in Tables 1 and 2. For all tasks, the policy is parameterized by a layer neural network with tanh nonlinearities and
hidden units, with a softmax layer outputting a distribution over actions. In practice, we optimize over stochastic policies with a fixed
. In our experiments, it converged to a near deterministic policy since the optimization is wellconditioned enough in that case. The policies are trained using Adam [kingma2014adam].We evaluate differently depending on the task: since Cartpole and Acrobot have a fixed reward threshold at which they are considered solved, for these we present the number of training samples needed to reach that performance (in Table 1). In contrast, for Mountain Car and Hand Mass, we report (in Table 2) the reward achieved with a fixed number of samples. Both of these metrics are meant to evaluate sampleefficiency.
Sample Complexity
As shown in Tables 1 and 2, our algorithm clearly outperforms A3C across all tasks, requiring between and times less samples to solve the tasks and showing significantly better performance for the same number of samples. As shown in Table 2, RPG performs better than CEM in Hand Mass and within for Mountain Car, despite using times less samples. We also note that CEM is particularly well suited for Acrobot, as it is a derivativefree method that can explore the space of parameters very fast and find nearoptimal parameters quickly when the optimal policies are fairly simple, explaining its impressive samplecomplexity on this specific task. We additionally point out that full plots of performance against number of samples are reported for all tasks in Figure 2, and the numbers in Tables 1 and 2 can all be extracted from there.
Robustness, Variance and Training Stability
Overall, our method was very robust to the choice of hyperparameters such as learning rate or architecture. Indeed, our policy was trained on all tasks with the same learning rate , whereas different learning rates were crossvalidated for A3C. When examining the training curves, our estimator demonstrates significantly less variance than A3C and a more stable training process. In tasks where the challenges are exploratory (Acrobot or Mountain Car), RPG’s exploration process is guided by its model of the dynamics while A3C’s is completely undirected, often leading to total failure. The same phenomenon can be observed on control tasks (Hand Mass or Cart Pole), where A3C favors bad actions seen in high reward trajectories, leading to unstable control.
Approximate Reward
On all tasks except Hand Mass, our estimator was trained using approximate smoothed rewards. The performances reported in Figure 2 and Table 2 show that this did not impair training at all. Regarding the baselines, it is interesting to note that since CEM does not leverage the structure of the RL problem, it is natural to train with the real rewards. For A3C, we experimented with the smoothed rewards and obtained comparable numbers.
Method  Samples until solve 

Cart Pole (Threshold = )  
CEM  
A3C  
RPG  
Acrobot (Threshold = 105)  
CEM  
A3C  
RPG 
Method  Samples  Performance 

Hand Mass ( episodes)  
CEM  
A3C  
RPG  
Mountain Car ( episodes)  
CEM  
A3C  
RPG 
Limitations
In this section, we explore limitations of our method. To leverage the dynamics of the RL problem, we tradeoff some flexibility in the class of problems that modelfree RL can tackle for better sample complexity. This can also be seen as an instance of the bias/variance tradeoff.
Limitations on the reward function
While we presented a general way to extend this estimator for rewards in the discrete domain, such function approximations can be difficult to construct for highdimensional statespaces such as Atari games. Indeed, one would have to fit the indicator function of a very low dimensional manifold  corresponding to the set of images encoding the state of a given game score  living in a highdimensional space (order of ).
Limitations on the type of tasks
While we show results on classical control tasks, our estimator is broadly applicable to all tasks where the dynamics can be estimated reasonably well. This has been shown to be possible on a number of locomotion and robotics tasks [levine2014learning, heess2015learning]. However, our work is not directly applicable to raw pixel input tasks such as the Atari domain.
Computational overhead
Compared to REINFORCE, our model presents some computational overhead as it requires evaluating as well as fitting dynamics matrices. In practice, this is minor compared to other training operations such as sampling trajectories or computing necessary gradients. In our experiments, computing the amounts to less than of overhead while dynamics estimation constitutes about .
Discussion
In this work, we presented a method to find an optimal deterministic policy for any action space and in particular discrete actions. This method relies on a relaxation of the optimization problem to a carefully chosen, larger class of stochastic policies. On this class of policies, we can derive a novel, lowvariance, policy gradient estimator, Relaxed Policy Gradient (RPG), by differentiating approximate dynamics. We then perform gradientbased optimization while slowly annealing the stochasticity of our policy, converging to a deterministic solution. We showed how this method can be successfully applied to a collection of RL tasks, presenting significant gains in sampleefficiency and training stability over existing methods. Furthermore, we introduced a way to apply this algorithm to noncontinuous reward functions by learning a policy under a smooth surrogate reward function, for which we provided a construction method. It is also important to note that our method is easily amenable to problems with stochastic dynamics, assuming one can reparameterize the noise.
This work also opens the way to promising future extensions of the estimator. For example, one could easily incorporate imaginary rollouts [gu2016continuous] with the estimated dynamics model or trust regions [schulman2015trust]. Finally, this work can also be extended more broadly to gradient estimation for any discrete random variables.
Acknowledgments
The authors thank Aditya Grover, Jiaming Song and Steve Mussmann for useful discussions, as well as the anonymous reviewers for insightful comments and suggestions. This research was funded by Intel Corporation, FLI, TRI and NSF grants , , .
References
Appendix A Appendix
Tasks specifications
Cart Pole
The classical Cart Pole setting where the agent wishes to balance a pole mounted on a pivot point on a cart. Due to instability, the agent has to perform continuous cart movement to preserve the equilibrium. The agent receives reward for each time step spend upwards. The task is considered solved if the agent reaches a reward of .
Acrobot
First introduced in [sutton1996generalization], the system is composed of two joints and two links. Initially hanging downwards, the objective is to actuate the links to swing the end of lower link up to a fixed height. The path’s reward is determined by the negative number of actions required to reach the objectif. The task is considered solved if the agent reaches a reward of .
Mountain Car
We consider the usual setting presented in [sutton1996generalization]. The objective is to get a car to the top of a hill by learning a policy that uses momentum gained by alternating backward and forward motions. The state consists of , where is the horizontal position of the car. The reward can be expressed as . The task’s difficulty lies in both the limitation of maximum tangential force and the need for exploration. The agent receives a reward of We use a surrogate smoothed reward for DCTPG. We evaluate the performance after a fixed number of episodes.
Hand Mass
We consider a physical system first presented in [munos2006policy]. The agent is holding a spring linked to a mass. The agent can move in four directions. The objective is to bring the mass to a target point while keeping the hand closest to the origin point . The state can be described as and the dynamics are described in [munos2006policy]. At the final time step, the agent receives the reward .
Proof of Theorem 1
Lemma 6.
Let be a random variable of of bounded variance. Let .
Proof.
We can simply use CauchySchwarz inequality. Let’s define .
∎
Let’s prove this by recursion.
(11)  
(12)  
(13)  
(14)  
(15) 
(2) is obtained with triangular inequality. (3) because is lipschitz. (4) with triangular inequality and because . (5) because of Lemma 6 and the uniform bound on variances.
Given this inequality and the fact that , we can compute the geometric sum i.e.
concludes the proof.
Proofs of Theorem 3
As in the modelbased method presented in Background, we want to directly differentiate . As before, we require the reward function to be differentiable. , but when computing , we need to differentiate through the relaxed equation of the dynamics:
The recursion corresponds to a MonteCarlo estimate with trajectories obtained by running the stochastic policy in the environment, thus concluding the proof.
Proof of Proposition 4
Given a compact of and an open neighborhood , let us show that there exists s.t. . While according to Urysohn’s lemma [alexandroff1924theorie], such continuous functions exist in the general setting of normal spaces given any pair of disjoint closed sets and , we will show how to explicitly construct under our simpler assumptions.
It is sufficient to show that we can construct a function such that for open set containing the closed unit ball, is equal to 1 on the unit ball and is null outside of . Given this function, we can easily construct by scaling and translating.
Let us define . This function is smooth on both and .
Let us define . This function is strictly positive on and null outside of . For large enough, the function satisfy the desired property, thus proving that we can construct .
Figure 3 shows an example of a smooth surrogate reward for the Mountain Car task.
Proof of Proposition 5
Let us consider the deterministic case where and the reward is terminal. Let be the optimal policy under reward .
There exists a sequence of open neighborhoods s.t. and tends to . Let be the smooth reward function constructed for with neighborhood . Let be the deterministic optimal policy under reward .
For a start state , we can distinguish between two cases:
For the former, no policy can reach the compact and thus the surrogate reward make no difference. For the latter, the policy , take to final state where . However, which tends to . The is a Cauchy sequence in a complete space and thus converges to a limit . Since and by continuity of the distance to a compact, proving the result.
Comments
There are no comments yet.