1 Introduction
Reinforcement Learning (RL) is a promising approach to solving complex real world tasks with physical robots, supported by recent successes [Andrychowicz et al.2018, Kalashnikov et al.2018, Haarnoja et al.2018]. Exploration is an integral part of RL responsible for discovery of new behaviors. It is typically achieved by executing a stochastic behavior policy [Sutton and Barto2018]. In continuous control domain, for instance, policies with a parametrized Gaussian distribution have been commonly used [Schulman et al.2015, Levine et al.2016, Mnih et al.2016, Schulman et al.2017, Haarnoja et al.2018]
. The samples from such policies are temporally coherent only through the distribution mean. In most environments this coherence is not sufficient to provide consistent and effective exploration. In early stages of learning in particular, with randomly initialized policy parameters, exploration essentially relies on a white noise process around zero mean. In environments where actions represent lowlevel motion control, e.g. velocity or torque, such exploration rarely produces a consistent motion that could lead to discovery of rewarding behaviors. This contributes to low sample efficiency of learning algorithms
[Wawrzynski2015, van Hoof et al.2017, Plappert et al.2018]. For real world robotic applications a short reaction time is often desirable, however, as action rate increases, a white noise exploration model becomes even less viable, effectively locking the robot in place [Plappert et al.2018]. In addition, temporally incoherent exploration results in a jerky movement on physical robots leading to hardware damage and safety concerns [Peters and Schaal2007].In this work we observe that the parameters of policy distribution typically exhibit high temporal coherence, particularly at higher action rates. Specifically, in the case of Gaussian policy distribution, the action can be represented as a sum of deterministic parametrized mean and a scaled white noise component. The mean part typically changes smoothly between subsequent states. It is the white noise part that results in inconsistent exploration behavior. We propose to replace the white noise component with a stationary autoregressive Gaussian process that has stationary standard normal distribution, while exhibiting temporal coherence between subsequent observations. We derive a general form of these processes of an arbitrary order and show how the degree of temporal coherence can be continuously adjusted with a scalar parameter. We demonstrate an advantage of higher orders processes compared to the first order ones used in prior work. Further, we propose an agent’s policy structure that directly implements the autoregessive computation in such processes. Temporal action smoothing mechanism therefore is not hidden from the agent but is made explicit through its policy function. In order to achieve this, we require a fixed length history of past states and actions. However, the set of resulting historydependent policies contains the set of Markov deterministic policies, and those contain the optimal policies in many tasks
[Puterman2014, Section 4.4]. We find that, in practical applications, the search for such optimal policies can be more efficient and safe in a space of historydependent stochastic policies with special structure, compared to conventional search in a space of Markov stochastic policies.Empirically we show that proposed autoregressive policies can be used with offtheshelf learning algorithms and result in superior exploration and learning in sparse reward tasks compared to conventional Gaussian policies, while achieving similar or slightly better performance in tasks with dense reward. We also show that the drop in learning performance due to increasing action rate can be greatly mitigated by increasing the degree of temporal coherence in the underlying autoregressive process. In the real world robotic experiments we demonstrate that the autoregressive policies result in a smoother and safer movement. ^{1}^{1}1See accompanying video at https://youtu.be/NCpyXBNqNmw
2 Related Work
The problem of exploration in Reinforcement Learning has been studied extensively. One approach has been to modify the environment by changing its reward function to make it easier for an agent to obtain any rewards or to encourage the agent to visit new environment states. This approach includes work on reward shaping [Ng et al.1999] and auxiliary reward components such as curiosity [Oudeyer et al.2007, Houthooft et al.2016, Pathak et al.2017, Burda et al.2018]. Note that regardless of chosen reward function temporally consistent behavior would still be beneficial in most tasks as it would discover rewarding behaviors more efficiently. A randomly initialized agent is unaware of the reward function and for example will not exhibit curiosity until after some amount of learning, which already requires visiting new states and discovering rewarding behaviors in the first place.
A second approach, particularly common in practical robotic applications, has been to directly enforce temporal coherence between subsequent motion commands. In the most simple case a lowpass filter is applied, e.g. the agent actions are exponentially averaged over the fixed or infinite length window [Benbrahim and Franklin1997]. A similar alternative is to employ a derivative control where agent’s actions represent higher order derivatives of the control signal [Mahmood et al.2018a]. Both of these approaches correspond to acting in a modified MDP with different state and action spaces and result in a less direct connection between agent’s action and its consequence in the environment, which can make the learning problem harder. They also make the process less observable unless the agent has access to the history of past actions used in smoothing and to the structure of a smoothing mechanism itself, which is typically not the case. As in the case with modified reward function, the optimal policies in the new MDP generally may not correspond to the optimal policies in the original MDP.
A third approach has been to learn parameters of predefined parametrized controllers, such as motor primitives, instead of learning control directly in the actuation space [Peters and Schaal2008]. This approach is attractive, as it allows to ensure safe robot behaviour and often results in an easier learning problem [van Hoof et al.2017]. However, it requires expert knowledge to define appropriate class of controllers and limits possible policies to those, representable within the selected class. In complex tasks [Andrychowicz et al.2018, Kalashnikov et al.2018] it may be nontrivial to design a sufficiently rich primitives set.
Several studies have considered applying exploration noise to policy distribution parameters such as network weights and hidden units activations. Plappert et al. plappert2017parameter applied Gaussian exploration noise to policy parameters at the beginning of each episode, demonstrating a more coherent and efficient exploration behavior compared to only adding Gaussian noise to the action itself. Fortunato et al. fortunato2017noisy similarly applied independent Gaussian noise to policy parameters, where the scale of the noise was also learned via gradient descent. Both of these works demonstrated improved learning performance compared to baseline Gaussian action space exploration, in particular in tasks with sparse rewards. Our approach is fully complimentary to auxiliary rewards and parametric noise ideas, as both still rely on exploration noise in the action space in addition to other noise sources and can benefit from consistent and temporally smooth exploration trajectories provided by our method.
In the context of continuous control deep RL our work is most closely related to the use of temporally coherent Gaussian noise during exploration. Wawrzynski wawrzynski2015control used moving average process for exploration where temporal smoothness of exploration trajectories was determined by the integer size of an averaging window. They showed that learning with such process results in a similar final performance as with standard Gaussian exploration, while providing smoother behavior suitable for physical hardware applications. Hoof et al. van2017generalized proposed a stationary first order AR exploration process in parameters space. Lillicrap et al. lillicrap2015continuous and Tallec et al. tallec2019making used Ornstein–Uhlenbeck (OU) process for exploration in offpolicy learning. The latter work showed that adjusting process parameters according to the time step duration helps to maintain exploration performance at higher action rates. It can be shown that in a discrete time form OU process is a first order Gaussian AR process, which makes it a particular case of our model. AR processes derived in this work generalize the processes used in these studies, providing a wider space of possible exploration trajectories. In addition, the current work proposes policy structure that directly implements autoregressive computation, in contrast to the above studies, where the agent was unaware of the noise structure. Due to this explicit policy formulation, autoregressive exploration can be used in both, onpolicy and offpolicy learning.
Autoregressive architectures have been proposed in the context of highdimensional discrete or discretized continuous action spaces [Metz et al.2017, Vinyals et al.2017] with regression defined over action components. The objective of such architectures was to reduce dimensionality of the action space. In contrast, we draw on autoregressive stochastic processes literature, and define regression over time steps directly in a multidimensional continuous action space with the objective of enforcing temporally coherent behaviour.
3 Background
Reinforcement Learning
Reinforcement Learning (RL) framework [Sutton and Barto2018] describes an agent interacting with an environment at discrete time steps. At each step the agent receives the environment state and a scalar reward signal . The agent selects an action
according to a policy defined by a probability distribution
. At the next time step in part due to the agent’s action, the environment transitions to a new state and produces a new reward according to a transition probability distribution . The objective of the agent is to find a policy that maximizes the expected return defined as the future accumulated rewards , whereis a discount factor. In practice, the agent observes the environment’s state partially through a realvalued observation vector
.Autoregressive processes
An autoregressive process of order (AR) is defined as
An autoregressive process is called weakly stationary, if its mean function is independent of and its covariance function is independent of for each . In the future we will use the term stationary implying this definition. The process (1) is stationary if the roots (possibly complex) of its characteristic polynomial
lie within a unit circle, e.g. (see e.g. [Brockwell et al.2002, Section 3.1]).
An autocovariance function is defined as . From definition, . For a stationary AR
process a linear system of YuleWalker equations holds:
(2)  
and  
The system (2) has a unique solution with respect to the variables .
4 Stationary autoregressive Gaussian processses
In this section we derive a family of stationary AR Gaussian processes for any , such that , meaning has a marginal standard normal distribution at each . We also show how the degree of temporal smoothness of trajectories formed by subsequent observations of such processes can be continuously tuned with a scalar parameter.
Proposition 4.1.
For any and for any consider a set of coefficients
(3) 
The YuleWalker system (2) with coefficients has a unique solution with respect to , such that and . Furthermore, the autoregressive process
(4)  
is a stationary Gaussian process with zero mean and unit variance, meaning .
Proof.
The proof follows from the observation that are coefficients of a polynomial with roots that all lie within a unit circle. Since is a characteristic polynomial of a process (4), the process is stationary. The existence of a unique solution to the system (2) with follows from the observation that for any the system (2) has a unique solution with respect to , while it is homogeneous with respect to . The result then follows from observing that , and scaling the solution by . A complete proof can be found in Appendix A. ∎
Corollary 4.1.1.
For any and for any let be the autoregressive process:
(5)  
where is a binomial coefficient, is a solution to the system (2) with and . Then .
Proposition 4.1 allows to formulate stationary autoregressive processes of an arbitrary order for arbitrary values , such that the marginal distributions of realizations of these processes are standard normal at each time step. This gives us great flexibility and power in defining properties of these processes, such as the degree of temporal coherence between process realizations at various time lags. If we were to use these processes as a source of exploration behavior in RL algorithms, this flexibility would translate into a flexibility in defining the shape and smoothness of exploration trajectories. Note, that the process (4) trivially generalizes to a vector form by defining a multivariate white noise with a diagonal covariance.
Autoregressive processes in the general form (4) possess a number of interesting properties that can be utilized in reinforcement learning. However, for the purposes of the discussion in the following sections, from now on we will consider a simpler subfamily of processes, defined by (5). Notice, that results in , and becomes a white Gaussian noise. On the other hand, results in , and becomes a constant function. Therefore, tuning a single scalar parameter from to continuously adjusts temporal smoothness of ranging from white noise to a constant function. Figure 1 shows realizations of such processes at different values of and .
The realizations are initialized from the same set of 3 random seeds for each , pair.
5 Autoregressive policies
In continuous control RL a policy is often defined as a parametrized diagonal Gaussian distribution:
(6) 
where is a state at time , and
are vectors parametrized by deep neural networks. The actions, sampled from such distribution, can be represented as
where is a white Gaussian noise. We propose to replace with observations of an AR process defined by (5) for some and :(7) 
Both and follow marginal standard normal distribution at each step , therefore such substitution does not change the network output to noise ratio in sampled actions, however for the sequence possesses temporal coherence and can provide a more consistent exploration behavior. We would like to build an agent that implements stochastic policy with samples, defined by (7). From definition (5) of the process , (7) can be expanded as
(8)  
From (7) also, , hence (8) can be rewritten as
(9)  
Denote the autoregressive ”history” term in (9). Note that is a function of past states and actions, . Then follows the distribution:
(10)  
In order to implement such action distribution, we need to define a historydependent policy , where is a history of past states and actions. In general, historydependent policies do not induce Markov stochastic processes, even if the environment transition probabilities are Markovian [Puterman2014, Section 2.1.6]. However when the dependence is only on a history of a fixed size, such policy induces a Markov stochastic process in an extended state space, where states are defined as pairs . In order to be able to lean on existing theoretical results, such as Policy Gradient Theorem [Sutton et al.2000], and to use existing learning algorithms, we will talk about learning policies in this extended MDP.
More formally, let be a given MDP with and denoting state and action sets, and and denoting transition probability and reward functions respectively. Let be an arbitrary integer number. We define a modified MDP with the elements , where denotes Cartesian product of set with itself times, , and and defined as follows:
In other words, transitions in a modified MDP correspond to transitions in the original MDP with states in containing also the history of past states and actions in . The interaction between the agent and the environment, induced by , occurs in the following way. At each time the agent is presented with the current state . Based on this state and its policy, it chooses an action from the set and sends it to the environment. Internally, the environment propagates the action to the original MDP , currently in state , which responds with a reward value and transitions to a new state
. At this moment, the MDP
transitions to a new state and presents it to the agent together with the reward . Let be an element of the set of the initial states of . A corresponding initial state of is defined as , where is any element of an action set , for example zero vector in the case of continuous space. The particular choice of is immaterial, since it does not affect future transitions and rewards (more details in Appendix E).We define an autoregressive policy (ARP) over as:
(11)  
where and are parametrized function approximations, such as deep neural networks. For notation brevity, we omitted dependence of policy on and
, since these values are constant once the autoregressive model
is selected. In this parametrization, should be thought of as the same parametrized function applied to different parts of the state vector , therefore each occurence of in (11) contributes to the gradient w.r.t. parameters . Similarly, each occurence of contributes to the gradient w.r.t. . Note, that including history of states and actions does not affect the dimensionality of the input to the function approximations, as both and accept only states from the original space as inputs.The historydependent policy (11) results in the desired action distribution (10) in the original MDP , at the same time with respect to it is just a particular case of a Gaussian policy (6). Formally, we will perform learning in , where is Markov, and therefore all the related theoretical results apply, and any offtheshelf learning algorithm, applicable to policies of type (6), can be used. In particular, the value function in e.g. actorcritic architectures is learned with usual methods. Empirically we found that conditioning value function only on a current state from the original MDP instead of an entire vector gives more stable learning performance. It also helps to maintain the critic network size invariant to the AR process order .
By design, for each sample path in there is a corresponding sample path in with identical rewards. Therefore, improving the policy and the obtained rewards in results in identical improvement of a corresponding historydependent policy in . Notice also, that if in (11), then reduces to a Markov deterministic policy in . Therefore, the optimal policy in the set of ARPs defined by (11) is at least as good, as the best deterministic policy in the set of policies . This is in contrast with action averaging approaches, where temporal smoothing is typically imposed on the entire action vector and not just on the exploration component, limiting the space of possible deterministic policies.
It is important to point out that for any historydependent policy there exists an equivalent Markov stochastic policy with identical expected returns [Puterman2014, Theorem 5.5.1]. For the policy (11), for example, it can be constructed as , where is a set of all histories of size . However, is a nontrivial function of a state , unknown to us at the beginning of learning. It is certainly not given by a random initialization of (6), while a random initialization of (11) already provides consistent and smooth behavior. also cannot be derived analytically from (11), since computing requires knowledge of environment transition probabilities, which we cannot expect to have for each given task. From these considerations, the particular form of policy parametrization defined by (11) can also be thought of as an additional structure, enforced upon the general class of Markov policies, such as policies defined by (6), restricting possible behaviors to temporally coherent ones.
Although autoregressive term in (11) is formally a part of the distribution mean, numerically it corresponds to a stationary zero mean random process , where is an underlying AR process defined by (5). Therefore, can be thought of as a part of an action exploration component around the ”true” mean, given by . It is this part that ensures a consistent and smooth exploration, as will be demonstrated in the next section.
In principle, one could define in (11) using arbitrary values of coefficients and . The role of particular values of computed according to (5) is to make sure, that the underlying AR process is stationary and the autoregressive part does not explode. The role of computed by solving (2) with coefficients and is to make sure, that the variance of the underlying process is 1. The total variance around is then conveniently defined by an agent controlled .
6 Experiments
We compared conventional Gaussian policy with ARPs on a set of tasks with both, sparse and dense reward functions, in simulation and the real world. In the following learning experiments we used the Open AI Baselines PPO algorithm implementation [Schulman et al.2017, Dhariwal et al.2017]. The results with Baselines TRPO [Schulman et al.2015] are provided in Appendix C. For each experiment we used identical algorithm hyperparameters and neural network structures to parametrize , and the value networks for both Gaussian and ARP policies. We used the same set of random seeds to initialize neural networks and the same set of random seeds to initialize environments that involve uncertainty. Detailed parameters for each task are included in Appendix D. We did not perform a hyperparameter search to optimize for ARP performance, as our primary objective is to demonstrate the advantage of temporally coherent exploration even in the setting, tuned for a standard Gaussian policy. The video of agent behaviors can be found at https://youtu.be/NCpyXBNqNmw. The code to reproduce experiments is available at https://github.com/kindredresearch/arp.
The order of an autoregressive process
From Figure 1 one can notice that the temporal smoothness of realizations of AR processes (5) empirically increases with both, parameter and order . Why do we need higher order processes if we can simply increase to achieve a higher degree of temporal coherence? To answer this question it is helpful to consider an autocorrelation function (ARF) of these processes. White Gaussian noise by definition has autocorrelation function equal to zero at any other than 0. An autoregressive process with nonzero coefficients generally has nonzero values of autocorrelation function at all .
One of the reasons we are interested in autoregressive processes for exploration is that they provide smooth trajectories that do not result in jerky movement and do not damage physical robot hardware. Intuitively, the smoothness of the process realization is defined by a correlation between subsequent observations , which for a given increases with increasing . However, given the same value , processes of different orders behave differently. Figure 2a shows ARFs for different processes defined by (5) and their corresponding values of with the same value of , while Figure 2b shows realizations of these processes. ARF values at higher orders decrease much faster with increasing time lag compared to the 1st order process, where correlation between past and future observations lingers over long periods of time. As shown on Figure 2b, the 1st order AR process produces nearly a constant function, while the 5th order process exhibits a much more diverse exploratory behavior. Given the same value of correlation between subsequent realizations, higher order autoregressive processes exhibit lower correlation between observations distant in time, resulting in trajectories with better exploration potential. In robotics applications where smoothness of the trajectory can be critical, higher order autoregressive processes may be a preferable choice. Empirically we found that the 3rd order processes provide sufficiently smooth trajectories while exhibiting a good exploratory behavior, and used in all our subsequent learning experiments varying only the smoothing parameter .
Toy environment with sparse reward
To demonstrate the advantage of temporally consistent exploration, in particular at high action rates, we designed a toy Square environment with a 2D continuous state space bounded by a 10x10 square arena. The agent controls a dot through a continuous direct velocity control. The agent is initialized in the middle of the arena at the start of each episode and receives a 1 reward at each time step scaled by time step duration. The target is generated at a random location on a circle of diameter 5 centered at the middle of the arena to make episodes homogenous in difficulty. The episode is over when the agent approaches the target to within a distance of 0.5. The action space is bounded within a twodimensional interval. The observation vector contains the agent’s position, velocity, and the vector difference between agent position and the target position.
To compare exploration efficiency we ran random ARP () and Gaussian agents with initialized to and respectively for 10 million simulated seconds at different action rates. Figure 3a shows average time to reach the target as a function of an action rate. The results show that the optimal degree of temporal coherence depends on the environment properties, such as action rate. At low control frequency a white Gaussian exploration is more effective than ARPs with high , as in the latter the agent quickly reaches the boundary of the state space and gets stuck there. The efficiency of Gaussian exploration drops dramatically with the increase of action rate. However it is possible to recover the same exploration performance in ARP by increasing accordingly the parameter. This effect is visualized on Figure 3b which shows five 10 seconds long exploration trajectories at 10Hz and 100Hz control for Gaussian and ARP policies. Although ran for the same amount of simulated time, Gaussian exploration at 100Hz covers substantially smaller area of state space compared to 10Hz control, while increasing from 0.8 to 0.95 (the values were chosen empirically) results in ARP trajectories covering similar space at both action rates. Note that the issue with Gaussian policy can not be fixed by simply increasing the variance, as most actions will just be clipped at the boundary, resulting in a similarly poor exploration. Figure 3b top right plot shows exploration trajectories for (blue) and (gray). To the contrary of the common intuition, in bounded action spaces Gaussian exploration with high variance does not produce a diverse stateaction visitation.
The advantage of ARPs in exploration translates into an advantage in learning. Figure 4 shows learning curves (averaged over 5 random seeds) on Square environment at different action rates ran for 50,000 seconds of total simulated time with episodes limited to 1000 simulated seconds. Not only ARPs exhibit better learning, but the initial random behaviour gives much higher returns compared to initial Gaussian agent behaviour. At higher action rates ARPs with higher produce better results.
In the formulation of the AR1 process used in Lillicrap et al. lillicrap2015continuous and in Tallec et al. tallec2019making, parameter corresponds to , where is a time step duration. Hence, in that formulation naturally approaches 1 as approaches zero. Note, that in order to achieve the best performance on each given task, parameter still needs to be tuned, just as parameter needs to be tuned in our formulation. The optimal values of these parameters depend not only on action rate, but also on environment properties, such as a size of a state space relative to the typical size of an agent step.
Mujoco experiments
Figure 5 shows the learning results on standard OpenAI Gym Mujoco environments [Brockman et al.2016]. These environments have dense rewards, so consistent exploration is less crucial here compared to tasks with sparse rewards. Nevertheless, we found that ARPs perform similarly or slightly better, than a standard Gaussian policy. On a Swimmerv2 environment ARP resulted in a much better performance compared to Gaussian policy, possibly because in this environment smooth trajectories are highly rewarded.
Physical robot experiments
On a UR5 robotic arm we were able to obtain results similar to those in the toy environment. We designed a sparse reward version of a UR5 Reacher 2D task introduced in [Mahmood et al.2018b]. In a modified task at each time step the agent receives a 1 reward scaled by a time step duration. The episode is over when the agent reaches the target within a distance of 0.05. In order to provide sufficient time for exploration in a sparse reward setting we doubled the episode time duration to 8 seconds. Figure 6 shows the learning curves for 25Hz and 125Hz control. Each curve is an average across 4 random seeds. The Gaussian policy fails to learn in a 125Hz control setting within a 5 hours time limit, while ARP was able to find an effective policy in 50% of the runs, and the effectiveness at higher increased at a higher action rate.
7 Conclusions
We introduced autoregressive Gaussian policies (ARPs) for temporally coherent exploration in continuous control deep reinforcement learning. The policy form is grounded in the theory of stationary autoregressive stochastic processes. We derived a family of stationary Gaussian autoregressive stochastic processes for an arbitrary order with continuously adjustable degree of temporal coherence between subsequent observations. We derived an agent policy that implements these processes with a standard agentenvironment interface. Empirically we showed that ARPs result in a superior exploration and learning in sparse reward tasks and perform on par or better compared to standard Gaussian policies in dense reward tasks. On physical hardware, ARPs result in smooth trajectories that are safer to execute compared to the trajectories provided by conventional Gaussian exploration.
References
 [Andrychowicz et al.2018] Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous inhand manipulation. arXiv preprint arXiv:1808.00177, 2018.
 [Benbrahim and Franklin1997] Hamid Benbrahim and Judy A Franklin. Biped dynamic walking using reinforcement learning. Robotics and Autonomous Systems, 22(34):283–302, 1997.
 [Brockman et al.2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
 [Brockwell et al.2002] Peter J Brockwell, Richard A Davis, and Matthew V Calder. Introduction to time series and forecasting, volume 2. Springer, 2002.
 [Burda et al.2018] Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. Largescale study of curiositydriven learning. arXiv preprint arXiv:1808.04355, 2018.
 [Dhariwal et al.2017] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. GitHub, GitHub repository, 2017.
 [Fortunato et al.2017] Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. Noisy networks for exploration. arXiv preprint arXiv:1706.10295, 2017.
 [Haarnoja et al.2018] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actorcritic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
 [van Hoof et al.2017] Herke van Hoof, Daniel Tanneberg, and Jan Peters. Generalized exploration in policy search. Machine Learning, 106(910):1705–1724, 2017.
 [Houthooft et al.2016] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016.
 [Kalashnikov et al.2018] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. Scalable deep reinforcement learning for visionbased robotic manipulation. In Aude Billard, Anca Dragan, Jan Peters, and Jun Morimoto, editors, Proceedings of The 2nd Conference on Robot Learning, volume 87 of Proceedings of Machine Learning Research, pages 651–673. PMLR, 29–31 Oct 2018.
 [Levine et al.2016] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 [Lillicrap et al.2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [Mahmood et al.2018a] A Rupam Mahmood, Dmytro Korenkevych, Brent J Komer, and James Bergstra. Setting up a reinforcement learning task with a realworld robot. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4635–4640. IEEE, 2018.
 [Mahmood et al.2018b] A. Rupam Mahmood, Dmytro Korenkevych, Gautham Vasan, William Ma, and James Bergstra. Benchmarking reinforcement learning algorithms on realworld robots. In CoRL, 2018.
 [Metz et al.2017] Luke Metz, Julian Ibarz, Navdeep Jaitly, and James Davidson. Discrete sequential prediction of continuous actions for deep rl. arXiv preprint arXiv:1705.05035, 2017.
 [Mnih et al.2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
 [Ng et al.1999] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287, 1999.

[Oudeyer et al.2007]
PierreYves Oudeyer, Frdric Kaplan, and Verena V Hafner.
Intrinsic motivation systems for autonomous mental development.
IEEE transactions on evolutionary computation
, 11(2):265–286, 2007.  [Pathak et al.2017] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiositydriven exploration by selfsupervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.
 [Peters and Schaal2007] Jan Peters and Stefan Schaal. Reinforcement learning by rewardweighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750. ACM, 2007.
 [Peters and Schaal2008] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008.
 [Plappert et al.2017] Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration. arXiv preprint arXiv:1706.01905, 2017.
 [Plappert et al.2018] Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multigoal reinforcement learning: Challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464, 2018.
 [Puterman2014] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
 [Schulman et al.2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
 [Schulman et al.2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 [Sutton and Barto2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 [Sutton et al.2000] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 [Tallec et al.2019] Corentin Tallec, Léonard Blier, and Yann Ollivier. Making deep qlearning methods robust to time discretization. arXiv preprint arXiv:1901.09732, 2019.
 [Vinyals et al.2017] Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, et al. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782, 2017.
 [Wawrzynski2015] Pawel Wawrzynski. Control policy with autocorrelated noise in reinforcement learning for robotics. International Journal of Machine Learning and Computing, 5(2):91, 2015.
Appendix A Proof of Proposition 4.1
Lemma A.1.
For any and for any the autoregressive process
(12)  
is stationary.
Proof.
Lemma A.2.
Let be an autoregressive process defined in (12). If its white noise component is Gaussian, i.e. , then
are identically distributed normal random variables with zero mean and finite variance
.Proof.
According to lemma A.1 the process (12) is stationary, meaning are identically distributed random variables with finite variance . If is Gaussian, then are identically distributed normal variables. Let us denote the mean of this distribution. Taking expectation of both sides of (12) gives
since polynomial (13) does not have a root , therefore . ∎
We established that under Gaussian white noise the process (12) represents a series of identically distributed normal variables with zero mean and finite variance. From the linear form of the process it is clear that the variance linearly depends on the variance of a white noise component. Scaling by a factor results in scaling by the same factor . Therefore, it should be possible to pick such that the variance would have any desired positive value, in particular, value of 1. The following lemma formalizes this observation.
Lemma A.3.
For any let be a set of coefficients corresponding to a stationary ARp process defined in (1). Then the linear system
(14)  
and  
has a unique solution , where . Furthermore, the autoregressive process
(15)  
is stationary with variance .
Proof.
For any stationarity of follows from stationarity of , since both processes share the same coefficients , and therefore, the same characteristic polynomial.
Since is stationary, the corresponding system of YuleWalker equations (2) has a unique solution with respect to [Brockwell et al.2002, Section 3.1]. Notice, however, that the system (2) is homogenous with respect to the variables , meaning that if is a solution, then is also a solution . Therefore, since , exists a unique solution such that . We can find it by substituting into a linear system (2), resulting in (14). This solution corresponds to a stationary process with .
∎
Now the proof of proposition 4.1 is straightforward. By lemma A.1 coefficients correspond to a stationary autoregressive process, therefore by lemma A.3 the system (14) has a unique solution and the process (4) is a stationary process with unit variance. Since is normal, by lemma (A.2) the process (4) is Gaussian with zero mean. Therefore, .
Appendix B Example of an AR3 Gaussian process
A third order process AR3 defined by (4) has a form:
where is a solution of a system:
and  
resulting in
For any this process is stationary with .
If , the process reduces to
Appendix C Learning results in simulation with ARPs and OpenAI Baselines TRPO
We ran a set of experiments with ARPs ( = 3) and OpenAI Baselines TRPO algorithm. Figures 6(a) and 6(b) show learning curves in Mujoco and Square environments respectively. The hyperparameters are specified in the Appendix D. TRPO delivered a similar performance to PPO in Mujoco tasks (with a more stable performance on a Reacherv2 task), however PPO produced better results on a sparse reward Square environment.
Appendix D Algorithm parameters
In our experiments we used the default Open AI Baselines parameters specified in https://github.com/openai/baselines/blob/master/baselines with the exception of and parameters, for which we used a larger value of 0.995. For experiments in Square environment at 10Hz action rate we used 4 times larger batch and optimization batch sizes to account for longer episodes. For Square and UR5 Reacher 2D environments at higher action rates we used the same parameter values as for basic versions (10Hz and 25Hz respectively), but scaled the batch and the optimization batch sizes accordingly to make sure that the data within a batch at all action rates corresponds to the same amount of simulated time (e.g. for UR5 Reacher 2D at 125Hz we used 5 times bigger batch and optimization batch compared to those in UR5 Reacher 2D at 25Hz). In all experiments we used fully connected networks with the same hidden sizes to parametrize policy and value networks. For each experiment identical parameters and network architectures were used for standard Gaussian policy and ARP. The table below shows the parameters values for basic versions of the environments:
PPO:
Hyperparameter  Square at 10Hz  Mujoco and UR5 Reacher 2D at 25Hz 

batch size  8192  2048 
stepsize  
opt. batch size  256  64 
opt. epochs 
10  10 
0.995  0.995  
0.995  0.995  
clip.  0.2  0.2 
hidden layers  2  2 
hidden sizes  64  64 
TRPO:
Hyperparameter  Square at 10Hz  Mujoco 

batch size  8192  1024 
maxkl  0.01  0.01 
cgiters  10  10 
vfiters  5  5 
vfstepsize  
0.995  0.995  

0.995  0.995 
hidden layers  2  2 
hidden sizes  64  64 
Appendix E Implementation details
The equation (11) in the main text, replicated for convenience below, defines a stationary autoregressive policy distribution in under given parameters :
Comments
There are no comments yet.