I Introduction
Recently, in a series of blog posts^{1}^{1}1http://www.argmin.net/2018/03/20/mujocoloco/ and in [12], Ben Recht and colleagues reached the following conclusion: “Our findings contradict the common belief that policy gradient techniques, which rely on exploration in the action space, are more sample efficient than methods based on finitedifferences.”
That’s a conclusion that we have often felt has much merit. In a survey (with Jens Kober and Jan Peters) [10], we wrote:
“Black box methods are general stochastic optimization algorithms (Spall, 2003) using only the expected return of policies, estimated by sampling, and do not leverage any of the internal structure of the RL problem. These may be very sophisticated techniques (Tesch et al., 2011) that use response surface estimates and banditlike strategies to achieve good performance. White box methods take advantage of some of additional structure within the reinforcement learning domain, including, for instance, the (approximate) Markov structure of problems, developing approximate models, valuefunction estimates when available (Peters and Schaal, 2008c), or even simply the causal ordering of actions and rewards. A major open issue within the field is the relative merits of the these two approaches: in principle, white box methods leverage more information, but with the exception of models (which have been demonstrated repeatedly to often make tremendous performance improvements, see Section 6), the performance gains are tradedoff with additional assumptions that may be violated and less mature optimization algorithms. Some recent work including (Stulp and Sigaud, 2012; Tesch et al., 2011) suggest that much of the benefit of policy search is achieved by blackbox methods.”
Many empirical examples–the classic Tetris [17]
for instance–demonstrate that CrossEntropy or other (fast) blackbox heuristic methods are generally far superior to any policy gradient method and that policy gradient methods often achieve orders of magnitude better performance than methods demanding still more structure like,
e.g. temporal difference learning [16]. In this paper, we set out to study why blackbox parameter space exploration methods work better and in what situations can we expect them to perform worse than traditional action space exploration methods.Ii The Structure of Policies
Action space exploration methods, like REINFORCE [18], SEARN [6], PSDP [3], AGGREVATE [14, 15], LOLS [5], leverage more structure than parameter space exploration methods. More specifically, they understand the relationship (e.g., the Jacobian) between a policy’s parameters and its outputs. We could ask: Does this matter? In the regime of large parameter space and small output space problems as explored often by our colleagues, like atari games [13], it might. (Typical implementations also leverage causality of reward structure as well, although one might expect that is relatively minor.)
In particular, the intuition behind the use of action space exploration techniques is that they should perform well when the action space is quite small compared to the parametric complexity required to solve the Reinforcement Learning problem.
Iii Experiments
We test this intuition across three experiments: MNIST, Linear Regression and LQR. The code for all these experiments is published at
https://github.com/LAIRLAB/ARSexperiments.Iiia Mnist
To investigate this we, like [1], consider some toy RL problems, beginning with a single time step MDP. In particular, we start with a classic problem: MNIST digit recognition. To put in a bandit/RL framework, we consider a reward for getting the digit correct and a reward for getting it wrong. We use a LeNetstyle architecture^{2}^{2}2Two convolution layers each with kernels with and output channels, followed by two fully connected layers of and
units and a output softmax layer resulting in 1hot encoding of dimensionality
, [11]. The total number of trainable parameters in this architecture is . The experimental setup is described in greater detail in Appendix C2.We then compare the learning curves for vanilla REINFORCE with supervised learning and with ARS
V2t, an augmented random search in parameter space procedure introduced in Mania et al. [12]. Figure 1 demonstrates the results where solid lines represent mean test accuracy over random seeds and the shaded region corresponds to standard deviation. Clearly, in this setting where the parameter space dimensionality significantly exceeds the action space dimensionality, we can see that action space exploration methods such as REINFORCE outperform parameter space exploration methods like ARS.IiiB Linear Regression
Following the heuristic of the Linearization Principle introduced in [1], we attempt to get a handle on the tradeoff between sample complexity in parameter space and complexity in action space by considering another simple, one step problem: linear regression with a single output variable and input dimensionality, and thus parameters. We consider learning a random linear function.
Before empirically studying REINFORCE and ARS, we first perform a regret analysis (Appendix A) on online linear regression for three learning strategies: (1) online gradient descent, (2) exploration in action space, and (3) exploration in parameter space, which correspond to full information setting, linear contextual bandit setting, and bandit setting respectively. The online gradient descent approach is simply OGD from Zinkevich [19] applied to full information online linear regression setting, and for exploration in parameter space, we simply used the BGD algorithm from Flaxman et al. [7], which completely ignores the properties of linear regression setting and works in a bandit setting. The algorithm for random exploration in action space—possibly the simplest linear contextual bandit algorithm, shown in Alg. 3
, operates in the middle: it has access to feature vectors and performs random search in prediction space to form estimations of gradients. The analysis of all three algorithms is performed in the online setting: no statistical assumptions on the sequence of linear loss functions, and multipoint query per loss function
[2] is not allowed.^{3}^{3}3Note that the ARS algorithms presented in Mania et al. [12]actually take advantage of the reset property of episodic RL setting and perform twopoint feedback query to reduce the variance of gradient estimations
The detailed algorithms and analysis are provided in Appendix C3. The main difference between exploration in action space and exploration in parameter space is that exploration in action space can take advantage of the fact that the predictor it is learning is linear and it has access to the linear feature vector (i.e., the Jacobian of the predictor). The key advantage of exploration in action space over exploration in parameter space is that exploration in action space is inputdimension free.^{4}^{4}4It will dependent on the output dimension, if one considers multivariate linear regression. More specifically, one can show that in order to achieve average regret, the algorithm (Alg. 2) performing exploration in parameter space requires samples (we ignore problem specific parameters such as the maximum norm of feature vector, the maximum norm of the linear predictor, and the maximum value of prediction, which we assume are constants), while the algorithm (Alg. 3) performs exploration in action space requires , which is not explicitly dependent on .
We empirically compare the test squared loss of REINFORCE, natural REINFORCE (which simply amounts to whitening of input features) [8] and the ARS V2t method discussed in Mania et al. [12] with classic followtheregularizedleader (Supervised Learning). The results are shown in Figures (a)a, (b)b and (c)c, where solid lines represent mean test squared loss over 10 random seeds and the shaded region corresponds to standard deviation. The learning curves match our expectations, and moreover show that this bandit style REINFORCE lies between the curves of supervised learning and parameter space exploration: that is action space exploration takes advantage of the Jacobian of the policy itself and can learn much more quickly.
IiiC Lqr
Finally, we consider what happens as we extend the time horizon. In Appendix B, we consider finite horizon () optimal control task with deterministic dynamics, fixed initial state and a linear stationary policy. We show that we can estimate the policy gradient via a random search in parameter space as ARS did (Eq. 21 in Appendix B), or we can do a random search in action space across all time steps independently (Eq. 20 in Appendix B). Comparing the norm of both gradient estimates, we can see that the major difference is that the norm of the gradient estimate from random exploration in parameter space (Eq. 21) linearly scales with the dimensionality of state space (i.e., dimensionality of parameter space as we assume linear policy), while the norm of the gradient estimate from random search in action space (Eq. 20) linearly scales with the product of horizon and action space dimensionality. Hence, when the dimensionality of the state space is smaller than the product of horizon and action space dimensionality, one may prefer random search in parameter space, otherwise random search in action space is preferable. Note that for most of the continuous control tasks in OpenAI gym [4], the horizon is significantly larger than the state space dimensionality.^{5}^{5}5Take Walker2dv2 as an example, is usually equal to 1000. The action space dimension is , and the dimension of the state space is . Hence random exploration in action space is actually randomly searching in 6000 dimension space, while random search in parameter space is searching in 17 dimension space. This explains why ARS [12] outperforms most of the action space exploration methods in these tasks.
The simplest setting to empirically evaluate this is where we’d all likely agree that a model based method would be the preferred approach: a finitehorizon Linear Quadratic Regulator problem with 1d control space and a d state space. We then compare random search (ARS V1t from Mania et al. [12]) versus REINFORCE (with ADAM [9] as the underlying optimizer), in terms of the number of samples they need to train a stationary policy that reaches within 5% error of the nonstationary optimal policy’s performance with respect to the horizon ranging from to . Fig. 3 shows the comparison where the statistics are averaged over 10 random seeds (mean standard error).
From Fig. 3 we see that as increases, both algorithms require larger number of samples.^{6}^{6}6Our simple analysis on finite horizon optimal control with deterministic dynamic shows that both gradient estimators’ norm linearly depends on the objective value measured at the current policy. As increases, the total cost increases. While ARS is stable across different random seeds, we found reinforce becomes more and more sensitive to random seeds when is larger, and performance becomes less stable. However notice that when is small (e.g., ), we see that REINFORCE has lower variance as well and can consistently outperform ARS. Though we would expect that REINFORCE would require more samples than ARS when , which is the dimension of state space, in this experiment we only notice this phenomenon at , though the variance of REINFORCE is already terrible at that point.
Iv Conclusion
Mania et al. [12] have shown that simple random search in parameter space is a competitive alternative to traditional action space exploration methods. In our work, we show that this is only true for domains where the parameter space dimensionality is significantly smaller than the product of action space dimensionality and horizon length. For domains where this does not hold true, action space exploration methods such as REINFORCE [18] are more sample efficient as they do not have an explicit dependence on parameter space dimensionality.
V Acknowledgements
We thank the entire LairLab for stimulating discussions, and Ben Recht for his interesting blog posts.
References
 [1] An outsider’s tour of reinforcement learning. http://www.argmin.net/2018/05/11/outsiderrl/. Accessed: 20190526.
 Agarwal et al. [2010] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multipoint bandit feedback. In COLT, pages 28–40. Citeseer, 2010.
 Bagnell et al. [2004] J Andrew Bagnell, Sham M Kakade, Jeff G Schneider, and Andrew Y Ng. Policy search by dynamic programming. In Advances in neural information processing systems, pages 831–838, 2004.
 Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
 Chang et al. [2015] Kaiwei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daume, and John Langford. Learning to search better than your teacher. In ICML, 2015.
 Daumé III et al. [2009] Hal Daumé III, John Langford, and Daniel Marcu. Searchbased structured prediction. Machine learning, 2009.
 Flaxman et al. [2005] Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACMSIAM symposium on Discrete algorithms, pages 385–394. Society for Industrial and Applied Mathematics, 2005.
 Kakade [2002] Sham Kakade. A natural policy gradient. NIPS, 2002.
 Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kober et al. [2013] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
 LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Mania et al. [2018] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
 Mnih et al. [2015] Volodymyr Mnih et al. Humanlevel control through deep reinforcement learning. Nature, 2015.
 Ross and Bagnell [2014] Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive noregret learning. arXiv preprint arXiv:1406.5979, 2014.

Sun et al. [2017]
Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew
Bagnell.
Deeply aggrevated: Differentiable imitation learning for sequential prediction.
ICML, 2017.  Sutton [1988] RichardS. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.
 Thiery and Scherrer [2009] Christophe Thiery and Bruno Scherrer. Building controllers for tetris. ICGA Journal, 32:3–11, 2009.
 Williams [1992] Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 1992.
 Zinkevich [2003] Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In ICML, 2003.
Comments
There are no comments yet.