Exploration in Action Space

03/31/2020 ∙ by Anirudh Vemula, et al. ∙ Carnegie Mellon University 5

Parameter space exploration methods with black-box optimization have recently been shown to outperform state-of-the-art approaches in continuous control reinforcement learning domains. In this paper, we examine reasons why these methods work better and the situations in which they are worse than traditional action space exploration methods. Through a simple theoretical analysis, we show that when the parametric complexity required to solve the reinforcement learning problem is greater than the product of action space dimensionality and horizon length, exploration in action space is preferred. This is also shown empirically by comparing simple exploration methods on several toy problems.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recently, in a series of blog posts111http://www.argmin.net/2018/03/20/mujocoloco/ and in [12], Ben Recht and colleagues reached the following conclusion: “Our findings contradict the common belief that policy gradient techniques, which rely on exploration in the action space, are more sample efficient than methods based on finite-differences.”

That’s a conclusion that we have often felt has much merit. In a survey (with Jens Kober and Jan Peters) [10], we wrote:

“Black box methods are general stochastic optimization algorithms (Spall, 2003) using only the expected return of policies, estimated by sampling, and do not leverage any of the internal structure of the RL problem. These may be very sophisticated techniques (Tesch et al., 2011) that use response surface estimates and bandit-like strategies to achieve good performance. White box methods take advantage of some of additional structure within the reinforcement learning domain, including, for instance, the (approximate) Markov structure of problems, developing approximate models, value-function estimates when available (Peters and Schaal, 2008c), or even simply the causal ordering of actions and rewards. A major open issue within the field is the relative merits of the these two approaches: in principle, white box methods leverage more information, but with the exception of models (which have been demonstrated repeatedly to often make tremendous performance improvements, see Section 6), the performance gains are traded-off with additional assumptions that may be violated and less mature optimization algorithms. Some recent work including (Stulp and Sigaud, 2012; Tesch et al., 2011) suggest that much of the benefit of policy search is achieved by black-box methods.”

Many empirical examples–the classic Tetris [17]

for instance–demonstrate that Cross-Entropy or other (fast) black-box heuristic methods are generally far superior to any policy gradient method and that policy gradient methods often achieve orders of magnitude better performance than methods demanding still more structure like,

e.g. temporal difference learning [16]. In this paper, we set out to study why black-box parameter space exploration methods work better and in what situations can we expect them to perform worse than traditional action space exploration methods.

Ii The Structure of Policies

Action space exploration methods, like REINFORCE [18], SEARN [6], PSDP [3], AGGREVATE [14, 15], LOLS [5], leverage more structure than parameter space exploration methods. More specifically, they understand the relationship (e.g., the Jacobian) between a policy’s parameters and its outputs. We could ask: Does this matter? In the regime of large parameter space and small output space problems as explored often by our colleagues, like atari games [13], it might. (Typical implementations also leverage causality of reward structure as well, although one might expect that is relatively minor.)

In particular, the intuition behind the use of action space exploration techniques is that they should perform well when the action space is quite small compared to the parametric complexity required to solve the Reinforcement Learning problem.

Iii Experiments

We test this intuition across three experiments: MNIST, Linear Regression and LQR. The code for all these experiments is published at


Iii-a Mnist

To investigate this we, like [1], consider some toy RL problems, beginning with a single time step MDP. In particular, we start with a classic problem: MNIST digit recognition. To put in a bandit/RL framework, we consider a reward for getting the digit correct and a reward for getting it wrong. We use a LeNet-style architecture222Two convolution layers each with kernels with and output channels, followed by two fully connected layers of and

units and a output softmax layer resulting in 1-hot encoding of dimensionality

, [11]. The total number of trainable parameters in this architecture is . The experimental setup is described in greater detail in Appendix -C2.

We then compare the learning curves for vanilla REINFORCE with supervised learning and with ARS

V2-t, an augmented random search in parameter space procedure introduced in Mania et al. [12]. Figure 1 demonstrates the results where solid lines represent mean test accuracy over random seeds and the shaded region corresponds to standard deviation. Clearly, in this setting where the parameter space dimensionality significantly exceeds the action space dimensionality, we can see that action space exploration methods such as REINFORCE outperform parameter space exploration methods like ARS.

Fig. 1: Test accuracy of different approaches against number of samples

Iii-B Linear Regression

Following the heuristic of the Linearization Principle introduced in [1], we attempt to get a handle on the trade-off between sample complexity in parameter space and complexity in action space by considering another simple, one step problem: linear regression with a single output variable and input dimensionality, and thus parameters. We consider learning a random linear function.

Before empirically studying REINFORCE and ARS, we first perform a regret analysis (Appendix -A) on online linear regression for three learning strategies: (1) online gradient descent, (2) exploration in action space, and (3) exploration in parameter space, which correspond to full information setting, linear contextual bandit setting, and bandit setting respectively. The online gradient descent approach is simply OGD from Zinkevich [19] applied to full information online linear regression setting, and for exploration in parameter space, we simply used the BGD algorithm from Flaxman et al. [7], which completely ignores the properties of linear regression setting and works in a bandit setting. The algorithm for random exploration in action space—possibly the simplest linear contextual bandit algorithm, shown in Alg. 3

, operates in the middle: it has access to feature vectors and performs random search in prediction space to form estimations of gradients. The analysis of all three algorithms is performed in the online setting: no statistical assumptions on the sequence of linear loss functions, and multi-point query per loss function

[2] is not allowed.333Note that the ARS algorithms presented in Mania et al. [12]

actually take advantage of the reset property of episodic RL setting and perform two-point feedback query to reduce the variance of gradient estimations

The detailed algorithms and analysis are provided in Appendix -C3. The main difference between exploration in action space and exploration in parameter space is that exploration in action space can take advantage of the fact that the predictor it is learning is linear and it has access to the linear feature vector (i.e., the Jacobian of the predictor). The key advantage of exploration in action space over exploration in parameter space is that exploration in action space is input-dimension free.444It will dependent on the output dimension, if one considers multivariate linear regression. More specifically, one can show that in order to achieve average regret, the algorithm (Alg. 2) performing exploration in parameter space requires samples (we ignore problem specific parameters such as the maximum norm of feature vector, the maximum norm of the linear predictor, and the maximum value of prediction, which we assume are constants), while the algorithm (Alg. 3) performs exploration in action space requires , which is not explicitly dependent on .

We empirically compare the test squared loss of REINFORCE, natural REINFORCE (which simply amounts to whitening of input features) [8] and the ARS V2-t method discussed in Mania et al. [12] with classic follow-the-regularized-leader (Supervised Learning). The results are shown in Figures (a)a, (b)b and (c)c, where solid lines represent mean test squared loss over 10 random seeds and the shaded region corresponds to standard deviation. The learning curves match our expectations, and moreover show that this bandit style REINFORCE lies between the curves of supervised learning and parameter space exploration: that is action space exploration takes advantage of the Jacobian of the policy itself and can learn much more quickly.

Fig. 2: Linear Regression Experiments with varying input dimensionality

Iii-C Lqr

Finally, we consider what happens as we extend the time horizon. In Appendix -B, we consider finite horizon () optimal control task with deterministic dynamics, fixed initial state and a linear stationary policy. We show that we can estimate the policy gradient via a random search in parameter space as ARS did (Eq. 21 in Appendix -B), or we can do a random search in action space across all time steps independently (Eq. 20 in Appendix -B). Comparing the norm of both gradient estimates, we can see that the major difference is that the norm of the gradient estimate from random exploration in parameter space (Eq. 21) linearly scales with the dimensionality of state space (i.e., dimensionality of parameter space as we assume linear policy), while the norm of the gradient estimate from random search in action space (Eq. 20) linearly scales with the product of horizon and action space dimensionality. Hence, when the dimensionality of the state space is smaller than the product of horizon and action space dimensionality, one may prefer random search in parameter space, otherwise random search in action space is preferable. Note that for most of the continuous control tasks in OpenAI gym [4], the horizon is significantly larger than the state space dimensionality.555Take Walker2d-v2 as an example, is usually equal to 1000. The action space dimension is , and the dimension of the state space is . Hence random exploration in action space is actually randomly searching in 6000 dimension space, while random search in parameter space is searching in 17 dimension space. This explains why ARS [12] outperforms most of the action space exploration methods in these tasks.

The simplest setting to empirically evaluate this is where we’d all likely agree that a model based method would be the preferred approach: a finite-horizon Linear Quadratic Regulator problem with 1-d control space and a -d state space. We then compare random search (ARS V1-t from Mania et al. [12]) versus REINFORCE (with ADAM [9] as the underlying optimizer), in terms of the number of samples they need to train a stationary policy that reaches within 5% error of the non-stationary optimal policy’s performance with respect to the horizon ranging from to . Fig. 3 shows the comparison where the statistics are averaged over 10 random seeds (mean standard error).

Fig. 3: LQG with state dimensionality with varying horizon

From Fig. 3 we see that as increases, both algorithms require larger number of samples.666Our simple analysis on finite horizon optimal control with deterministic dynamic shows that both gradient estimators’ norm linearly depends on the objective value measured at the current policy. As increases, the total cost increases. While ARS is stable across different random seeds, we found reinforce becomes more and more sensitive to random seeds when is larger, and performance becomes less stable. However notice that when is small (e.g., ), we see that REINFORCE has lower variance as well and can consistently outperform ARS. Though we would expect that REINFORCE would require more samples than ARS when , which is the dimension of state space, in this experiment we only notice this phenomenon at , though the variance of REINFORCE is already terrible at that point.

Iv Conclusion

Mania et al. [12] have shown that simple random search in parameter space is a competitive alternative to traditional action space exploration methods. In our work, we show that this is only true for domains where the parameter space dimensionality is significantly smaller than the product of action space dimensionality and horizon length. For domains where this does not hold true, action space exploration methods such as REINFORCE [18] are more sample efficient as they do not have an explicit dependence on parameter space dimensionality.

V Acknowledgements

We thank the entire LairLab for stimulating discussions, and Ben Recht for his interesting blog posts.


  • [1] An outsider’s tour of reinforcement learning. http://www.argmin.net/2018/05/11/outsider-rl/. Accessed: 2019-05-26.
  • Agarwal et al. [2010] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In COLT, pages 28–40. Citeseer, 2010.
  • Bagnell et al. [2004] J Andrew Bagnell, Sham M Kakade, Jeff G Schneider, and Andrew Y Ng. Policy search by dynamic programming. In Advances in neural information processing systems, pages 831–838, 2004.
  • Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
  • Chang et al. [2015] Kai-wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daume, and John Langford. Learning to search better than your teacher. In ICML, 2015.
  • Daumé III et al. [2009] Hal Daumé III, John Langford, and Daniel Marcu. Search-based structured prediction. Machine learning, 2009.
  • Flaxman et al. [2005] Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385–394. Society for Industrial and Applied Mathematics, 2005.
  • Kakade [2002] Sham Kakade. A natural policy gradient. NIPS, 2002.
  • Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kober et al. [2013] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Mania et al. [2018] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
  • Mnih et al. [2015] Volodymyr Mnih et al. Human-level control through deep reinforcement learning. Nature, 2015.
  • Ross and Bagnell [2014] Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979, 2014.
  • Sun et al. [2017] Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell.

    Deeply aggrevated: Differentiable imitation learning for sequential prediction.

    ICML, 2017.
  • Sutton [1988] RichardS. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.
  • Thiery and Scherrer [2009] Christophe Thiery and Bruno Scherrer. Building controllers for tetris. ICGA Journal, 32:3–11, 2009.
  • Williams [1992] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 1992.
  • Zinkevich [2003] Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In ICML, 2003.