!MDP Playground: Meta-Features in Reinforcement Learning

09/17/2019 ∙ by Raghu Rajan, et al. ∙ University of Freiburg 0

Reinforcement Learning (RL) algorithms usually assume their environment to be a Markov Decision Process (MDP). Additionally, they do not try to identify specific features of environments which could help them perform better. Here, we present a few key meta-features of environments: delayed rewards, specific reward sequences, sparsity of rewards, and stochasticity of environments, which may violate the MDP assumptions and adapting to which should help RL agents perform better. While it is very time consuming to run RL algorithms on standard benchmarks, we define a parameterised collection of fast-to-run toy benchmarks in OpenAI Gym by varying these meta-features. Despite their toy nature and low compute requirements, we show that these benchmarks present substantial difficulties to current RL algorithms. Furthermore, since we can generate environments with a desired value for each of the meta-features, we have fine-grained control over the environments' difficulty and also have the ground truth available for evaluating algorithms. We believe that devising algorithms that can detect such meta-features of environments and adapt to them will be key to creating robust RL algorithms that work in a variety of different real-world problems.



There are no comments yet.


page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Like humans, a true Artificial General Intelligence (AGI) would generalise to all sorts of environments by adapting to the task at hand. Despite the success of RL algorithms at many tasks (Abbeel et al., 2010; Mnih et al., 2013; Silver et al., 2016; Chua et al., 2018), we are still far away from AGI. RL algorithms can solve many tasks such as the game of Go, video game playing in Atari, and locomotion in Mujoco, but when faced with a completely new environment, they can’t really adapt like humans do. We need to understand the environments and their interactions with RL algorithms better if we are to progress to more intelligent algorithms.

RL algorithms usually assume the environment to be an MDP (or POMDP) and update their agents based on reward received at every timestep. The reward is assumed to be for the current state and action111Even algorithms which use techniques like eligibility traces are based on frequency and recency of occurrence of states and actions which can help us identify “correlationary” causes but not necessarily the true ones., an assumption that may not hold true in real life.222This depends on the state formulation in general. But assuming an infinitely differentiable world and dynamics of arbitrary order, we would need an infinite “stack” of the current “state” and its higher order derivatives to be able to predict the next state based on the current one and thus have Markovianness. In this work, we attempt to identify some key meta-features of environments which may help agents deal with violations of Markovianness and we provide a platform with different instantiations of these meta-features in order to understand the workings of different RL algorithms better. The platform is implemented as a Python package, !MDP Playground that allows us complete control over these meta-features to be able to benchmark RL algorithms.

Usual environments that RL algorithms are tested on also tend to take a long time to run. As a remedy, our platform is meant to be a low cost proxy for identifying how RL algorithms would work on real world problems.

2 Key Meta-Features

We first define an MDP and then explain the key meta-features we identified and how varying them in an environment violates MDP assumptions.

2.1 MDPs and Assumptions

We define an MDP as a 6-tuple , where is the set of states, is the set of actions, is the transitions dynamics, is the reward dynamics, is the initial state distribution, and is the set of terminal states.

In RL algorithms, such as TD learning (Sutton and Barto, 2018), we make the usual MDP assumptions like receiving immediate reward depending on only the previous state and action. However, in general, this is not true for even many simple environments, such as a slot machine, where depending on the discretisation of the time step, the reward may not be immediate.

Algorithms like DQN (Mnih et al., 2013) were applied to many varied environments and produce very variable performance across these. In some simple environments, DQN’s performance exceeds human performance by large amounts, but in other environments, such as Montezuma’s revenge, performance is very poor. For such environments, we need a very specific sequence of actions to get a reward. In general, environments where MDP assumptions hold to a greater extent are environments where algorithms which model the environment as an MDP will tend to work better.

Another key feature of environments is sparsity of rewards. Sparsity usually exists in environments where specific sequences of states need to be followed to reach the target. Agents generally perform a very specific sequence of tasks and are rewarded at the end of it. This leads to sparse rewards.

Another important meta-feature of environments is stochasticity. The environment itself, i.e., dynamics and , may be stochastic or may seem stochastic to the agent due to partial observability or sensor noise.

The key meta-features of the environment that we identify from the above discussion are

  • Delayed Rewards

  • Specific Sequences

  • Sparsity of Rewards

  • Stochasticity

We now describe the benchmark environment where we control these meta-features to test algorithms.

3 !MDP Playground

We believe that the meta-features discussed above are key features which can be controlled in small environments and if an algorithm can master all of these in simple environments, it has gone some way towards being able to perform at scale as well. In order to be able to benchmark how algorithms would perform in such environments, we implement a package, !MDP Playground333The ! signifies the environments in the package are not MDPs. We can of course model them as MDPs with an augmented state and that is indeed how we implement them in practice., which generates randomly configured OpenAI Gym environments to allow us to benchmark algorithms across a grid of configurations. It is also manually configurable so we can control in a fine-grained manner how exactly we intend the environment to be. We now briefly describe how the environment and these meta-features are implemented.

The environment is implemented as an MDP with an augmented state. We generate random MDPs with various instantiations of and . For the current implementation, we restrict ourselves to discrete state and action spaces. The and are deterministic unless we deliberately inject stochasticity through the respective meta-features we have created for them.

Delayed Rewards are pretty common in real environments. We delay the reward for a state-action pair by a number of timesteps, which we call the delay length, . In general, will not be a constant in real world environments and would be a function of the state and action (sequence), but for our simple instantiation here, we use a fixed .

Specific Sequences of states which are rewarded, instead of just a single state, are also pretty common in real environments. For example, to execute a tennis serve, we need a very specific sequence of actions which would result in a point if we served an ace. Let’s denote the sequence length characteristic of an environment by . Like , would not be a constant in real world environments. For our experiments, we consider that specific sequences of states would be rewardable, though, in general, specific sequences of states and actions should be considered for rewards.444The implementation would be easily extendable to additionally consider actions for future experiments.

Sparsity of Rewards tells us how sparse the reward is in the environment. Here we define reward density of sequences in terms of the fraction of possible sequences of length that are actually rewarded by the environment, for the specific case when the sequence length is constant. There are possible specific sequences for an environment with sequence length and if of them are rewarded we define the reward density to be and sparsity as .555For the general case where is variable, it may be worthwhile to define reward density as the average reward a random agent receives in an environment, but for the fixed length case, it makes sense to implement and define it like we have done here.

With regard to sparsity, recall the tennis serve again. The point received by serving an ace would be a sparse reward. We as humans know to reward ourselves for executing only a part of the sequence correctly. Rewards in continuous control tasks to reach a target point (e.g., in Mujoco (Todorov et al., 2012)), are usually dense (such as the negative squared distance from the target). This lets the algorithm obtain a dense signal in space to guide learning, and it is well known that it would be much harder for the algorithm to learn if it only received a single reward at the target point. We make our environments configurable to allow this sort of reward shaping to make the reward denser and enable algorithms to learn faster. The environment gives a fractional reward if a fraction of a specific sequence is achieved when configured to do so.

Stochasticity is implemented by making and noisy. For discrete environments, for we do this by taking a transition noise between 0 and 1, say , and letting the environment transition to a state that is not the true next state given by , of the times. For , we take a

and add a normal random variable distributed according to


Some additional meta-features that we allow the user to fully control are the terminal state density and the reward unit which is the reward given whenever the environment hands out one (this could help with reward scaling). In the future, we also intend to allow the user to fully specify an MDP they may have in mind and use it for their experiments. Algorithm 1 gives a very simplified high level code for !MDP Playground generates random MDPs.

A parallel and independent work along similar lines that was released within the last month is the Behaviour Suite for RL (bsuite, Osband et al. (2019)). That suite collects simple RL benchmarks from the literature that are representative for various types of problems which occur in RL and tries to characterise RL algorithms. However, they do not employ orthogonal meta-features like we do and do not generate new environments; as a result, they do not have the same type of fine-grained control over their environments’ difficulty, especially not along controllable dimensions. Unlike their framework, where currently there’s no toy environment for Hierarchical RL (HRL) algorithms, the specific sequences that we describe would also fit very well with HRL. An important distinction between the two platforms could be summed up by saying that they try to characterise algorithms while we try to characterise environments with the aim that new adaptable algorithms can be developed that can tackle environments of desired difficulty.

(a) DQN
(b) Rainbow
(c) A3C
(d) A3C + LSTM
Figure 1: Mean episodic reward at the end of training for the different algorithms when varying delay and sequence lengths. Please note the different colorbar scales. We note that the intent of this figure is solely to show how the performance of an algorithm across meta-features gets worse for greater violations of underlying assumptions. It is not to compare DQN with A3C since the training procedures were different and we stopped training after different number of environment timesteps. What we tried to keep constant was the number of optimizer steps trained.
Input: number of states , number of actions , reward delay , length of reward sequences , density of reward sequences , ,
init_terminal_states() According to configured terminal state density
function init_reward_function()
     Randomly sample sequences to be
     rewardable and store in rewardable_sequences State sequences with no repeats in them
end function
function init_transition_function()
     for each state  do
         Set possible successor states: S’ = S
         for each action  do
              Set successor state for state and action to a state sampled uniformly from
              Remove from
         end for
     end for
end function
function reward_function()
     r = 0
     if not  then
         if state sequence of states ending steps in the past is in rewardable_sequences then
              r =
         end if
         for i in range(n) do
              if sequence of i states ending steps in the past is in sub-sequences of length i in rewardable_sequences then
                  r += * i/
              end if
         end for
     end if
     return r
end function
Algorithm 1 Generating random MDPs with !MDP Playground

4 Experiments and Results

We ran DQN (Mnih et al., 2013), Rainbow DQN(Hessel et al., 2017), A3C(Mnih et al., 2016), A3C with LSTM (from the Ray RLLib (Liang et al., 2017) implementations) on grids of values for the meta-features discussed above. We fixed and to be 8, to be uniformly random over non-terminal states, and the density of terminal states, equal to , to be 0.25 for the experiments. The reward unit is fixed to be 1.0 whenever a reward is given by the environment. The is randomly sampled, but since we use in our experiments, from each state there is one action that leads to any of the states in .

Results for varying reward delay and length of specific reward sequences

We plot the average over 10 runs666over 10 random seeds for the algorithm but fixed seed for the Environment of the final mean episodic reward777over previous 100 episodes at the end of training for all the algorithms in Figure 1 for a grid of values over the delay and specific sequences meta-features. As can be seen from the figure, all algorithms perform very well in the vanilla environment where the MDP assumption is perfect because there is no delay and the sequence length is 1, but performance degrades in environments where these assumptions are not met. The performance degredations clearly grow as we move further and further away from the assumptions. It is interesting (and expected) that Rainbow manages to be more robust than DQN. However, it is unexpected that A3C with an LSTM does not improve over vanilla A3C (even though we set the LSTM max sequence length to the delay + sequence length which would let it remember the stack of states that would let the environment be correctly modelled as an MDP); we plan to study this effect in more detail in the future.

Figure 2:

DQN training reward standard deviation across 10 runs

We plot the standard deviation in the training of one of the algorithms (DQN) in Figure 2

. The plot shows a lot of variance in many of the environments where assumptions are violated. Sometimes, DQN nevertheless managed to perform decently, which emphasizes that algorithms

can sometimes perform well even when their assumptions are violated, and offers an explanations for the fact that tuning seeds can lead to good results.

We relegate plots of the evaluation reward at the end of the training888rollout with the learnt policy, averaged over 10 episodes to the Appendix (Figure 1 in Appendix) since they are qualitatively similar to the training episodic rewards in Figure 1.

Results for varying transition and reward noise

We see a similar trend as for delays and sequences when we vary the transition and reward noises in Figure 3. There is a gradient of performance degradation as more and more noise is injected. It’s interesting here to see that DQN seems to be more sensitive to noise in the transition dynamics compared to the reward dynamics with values as low as 0.02 leading to a clear handicap in learning while for the reward dynamics (with the reward unit being 1.0) noises with a comparable standard deviation of 1.0 still had decent learning. Further interesting results can be seen in the evaluation rollouts (Figure 4).999Here, for evaluation, and not for training because training is in the noisy environment, we evaluated in the corresponding environment without noise to get an idea of how true learning is proceeding. We see that the algorithms are more sensitive to noise in the transition dynamics during training as compared to during evaluation. While it is obvious that the mean episodic reward would be perturbed when noise is injected into the reward function, it is non-trivial that injecting noise into the transition function still leads to good learning as displayed in the evaluation rollout plots. An additional seeming anomaly is that the evaluation rollouts for A3C seem to suggest that it performs better in the presence of transition noise (when reward noise is 0 or 1). This seems counterintuitive and warrants more investigation.

In addition to the plots for the rewards at the end of the training, we plot the complete learning curves for evaluation rollouts for DQN in the presence of injected transition and reward noises in Figure 5. So, each square in the heatmap in Figure 3(a) corresponds to the mean over the rightmost points in the corresponding noise training curve plot in Figure 5. We see here visually how the training seems more robust to transition noise than reward noise.

(a) DQN
(b) Rainbow
(c) A3C
(d) A3C + LSTM
Figure 3: Mean episodic reward at the end of training for the different algorithms when varying transition noise and reward noise. Please note the different colorbar scales.
(a) DQN
(b) Rainbow
(c) A3C
(d) A3C + LSTM
Figure 4: Mean episodic reward for evaluation rollouts (limited to 100 timesteps) at the end of training for the different algorithms when varying transition noise and reward noise. Please note the different colorbar scales.
Figure 5: Evaluation Learning Curves for DQN when varying transition noise and reward noise. Please note the different colorbar scales.
Results for sparsity

The plots for controlling the meta-feature sparsity in the vanilla environment show that DQN variants are able to learn the important rewarding states in the vanilla environment even when these are sparse while A3C once again behaved a bit unexpectedly (Figure 6).

The plots for the configuration setting where we make the environment give denser rewards for specific sequences of lengths

by rewarding even when only part of the sequence has been achieved, show that learning is less variant across different runs although the algorithms still don’t perform as well as they could, probably due to the sequence lengths still

violating the MDP assumptions. These are relegated to the appendix (Figure 17 in Appendix).

(a) DQN
(b) Rainbow
(c) A3C
(d) A3C + LSTM
Figure 6: Mean episodic reward at the end of training for the different algorithms when varying reward sparsity. Please note the different colorbar scales.
Hyperparameter Tuning

Hyperparameters were tuned for the vanilla environment; we did so manually in order to obtain good intuition about them before applying automated tools. We tuned the hyperparameters in sets, loosely in order of their significance and did 3 runs over each setting to get a more robust estimate of the performance. We did not use more advanced AutoML methods because this was mainly to obtain a better manual understanding of how these hyperparameters work in RL and to motivate moving towards algorithms which recognise meta-features and adapt to them. Another prior was that, for such toy environments, we wouldn’t need a lot of hyperparameter tuning. But it turned out that hyperparameter tuning was still very significant and it turns out that the toy environments might be decent test beds for researching hyperparameters in RL too. We describe a small part of our hyperparam tuning for DQN next.

We thought that pretty small Neural Networks would do for such toy environments and we initially grid searched over small NN sizes (Figure

6(a)). But the variance in performance was pretty high there (Figure 6(b). When we tried to tune DQN hyperparams learning starts and target network update frequency, however, it became clear that the target network update frequency was pretty significant (Figure 6(c) and 6(d)) and setting it to a better value of 800 (instead of the old 80) led to better, expected and less variant performance for the NN sizes when we searched over those again (Figure 6(e) and 6(f)).

(a) Reward
(b) Std dev.
(c) Reward
(d) Std. Dev.
(e) Reward
(f) Std dev.
Figure 7: Mean episodic reward at the end of training for different hyperparameter sets for DQN. Please note the different colorbar scales.

5 Conclusion and Future Work

We introduced a platform to test out baseline RL algorithms in environments with varying key meta-features that we identified, and evaluated some baseline RL algorithms on these. The platform allows us to be able to test new algorithms which may adapt to these and more key meta-features in environments. It also allows quick and coarse insights into current algorithms and their hyperparameters. We will release the code as Open Source to help with benchmarking of algorithms.101010It is currently available for anonymous review, with the Appendix with full-quality images, at https://github.com/anonips/-MDP-Playground.git.

We will further implement plug and play model-based metrics to evaluate model-based algorithms, such as the KL-divergence (probably a sampled version because analytical calculation would be intractable in many cases) between the true dynamics models and the learnt one. We intend to have toy benchmarks for exploration by testing whether algorithms can perform directed exploration along simple manifolds of rewarding sequences.111111Since this a platform under active development, there are a lot of further improvements planned and we explain some things mentioned here in a bit more detail in the Appendix (Section D).

Even though we have a playground to generate environments where the meta-features such as sequence length are not constant, being able to solve environments with variable delay and sequence lengths and identifying them (i.e., segmentation of events in the time domain) is another area we are currently working on with attention-based agents.

Another significant meta-feature is reachability in the transition graph. It would be interesting to have a few important environments which capture some real world characteristics at a very high level.

It would also be interesting to integrate our platform with the bsuite. Overall, we intend to promote more adaptivity in RL algorithms and we hope this platform is a first small step towards it.


The authors gratefully acknowledge support by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant no. 716721, by BMBF grant DeToL and by Bosch Center for Artificial Intelligence. Raghu would additionally like to thank his group, especially André Biedenkapp, for helpful discussions and the RLSS 2019, Lille organizers and participants for a stimulating summer school.


  • P. Abbeel, A. Coates, and A. Y. Ng (2010) Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research 29 (13), pp. 1608–1639. Cited by: §1.
  • K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765. Cited by: §1.
  • M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2017) Rainbow: combining improvements in deep reinforcement learning. External Links: 1710.02298 Cited by: §4.
  • E. Liang, R. Liaw, P. Moritz, R. Nishihara, R. Fox, K. Goldberg, J. E. Gonzalez, M. I. Jordan, and I. Stoica (2017) RLlib: abstractions for distributed reinforcement learning. External Links: 1712.09381 Cited by: §4.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. External Links: 1602.01783 Cited by: §4.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1, §2.1, §4.
  • I. Osband, Y. Doron, M. Hessel, J. Aslanides, E. Sezener, A. Saraiva, K. McKinney, T. Lattimore, C. Szepezvari, S. Singh, B. V. Roy, R. Sutton, D. Silver, and H. V. Hasselt (2019) Behaviour suite for reinforcement learning. External Links: 1908.03568 Cited by: §3.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. Cited by: §2.1.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. Cited by: §3.