Challenges of Context and Time in Reinforcement Learning: Introducing Space Fortress as a Benchmark

09/06/2018 ∙ by Akshat Agarwal, et al. ∙ Carnegie Mellon University 0

Research in deep reinforcement learning (RL) has coalesced around improving performance on benchmarks like the Arcade Learning Environment. However, these benchmarks conspicuously miss important characteristics like abrupt context-dependent shifts in strategy and temporal sensitivity that are often present in real-world domains. As a result, RL research has not focused on these challenges, resulting in algorithms which do not understand critical changes in context, and have little notion of real world time. To tackle this issue, this paper introduces the game of Space Fortress as a RL benchmark which incorporates these characteristics. We show that existing state-of-the-art RL algorithms are unable to learn to play the Space Fortress game. We then confirm that this poor performance is due to the RL algorithms' context insensitivity and reward sparsity. We also identify independent axes along which to vary context and temporal sensitivity, allowing Space Fortress to be used as a testbed for understanding both characteristics in combination and also in isolation. We release Space Fortress as an open-source Gym environment.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

Code Repositories

spacefortress

OpenAI Gym compatible reinforcement learning environment for Space Fortress


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in computer vision

[Krizhevsky, Sutskever, and Hinton2012]

and natural language processing

[Sutskever, Vinyals, and Le2014]

can be attributed to the advent of deep learning and the presence of robust benchmarks to quantitatively measure progress, such as the ImageNet challenge

[Russakovsky et al.2015]

. In the last few years, neural network-based function approximation has also proven successful in reinforcement learning, with AI agents now able to perform at superhuman levels in games like Go

[Silver et al.2016] and the Atari [Mnih et al.2015] suite. Once again, research in Deep RL has been steered by the establishment of benchmarks like the Arcade Learning Environment [Bellemare et al.2013], along with the OpenAI Gym interface [Brockman et al.2016], which has been widely adopted by the research community.

These benchmarks are conspicuously missing 2 challenging characteristics: (a) abrupt context-dependent switching of strategy and (b) temporal sensitivity. For agents to operate in the real world, they need to be able to switch behaviors very abruptly, which necessitates (i) learning to identify critical points where behavior needs to change, and (ii) learning the different behaviors required in each context. Agents also need to have an understanding of time as an independent variable, along with the ability to adapt their behavior accordingly. While having no understanding of time as something that’s always ticking might work for simulated or static real-world environments, it is not acceptable for real-world dynamic environments with moving entities and where decisions might have to be adaptively taken very quickly or very slowly, depending on the context. Since existing benchmarks do not focus on these properties, reinforcement learning research has not tackled these problems yet.

In this paper, we introduce a challenging RL environment based on Space Fortress (SF) [Mané and Donchin1989], an arcade-style game which was developed by psychologists in the 80s to study human skill acquisition, and is still used quite frequently [Towne, Boot, and Ericsson2016, Destefano and Gray2016]. The objective of the game is to fly a ship and destroy a fortress by firing missiles at it. The ship has to respect a minimum time difference between successive shots, while building up the fortress’ vulnerability, and once the fortress becomes vulnerable, destroy it with a rapid double shot. As a RL testbed, Space Fortress possesses both the characteristics discussed above: context-dependent strategy change (change in required firing rate after the fortress becomes vulnerable) and time sensitivity (firing rate requirements independent of the agent’s decision speed i.e., the frame rate). It also has a sparse reward structure, and, as we show, is not solved by any state-of-the-art RL algorithms such as Rainbow [Hessel et al.2018], Proximal Policy Optimization (PPO) [Schulman et al.2017] and Advantage Actor-Critic (A2C) [Mnih et al.2016].

While being an interesting and relevant challenge for reinforcement learning, the rich background on human skill acquisition research based on Space Fortress also makes it an attractive tool to study human-AI collaboration in a dynamic environment, compare skill acquisition techniques of humans vs artificial agents, and work on few-shot learning by leveraging lessons from cognitive architectures like ACT-R [Anderson2009] which have previously learned the game with extremely high sample efficiency, albeit using handcrafted features and extensive domain knowledge.

We make the following contributions. First, we present a new RL testbed that requires the agent to switch strategies abruptly based on context, and develop a conceptualization of time independent of its speed of decision making, and demonstrate empirically that performance on par with humans is beyond the capability of current state-of-the-art RL algorithms, even after relaxing the reward sparsity through shaping [Ng, Harada, and Russell1999]. We identify the aspects of the game which can be varied to control both temporal and context sensitivity, allowing research on either in isolation. Finally, we demonstrate that after introducing modifications to ease identification of critical contexts, the PPO algorithm learns to play the game well enough to outperform humans, verifying that context insensitivity is the primary driver behind the poor performance of RL algorithms. We also present robust human benchmark results for Space Fortress, allowing future researchers to place new experimental results in context. We open-source111https://github.com/agakshat/spacefortress the OpenAI Gym environment for Space Fortress as well as all the code used to run our experiments, to promote research in temporal and context-sensitive reinforcement learning algorithms.

2 Related Work

The Arcade Learning Environment (ALE) [Bellemare et al.2013] poses the challenge of building AI agents with competency across dozens of Atari 2600 games, like Space Invaders, Asteroids, Bowling and Enduro. Following the development of Deep Q Networks [Mnih et al.2015], a lot of research in the RL community has focused on improving performance in one or more of the games with improvements like massive parallelization, sample efficiency [Wang et al.2015, Schaul et al.2015], better exploration [Fortunato et al.2017, Plappert et al.2017], reward sparsity [Pathak et al.2017, Andrychowicz et al.2017] and long-term strategies [Bacon, Harb, and Precup2017, Kulkarni et al.2016]. In continuous control tasks on the MuJoCo testbed [Todorov, Erez, and Tassa2012], on-policy actor critic methods [Schulman et al.2017, Mnih et al.2016]

have shown promise. bellemare2017distributional bellemare2017distributional estimated a probability distribution over the Q-value of a state (instead of just the mean of the Q-value), with greatly improved results. Rainbow

[Hessel et al.2018] combined a lot of orthogonal improvements in DQNs to achieve state of the art results. However, we show below that these algorithms fail to learn anything on Space Fortress.

Games like Ms. Pacman and Seaquest in the ALE have previously required some context or temporal sensitivity, but these characteristics can’t be controlled or varied, and form a minor part of the overall game. As a RL testbed, Space Fortress relies heavily on both context and temporal sensitivity, as we show in Section 4, and both characteristics can be controlled directly to enable their study in isolation.

There has also been a fair amount of prior work on reinforcement learning with sparse rewards. pathak2017curiosity pathak2017curiosity use curiosity as an intrinsic reward signal to efficiently direct exploration. State visitation counts have also been investigated for exploration [Bellemare et al.2016]

, and osband2016deep osband2016deep train multiple value functions and make use of bootstrapping and Thompson sampling for exploration. These works focus on learning with sparse rewards through better exploration of the state space, which does not help with Space Fortress where exploration is required in time and in latent contexts.

zambrano2015continuous zambrano2015continuous trained agents to deal with actions that take a finite amount of time through neural reinforcement learning in grid worlds, which still did not require a conceptualization of time independent of the internal speed of decision making, hence differing from the proposed work. Finally, van2017higher,van2017towards van2017higher,van2017towards previously used A3C on a simple control task abstracted from Space Fortress, with no fortress destruction required. Crucially, this task removed the interesting characteristics of Space Fortress, namely contextual and temporal sensitivity, as well as reward sparsity. We release an implementation of the game as an OpenAI Gym environment to promote research, conduct an ablation study to ascertain the roles of context and temporal sensitivity and reward sparsity in poor performance, and then present results showing existing RL algorithms outperforming humans after we control for the above factors.

3 The Space Fortress RL Environment

(a)
(b)
(c)
Figure 1: Game Screens in Space Fortress. The ship has to fly between the two hexagons, while the fortress can only change its orientation at a fixed position. The game score is displayed at the top, and the fortress’ vulnerability is displayed as a bar which fills up on each shot. (a) The bar is empty, indicating that the fortress’ vulnerability is 0, (b) The bar is now full, indicating that vulnerability is equal to 10 and a rapid double shot will now destroy the fortress and (c) The fortress has been destroyed. This is followed by a reset of the fortress and continuation of the game till end of episode (3 minutes game time)

We now describe the Space Fortress game, discuss its utility as a testbed for reinforcement learning, and present results from humans learning to play the game, intended as a baseline. The game environment can be seen in Fig. 1.

3.1 Game Description

The player/AI agent controls a ship, which has to fly around in a frictionless arena, firing missiles to destroy a fortress located centrally within the arena. Hitting the walls on either sides or being hit by shells fired by the fortress results in immediate ship death, which incurs a penalty on the agent. Destroying the fortress, however, requires a context-aware strategy. Each missile that successfully hits the fortress increases its vulnerability by one. When , the fortress is ’not vulnerable’, and the ship must fire its missiles spaced more than 250ms apart. Firing faster than this while leads to the fortress vulnerability getting reset back to zero. This is obviously undesirable and the agent must learn to shoot slowly. However, once , the fortress becomes vulnerable, and a rapid double fire (2 shots spaced less than 250ms apart) is required to destroy the fortress. We refer to this 250ms time specification as the “critical time interval”. It is important to note that once , shooting further at the fortress at a rate less than 4Hz will lead to no change in vulnerability. Hence the firing strategy completely reverses at the point when vulnerability reaches 10, and the agent must learn to identify this critical point to perform well. Since the game is simply reset (without ending the episode) when the fortress is destroyed, it is crucial that the agent also recognize this second critical point of fortress destruction, and switch back its firing rate to continue playing well. This major dependence on contextual and temporal sensitivity is unique to Space Fortress among RL benchmarks.

A single game lasts for 3 minutes. The game does not end in the event of either a fortress or ship destruction, and points are scored by destroying the fortress as many times as possible in those 3 minutes while avoiding getting shot down by the fortress or colliding with the arena. When a fortress is destroyed, its vulnerability resets to zero, and the game continues. When the ship is destroyed, it respawns at a random position and orientation, but the fortress’ vulnerability is preserved.

3.2 Game Versions

Space Fortress requires the agent to master advanced controls in a frictionless environment, orienting and firing missiles at the fortress while avoiding shells and not colliding with the walls. Since current RL algorithms proved unable to solve the game in its entirety (see experiments in Section 4.2), we introduced another version of the game to reduce navigation complexity by having the ship automatically pointed at the fortress. Throughout the rest of the paper, the simpler version is called ‘Autoturn’, while the original game is referred to as ‘Youturn’.

3.3 Human Evaluations

The human player results were collected by the authors in the context of a study on human skill acquisition [Anderson et al.2018]. 117 people were asked to play 20 games of Space Fortress, with 52 playing Autoturn and 65 playing Youturn. They were all given instructions about the rules of the game beforehand, and told about the change in firing rate required when the fortress vulnerability reaches 10. Considering that humans would require some turns to learn to play the game, we report the following results in Table 1: (1) Best performance of any subject in any game, (2) Average performance of all subjects in the last 5 games, considering the first 15 as a learning phase, (3) Average performance of all subjects in the last 10 games, considering the first 10 as a learning phase, (4) Average performance of all subjects in the last 15 games, considering the first 5 as a learning phase and (5) Average performance of all subjects in all 20 games. The scores shown to the humans (and reported in Table 1) were as follows: +100 for fortress destruction, -100 for ship death and -2 for each missile shot to penalize excessive firing.

Game N Metric Best Last 5 Last 10 Last 15 All
Autoturn 52 Score 3000 1989 1978 1940 1810
FortressDeath 40 30.311 30.044 29.591 28.181
Youturn 65 Score 2314 216 153 43 -169
FortressDeath 32 14.36 13.704 12.882 11.4
Table 1: Aggregated results for 102 humans playing Space Fortress. After being provided with instructions about the rules beforehand, each player played the game for 1 hour, or 20 games. Allowing for a few practice games, we report the average scores on the Last ‘’ games (), as well as the best individual score.
S.No. Algorithm Game Avg. Score Best Score Fortress Death
1 A2C Autoturn -2685 -2242 0
2 A2C Youturn -5859 –5604 0
3 PPO Autoturn -2502 -2178 0
4 PPO Youturn -5269 -4698 0
3 Rainbow Autoturn -8327 -8264 0
4 Rainbow Youturn -9378 -9245 0
Table 2: Average game scores for RL agents, trained with default (sparse) rewards, for 45M steps

3.4 RL Setup

We now describe the exact game setup used for reinforcement learning on Space Fortress.

  • Observations: The observations are in the form of pixel-level grayscale 84x84 size renderings of the game screen (similar to Fig. 1. Important information such as the time lapsed since the last shot is not a part of this observation, making the task partially observed. We provide the agent with a stack of the last 4 observations as input at each time step, allowing it to infer direction of movement of the ship and fortress using the difference between successive frames.

  • Actions: The agent chooses from 5 actions: (i) No Operation, (ii) Fire (a missile), (iii) Thrust Forward (in the direction of current orientation), (iv) Thrust Right (rotate right without changing position) and (v) Thrust Left (rotate left without changing position). The game operates at a default frame rate of 30 FPS and there is no action repeat, which means an action is chosen every 33ms. Note that the Autoturn version only has 3 actions (since no turning is required).

  • Rewards: In line with mnih2015human mnih2015human, we found that learning was more stable when using clipped rewards. The fortress and ship destruction rewards were clipped to +1 and -1, respectively, and the missile penalty reduced to -0.05. Note that the results used for evaluation and reporting were not clipped, in order to follow the same scheme as described in Section 3.3.

S.No. Algorithm Game Avg. Score Best Score Fortress Death
1 A2C Autoturn -4116 -2100 0
2 A2C Youturn -4781 -3890 1.3
3 PPO Autoturn -1294 -1108 1
4 PPO Youturn -1435 -1206 0.94
3 Rainbow Autoturn -6161 -5960 0
4 Rainbow Youturn -4894 -4577 0
Table 3: Average game scores for RL agents, trained with dense rewards, for 45M steps.
S.No. Algorithm Architecture Game Avg. Score Best Score Fortress Death
1 A2C SF-GRU Autoturn -1641 -718 3
2 A2C SF-GRU Youturn -2444 -1700 11
3 PPO SF-FF Autoturn 2337 2818 41
4 PPO SF-FF Youturn 2235 2880 40
5 PPO SF-GRU Autoturn 2510 2870 43
6 PPO SF-GRU Youturn 2356 2932 41
7 Rainbow Autoturn -2973 -2330 1.2
8 Rainbow Youturn -4112 -3934 0.0
Table 4: Average game scores for RL agents, trained after making context identification easier, for 45M steps

4 Experiments and Results

In this section, we experimentally show that (a) no state-of-the-art reinforcement learning algorithm (Rainbow [Hessel et al.2018], A2C [Mnih et al.2016] and PPO [Schulman et al.2017]) can learn to play Space Fortress, (b) removing reward sparsity does not improve the performance and (c) making context identification easier through specific alterations in the reward structure allow PPO to achieve superhuman performance. We also discuss temporal sensitivity by examining effectiveness of transfer of learning across different settings of the game’s critical time interval.

4.1 Network Architecture

For Rainbow, the Q-network architecture was identical to that in hessel2017rainbow hessel2017rainbow. For PPO and A2C, we experiment with two policy network architectures:

  • SF-GRU:

    The agent’s policy network takes the 1x84x84 environment observations as input, and outputs (a) a probability distribution over the actions, and (b) a value function estimate of the expected return. The input goes through two convolutional layers with 16, 32 filters of size 8,4 and stride 4,2 respectively, and ReLU activation. The output is flattened and passed through a linear layer with a ReLU non-linearity to get an output vector of size 256. This is then passed through a unidirectional Gated Recurrent Unit (GRU) cell

    [Cho et al.2014] with a tanh non-linearity giving an output of size 256. Finally, this vector is passed as input to two linear layers that output the probability distribution over actions (using a softmax activation) and the value estimate of the expected return.

  • SF-FF: Same as above, but with a fully connected layer of size 256 with ReLU non-linearity instead of the recurrent GRU cell.

For all experiments, we ran 16 processes collecting game experience in parallel, with discount factor and Generalized Advantage Estimation (GAE) [Schulman et al.2015] parameter . PPO used value loss coefficient , entropy regularization coefficient and learning rate , while A2C used and learning rate . Both A2C and PPO used

-step returns. These hyperparameters were found after extensive tuning. We also clipped the gradients of all the network parameters to 0.5, to prevent catastrophic updates from outlying samples of the expected gradient value. Since the PPO algorithm is more stable, we updated the policy 4 times every epoch - while A2C made only 1 update every epoch.

(a) Autoturn
(b) Youturn
Figure 2: Learning Curves for PPO on Space Fortress, for different reward structures and architectures. ’Average Human Score’ refers to the average score over all 20 games, provided as a point of comparison. ’Default Rewards’ discussed in Section 4.2, ’Dense Rewards’ in Section 4.3 and ’AECI’ (After Easing Context Identification) in Section 4.4. Both SF-FF and SF-GRU architectures are able to achieve superhuman performance after making context identification easier. The agent’s performance is very poor with both default (sparse) rewards and dense rewards.
(a) 125ms
(b) 400ms
(c) 600ms
Figure 3:

Checking for positive transfer of learning while changing the critical time interval in Space Fortress. In each figure, we changed the time interval to a different value and verified whether transferring learned weights from the agent trained on 250ms as the critical interval led to any speedup in the learning process, and an improvement in the final performance.

4.2 With Default (Sparse) Rewards

With the default sparse reward structure which rewards fortress destruction and penalizes ship destruction and missile firing, no algorithm is able to learn to destroy the fortress. A visual inspection of the game play revealed that the PPO and A2C agents (with both architectures) just learned to stop firing, since that leads to an immediate penalty. The Rainbow agent did not learn anything. Table 2 presents the aggregated results for PPO, A2C and Rainbow on both versions of the game. The ‘Fortress Death’ column in Table 2 indicates the number of times the agent was able to destroy the fortress per game, on average.

4.3 With Dense Rewards

Considering the results in Section 4.2 and to understand how reward sparsity is impacting performance, we introduce an additional reward of +1 each time the fortress is hit by a missile, and a penalty of -1 if the fortress’ vulnerability gets reset due to a faster firing speed than the context demands. This makes the reward density comparable to Atari games, on which PPO, A2C and Rainbow have all been shown to perform well. Their performance on Space Fortress with dense rewards can be seen in Table 3, where the scores for PPO and Rainbow have improved. From watching a video of the trained agent playing the game, we observed that the improvement stemmed from having learned to avoid ship death and to fire at the fortress, albeit without knowledge of the critical time interval and context-dependent strategy shifts, resulting in an inability to destroy the fortress with any consistency.

Hence, Space Fortress clearly presents a challenge to the state of the art in reinforcement learning, and is a useful and relevant benchmark for further research.

We now move on to studying the impact of context insensitivity of RL algorithms on the task performance (Section 4.4), and show that by making the identification of critical contexts easier with 2 simple modifications to the reward, PPO learns to play the game very well - outperforming humans comfortably. This clearly indicates that it is context insensitivity and the inability to identify critical points which is hampering performance, further making the case that the Space Fortress game is a useful benchmark for studying context sensitivity.

4.4 After Making Context Identification Easier

As discussed in Section 3.1, there are 2 critical points which the agent has to learn to identify and switch strategies at. The first is when the fortress becomes vulnerable, i.e. and the agent has to switch from firing more than 250ms apart to a rapid double shot fired less than 250ms apart. The second is when the fortress is destroyed, and the agent has to switch back to its slow firing speed. To confirm our intuition that it is indeed the algorithms’ inability to identify these critical points and accordingly adapt its firing strategy which lead to poor performance, we introduce 2 changes to the reward structure (with respect to the dense reward from Section 4.3) which make it trivial for the agent to identify the critical points where context changes:

  • Instead of rewarding fortress hits (as in Section 4.3), we switch to rewarding fortress vulnerability change, by giving a reward of +1 for unit increase in vulnerability, and giving a penalty of -1 for decrease in vulnerability. This has the effect of rewarding fortress hits only until the fortress’ vulnerability is building up to 10, at which point further hits are not rewarded. This clearly helps it identify the critical context at which the fortress becomes vulnerable.

  • We give the agent a bonus reward of +2 for fortress destruction, to help it identify when the fortress is destroyed.

Table 4 presents the results for agents trained after these 2 changes have been introduced to the reward structure to ease context identification. PPO with the recurrent architecture SF-GRU achieves the best performance in both score and number of fortress deaths, learning faster than SF-FF, and achieving a higher final score. The performance of both A2C and Rainbow also improves, although they are still unable to outperform humans. Fig. 2 tracks the learning curves for PPO learning with all 3 reward settings (default, dense and after making context identification easier) in both game versions - Autoturn and Youturn.

4.5 Temporal Sensitivity

Having established that context insensitivity is the primary driver of poor performance of state of the art RL algorithms on Space Fortress, we now analyze the temporal sensitivity of the PPO algorithm. As described in Section 3.1, Space Fortress has a dominant temporal aspect - missiles must hit the fortress at least 250ms apart when it is not vulnerable, and then the strategy must reverse to hit the fortress twice within 250ms when it is vulnerable, in order to destroy it. In order to understand whether the RL algorithms had developed any understanding of time as an independent dimension, we modified the critical time interval from 250ms to other values, and checked for positive transfer of learning from the policy trained with 250ms as the critical time interval. We achieve transfer of learning by simply initializing the weights of the transferee with the learned weights of the transferer.

Figure 3 compares the learning curves for an agent learning with PPO (using the SF-FF architecture) on Youturn, when the critical time interval is changed from 250ms to ms. The blue line is for an agent learning from scratch, while the orange line is for an agent transferring learning from the PPO SF-FF agent trained on the 250ms interval. From Figure 2(b) and 2(c), it can be seen that while the transfer of learning helps by initializing the weights in a favorable corner of the parameter space, the learning saturates very quickly and ends up with a final score much lower than achieved when the critical interval was 250ms. Modifying the critical time interval in Space Fortress is a useful technique to study temporal sensitivity of reinforcement learning algorithms.

5 Conclusion

This paper introduced Space Fortress as a new challenge for deep reinforcement learning research, with its time-sensitive game play, abrupt context-dependent shift of strategy and sparse rewards. We showed that state of the art RL algorithms (PPO, A2C and Rainbow) were unable to learn to play the game with neither the default sparse rewards nor the dense reward structure we defined. After making context identification easier through two minor tweaks in the reward structure, however, PPO was able to learn to play the game, outperforming humans comfortably. This ablation study allowed us to conclude that context insensitivity was the primary reason behind the poor performance of RL algorithms on Space Fortress, along with the inability to learn with sparse rewards. We then looked at whether PPO develops a concept of time as an independent variable - by checking for positive transfer of learning while changing the critical time interval of 250ms in Space Fortress. We found that while there was some positive transfer of learning, the agents saturated very quickly and did not achieve a good final score. By studying generalization and transfer across different settings of the critical time interval, Space Fortress can hence also be used as a benchmark to study temporal sensitivity of reinforcement learning algorithms.

Learning to play Space Fortress without making any modifications to the reward structure will require reinforcement learning algorithms to be able to identify various latent contexts and adapt their strategies suitably. It will also require being able to learn with very sparse rewards. This is beyond the capability of current state of the art reinforcement learning algorithms, making Space Fortress a useful benchmark for research.

Acknowledgments

This research was sponsored by AFOSR Grant FA9550-15-1-0442. The collection of human data and development of the OpenAI Gym interface for Space Fortress was supported by ONR grant N00014-15-1-2151. We would like to thank Shawn Betts and John Anderson for insightful discussions on the game of Space Fortress, and for the OpenAI Gym interface for Space Fortress which was used to run the experiments in this paper.

References