It has become ubiquitous to apply deep reinforcement learning methods to the games that humans enjoy. Perfect information games such as Go have fallen to a combination of deep RL and Monte-Carlo Tree Search (Silver et al., 2017), and even imperfect information games such as Poker are being solved (Moravcík et al., 2017). Video games, starting with classic Atari console titles, were among the first to be tackled by deep RL (cite DQN), and are still widely used as benchmarks for state-of-the-art RL algorithms today. More recently, much interest has been shown in modern games such as StarCraft II (Vinyals et al., 2017) and Dota 2 (OpenAI, 2017), which have established fan followings and professional scenes.
In all of these cases, the bar we wish our agents to reach is the level of competent or even world-class humans. This is especially true of those multi-player games in which humans can face off directly against trained AI opponents. It is certainly impressive and perhaps awe-inspiring to watch machines surpass us at the games that we have put in so much passion and dedication to master.
However, AI agents are often winning on more than intelligence alone – they possess superhuman speed and precision by default. A more principled way to compare the intelligence – that is, information processing – abilities of machines and people would be to level the playing field in this regard. The addition of human constraints may also result in agents employing more interesting and relatable strategies to humans.
To mimic the limits of human reaction time, we add fixed delay between the time an agent chooses an action and when that action reaches the environment. To our knowledge, deep reinforcement learning methods have not been deliberately applied to environments with action delay. We investigate how deep RL methods perform with delay, and find that performance drastically falls as delay increases for agents playing Super Smash Bros. Melee and a variety of Atari 2600 games.
We present a novel technique for deep RL agents to cope with action delay, inspired by human perception and previous work on constant-delay Markov Decision Processes (MDPs). We endow agents with a neural predictive model of the environment, which can “undo” action delay, enabling them to act according to an estimate of the true state in which their action will be executed. Combining this predictive model with the IMPALA architecture, we extend the work in(Firoiu et al., 2017) which trained superhuman SSBM undelayed agents via self-play. With this predictive architecture, agents are able to challenge world-class SSBM players while constrained by human-like reaction time.
2.1 Super Smash Bros. Melee
Super Smash Bros. Melee (SSBM) is a fast-paced multi-player fighting game released in 2001 for the Nintendo Gamecube. SSBM has steadily grown in popularity over its 17-year history, and today sports an active professional scene with tournaments that can draw hundreds of thousands of viewers. Although 2v2 matches are also played professionally, we focus on 1v1, which is the main tournament format.
We use the same interface to SSBM as in (Firoiu et al., 2017), which uses a discrete action set and structured state space with both discrete and continuous components. While deep RL has often been applied to environments with visual state spaces such Atari (Bellemare et al., 2013) and Deepmind-lab (Beattie et al., 2016), more recent work on Dota 2 and StarCraft II has used structured feature representations. Rewards are given both for knock outs – the underlying objective – and damage, which is displayed on screen.
Being a fighting game, SSBM is naturally faster-paced than Dota or SC2. With important interactions occurring at such high frequency, human players are pushed to the limits of their reaction time. Without this handicap, relatively standard deep RL methods combined with self-play have surpassed human professionals (Firoiu et al., 2017)
. There even exists a hand-engineered decision tree-based AI which can play almost perfectly against humans, albeit in a limited setting where it can fully utilize unlimited reactions(Petro, 2017). Given the importance of reaction time, SSBM is a natural environment in which to pose the problem of AI with action delay, from the point of view of both scientists and players.
2.2 Delayed MDPs
(Walsh et al., 2008) studied constant-delay Markov Decision Processes (CDMDPs), defined as MDPs where actions are delayed by a constant number of steps. They showed that state augmentation, which naively turns the CDMDP back into an MDP by appending the delayed actions to the state, is intractable due to the exponential blowup in the size of the new state space. They proposed Model-Based Simulation (MBS) as a sample efficient solution, similar to our approach, which is theoretically tractable when the underlying MDP is only “mildly stochastic”. Empirically, they found that MBS performs well on grid worlds, mazes, and the one-dimensional mountain car problem. We note that these environments are both simpler than SSBM and, crucially, are single-agent; the presence of an adversary greatly complicates the problem of modeling the environment.
2.3 Reaction Time
Fast-paced games like SSBM push players to the limits of their reaction time, which for the average person is about 250ms for visual stimuli (Jain et al., 2015). It has been found that this reaction time both varies throughout the population and can be improved with training, such as by playing video games (Dye et al., 2009). Human auditory reaction times are known to be somewhat faster, and indeed professional SSBM players will in certain situations listen for auditory cues instead of visual ones.
Many video games, Atari and SSBM included, run at 60Hz, which means that each frame lasts about 17ms. A completely undelayed agent thus has a reaction time of 17ms, while an agent under 15 frames of delay will have the reactions of an average human. We consider 12 frames to be the lowest human-plausible reaction time.
3 Deep RL and action delay
To our knowledge, deep reinforcement learning methods have not been deliberately applied to environments with action delay. 111Anecdotally, we have heard that A3C performs significantly worse in OpenAI’s Universe framework, which introduces a modest (40ms) length of delay. That being so, an empirical investigation is in order.
For all experiments, we augment the environment with a length queue of actions. When the agent takes an action, it is pushed to the queue, and the action which pops out of the other end is executed instead. Thus, each action is executed exactly steps later than usual. Note that each step encompasses multiple game frames due to frame skipping.
The action queue is passed to the agent along with the state at each step, giving the agent in principle perfect information. This is known as the augmented approach in (Walsh et al., 2008).
We trained IMPALA agents on six Atari games for 200 million frames using a frame skip of 4 and delays of 0 through 5 agent steps. Figure 1 shows the learning curves of the agents with varied delay for each game. While the outcomes of Ms. Pacman were slightly mixed, increasing delay resulted in significantly lower scores on all other games.
We trained IMPALA agents against the in-game AI at its hardest difficulty setting for one day using a frame skip of 3. Figure 2 shows the learning curves of agents with varying delay against the in-game AI. Again, increasing delay dramatically lowered performance.
3.4 Why is delay hard?
As we have seen, agents under action delay perform quite poorly. Intuitively, we can see that, with delay, the agent does not know which state it will be in when its action is eventually executed by the environment, and without this knowledge it is difficult to act appropriately (Figure 2(b)), as compared to the process of an agent with no delay (Figure 2(a)).
This is especially problematic when it comes to the discrete components of the state, which can completely change the transition dynamics and therefore the optimal policy. For example, in SSBM each of the two characters has a discrete “animation state” which can take on over three hundred different values. Possible values discriminate between the twenty or so different attacks the character might be performing, whether the character is jumping, running, crouching, rolling, sliding, stunned from an enemy attack, and many others. Knowing which state your character is in is crucial for determining the best action.
Even the continuous components such as position can be tricky to deal with under uncertainty, as there is sharp discontinuity between an attack hitting or missing based on the distance of the characters.
More theoretically, we can measure the complexity of adding delay by considering the size of the resulting delayed MDP. In order to be Markovian, we must augment the original space with the queue of delayed actions . This results in an increase by a factor of , which can easily become quite large.
4 Predictive modeling as a solution to delayed actions
4.1 Human perception
As we have seen, deep RL agents struggle in delayed environments. Since we wish to train policies that act under human-like delays, it is natural to ask how humans themselves deal with delay. Experimental psychology suggests that the brain constantly and subconsciously anticipates the near future in physical environments (Nijhawan, 1994). Optical illusions such as the Flash-Lag Effect show that our very perception of the present is actually a prediction, with moving objects placed in their extrapolated rather than present locations. This feature of our perceptual systems explains how we can perform athletic feats such as catching a baseball or returning a tennis serve with relatively slow motor controls.
4.2 Predicting the present
Taking this insight to heart, we endow our agents with a predictive model of the SSBM environment. Once trained, this model can be used to “undo” the agent’s delay, as in MBS (Walsh et al., 2008). Figure 4 displays the predictive architecture, where Figure 3(a) illustrates the predictive agent unrolled and Figure 3(b) shows the predictive model unrolled.
More precisely, suppose that is the learned action-conditional transition model, the agent is under frames of delay, the current state is , and the previously chosen actions were . Due to the delay, the next action to be sent to the environment is precisely , and the current decision will only be sent after state .
Our initial agents used a policy network that directly output given the augmented state . With our predictive model, we can generate predicted states where
We say that a agent is one whose actions are under frames of delay and which runs the predictive model steps. In state , the agent’s policy network receives as input the predicted state and actions .
Note that and are measured in the frames the agent sees, not counting those skipped. Thus, a agent acting every frames has a reaction time of frames. The frame skip itself adds another frames on average. When specifying the frame skip, refer to such an agent as a agent.
4.3 Predictive architecture
Our predictive model employs a residual-style architecture.
is a “delta” network which additively adjusts the previous state.
is a “new” network which constructs a new state.
is a “forget” network whose outputs are weights in [0, 1] and which smoothly interpolates between the adjusted and new states.
All three networks are feed-forward with output shapes equal to the state itself.
Addition and multiplication are done component-wise.
This architecture leverages the fact that our states
are already encoded by semantically meaningful features. The changes in continuous components such as character position and velocity are well captured by the delta network. For the discrete components, we first transform from probability to logit space where addition is more meaningful. Interpreting the continuous components of the predicted state as means of fixed-variance normal distributions, the predicted state becomes a diagonal (that is, with independent components) approximation to the true distribution over states.
Although we omit their dependence on previous states, in practice the networks sit on top of a shared recurrent core using a Gated Recurrent Unit(Cho et al., 2014). Using for core hidden states and for core outputs:
4.4 Training with delay
We train our predictive model by regressing each predicted state to its true counterpart . The distance between states is computed component-wise, with for the continuous components (character position, velocity, etc.) and cross-entropy for the discrete components.
Returns are computed somewhat differently for delayed agents. Because the action taken in state isn’t executed until state , it does not make sense to use any of the rewards for reinforcing . Instead, we the return from time step , the point when is executed.
This choice of return raises the question of what to do with the critic. Already, our objective has changed: at time , we wish to estimate the expected return at time rather than time . Intuitively, one might use the same predicted state that the policy does. However, because the critic is only used when training, we have full knowledge of the true state , and so we can use that instead to form a more accurate value estimate.
The policy gradient is largely unchanged, although one must be careful to compute the predicted state in the same manner on both the actor and learner. We found V-trace – the off-policy correction algorithm introduced in (Espeholt et al., 2018) – to be important, as the steps of prediction make the policy even more sensitive to changes in the parameters.
In the first test of our predictive architecture, we trained three agents: , , and , against the in-game AI at its highest difficulty setting. As seen in Figure 4(a), we found the predictive agent to do slightly worse. Since the in-game AI is mostly deterministic and easily exploitable, and because the predictive model is non-trivially slower to run and train, against such a weak opponent the faster non-predictive agents can do slightly better in terms of wall-clock time.
Ultimately, performance against the in-game AI is not our real objective – we wish to train agents with self-play that will be able to defeat human players. This suggests that we compare the predictive and non-predictive agents more directly, by having them train against each other. The resulting scores seen in Figure 4(b) clearly show the (4, 4) agent with a significant advantage over the other two, suggesting that the predictive model is necessary for learning more difficult policies. In particular, it appears that predicting only partially – that is, with – is insufficient, and best results are achieved with .
Our final test was against “Professor Pro”, the top player in the UK and ranked 41st internationally. To face him, we trained a (6, 6, 2) agent for three days, and then retrained it as a (7, 7, 2) agent for one week. Games were in tournament format – first to four KOs – and recorded at both delays 6 and 7. We also trained a non-predictive (6, 0, 2) agent for one week.
|Delay||Prediction Steps||Days Trained||Wins||Losses|
Although our predictive agents were not ultimately victorious, they did come close to even against a very skilled human opponent. We believe that with some additional work, perhaps by leveraging the predictive model for better exploration as in (Pathak et al., 2017), truly superhuman agents with human-level reactions will be possible.
5 Future directions
Perhaps the most promising extension of our work is to run the predictive model past the delayed action sequence and into the future. This opens the promising avenue of neural model-based planning that has proven immensely successful in perfect information games (Silver et al., 2016).
There are several challenges along this path, however. Without access to the true environment model, errors can quickly compound, making the resulting plan unreliable. This is exacerbated by the search procedure itself, which is likely to exploit flaws in the model as it tries to optimize reward. The approach taken in (Weber et al., 2017) attempts to remedy this by allowing the policy to arbitrarily interpret the planned trajectory.
Another issue is runtime, which can be limited in real-time environments such as SSBM. Already, unrolling the predictive model can be quite expensive. While not an issue for a (7, 7, 2) agent, we found that at (9, 9, 2) the agent could not run quickly enough to keep up with a real-time environment, and thus could not play against human opponents. However, there are certainly opportunities for improving the model’s computational complexity, for example by precomputing predictive steps before they are needed.
5.2 Modeling the opponent
While we demonstrate that our approach can perform well in the multi-agent setting – that is, when the opponent is also learning – our predictive model ignores the opponent, effectively pretending that the opponent is a part of the environment. With privileged post-facto information of the opponent’s actions, one could train a model that conditions on both players’ actions, and use it to reason about the underlying imperfect-information game. In this form it would be possible to apply methods from (Moravcík et al., 2017), though to our knowledge this has yet to be attempted with a neural environment model.
5.3 Other temporal action spaces
While constant delay may be a reasonable proxy for human reaction time, in other contexts such as robotics (especially over an unreliable network) variable delay may be more accurate. Constructing models that can deal with variable delay in real time is likely to be difficult, and it may be more pragmatic to simply move to lower-frequency policies.
Another limitation that humans have, aside from reaction time, is their total number of actions per minute (APM). Even in games such as StarCraft which are known for high APM, top professionals rarely exceed 400 APM, well below the 1800 taken by an RL agent with frame skip of two. Clearly humans are being much more efficient, acting only when it is truly necessary to do so. An RL agent that could decide not to act might even learn more effectively, as the credit assignment problem becomes easier when there are fewer actions that need to be reinforced.
In this paper we consider the problem of deep reinforcement learning in environments with action delay. We find that standard methods such as IMPALA are ill-equipped to deal with this new challenge and rapidly lose performance with increasing delay. Inspired by human visual perception and previous work on constant-delay MDPs, we propose a solution using a predictive environment model to anticipate the future state on which the current action will act. This provides the right inductive bias that is missing from the simpler augmented-state approach, endowing the agent with a model that more closely matches reality. Empirically, we find that predictive agents significantly outperform non-predictive ones when matched head to head, and can even hold their own against highly-ranked human professionals.
- Beattie et al.  Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Kuttler, Andrew Lefrancq, Simon Green, Victor Valdes, Amir Sadik, Julian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, and Stig Petersen. Deemind lab. CoRR, 2016.
- Bellemare et al.  Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. J. Artif. Int. Res., 47(1):253–279, May 2013.
- Cho et al.  Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation, 2014.
- Dye et al.  M. W. Dye, C. S. Green, and D. Bavelier. Increasing Speed of Processing With Action Video Games. Curr Dir Psychol Sci, 18(6):321–326, 2009.
- Espeholt et al.  Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures, 2018.
- Firoiu et al.  Vlad Firoiu, William F. Whitney, and Joshua B. Tenenbaum. Beating the world’s best at super smash bros. with deep reinforcement learning, 2017.
Jain et al. 
A. Jain, R. Bansal, A. Kumar, and K. D. Singh.
A comparative study of visual and auditory reaction times on the basis of gender and physical activity levels of medical first year students.Int J Appl Basic Med Res, 5(2):124–127, 2015.
- Moravcík et al.  Matej Moravcík, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael H. Bowling. Deepstack: Expert-level artificial intelligence in no-limit poker. CoRR, abs/1701.01724, 2017.
- Nijhawan  R Nijhawan. Motion extrapolation in catching. In Nature, pages 256–257, 1994.
- OpenAI  OpenAI. Dota 2, 2017.
- Pathak et al.  Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction, 2017.
- Petro  Dan Petro. Smashbot, 2017.
Silver et al. 
David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre,
George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda
Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal
Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray
Kavukcuoglu, Thore Graepel, and Demis Hassabis.
Mastering the game of go with deep neural networks and tree search.Nature, 529:484–503, 2016.
- Silver et al.  David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550:354–, October 2017.
- Vinyals et al.  Oriol Vinyals, Stephen Gaffney, and Timo Ewalds. Deepmind and blizzard open starcraft ii as an ai research environment, 2017.
- Walsh et al.  Thomas J. Walsh, Ali Nouri, Lihong Li, and Michael L. Littman. Learning and planning in environments with delayed feedback. Autonomous Agents and Multi-Agent Systems, 18:83–105, 2008.
- Weber et al.  Théophane Weber, Sébastien Racanière, David P. Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adria Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, Razvan Pascanu, Peter Battaglia, Demis Hassabis, David Silver, and Daan Wierstra. Imagination-augmented agents for deep reinforcement learning, 2017.