AlphaStar: An Evolutionary Computation Perspective

02/05/2019 ∙ by Kai Arulkumaran, et al. ∙ Imperial College London 6

In January 2019, DeepMind revealed AlphaStar to the world-the first artificial intelligence (AI) system to beat a professional player at the game of StarCraft II-representing a milestone in the progress of AI. AlphaStar draws on many areas of AI research, including deep learning, reinforcement learning, game theory, and evolutionary computation (EC). In this paper we analyze AlphaStar primarily through the lens of EC, presenting a new look at the system and relating it to many concepts in the field. We highlight some of its most interesting aspects-the use of Lamarckian evolution, competitive co-evolution, and quality diversity. In doing so, we hope to provide a bridge between the wider EC community and one of the most significant AI systems developed in recent times.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

Code Repositories

scared_citizen_simulation

A simulated player chases citizens through a town. The citizen must "learn" to escape. Get it? Learn? Haha.. (It's a simple neural net)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Background

The field of artificial intelligence (AI) has long been involved in trying to create artificial systems that can rival humans in their intelligence, and as such, has looked to games as a way of challenging AI systems. Games are created by humans, for humans, and therefore have external validity to their use as AI benchmarks (Yannakakis and Togelius, 2018).

After the defeat of the reigning chess world champion by Deep Blue in 1997, the next major milestone in AI versus human games was in 2016, when a Go grandmaster was defeated by AlphaGo (Silver et al., 2016). Both chess and Go were seen as some of the biggest challenges for AI, and arguably one of the few comparable tests remaining is to beat a grandmaster at StarCraft (SC), a real-time strategy (RTS) game. Both the original game, and its sequel SC II, have several properties that make it considerably more challenging than even Go—real-time play, partial observability, no single dominant strategy, complex rules that make it hard to build a fast forward model, and a particularly large and varied action space.

DeepMind recently took a considerable step towards this grand challenge with AlphaStar, a neural-network-based AI system that was able to beat a professional SC II player in December 2018 (Vinyals et al., 2019)

. This system, like its predecessor AlphaGo, was initially trained using imitation learning to mimic human play, and then improved through a combination of reinforcement learning (RL) and self-play. At this point the algorithms diverge, as AlphaStar utilises population-based training (PBT)

(Jaderberg et al., 2017) to explicitly keep a population of agents that train against each other (Jaderberg et al., 2018). This part of the training process was built upon multi-agent RL and game-theoretic perspectives (Lanctot et al., 2017; Balduzzi et al., 2018), but the very notion of a population is central to evolutionary computation (EC), and hence we can examine AlphaStar through this lens as well111Note that we present a high-level overview of general interest, and have left aside the many deep links to the crossovers between EC and game theory (Smith, 1982)..

2. Components

2.1. Lamarckian evolution

Currently, the most popular approach to training the parameters of neural networks is backpropagation (BP). However, there are many methods to tune their hyperparameters, including evolutionary algorithms (EAs). A particularly synergistic approach is to use a memetic algorithm (MA), in which evolution is run as an outer optimisation algorithm, and individual solutions can be optimised by other means, such as BP, in an inner loop

(Moscato, 1989). In this specific case, an MA can combine the exploration and global search properties of EAs with the efficient local search properties of BP.

PBT (Jaderberg et al., 2017), used in AlphaStar to train agents, is an MA that uses Lamarckian evolution (LE)222A more extensive literature review on LE can be found in the original paper.: in the inner loop, neural networks are continuously trained using BP, while in the outer loop, networks are picked using one of several selection methods (such as binary tournament selection), with the winner’s parameters overwriting the loser’s; the loser also receives a mutated copy of the winner’s hyperparameters (Goldberg and Deb, 1991)

. PBT was originally demonstrated on a range of supervised learning and RL tasks, tuning networks with higher performance than had previously been achieved. It is perhaps most beneficial in problems with highly non-stationary loss surfaces, such as deep RL, as it can change hyperparameters on the fly.

As a single network may take several gigabytes of memory, or need to train for several hours, scalability is key for PBT. As a consequence, PBT is both asynchronous and distributed (Nowostawski and Poli, 1999). Rather than running many experiments with static hyperparameters, the same amount of hardware can utilise PBT with little overhead—the outer loop reuses solution evaluation from the inner loop, and requires relatively little communication. When considering the effect of non-stationary hyperparameters and pre-emption on weaker solutions, the savings are even greater.

Another consequence of these requirements is that PBT is steady state (Syswerda, 1991)

, as opposed to generational EAs such as classic genetic algorithms. A natural fit for asynchronous EAs and LE, steady state EAs can allow the optimisation and evaluation of individual solutions to proceed uninterrupted and hence maximise resource efficiency. The fittest solutions survive longer, naturally providing a form of elitism/hall of fame, but even ancestors that aren’t elites may be preserved, maintaining diversity

333When given an appropriate selection pressure (Miller and Goldberg, 1995)..

2.2. Co-evolution

When optimising an agent to play a game, like in AlphaStar, it is possible to use self-play for the agent to improve itself. Competitive co-evolutionary algorithms (CCEAs) can be seen as a superset of self-play, as rather than keeping only a solution and its predecessors, it is instead possible to keep and evaluate against an entire population of solutions. Like self-play, CEAs form a natural curriculum (Hillis, 1990), but also confer an additional robustness as solutions are evaluated against a varied set of other solutions (Rosin and Belew, 1997; Stanley and Miikkulainen, 2004).

Through the use of PBT in a CCEA setting, Jaderberg et al. (Jaderberg et al., 2018) were able to train agents to play a first-person game from pixels, utilising BP-based deep RL in combination with evolved reward functions (Ackley and Littman, 1991). The design of CEAs have many aspects (Popovici et al., 2012)

, and characterising this approach could lead to many potential variants. Here, for example, the interaction method was atypically based on sampling agents with similar fitness evaluations (Elo ratings), but many other heuristics exist.

2.3. Quality diversity

A major advantage of keeping a population of solutions—as opposed to a single one—is that the population can represent a diverse set of solutions. This is not restricted strictly to multi-objective optimisation problems, but can also be applied to single objectives, where behaviour descriptors (BDs; i.e., solution phenotypes) can be used to pick solutions in the end. Quality diversity (QD) algorithms explicitly optimise for a single objective (quality), but also searches for a large variety of solution types, via BDs, to encourage greater diversity in the population (Cully and Demiris, 2018b). Recently, Ecoffet et al. (Ecoffet et al., 2019) used a QD algorithm to reach another milestone in playing games with AI—their system was the first to solve Montezuma’s Revenge, a platform game notorious for its difficulty in exploring the environment.

In SC, there is no best strategy. Hence, the final AlphaStar agent consists of the set of solutions from the Nash distribution of the population—the set of complementary, least exploitable strategies (Balduzzi et al., 2018)

. In order to improve training, as well as increase the variety in the final set of solutions, it therefore makes sense to explicitly encourage diversity. As it does so, AlphaStar can also be classified as a QD algorithm. In particular, agents may have game-specific BDs, such as building extra units of a certain type, but also criteria to beat a certain other agent

444A concept highly related to competitive fitness sharing in CCEAs (Rosin and Belew, 1997)., criteria to beat a set of other agents, or even a mix of these. Furthermore, these specific criteria are also adapted online, which is relatively novel among QD algorithms (Wang et al., 2019). There is more that could be done here though: it may be possible to extract BDs from human data (Yannakakis and Togelius, 2018), or even learn them in an unsupervised manner (Cully and Demiris, 2018a). And, given a set of diverse strategies, a natural next step is to infer which might work best against a given opponent, enabling online adaptation.

3. Discussion

While AlphaStar is a complex system that draws upon many areas of AI research, we believe a hitherto undersold perspective is that of it as an EA. In particular, it combines LE, CCEAs, and QD to spectacular effect. We hope that this perspective will give both the EC and deep RL communities the ability to better appreciate and build upon this significant AI system.

References