Leveling the Playing Field - Fairness in AI Versus Human Game Benchmarks

03/17/2019 ∙ by Rodrigo Canaan, et al. ∙ NYU college 4

From the beginning if the history of AI, there has been interest in games as a platform of research. As the field developed, human-level competence in complex games became a target researchers worked to reach. Only relatively recently has this target been finally met for traditional tabletop games such as Backgammon, Chess and Go. Current research focus has shifted to electronic games, which provide unique challenges. As is often the case with AI research, these results are liable to be exaggerated or misrepresented by either authors or third parties. The extent to which these games benchmark consist of fair competition between human and AI is also a matter of debate. In this work, we review the statements made by authors and third parties in the general media and academic circle about these game benchmark results and discuss factors that can impact the perception of fairness in the contest between humans and machines



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Since the inception of AI as a research field, games have been a popular application. Alan Turing famously worked on a program that play chess in 1948 [1], before there was even a machine that could run such a program. A popular way of evaluating such programs is by having it play a competent human player. Human-level competence in complex games became a target researchers worked to reach.

Decades later, this target finally started being met. From TD-Gammon’s [2] (1992) unorthodox play that challenged the prevailing consensus in backgammon theory, Deep Blue’s [3] famous victory against Gary Kasparov in a 1997 showmatch and AlphaGo’s [4]

demonstration of the power of MCTS and neural networks in 2016, these achievements have helped advance AI research and shape perception of AI by the general public.

After Go was beaten, focus started to shift to video games. Notable examples are the Arcade Learning Environment (ALE) [5], the Mario AI Competition [6], the General Video-Game AI Competition [7] and current advancements in Starcraft [8] and Dota 2 [9] agents.

As is often the case with AI research, these results are liable to be exaggerated or misrepresented by either authors or media. The extent to which these approaches provide a pathway towards ”true AI” or ”general debate” is also a matter of academic debate. The goal of this work is two-fold: first, to research how game AI benchmarks and their achievements are portrayed by their authors, by outside media outlets and (to a smaller extent) by follow-up academic publications. These findings will be related to previously published guidelines on how to represent AI in media such as found in  [10, 11].

The second goal is to discuss in what ways human and AI performance in games is meaningfully comparable, and how innate differences between human and machine can make such comparisons difficult, both in tabletop and electronic games.

Ii Portrayal of AI game benchmark achievements

In this section, we look at what claims were made by the original authors of some systems that achieved success in game benchmarks, and how those results were subsequently discussed in the general media and follow-up academic papers. Our main inspiration is previously published guidelines on how to write about AI research such as  [10] and [11]. While the use of “suitcase words” [10] is almost unavoidable given the prevalence of terms such as intelligence, learning, prediction even academic writing, more egregious violations will be explicitly noted. Our goal in this section is also to illustrate how game AI benchmarks are perceived by society, and what are the main concerns regarding the fairness of comparison between human and AI programs.

We note that no statistical significance or quantitative fact about these portrayals is claimed. Articles were selected based on the discussion to be had on their portrayal of AI versus human game benchmarks.

Ii-a TD Gammon

TD-Gammon [2]

, is a Backgammon-playing software developed by Gerald Tesauro at IBM using the temporal-difference learning, a reinforcement learning technique where a neural network learns through self-play by minimizing the difference in prediction of the outcome of the game between successful game states. Between 1991 and 1992, it played over a hundred games against some of the best players in the world across three different versions of the algorithm. The last version (TGD 2.1) achieved came very close to parity with Bill Robertie, a former world champion, in a 40-game series by a difference of a single point.

Tesauro highlights how observing the algorithm play has led to a change in how humans evaluate positions, especially in opening theory for the game. In particular, with some opening rolls, the system preferred ”splitting” its back checkers rather than the more risky, but favored at the time option of ”slotting” its 5-point. Since then, the splitting opening has been confirmed to be the superior choice by computer rollouts and is now the standard for the 2-1, 4-1 and 5-1 initial rolls.

When discussing applicability in other domains, Tesauro lists robot motor control and financial trading as potential applications while cautioning that the lack of a forward model and the scarcity of data might limit the success in these real world environments. Not much discussion of TD-Gammon’s achievements was found in general media dating from the time of its release, but Woolsey, an analyst in Tesauro’s paper states that  [2] says that TD-Gammon’s algorithm is “smart” and learns “pretty much the same way humans do”, as opposed to “dumb” chess programs that merely calculate faster than humans.

Ii-B Deep Blue

Deep Blue [3]

is a computational system for playing chess, designed by a team at IBM led by Murray Campbell. It uses a combination of specialized hardware with software techniques, such as tree-search augmented heuristics crafted by human experts, pruning and databases of opening moves and endgame scenarios. It achieved enormous visibility in 1997 when it defeated the reigning champion Garry Kasparov in a six-game match with tournament with a score of

. Kasparov had previously beat a former version of the algorithm in 1996 by .

While the authors make no speculative claims in their paper describing the system [3], the same cannot be said about the media. One article from the Weekly Standard, with the ominous title “Be Afraid” [12], first claimed that the system’s “brute force” approach is “not artificial intelligence”, but mere calculation. This claim is backed by an interview of Deep Blue programmer Joe Hoane in the same article.

From there, however, the article argues that in the second game, “Deep Blue won. Brilliantly. Creatively. Humanly” from a position that allegedly does not benefit as much from brute-force calculation. Then, they speculate that this amounts to passing a chess-specific Turing test and, if machines can pass this test, they might eventually pass the more general Turing test and grow beyond our control and understanding. Ultimately, they might become “creatures sharing our planet who not only imitate and surpass us in logic, who have perhaps even achieved consciousness and free will, but are utterly devoid of the kind of feelings and emotions that, literally, humanize human beings”. This is an example of magical thinking and “Hollywood scenario” [10], where AI might gain human-like abilities in general intelligence through unspecified means unrelated to Chess research, and whose potentially dangerous results should be feared.

Other commentators, such as in this New York Times article [13] focus on Kasparov’s own reactions to the match, especially the last one, which Kasparov conceded after 19 moves claiming he had lost his fighting spirit and that he, as a human being, is afraid when faced with something he does not understand. Kasparov also said that the match should have been longer, as he needs time to rest, and that previous games by Deep Blue should be made available. This final remark might be justified by the fact that Deep Blue’s “opening preparation was most extensive in those openings expected to arise in match play against Kasparov”, although ultimately “none of the Kasparov-specific preparation arose in the 1997 match.” [3].

Another article, also from the New York Times [14] starts by characterizing both Deep Blue and the human brain as information processing machines, and in this view, “not a matter of man versus machine but machine versus machine”. The main difference is that Kasparov and humans have feelings such as fear and regret, which help control the many activities that can be performed by a human. Deep Blue has none such feelings. However, they speculate that in the future, a potential machine called Deeper Blue might be able to model its opponent and even have life goals outside of chess, such as fame. A yet more advanced machine, Deepest Blue, might also have a model of itself (which might count as consciousness) and be vulnerable to psychological warfare, at which point humans would again stand a chance in a game of chess. This also an example of speculation unrelated to the matter at hand (magical thinking), although with milder consequences than seen in [12]

Ii-C Alpha Go, Alpha Go Zero and Alpha Zero

In 2016, AlphaGo [4], an agent developed by group of Google DeepMind researchers led by David Silver, became the first program to beat a human Go champion in a match against Lee Sedol, in which AlphaGo won by

. The system uses a combination of Monte Carlo Tree Search with convolutional neural networks, which learned from professional human games and self play. In 2017, they announced a new version, AlphaGo Zero 

[15], which learned entirely from self play, with no human examples, and which was able to beat the previous AlphaGo version (AlphaGo Lee). Still in 2017, they announced AlphaZero, which uses a similar architecture (but different input representations and training) to beat other top engines in Go, Chess and Shogi.

The authors claim that the later versions of the system, (i.e., AlphaGo Zero and Alpha Zero) master the games without human help, or Tabula Rasa. These claims were scrutinized in a paper by Gary Marcus [16], who views the agent as an example of hybrid system. In particular, he points out the inability of the system to generalize to variations of the game without further training. The system is also unable to learn the paradigm of tree search or the rules of the game, which humans are capable of.

Similar to Deep Blue, AlphaGo and its successors also received wide media coverage. An article from Wired [17]

states in its title that AlphaGo and Lee Sedol, together, ”redefined the future”, referring to two specific moves (which became famous as move 37 and move 78), the first by AlphaGo, the second by the human champion, which defied all expert opinions, and, indeed, were both evaluated by AlphaGo itself by having a probability of being played by a human close to one in ten million. An article by The Washington Post 

[17] also looks at move 37, and asks experts about its implications for creativity. One interviewee, Pedro Domingos, sees the move as creative, asking “if that’s not creative, then what is?”. Others, such as Jerry Kaplan, attribute the move to clever programming, not creativity of the software.

One final article worth discussing is “Why is Elon Musk afraid of AlphaGo-Zero?” [18]. While the article does not explicitly state that Elon Musk is right to be afraid, the title evokes the same fear as seen in [12], with an additional appeal to Elon Musk’s authority as a notorious figure in the technology industry to give the scenario more credibility.

Ii-D Electronic games

Electronic games (or video games) offer additional challenges to AI researchers compared to traditional tabletop games. Due to a combination of almost continuous time scale (limited by the system’s frame rate) and potentially huge game state space and action space, electronic games are typically even more intractable by brute-force search than games as Go or Chess. As an example, an estimate by Ontñon et al, 

[8] quotes the state space of Starcraft as , its branching factor as and its depth as , whereas Go has corresponding values of roughly , and . As such, a number of video game AI benchmarks have been proposed. While the use of video games as AI benchmarks goes back a long way, interest in these benchmarks has spiked since AlphaGo’s results of 2016, as Go, which was considered among the most challenging tabletop games, was finally beaten and new, harder challenges had to be explored.

Some of these benchmarks encourage the development of general techniques, that can be applied for a large number of domain problems, such as different games. That is the case of frameworks such as the Arcade Learning Environment (ALE) [19], where agents can be evaluated in one of hundreds of Atari 2600 games and the General Video Game AI Competition [7], where agents are evaluated in previously unseen arcade-like games.

Other examples benchmarks proposed for specific games are Vizdoom [20] (first person shooter), the Mario AI Benchmark [6] (platform game) and even benchmarks not focused on winning a game, but building a level for a platform game [21] or, inspired by the Turing test, playing in a way that is indiscernible from humans [22].

While all these benchmarks have garnered academic interest, none has arguably received as much general media coverage and player attention as AI challenges in the for Starcraft [5] after Google DeepMind and Blizzard, the game’s publisher, released an reinforcement learning for the game, and Dota 2 (a game by Valve), where different versions of agents developed OpenAI went from defeating one of the best players in the world in a limited 1v1 version of the game in a showmatch in an official Valve tournament in 2017 [23], to defeating a team of 5 semi-professional players in the 99th percentile of skill in another showmatch in 2018 [9, 24] to eventually losing to professional players in a showmatch during The International 8 [25], the biggest Dota 2 event of the year. The fact that both Starcraft and specially Dota 2 are popular eSports seems to have helped garner a lot of attention from the community of players as well.

Starcraft-playing agents are still unable to beat top human players, which has probably contributed to tone down the amount of speculation, but media outlets (and some researchers) seem to be betting on an AI victory in the near future [26].

For the remainder of this section, we will focus on Dota 2 AI media coverage, whose trajectory has been full of ups and downs and controversy.

A major point of debate has been the way the OpenAI agent visualizes and interacts with the game, as described in [9]. The high level features used by the agent in its observations allows it to ”see” at any point in time, information such as the remaining health and attack value of all units in its view. A human would have A human would have to click on each unit, one by one, to view this information. Agents can also specify its actions at a high level by selecting ability, target, offset and even a time delay (from one to four frames). A human would have to make a combination of key presses and imprecise mouse movements to achieve the same effect.

An article on Motherboard [27] has described the advantages provided to the AI as “basically cheating”, summing up that “Open AI Five plays like an entire team with programmable mice and telepathy”. This statement is framed by the fact that humans have been disqualified from tournaments before due to the use of programmable macro action. The article also proposes that the agent should learn directly from visuals.

In a blog post [28], AI researcher Mike Cook, while ultimately having a positive view on the benchmark, also comments on the interface advantages, drawing attention to some highlights of the games where, even though the agents have a reaction speed of 200ms (in theory comparable to humans), they executed key actions such as interrupting a spell or coordinating powerful abilities in a way that is seemingly impossible for humans. Cook also warned about the potential of the AI to fall prey to techniques it has never encountered in its self play (such as the technique of pulling or unusual hero lineups) and that good performance in a few facets of the game (such as teamfighting) might give the illusion of greater overall competence in the game.

A final critique against OpenAI’s agents came from the number of simplifications that had to be made to tacke a game as complex as Dota 2, such as playing with a reduced Hero pool, the innability to fight Roshan (a powerful NPC that tipically takes risky team-wide efforts to kill, but drops a valuable reward and is often the focus of game-deciding fights between teams) and the choice to have individual invulnerable couriers per player (as opposed to a vulnerable courier shared by the entire team). These demands can be seen in game forums such as  [29, 30] and ultimately led to OpenAI’s decision to drop most restrictions in preparation for the final matches at The International 8, which OpenAI lost [25]

Iii A discussion of human-AI comparisons in game benchmarks

At a first glance, the issue of fair conditions in between a computer agent and a human seems more tractable in tabletop games, such as Backgammon, Chess and Go, than in electronic games. A major difference between the two domains seems to be the input and output interfaces between the algorithm and the game itself.

This issue seems to be less applicable to tabletop games. Modifying Alpha Go with a camera to read the board state and a robotic arm to move the pieces (as opposed to using a human facilitator to receive input and effect its output on a real game board) might interesting Computer Vision and Robotic problems on their own, none of the comments we’ve seen argue that such improvements would make for a better or fairer Go player.

Due to this significant difference, we divide discussion below between tabletop games and electronic games. Issues discussed for tabletop games in general also apply to electronic games, but the reverse is not true.

Iii-a Tabletop games

The first key issue affecting the fairness between human and artificial players in tabletop games are feelings such as fatigue, fear, anxiety, etc. In [13], Kasparov comments on the role these factors can play in a match. In  [31], Ke Jie, another prominent Go player who has also lost to AlphaGo, stated that psychological factors are possibly “the weakest part of human beings”. It is a regular occurrence for sports commentators to also build a narrative around the mental factors going into an important match, especially one where a lot of pride or money is involved. The magnitude of the psychological effect is unclear from this brief study, but, to the degree in which it might change the outcomes, compensating for it is also not trivial. There is no straightforward way to account for these emotions in a computer simulation, and attempt to do so (e.g. by artificially injecting noise in the algorithm’s evaluation in situations of high stress) would defeat the purpose of building the best possible game-playing systems.

A second issue that can be raised is the use of look-up tables for specific points of a match, such as the opening and endgame. These have been used in Deep Blue [3] and suggested as a potential improvement for TD-Gammon [2] when noticing that TD-Gammon’s biggest weakness was in endgame situations. However, such databases of moves and positions would clearly not be allowed in human tournament play.

An argument could be made following the Extended Mind [32] that whether such database is internal or external to a system makes little difference when considering the system’s cognitive abilities. Taking this argument to one extreme, the whole artificial game-playing system could be viewed as a mere augmentation of the human’s cognitive abilities, leading us to the absurd scenario of a ”human versus AI” match where nonetheless all moves are selected by the same algorithm, one playing for itself, the other in the human’s stead. Taken to the opposite extreme, we could deem the use of these databases by the AI as inherently unfair, which could lead into a rabbit role of judging exactly which techniques are to be considered fair, possibly culminating with a Chinese Room  scenario where no AI achievements are ever to be considered as proof of mastery in a game, as they can all be reduced to a human following instructions in a piece of paper.

A third factor is the one relating to the availability of information about one’s opponent in a match. If an algorithm is capable of studying examples of human play in general (as is the case for the original AlphaGo [4] or even have some of its parameters or design decisions tuned to face a specific human player (as happened with Deep Blue [3]), wouldn’t it be fair for a human to review a large number of games by an artificial agent, receive a detailed summary of its preferred openings and strategies, perhaps even inspect the source code?

The second and third issues could be alleviated as the methods behind existing engines rely less on human examples and pre-calculated lookup tables. Similarly, all three issues can be alleviated as the gap widen between state of the art game-playing agents and human players. In a close series such as the one between Deep Blue and Kasparov in 1997, it is conceivable that fatigue, anxiety, endgame databases or specific opponent knowledge could have played a significant role in the end result. However, today’s top chess engines (over 20 years later) play at an estimated ELO rating of 3500 [33] compared to top humans around 2800 [FIDE]). However, these concerns about the fairness of comparison between machines and humans could still be relevant for new tasks where human performance might be achieved in the near future, where the machine’s margin of victory is potentially still small.

A final and fourth issue is the machine’s ability to generalize its learnings across different games or variations of the same game. According to Brooks [10], humans are prone to infer competence from performance. As humans, we might expect a system that performs as the best Go player in the world to be competent enough to play on a board of different dimensions, or play with a different goal (such as the intent to lose) or be at least a passable player in another similar game (such as chess). Marcus [16] points out that this is not the case with most existing techniques, and this fact can be seen as disappointing, which is one of the motivations behind competition frameworks such as ALE [19], GVGAI [7] and GGP [34].

Iii-B Electronic Games

The major difference between tabletop games and electronic games when it comes to perception of fairness seems to be rooted on the representation of the observation and action space, as well as reaction time, as discussed in  [27, 28].

Regarding the observation space, a common paradigm to solve the issue is playing the game from pixels, rather than from higher level game features. This is the approach followed by Vizdoom [20] and ALE [19]. While the approach can be said to more closely emulate the way humans perceive video games, the comparison is not perfect. On one hand, favoring the AI, questions such as “is the difference between these two objects smaller than X?”

are still much easier to answer accurately for an agent playing from pixels than for a human, and could benefit the agent when aiming an ability. On the other hand, when a human sees pixels in the shape of a coin, a spider and fire, they can reasonably infer that the first object has to be collected, the second attacked and the third avoided, and such heuristic would work well for many games. Embedding this representation and real-world knowledge in a visual AI system is an unsolved problem, which provides humans with an advantage that is not easy to surmount at the moment.

While objections to high level representations are valid, taken to the extreme, these objections would imply that no meaningful advancements could be made in video game-playing AI before the field of computer vision is essentially solved. This would be disappointing from a game AI perspective. After all, low level recognition of pixel patterns is not what immediately comes to mind we picture a human expertly playing a game. Results obtained on less structured or more general representations can fairly characterized to be more impressive, but the challenges involved in dealing with lower level representations don’t necessarily capture what makes games such interesting AI problems in the first place.

For this reason, they shouldn’t be a barrier for game AI research, especially in environments where humans currently have the upper hand. We believe novel results using higher level representations are important, and further research that attempts to replicate these results while using less favorable or more general representations are also important and will likely naturally follow the initial results.

Similarly, the representation of the action space can take many forms, such as the high level actions specifying ability, target and offset present in OpenAI’s methodology [9], to simulating medium level user interface commands such as screen movements and unit highlighting in Starcraft [5] to directly simulating a virtual controller as in ALE [19]. The extreme position would be to insist on a robotic arm manipulating a physical controller or keyboard, which would again distract researchers from other legitimate game AI problems that can be tackled with higher level representations.

Too fast reaction speed is often cited as one of the factors that make an agent play in a perceived artificial fashion [35]. A popular solution, used by OpenAI [9] is to directly enforce a specific reaction time. Alternate solutions involve the ”Sticky Action” and other methods discussed in [36] for the ALE environment. Interestingly, its original motivation was not to emulate human play, but to provide enough randomness to the otherwise deterministic ALE environment to force the agent to learn ”closed loop policies” that react to a perceived game state, rather than potential ”open loop policies” that merely memorize effective action sequences, but also works to avoid inhuman reaction speeds.


We have described in brief detail some of the most relevant game AI benchmark results in the past three decades, for both tabletop games (Backgammon, Chess and Go) and electronic games (specially Starcraft and Dota 2). We have also looked at some of the claims made by the authors of these game-playing systems and some third-party comments made by general media and research communities. While many examples of constructive and questioning of what constitutes fairness in contests between humans and AI, some examples of violations to principles of AI coverage in the media could also easily be found.

We also listed and discussed some factors that might affect how fairness between human and AI. For both tabletop and electronic games, human feelings such as fatigue, fear and anxiety, the ability of artificial to study the style of a specific human opponent and techniques such as databases of moves can be construed as unfair advantages of AI. However, as the gap between the best agents and humans widens (for games where the AI has an advantage) , it becomes harder to argue that these are the deciding factors in the outcome.

Specifically to video games, most of the discussion is centered around the interface used by the agent for input and output, and also on the unfairness of a reaction time that is quick beyond human. We argue that (all else being equal) results using architectures that interact with the game in a human-like fashion are more impressive, this should not be a discouragement to research done using more high-level representations. It is likely that, once these games are eventually beaten using more generous architectures, work that attempts to achieve similar performance while reducing the AI program’s inherent advantages will quickly follow.