A Continuous Information Gain Measure to Find the Most Discriminatory Problems for AI Benchmarking

09/09/2018 ∙ by Matthew Stephenson, et al. ∙ 0

This paper introduces an information-theoretic method for selecting a small subset of problems which gives us the most information about a group of problem-solving algorithms. This method was tested on the games in the General Video Game AI (GVGAI) framework, allowing us to identify a smaller set of games that still gives a large amount of information about the game-playing agents. This approach can be used to make agent testing more efficient in the future. We can achieve almost as good discriminatory accuracy when testing on only a handful of games as when testing on more than a hundred games, something which is often computationally infeasible. Furthermore, this method can be extended to study the dimensions of effective variance in game design between these games, allowing us to identify which games differentiate between agents in the most complementary ways. As a side effect of this investigation, we provide an up-to-date comparison on agent performance for all GVGAI games, and an analysis of correlations between scores and win-rates across both games and agents.



There are no comments yet.


page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Competitions and challenges are regularly used within AI as a way of evaluating algorithms, and also for promoting interest into specific problems. However, if the challenge poses a large set of possible problems it can often be impractical or even impossible to evaluate a new algorithm on every problem within this set. Comparing a new algorithm with the state of the art on the full set of problems can require immense computational resources, which are not available to many researchers. Therefore, a smaller set of problems is usually selected that intends to be representative of the entire problem space. This leads to the fundamental question: how should we select this subset of problems? This question is critical, as selecting a poorly representative subset of the problems available might leave out key aspects of the challenge, resulting in an unintentional bias that leads to specific solutions performing better than they would have on the entire problem set. If you have good knowledge of the domain, you could choose a set of problems with interestingly difficult design features, but there is no guarantee that these differences in design translate to meaningfully different challenges. Also in many cases, you do not have deep knowledge about the design of the different problems.

This issue is prevalent in any situation where it may be computationally prohibitive to test a new algorithm on all sub-problems presented. Examples of competitions or challenges where this is the case include the GVGAI [perez20162014], ALE [bellemare2013arcade] and Kaggle competitions [carpenter2011may]

, each of which have hundreds of separate problems. There are also other sets of machine learning benchmarks that contain a multitude of disparate tasks, such as the OpenAI Gym


or the UCI repository of supervised learning tasks

[uclrep]. Many of the participants in these challenges cherry-pick benchmarks where their new algorithm performs well. This will continue to happen as long as benchmark sets are so large that it is computationally infeasible to test on all benchmarks available. However, we should not simply reject good papers simply because the authors lack the resources to perform full evaluations on every benchmark possible. The solution to this dilemma is to test new algorithms on problems that are most relevant given the current set of well-performing algorithms, not simply those where the new algorithm performs best. This paper therefore proposes an approach for selecting a small number of problems out of a larger set for accurately testing an algorithm, and has great potential to revolutionize AI benchmarking across many different AI challenges.

As a first approach to the task of selecting which problems to test an algorithm on, we look at the correlations between different algorithm’s performance for different problems. After testing an algorithm on one problem, another problem could be selected with an anti-correlated performance profile (i.e. one where other algorithms perform differently). This is an incomplete solution however, most importantly because it does not tell us which problem to test on first and does not factor in the potential variability in agent performance. We instead propose an information-theoretic measure for determining which problems are best at telling a given set of algorithms apart; a measure that also takes into account the concept of noise when analyzing performance measures. By recursively applying this measure, we can find problems that are maximally informative considering previously selected problems, meaning that we can identify problems that discriminate among a set of algorithms in different ways [martinez2016ai].

We use the General Video Game AI (GVGAI) framework as a testbed for our method. The GVGAI library includes more than a hundred mini video games [bontrager2016matching], and several dozen different agents that can play these games [soemers2016enhancements, gaina2017rolling, weinstein2012bandit, perez2016analyzing, mendes2016hyper] have been submitted to the associated GVGAI competition  [perez2016general]. We present a formalised analysis on the correlations between agent performances across different games, and use our information-theoretic measure to select a subset of the GVGAI game library that accurately represents the full discriminative scope when testing on all games available (i.e. provides a diverse range of problems that best distinguishes between agents).


General Video Game Playing (GVGP) is an area of research that looks to expand upon the success of the General Game Playing (GGP) competition. The GGP competition offers a platform for researchers to create agents that can play a wide range of board games  [Genesereth2005]. While board games offer an interesting variety of problems, they do not offer real-time situations where rapid decision making is key, which is what the GVGP competition does by providing a library of video games as a research platform [Levine2013].

The GVGAI competition has been running annually since 2014 and provides a Video Game Description Language (VGDL) with which to quickly design games, and a common API for agents to access those games [Ebner2013]. Each year ten games are selected to evaluate the submitted agents, which often covers a wide range of game types from role-playing to puzzle games [perez2018general]. One of the key elements of this competition is that the games being played by the agents for each year’s competition are unknown to both the developers and agents beforehand. Many of the GVGAI games and most of the agents include some form of stochasticity, meaning that performance evaluation is inherently noisy. Playing a GVGAI game will also give two signals of performance, whether an agent won the game or not and what score was obtained [perez20162014]. The competition currently offers multiple tracks, including a single and multi-player planning track [twoplayer], which provides a forward model for analyzing future game states, and a learning track which removes the forward model but allocates a training time to agents before submission [learningtrack]. For this paper, we only consider the games and agents used in the single-player planning track.

One of the issues that we set out to tackle with this work is that the games within the GVGAI framework are currently not well documented. A few previous papers attempted to evaluate how certain agents perform on different GVGAI games [bontrager2016matching, Nelson2016InvestigatingVM], but none have investigated the different discrimination profiles presented by the full game corpus, or how this information could be used to help design better agents and games in the future. The fact that different GVGAI games pose different types of problems to agents, may lead to biases towards a particular type of algorithm when selecting game subsets. Understanding what bias may exist in a given set of games, and being able to select ten games which minimize any particular bias, is desirable for ensuring that the competition is genuinely evaluating general problem-solving capabilities.

Data Collection

The first step towards analyzing different GVGAI games is to collect data from various playthroughs using a collection of agents. We used twenty-seven commonly available agents, which were some of the top performing entries in the previous GVGAI competitions over the last five years. The games that were used consist of the full corpus of 102 GVGAI games that are currently available (at the time of writing), plus an additional six deceptive GVGAI games introduced by Anderson et al.  [Anderson2018], making the total number of games equal to 108.

Similar to the GVGAI competition format, each agent has 40 milliseconds to perform each action and runs for at most 2000 time steps. To replicate the competition environment, we ran the agents using 243 CPU cores with 2.6 GHz and 8 GB of memory. Each successful playthrough of a game that resulted in either a win or loss without any crashes produces one unit of data, containing the information . Unfortunately, some of the agents occasionally crashed on certain games due to changes that the GVGAI framework has received through over the years, so not all the agents have the same amount of the generated data. A total of 3,990,760 successful playthroughs were recorded across all agents and games, with an average of 1,368.6 data samples per game-agent pair.

Algorithm Performance

As a basis for further analysis, we compute both the average win-rate and score for each game-agent pair, with the results visualized in Figure 1. Looking at the win-rate in Figure 0(a) we can already see that there are some games where nearly all agents either win or lose, in which case the score seen in Figure 0(b) would be the deciding value. Conceptually, it seems that games which offer a large spread of performance values would be best at discriminating between good and bad agents in a competition, but we can also see in Figure 1 that not all games are necessarily won by the same agents.

(a) The average win-rate of each agent for each game.
(b) The average normalized score of each agent for each game.
Figure 1: The average performance of each agent for each game. Figure 0(a) shows the average win-rate, while Figure 0(b)

shows the average score (normalised based on the highest score achieved by any agent for each game). The games and agents are sorted based using a hierarchical clustering algorithm.

We investigated this further by performing a principal component analysis, where the win-rate and score of the games are the data points, and the initial dimensions are the games. We found that the first ten principal components account for 80.1% of the variance. We also checked which agents did well along which axes, the results of which can be seen in Table 

1. The fact that different agents perform well along different dimensions reinforces the fact that it matters which subset of games is picked for a competition. By choosing games aligned with any of these principle dimensions, one could design a competition that would almost certainly be won by the top performer for that component.

Rank D1 Agents D2 Agents D3 Agents D4 Agents D5 Agents D6 Agents D7 Agents D8 Agents D9 Agents D10 Agents
#1 MaastCTS2 AtheneAI NovelTS bladerunner TomVodo adrienctx MaastCTS2 MaastCTS2 aStar MaastCTS2
#2 thorbjrn YBCriber aStar Return42 sampleMCTS TeamTopbug YBCriber ICELab adrienctx Number27
#3 NovTea NovTea TeamTopbug TomVodo SJA86 muzzle AtheneAI AtheneAI muzzle NovTea
#4 Return42 MaastCTS2 jaydee sampleMCTS CatLinux MaastCTS2 NovTea thorbjrn NovTea adrienctx
#5 AtheneAI ICELab muzzle muzzle Number27 MH2015 Return42 Number27 bladerunner muzzle
Table 1: Top five performing agents for each dimension using principal component analysis.

Correlation Analysis

In this section, we analyze the correlations between games in terms of agent performance, as well as between using win-rate and score as performance measures, as games which have similar performance patterns should have similar problem characteristics. The resulting correlation matrices are then used for clustering, and these clusters are analyzed for meaningful similarities between games.

Game / game correlation

(a) Win-rate correlation matrix.
(b) Score correlation matrix.
Figure 2: The correlation matrix between every game in the framework. Figure 1(a) is based on the agents’ win-rates, while Figure 1(b) is based on the agents’ scores. The games are sorted based on the result of a hierarchical clustering algorithm.

In the video game industry, similar video games are usually grouped under a specific category that is defined by common gameplay characteristics, referred to as a game genre. Games in the GVGAI framework are mostly ports of known video games, meaning that we can often find genre relations between them. However, attempting to group games by their genres does not necessarily indicate that similar problem-solving capabilities are required to solve them. We, therefore, took a more formal and robust approach for identifying correlations between games based on agent performance.

For this analysis, we calculated the correlation matrix between all 108 games in our sample using either the agents’ win-rates or scores. Figure 2 shows these correlation matrices where blue means high correlation, red means high anti-correlation, and white means no correlation. To simplify the task of analyzing such a large matrix, we clustered their values using a hierarchical clustering algorithm and selected clusters that minimize the variance between the games within each cluster. The different clusters are represented by the black vertical or horizontal lines and are ordered (and subsequently referred to) in terms of their location from the left/top of the matrix.

Figure 1(a) shows the correlation matrix using the agents’ win-rates. Using this matrix, we can see that games in the fifth cluster have a low anti-correlation to the rest of the games in the framework. These games are characterized by either being very hard to beat (plants) or not having a winning condition (invest). By analyzing the clusters row by row, we can see that the win-rates of most games are not highly correlated except for the first three clusters. Most of the games within these first three clusters appear to be puzzle games (zenpuzzle, sokoban, cookmepasta are some examples). These types of games are typically characterized by the need for long-term planning to solve them, which likely causes their win-rates to be highly similar.

Figure 1(b) shows the correlation matrix using the agents’ scores. By looking closely, we can see that the score distribution between most of the games are similar (the matrix is mostly blue). This was not surprising as we know that most of the games in the framework are designed to have a score distribution that reflects the progress of the agents in the game (good states have high scores, while bad states have low scores). The only exception to this is the first three clusters, which are highly anti-correlated with every game in the framework except for those within its cluster. These games appear to be characterized by a delayed score distribution (score only received near the end of the game) which makes them very different from the other games that provide rewards for incremental steps closer to the solution.

Win-rate / score correlation

Since many GVGAI games are designed so that the score heavily indicates progression towards the win condition, most of the 108 games within our sample had very high correlations between agent win-rates and score. In fact, 13 of the games had perfect correlation values of one (frogs, pokemon, racebet2, roadfighter, run, waitforbreakfast, watergame, x-racer, modality, portals, racebet, tercio and witnessprotected). However, some of the games had a very low or even negative correlation, with the ten lowest correlation games shown in Table 2. Note that some games such as flower and invest always result in the agent either losing or winning regardless of the actions they perform, meaning that these games do not provide any correlation measure.

Game name correlation coefficient
painter -0.90488026
rivers -0.62030330
surround -0.49545454
lemmings -0.37996316
chainreaction -0.35905795
donkeykong -0.11941969
lasers2 -0.04154697
boloadventures -0.02221504
beltmanager 0.01585323
deflection 0.05087994
Table 2: Games with the lowest correlation between win-rate and score.

From these results we can see that Painter has a very high negative correlation between win-rate and score, far more so than any other game. The likely reason for this is that the objective of Painter is to change the color of all of the tiles in the game to the same color, in as few steps as possible. Each time the color of a tile is changed the agent receives a score reward, so solutions that win the game quickly will often have a lower score. While this is a rather counter-intuitive idea, several other GVGAI games are also known to have this property within their design. This strong negative correlation between win-rate and score appears to be indicative of a particular type of deception in the game, the greed trap. This deceptive design element exploits the fact that most agents assume the reward structure for a game leads towards a goal state, by creating levels where this is not strictly the case [Anderson2018].

While our presented correlation matrices could be used to roughly identify a collection of games with decent discriminatory performance by selecting a game from each cluster, this approach has several limitations. Not only is it difficult to tell which games in each cluster would provide the most information, but neither the fact that certain agent’s performance on the same game can vary dramatically between attempts, nor that two distinct performance measures are available, are taken into account. However, these correlation results can certainly be useful in other areas, such as for allowing game designers to understand which games are similar in terms of agent performance. Identifying which games present unique performance distributions could help in designing additional games that fit entirely new or underrepresented clusters. Accomplishing this would increase the overall discrimination potential of our total game set, and thus also increase the total amount of information that could be achieved from a subset of games (i.e. allows our proposed information-theoretic measure to be even more effective).

Information Gain Analysis

In this section, we analyze the information provided by each of our 108 sample games. Information here is used in the sense of Shannon Information Theory [eoit], and the information gain of a game is the average reduction in uncertainty regarding what algorithm we are testing, given the score and/or win-rate performance of that algorithm This information gain measure can then be used to identify a benchmark set of games that provides us with the maximum information about our agents.

While it is possible to compute information gain on discretized data by first binning the mean performances of the different agents, this is problematic for two reasons. First, as long as all the agents’ results are at least somewhat different, it would be theoretically possible to obtain all information from just a single game. This situation would make calculating the information gain highly redundant, as nearly all of the games would give us the same maximal amount of information. Second, this approach disregards any noise within the measuring process. As an example, if we assume that the average results for two agents are .49 and .50 when playing a specific game, then a discretized information gain analysis would give these as two separate outcomes (assuming we binned to the nearest .01 value). This approach does not take into account the fact that repeated measurements would likely produce slightly different results, varied by some noise. Consequently, a game that gives us average results for two agents of .1 and .9, rather than our previous example of .49 and .50, would be much better suited to tell two agents apart, as the scores are likely to be significantly different even when taking noise into account. In essence, games with agent results that are furthest apart and with the lowest noise, provide the most information.

The following information gain formalism is an attempt to accurately measure this difference by modelling the noise within the agents’ performances as a Gaussian distribution. This approach calculates the information gain for a specific game

. Let us first define a few terms:

  • : The set of all algorithms,

  • : A specific algorithm, having an average performance of , with a variance of , for the game in question.

This allows us to approximate the conditional probability

. This probability expresses how likely it is that we are observing the result of algorithm , if we are in fact observing the average performance for algorithm . In other words, how well does work as an explanation for what we see from .

Equation 1 approximates this probability. It assumes that observations of performance are normally distributed, parameterized by the means and variances from their actual results. The upper part of the equation is the probability density function for a normal distribution based on

, computing how likely a result equal to the mean of is. The denominator is a normalization sum over all possible algorithms, ensuring that the overall probabilities sum to one.


Computing the probabilities for all pairwise combinations of algorithms allows us to define a confusion matrix

between different algorithms as:


Each row of the matrix sums to 1, and each entry in the first row indicates our best guess for the actual algorithm given that we observed the mean of algorithm . The matrix of conditional probabilities can then be seen as an error matrix for a channel defined by using the game in question as a measurement device. This allows us to compute the mutual information for this channel, under the assumption that the input distribution is an equal distribution. This value is equivalent to the amount of information we get about what algorithm is used from observing the average performance result. Formally, we can define this as the mutual information between your belief distribution and the distribution of the actual algorithm , expressed in Equation 3.


The a priori distribution of our beliefs , is an equal distribution. If we observe the average performance of the algorithm we get a distribution of , as defined by the confusion matrix. The average information gain of observing these results is the average difference in the entropy before observation and after observation, . The equal distribution reduces to the of the states, so we only need to compute the conditional entropy. A higher value here is more desirable, as the best games should provide us with the most information.

Information gain for multiple games

The previous formalism allows us to quantify how much information a single specific game can provide us with about what algorithm is being used, but the information gained from looking at two games is always less than or equal to the sum of the information gain from both games individually. To address this, we can directly compute a confusion matrix for a pair, or any higher number, of games by extending the definition of the conditional probability to that presented in Equation 4.


Using this new conditional probability definition allows us to compute the information gain for any subset of games, by just picking a suitable set . The mutual information for the resulting confusion matrix is computed as usual. This means that the theoretical maximum information gain that any set of games could give is equal to (). In general, those games that offer different kind of information lose less information due to redundancy.

Game Name (win-rate) Information gain Game Name (score) Information gain Game Name (combined) Information gain Game Name (top 10) Information gain (cumulative)
freeway 1.17484168 invest 1.62405816 freeway 1.89430152 freeway 1.89430152
labyrinth 1.10088062 intersection 1.13955416 invest 1.62405816 invest 3.08236771
tercio 1.10018133 freeway 1.13619392 intersection 1.59362941 labyrinthdual 3.81992620
labyrinthdual 1.08531707 tercio 1.10018133 chopper 1.48524965 tercio 4.22563462
iceandfire 1.07275305 watergame 0.89206793 tercio 1.44693431 sistersavior 4.40856274
chopper 1.06542656 cops 0.88658183 labyrinthdual 1.42090667 avoidgeorge 4.54036694
doorkoban 0.98911214 flower 0.86746818 iceandfire 1.32455879 escape 4.60252506
hungrybirds 0.91886839 waitforbreakfast 0.80128373 hungrybirds 1.32100004 whackamole 4.64444512
watergame 0.89206793 labyrinth 0.78021437 waitforbreakfast 1.28983481 chopper 4.67138328
escape 0.87721725 realportals 0.73246317 doorkoban 1.28593860 watergame 4.68457480
Table 3: The games with the highest information gain (using win-rate, score or both combined as measure of performance), as well as the top 10 games which collectively provide the highest information gain.

Combine win-rate and score together

Using the previous equations, we can calculate the information gain for a particular game or set of games, using either the win-rate or score as the measure of performance. However, it is also possible to calculate the total information gain based on both win-rate and score combined. To do this, we treat each of these cases as a separate game (i.e., for a particular game there are two variants, one where the win-rate is used as the measure of performance and one where the score is used ). Since the distance is scaled by the variance, both win-rate and score can be translated in the same way as information. We can then use Equation 4 to calculate the total combined information gain of the game by setting . This means we can create a single confusion matrix for each game that encompasses both the win-rates and scores of all agents. The first six columns of Table 3 show the 10 games with the highest information gain when using either the win-rate, score, or both of these combined as the measure of performance. In general, this approach allows for the combination of any scalar values expressed by the game, and can, therefore, be applied to a range of different gaming benchmarks, even those where games have entirely different performance measures.

Note, though, that the approximation used here operates under the assumptions that the performance measures are distributed independently. This is true for combining the same performance measure across different games, but not necessarily true for different performance measures, such as win-rate and score, for the same game. A more faithful, but also more complex, approximation could be achieved by using the Mahalanobis distance [mahalanobis1936generalized] instead of the sum of variances.

Top ten games (of 2018)

By initially selecting the game that provides the largest information gain (based on both win-rate and score combined) and then recursively selecting the game that adds the most information to the already selected games, we can create a set of 10 games that provide the most information possible. The rightmost two columns of Table 3 provide the 10 games that were chosen for this set in the order they were selected, along with the total cumulative information gain of the set after each game was added. These 10 games are also highlighted in red in Figures 1 and 2.


The theoretical maximum information gain that any set of games could give is roughly 4.75 (), so we can see from these results that after selecting only 3 or 4 games we can already get the majority of information about which agent is playing. It is worth noting that this set of games is not simply the ten games that individually provide the most information, as some of these games likely provide the same “kind” of information. For example, the game intersection had the third highest information gain when looking at each game individually but was not selected for our top 10 games set. This is likely because it provides the same information as one of the previously selected games. By looking at how this game is played and our correlation matrices in Figure 2, it would appear that this game is very close to that of the game freeway and would likely give similar information.

We can also compare the information gain provided by using just the win-rate or score for certain games, versus the combined information gain from using both. When looking at each of these performance measures separately it appears that Invest has the highest information gain, which is likely due to the large variation in possible scores that agents could achieve. However, as agents will always lose this game, either by spending too much money or the time limit expiring, the win-rate provides no information gain at all. Freeway, on the other hand, has a high information gain when using either win-rate or score, allowing it to have a combined information gain that is higher than Invest. It is worth reiterating that the combined information gain for a game is not simply the sum of its individual parameters, as some information may be shared between the different performance measures (calculation for combined information gain is subadditive).

Conclusions and Future Work

In this paper, we have proposed an information-theoretic method for selecting which problems to test a given algorithm on. This is particularly useful for the many cases where it is computationally infeasible to test a new algorithm on all benchmark problems. Our method is generally applicable to any situation where algorithms need to be tested on many problems, and is especially useful when the problems are noisy and/or have multiple performance metrics.

As part of developing this method we performed an in-depth analysis into the discriminatory capabilities of the games in the GVGAI framework, as well as the correlations between games in terms of agent performance. Our correlation analysis shows that there are substantial variations in agent performance between different GVGAI games, and the resulting correlation matrices can be used to cluster certain games together. Developing new GVGAI games that do not fit within these identified clusters would present an entirely new challenge for the current selection of agents, making them highly desirable. Games that have different discriminatory profiles from those that already exist would likely be more useful for investigating agent performance than those with similar profiles to previous games.

Our proposed information theory-based method provides a more principled approach to finding discriminatory games. We extend the notion of information gain to handle noisy feedback and to combine two feedback signals (win-rate and score). We also show how this measure can be applied recursively to find a small set of games that give us almost as much information about an agent as the full set of games would have. Analyzing this set of games reveals that several of them have “deceptive” qualities, where the score is not strongly correlated with win-rate. Future work could involve expanding the evaluation criteria to include additional data from agent playthroughs, such as the time required to solve a level or the number of moves used, which may help us to better differentiate between agents.

This research will hopefully allow future developers and researchers to accurately compare the performance of their new agent against the current set of evaluated agents without the need for exhaustive testing on the full GVGAI corpus. New agents can be tested on a set of “exploratory experiments” to help gauge how well the agent may perform on more detailed experiments that consider the entire GVGAI game set. This will be especially important in the future as more and more games are added to the GVGAI game library, resulting in significantly increased benchmark evaluation times. The proposed approach can also be used to evaluate games that are selected for future GVGAI competitions, to ensure that they present a diverse range of problems.

One thing to note here is that the specific results of our analysis are based on the combined game-agent ecosystem. Having a different set of agents could mean different games, that might previously have been too hard, would suddenly be more discriminatory. Similarly, adding additional games can affect which games provide us with redundant information. Because of this, while the specific games identified here are interesting today, they might very well change in the future. However, we believe the more important contribution presented in this paper is our proposed methodology that was used to select these games. Furthermore, this method, while used here on GVGAI games, could also easily be applied to other sets of problems and algorithms. Our approach can be generalized from just win-rates and scores to include any number of different outcome measures from other domains. This could even include problems outside of the traditional game space, as long as it is possible to obtain the mean and variance of each algorithm’s performance. Some obvious future application would be to analyze the performance of multiple deep reinforcement learning algorithms on the Atari games in the Arcade Learning Environment (ALE) framework, and supervised learning algorithms tested on datasets associated with Kaggle competitions.


Damien Anderson is funded by the Carnegie Trust for the Universities of Scotland as a PhD Scholar. Christoph Salge is funded by the EU Horizon 2020 programme under the Marie Sklodowska-Curie grant 705643. Ahmed Khalifa acknowledges the financial support from NSF grant (Award number 1717324 - ”RI: Small: General Intelligence through Algorithm Invention and Selection.”).