Introduction
Competitions and challenges are regularly used within AI as a way of evaluating algorithms, and also for promoting interest into specific problems. However, if the challenge poses a large set of possible problems it can often be impractical or even impossible to evaluate a new algorithm on every problem within this set. Comparing a new algorithm with the state of the art on the full set of problems can require immense computational resources, which are not available to many researchers. Therefore, a smaller set of problems is usually selected that intends to be representative of the entire problem space. This leads to the fundamental question: how should we select this subset of problems? This question is critical, as selecting a poorly representative subset of the problems available might leave out key aspects of the challenge, resulting in an unintentional bias that leads to specific solutions performing better than they would have on the entire problem set. If you have good knowledge of the domain, you could choose a set of problems with interestingly difficult design features, but there is no guarantee that these differences in design translate to meaningfully different challenges. Also in many cases, you do not have deep knowledge about the design of the different problems.
This issue is prevalent in any situation where it may be computationally prohibitive to test a new algorithm on all subproblems presented. Examples of competitions or challenges where this is the case include the GVGAI [perez20162014], ALE [bellemare2013arcade] and Kaggle competitions [carpenter2011may]
, each of which have hundreds of separate problems. There are also other sets of machine learning benchmarks that contain a multitude of disparate tasks, such as the OpenAI Gym
[openaigym]or the UCI repository of supervised learning tasks
[uclrep]. Many of the participants in these challenges cherrypick benchmarks where their new algorithm performs well. This will continue to happen as long as benchmark sets are so large that it is computationally infeasible to test on all benchmarks available. However, we should not simply reject good papers simply because the authors lack the resources to perform full evaluations on every benchmark possible. The solution to this dilemma is to test new algorithms on problems that are most relevant given the current set of wellperforming algorithms, not simply those where the new algorithm performs best. This paper therefore proposes an approach for selecting a small number of problems out of a larger set for accurately testing an algorithm, and has great potential to revolutionize AI benchmarking across many different AI challenges.As a first approach to the task of selecting which problems to test an algorithm on, we look at the correlations between different algorithm’s performance for different problems. After testing an algorithm on one problem, another problem could be selected with an anticorrelated performance profile (i.e. one where other algorithms perform differently). This is an incomplete solution however, most importantly because it does not tell us which problem to test on first and does not factor in the potential variability in agent performance. We instead propose an informationtheoretic measure for determining which problems are best at telling a given set of algorithms apart; a measure that also takes into account the concept of noise when analyzing performance measures. By recursively applying this measure, we can find problems that are maximally informative considering previously selected problems, meaning that we can identify problems that discriminate among a set of algorithms in different ways [martinez2016ai].
We use the General Video Game AI (GVGAI) framework as a testbed for our method. The GVGAI library includes more than a hundred mini video games [bontrager2016matching], and several dozen different agents that can play these games [soemers2016enhancements, gaina2017rolling, weinstein2012bandit, perez2016analyzing, mendes2016hyper] have been submitted to the associated GVGAI competition [perez2016general]. We present a formalised analysis on the correlations between agent performances across different games, and use our informationtheoretic measure to select a subset of the GVGAI game library that accurately represents the full discriminative scope when testing on all games available (i.e. provides a diverse range of problems that best distinguishes between agents).
Background
General Video Game Playing (GVGP) is an area of research that looks to expand upon the success of the General Game Playing (GGP) competition. The GGP competition offers a platform for researchers to create agents that can play a wide range of board games [Genesereth2005]. While board games offer an interesting variety of problems, they do not offer realtime situations where rapid decision making is key, which is what the GVGP competition does by providing a library of video games as a research platform [Levine2013].
The GVGAI competition has been running annually since 2014 and provides a Video Game Description Language (VGDL) with which to quickly design games, and a common API for agents to access those games [Ebner2013]. Each year ten games are selected to evaluate the submitted agents, which often covers a wide range of game types from roleplaying to puzzle games [perez2018general]. One of the key elements of this competition is that the games being played by the agents for each year’s competition are unknown to both the developers and agents beforehand. Many of the GVGAI games and most of the agents include some form of stochasticity, meaning that performance evaluation is inherently noisy. Playing a GVGAI game will also give two signals of performance, whether an agent won the game or not and what score was obtained [perez20162014]. The competition currently offers multiple tracks, including a single and multiplayer planning track [twoplayer], which provides a forward model for analyzing future game states, and a learning track which removes the forward model but allocates a training time to agents before submission [learningtrack]. For this paper, we only consider the games and agents used in the singleplayer planning track.
One of the issues that we set out to tackle with this work is that the games within the GVGAI framework are currently not well documented. A few previous papers attempted to evaluate how certain agents perform on different GVGAI games [bontrager2016matching, Nelson2016InvestigatingVM], but none have investigated the different discrimination profiles presented by the full game corpus, or how this information could be used to help design better agents and games in the future. The fact that different GVGAI games pose different types of problems to agents, may lead to biases towards a particular type of algorithm when selecting game subsets. Understanding what bias may exist in a given set of games, and being able to select ten games which minimize any particular bias, is desirable for ensuring that the competition is genuinely evaluating general problemsolving capabilities.
Data Collection
The first step towards analyzing different GVGAI games is to collect data from various playthroughs using a collection of agents. We used twentyseven commonly available agents, which were some of the top performing entries in the previous GVGAI competitions over the last five years. The games that were used consist of the full corpus of 102 GVGAI games that are currently available (at the time of writing), plus an additional six deceptive GVGAI games introduced by Anderson et al. [Anderson2018], making the total number of games equal to 108.
Similar to the GVGAI competition format, each agent has 40 milliseconds to perform each action and runs for at most 2000 time steps. To replicate the competition environment, we ran the agents using 243 CPU cores with 2.6 GHz and 8 GB of memory. Each successful playthrough of a game that resulted in either a win or loss without any crashes produces one unit of data, containing the information . Unfortunately, some of the agents occasionally crashed on certain games due to changes that the GVGAI framework has received through over the years, so not all the agents have the same amount of the generated data. A total of 3,990,760 successful playthroughs were recorded across all agents and games, with an average of 1,368.6 data samples per gameagent pair.
Algorithm Performance
As a basis for further analysis, we compute both the average winrate and score for each gameagent pair, with the results visualized in Figure 1. Looking at the winrate in Figure 0(a) we can already see that there are some games where nearly all agents either win or lose, in which case the score seen in Figure 0(b) would be the deciding value. Conceptually, it seems that games which offer a large spread of performance values would be best at discriminating between good and bad agents in a competition, but we can also see in Figure 1 that not all games are necessarily won by the same agents.
shows the average score (normalised based on the highest score achieved by any agent for each game). The games and agents are sorted based using a hierarchical clustering algorithm.
We investigated this further by performing a principal component analysis, where the winrate and score of the games are the data points, and the initial dimensions are the games. We found that the first ten principal components account for 80.1% of the variance. We also checked which agents did well along which axes, the results of which can be seen in Table
1. The fact that different agents perform well along different dimensions reinforces the fact that it matters which subset of games is picked for a competition. By choosing games aligned with any of these principle dimensions, one could design a competition that would almost certainly be won by the top performer for that component.Rank  D1 Agents  D2 Agents  D3 Agents  D4 Agents  D5 Agents  D6 Agents  D7 Agents  D8 Agents  D9 Agents  D10 Agents 

#1  MaastCTS2  AtheneAI  NovelTS  bladerunner  TomVodo  adrienctx  MaastCTS2  MaastCTS2  aStar  MaastCTS2 
#2  thorbjrn  YBCriber  aStar  Return42  sampleMCTS  TeamTopbug  YBCriber  ICELab  adrienctx  Number27 
#3  NovTea  NovTea  TeamTopbug  TomVodo  SJA86  muzzle  AtheneAI  AtheneAI  muzzle  NovTea 
#4  Return42  MaastCTS2  jaydee  sampleMCTS  CatLinux  MaastCTS2  NovTea  thorbjrn  NovTea  adrienctx 
#5  AtheneAI  ICELab  muzzle  muzzle  Number27  MH2015  Return42  Number27  bladerunner  muzzle 
Correlation Analysis
In this section, we analyze the correlations between games in terms of agent performance, as well as between using winrate and score as performance measures, as games which have similar performance patterns should have similar problem characteristics. The resulting correlation matrices are then used for clustering, and these clusters are analyzed for meaningful similarities between games.
Game / game correlation
In the video game industry, similar video games are usually grouped under a specific category that is defined by common gameplay characteristics, referred to as a game genre. Games in the GVGAI framework are mostly ports of known video games, meaning that we can often find genre relations between them. However, attempting to group games by their genres does not necessarily indicate that similar problemsolving capabilities are required to solve them. We, therefore, took a more formal and robust approach for identifying correlations between games based on agent performance.
For this analysis, we calculated the correlation matrix between all 108 games in our sample using either the agents’ winrates or scores. Figure 2 shows these correlation matrices where blue means high correlation, red means high anticorrelation, and white means no correlation. To simplify the task of analyzing such a large matrix, we clustered their values using a hierarchical clustering algorithm and selected clusters that minimize the variance between the games within each cluster. The different clusters are represented by the black vertical or horizontal lines and are ordered (and subsequently referred to) in terms of their location from the left/top of the matrix.
Figure 1(a) shows the correlation matrix using the agents’ winrates. Using this matrix, we can see that games in the fifth cluster have a low anticorrelation to the rest of the games in the framework. These games are characterized by either being very hard to beat (plants) or not having a winning condition (invest). By analyzing the clusters row by row, we can see that the winrates of most games are not highly correlated except for the first three clusters. Most of the games within these first three clusters appear to be puzzle games (zenpuzzle, sokoban, cookmepasta are some examples). These types of games are typically characterized by the need for longterm planning to solve them, which likely causes their winrates to be highly similar.
Figure 1(b) shows the correlation matrix using the agents’ scores. By looking closely, we can see that the score distribution between most of the games are similar (the matrix is mostly blue). This was not surprising as we know that most of the games in the framework are designed to have a score distribution that reflects the progress of the agents in the game (good states have high scores, while bad states have low scores). The only exception to this is the first three clusters, which are highly anticorrelated with every game in the framework except for those within its cluster. These games appear to be characterized by a delayed score distribution (score only received near the end of the game) which makes them very different from the other games that provide rewards for incremental steps closer to the solution.
Winrate / score correlation
Since many GVGAI games are designed so that the score heavily indicates progression towards the win condition, most of the 108 games within our sample had very high correlations between agent winrates and score. In fact, 13 of the games had perfect correlation values of one (frogs, pokemon, racebet2, roadfighter, run, waitforbreakfast, watergame, xracer, modality, portals, racebet, tercio and witnessprotected). However, some of the games had a very low or even negative correlation, with the ten lowest correlation games shown in Table 2. Note that some games such as flower and invest always result in the agent either losing or winning regardless of the actions they perform, meaning that these games do not provide any correlation measure.
Game name  correlation coefficient 

painter  0.90488026 
rivers  0.62030330 
surround  0.49545454 
lemmings  0.37996316 
chainreaction  0.35905795 
donkeykong  0.11941969 
lasers2  0.04154697 
boloadventures  0.02221504 
beltmanager  0.01585323 
deflection  0.05087994 
From these results we can see that Painter has a very high negative correlation between winrate and score, far more so than any other game. The likely reason for this is that the objective of Painter is to change the color of all of the tiles in the game to the same color, in as few steps as possible. Each time the color of a tile is changed the agent receives a score reward, so solutions that win the game quickly will often have a lower score. While this is a rather counterintuitive idea, several other GVGAI games are also known to have this property within their design. This strong negative correlation between winrate and score appears to be indicative of a particular type of deception in the game, the greed trap. This deceptive design element exploits the fact that most agents assume the reward structure for a game leads towards a goal state, by creating levels where this is not strictly the case [Anderson2018].
While our presented correlation matrices could be used to roughly identify a collection of games with decent discriminatory performance by selecting a game from each cluster, this approach has several limitations. Not only is it difficult to tell which games in each cluster would provide the most information, but neither the fact that certain agent’s performance on the same game can vary dramatically between attempts, nor that two distinct performance measures are available, are taken into account. However, these correlation results can certainly be useful in other areas, such as for allowing game designers to understand which games are similar in terms of agent performance. Identifying which games present unique performance distributions could help in designing additional games that fit entirely new or underrepresented clusters. Accomplishing this would increase the overall discrimination potential of our total game set, and thus also increase the total amount of information that could be achieved from a subset of games (i.e. allows our proposed informationtheoretic measure to be even more effective).
Information Gain Analysis
In this section, we analyze the information provided by each of our 108 sample games. Information here is used in the sense of Shannon Information Theory [eoit], and the information gain of a game is the average reduction in uncertainty regarding what algorithm we are testing, given the score and/or winrate performance of that algorithm This information gain measure can then be used to identify a benchmark set of games that provides us with the maximum information about our agents.
While it is possible to compute information gain on discretized data by first binning the mean performances of the different agents, this is problematic for two reasons. First, as long as all the agents’ results are at least somewhat different, it would be theoretically possible to obtain all information from just a single game. This situation would make calculating the information gain highly redundant, as nearly all of the games would give us the same maximal amount of information. Second, this approach disregards any noise within the measuring process. As an example, if we assume that the average results for two agents are .49 and .50 when playing a specific game, then a discretized information gain analysis would give these as two separate outcomes (assuming we binned to the nearest .01 value). This approach does not take into account the fact that repeated measurements would likely produce slightly different results, varied by some noise. Consequently, a game that gives us average results for two agents of .1 and .9, rather than our previous example of .49 and .50, would be much better suited to tell two agents apart, as the scores are likely to be significantly different even when taking noise into account. In essence, games with agent results that are furthest apart and with the lowest noise, provide the most information.
The following information gain formalism is an attempt to accurately measure this difference by modelling the noise within the agents’ performances as a Gaussian distribution. This approach calculates the information gain for a specific game
. Let us first define a few terms:
: The set of all algorithms,

: A specific algorithm, having an average performance of , with a variance of , for the game in question.
This allows us to approximate the conditional probability
. This probability expresses how likely it is that we are observing the result of algorithm , if we are in fact observing the average performance for algorithm . In other words, how well does work as an explanation for what we see from .Equation 1 approximates this probability. It assumes that observations of performance are normally distributed, parameterized by the means and variances from their actual results. The upper part of the equation is the probability density function for a normal distribution based on
, computing how likely a result equal to the mean of is. The denominator is a normalization sum over all possible algorithms, ensuring that the overall probabilities sum to one.(1) 
Computing the probabilities for all pairwise combinations of algorithms allows us to define a confusion matrix
between different algorithms as:(2) 
Each row of the matrix sums to 1, and each entry in the first row indicates our best guess for the actual algorithm given that we observed the mean of algorithm . The matrix of conditional probabilities can then be seen as an error matrix for a channel defined by using the game in question as a measurement device. This allows us to compute the mutual information for this channel, under the assumption that the input distribution is an equal distribution. This value is equivalent to the amount of information we get about what algorithm is used from observing the average performance result. Formally, we can define this as the mutual information between your belief distribution and the distribution of the actual algorithm , expressed in Equation 3.
(3) 
The a priori distribution of our beliefs , is an equal distribution. If we observe the average performance of the algorithm we get a distribution of , as defined by the confusion matrix. The average information gain of observing these results is the average difference in the entropy before observation and after observation, . The equal distribution reduces to the of the states, so we only need to compute the conditional entropy. A higher value here is more desirable, as the best games should provide us with the most information.
Information gain for multiple games
The previous formalism allows us to quantify how much information a single specific game can provide us with about what algorithm is being used, but the information gained from looking at two games is always less than or equal to the sum of the information gain from both games individually. To address this, we can directly compute a confusion matrix for a pair, or any higher number, of games by extending the definition of the conditional probability to that presented in Equation 4.
(4) 
Using this new conditional probability definition allows us to compute the information gain for any subset of games, by just picking a suitable set . The mutual information for the resulting confusion matrix is computed as usual. This means that the theoretical maximum information gain that any set of games could give is equal to (). In general, those games that offer different kind of information lose less information due to redundancy.
Game Name (winrate)  Information gain  Game Name (score)  Information gain  Game Name (combined)  Information gain  Game Name (top 10)  Information gain (cumulative) 
freeway  1.17484168  invest  1.62405816  freeway  1.89430152  freeway  1.89430152 
labyrinth  1.10088062  intersection  1.13955416  invest  1.62405816  invest  3.08236771 
tercio  1.10018133  freeway  1.13619392  intersection  1.59362941  labyrinthdual  3.81992620 
labyrinthdual  1.08531707  tercio  1.10018133  chopper  1.48524965  tercio  4.22563462 
iceandfire  1.07275305  watergame  0.89206793  tercio  1.44693431  sistersavior  4.40856274 
chopper  1.06542656  cops  0.88658183  labyrinthdual  1.42090667  avoidgeorge  4.54036694 
doorkoban  0.98911214  flower  0.86746818  iceandfire  1.32455879  escape  4.60252506 
hungrybirds  0.91886839  waitforbreakfast  0.80128373  hungrybirds  1.32100004  whackamole  4.64444512 
watergame  0.89206793  labyrinth  0.78021437  waitforbreakfast  1.28983481  chopper  4.67138328 
escape  0.87721725  realportals  0.73246317  doorkoban  1.28593860  watergame  4.68457480 
Combine winrate and score together
Using the previous equations, we can calculate the information gain for a particular game or set of games, using either the winrate or score as the measure of performance. However, it is also possible to calculate the total information gain based on both winrate and score combined. To do this, we treat each of these cases as a separate game (i.e., for a particular game there are two variants, one where the winrate is used as the measure of performance and one where the score is used ). Since the distance is scaled by the variance, both winrate and score can be translated in the same way as information. We can then use Equation 4 to calculate the total combined information gain of the game by setting . This means we can create a single confusion matrix for each game that encompasses both the winrates and scores of all agents. The first six columns of Table 3 show the 10 games with the highest information gain when using either the winrate, score, or both of these combined as the measure of performance. In general, this approach allows for the combination of any scalar values expressed by the game, and can, therefore, be applied to a range of different gaming benchmarks, even those where games have entirely different performance measures.
Note, though, that the approximation used here operates under the assumptions that the performance measures are distributed independently. This is true for combining the same performance measure across different games, but not necessarily true for different performance measures, such as winrate and score, for the same game. A more faithful, but also more complex, approximation could be achieved by using the Mahalanobis distance [mahalanobis1936generalized] instead of the sum of variances.
Top ten games (of 2018)
By initially selecting the game that provides the largest information gain (based on both winrate and score combined) and then recursively selecting the game that adds the most information to the already selected games, we can create a set of 10 games that provide the most information possible. The rightmost two columns of Table 3 provide the 10 games that were chosen for this set in the order they were selected, along with the total cumulative information gain of the set after each game was added. These 10 games are also highlighted in red in Figures 1 and 2.
Discussion
The theoretical maximum information gain that any set of games could give is roughly 4.75 (), so we can see from these results that after selecting only 3 or 4 games we can already get the majority of information about which agent is playing. It is worth noting that this set of games is not simply the ten games that individually provide the most information, as some of these games likely provide the same “kind” of information. For example, the game intersection had the third highest information gain when looking at each game individually but was not selected for our top 10 games set. This is likely because it provides the same information as one of the previously selected games. By looking at how this game is played and our correlation matrices in Figure 2, it would appear that this game is very close to that of the game freeway and would likely give similar information.
We can also compare the information gain provided by using just the winrate or score for certain games, versus the combined information gain from using both. When looking at each of these performance measures separately it appears that Invest has the highest information gain, which is likely due to the large variation in possible scores that agents could achieve. However, as agents will always lose this game, either by spending too much money or the time limit expiring, the winrate provides no information gain at all. Freeway, on the other hand, has a high information gain when using either winrate or score, allowing it to have a combined information gain that is higher than Invest. It is worth reiterating that the combined information gain for a game is not simply the sum of its individual parameters, as some information may be shared between the different performance measures (calculation for combined information gain is subadditive).
Conclusions and Future Work
In this paper, we have proposed an informationtheoretic method for selecting which problems to test a given algorithm on. This is particularly useful for the many cases where it is computationally infeasible to test a new algorithm on all benchmark problems. Our method is generally applicable to any situation where algorithms need to be tested on many problems, and is especially useful when the problems are noisy and/or have multiple performance metrics.
As part of developing this method we performed an indepth analysis into the discriminatory capabilities of the games in the GVGAI framework, as well as the correlations between games in terms of agent performance. Our correlation analysis shows that there are substantial variations in agent performance between different GVGAI games, and the resulting correlation matrices can be used to cluster certain games together. Developing new GVGAI games that do not fit within these identified clusters would present an entirely new challenge for the current selection of agents, making them highly desirable. Games that have different discriminatory profiles from those that already exist would likely be more useful for investigating agent performance than those with similar profiles to previous games.
Our proposed information theorybased method provides a more principled approach to finding discriminatory games. We extend the notion of information gain to handle noisy feedback and to combine two feedback signals (winrate and score). We also show how this measure can be applied recursively to find a small set of games that give us almost as much information about an agent as the full set of games would have. Analyzing this set of games reveals that several of them have “deceptive” qualities, where the score is not strongly correlated with winrate. Future work could involve expanding the evaluation criteria to include additional data from agent playthroughs, such as the time required to solve a level or the number of moves used, which may help us to better differentiate between agents.
This research will hopefully allow future developers and researchers to accurately compare the performance of their new agent against the current set of evaluated agents without the need for exhaustive testing on the full GVGAI corpus. New agents can be tested on a set of “exploratory experiments” to help gauge how well the agent may perform on more detailed experiments that consider the entire GVGAI game set. This will be especially important in the future as more and more games are added to the GVGAI game library, resulting in significantly increased benchmark evaluation times. The proposed approach can also be used to evaluate games that are selected for future GVGAI competitions, to ensure that they present a diverse range of problems.
One thing to note here is that the specific results of our analysis are based on the combined gameagent ecosystem. Having a different set of agents could mean different games, that might previously have been too hard, would suddenly be more discriminatory. Similarly, adding additional games can affect which games provide us with redundant information. Because of this, while the specific games identified here are interesting today, they might very well change in the future. However, we believe the more important contribution presented in this paper is our proposed methodology that was used to select these games. Furthermore, this method, while used here on GVGAI games, could also easily be applied to other sets of problems and algorithms. Our approach can be generalized from just winrates and scores to include any number of different outcome measures from other domains. This could even include problems outside of the traditional game space, as long as it is possible to obtain the mean and variance of each algorithm’s performance. Some obvious future application would be to analyze the performance of multiple deep reinforcement learning algorithms on the Atari games in the Arcade Learning Environment (ALE) framework, and supervised learning algorithms tested on datasets associated with Kaggle competitions.
Acknowledgments
Damien Anderson is funded by the Carnegie Trust for the Universities of Scotland as a PhD Scholar. Christoph Salge is funded by the EU Horizon 2020 programme under the Marie SklodowskaCurie grant 705643. Ahmed Khalifa acknowledges the financial support from NSF grant (Award number 1717324  ”RI: Small: General Intelligence through Algorithm Invention and Selection.”).