1 Introduction
Evaluation is a key driver of progress in machine learning, with e.g. ImageNet
deng:09 and the Arcade Learning Environment BellemareNVB13enabling subsequent breakthroughs in supervised and reinforcement learning
krizhevsky:12 ; mnih:15 . However, developing evaluations has received little systematic attention compared to developing algorithms. Immense amounts of compute is continually expended smashing algorithms and tasks together – but the results are almost never used to evaluate and optimize evaluations. In a striking asymmetry, results are almost exclusively applied to evaluate and optimize algorithms.The classic trainandtest paradigm on common datasets, which has served the community well donoho:15 , is reaching its limits. Three examples suffice. Adversarial attacks have complicated evaluation, raising questions about which attacks to test against szegedy:13 ; tramer:18 ; kurakin:18 ; uesato:18 . Training agents far beyond human performance with selfplay means they can only really be evaluated against each other silver:17 ; silver:17a . The desire to build increasingly generalpurpose agents has led to a proliferation of environments: Mujoco, DM Lab, Open AI Gym, Psychlab and others todorov:12 ; beattie:16 ; brockman:16 ; leibo:18 .
In this paper we pause to ask, and partially answer, some basic questions about evaluation: Q1. Do tasks test what we think they test? Q2. When is a task redundant? Q3. Which tasks (and agents) matter the most? Q4. How should evaluations be evaluated?
We consider two scenarios: agent vs task (AvT), where algorithms are evaluated on suites of datasets or environments; and agent vs agent (AvA), where agents compete directly as in Go and Starcraft. Our goal is to treat tasks and agents symmetrically – with a view towards, ultimately, cooptimizing agents and evaluations. From this perspective AvA, where the task is (beating) another agent, is especially interesting. Performance in AvA is often quantified using Elo ratings Elo78 or the closely related TrueSkill herbrich:07 . There are two main problems with Elo. Firstly, Elo bakesin the assumption that relative skill is transitive; but Elo is meaningless – it has no predictive power – in cyclic games like rockpaperscissors. Intransitivity has been linked to biodiversity in ecology, and may be useful when evolving populations of agents frean:01 ; kerr:02 ; laird:06 ; szolnoki:14 . Secondly, an agent’s Elo rating can be inflated by instantiating many copies of an agent it beats (or conversely). This can cause problems when Elo guides hyperoptimization methods like populationbased training jaderberg:17b . Similarly, the most important decision when constructing a tasksuite is which tasks to include. It is easy, and all too common, to bias tasksuites in favor of particular agents or algorithms.
1.1 Overview
Section 2 presents background information on Elo and tools for working with antisymmetric matrices, such as the Schur decomposition and combinatorial Hodge theory. A major theme underlying the paper is that the fundamental algebraic structure of tournaments and evaluation is antisymmetric dgm:18 . Techniques specific to antisymmetric matrices are less familiar to the machine learning community than approaches like PCA that apply to symmetric matrices and are typically correlationbased.
Section 3 presents a unified approach to representing evaluation data, where agents and tasks are treated symmetrically. A basic application of the approach results in our first contribution: a multidimensional Elo rating (mElo) that handles cyclic interactions. We also sketch how the Schur decomposition can uncover latent skills and tasks, providing a partial answer to Q1. We illustrate mElo on the domain of training an AlphaGo agent DSilverHMGSDSAPL16 .
The second contribution of the paper is Nash averaging, an evaluation method that is invariant to redundant tasks and agents, see section 4. The basic idea is to play a metagame on evaluation data lanctotl:17 . The metagame has a unique maximum entropy Nash equilibrium. The key insight of the paper is that the maxent Nash adapts automatically to the presence of redundant tasks and agents. The maxent Nash distribution thus provides a principled answer to Q2 and Q3: which tasks and agents do and do not matter is determined by a metagame. Finally, expected difficulty of tasks under the Nash distribution on agents yields a partial answer to Q4. The paper concludes by taking a second look at the performance of agents on Atari. We find that, under Nash averaging, human performance ties with the best agents, suggesting betterthanhuman performance has not yet been achieved.
1.2 Related work
Legg and Hutter developed a definition of intelligence which, informally, states “intelligence measures an agent’s ability to achieve goals in a wide range of environments” legg:05 ; legg:13 . Formally, they consider all computable tasks weighted by algorithmic complexity solomonoff:64 ; kolmogorov:65 ; chaitin:66 . Besides being incomputable, the distribution places (perhaps overly) heavy weight on the simplest tasks.
A comprehensive study of performance metrics for machine learning and AI can be found in ferri:09 ; hernandez:12 ; hernandez:17 ; hernandez:17a ; Olson2017
. There is a long history of psychometric evaluation in humans, some of which has been applied in artificial intelligence
spearman:04 ; woolley:10 ; bringsjord:11 . BradleyTerry models provide a general framework for pairwise comparison hunter:04 . Researchers have recently taken a second look at the arcade learning environment BellemareNVB13 and introduced new performance metrics Machado17a . However, the approach is quite particular. Recent work using agents to evaluate games has somewhat overlapping motivation with this paper liapis:13 ; horn:14 ; nielsen:15 ; silva:17 ; volz:18 . Item response theory is an alternative, and likely complementary, approach to using agents to evaluate tasks hambleton:91 that has recently been applied to study the performance of agents on the Arcade Learning Environment martinez:17 .Our approach draws heavily on work applying combinatorial Hodge theory to statistical ranking jiang:11
and game theory
candogan:11 ; candogan:13 ; candogan:13a . We also draw on empirical game theory Walsh03 ; Wellman06 , by using a metagame to “evaluate evaluations”, see section 4. Empirical game theory has been applied to domains like poker and continuous double auctions, and has recently been extended to asymmetric games PhelpsPM04 ; PhelpsCMNPS07 ; PonsenTKR09 ; BloembergenTHK15 ; Tuyls18 . von Neumann winners in the dueling bandit setting and NEregret are related to Nash averaging dudik:15 ; balsubramani:16 ; Jordan:07 ; Jordan:10 .2 Preliminaries
Notation.Vectors are column vectors. and denote the constant vectors of zeros and ones respectively. We sometimes use subscripts to indicate dimensions of vectors and matrices, e.g. or and sometimes their entries, e.g. or ; no confusion should result. The unit vector with a 1 in coordinate is . Proofs and code are in the appendix.
2.1 The Elo rating system
Suppose agents play a series of pairwise matches against each other. Elo assigns a rating to each player based on their wins and losses, which we represent as an vector
. The predicted probability of
beating given their Elo ratings is(1) 
The constant is not important in what follows, so we pretend . Observe that only the difference between Elo ratings affects winloss predictions. We therefore impose that Elo ratings sum to zero, , without loss of generality. Define the loss,
(2) 
and is the true probability of beating . Suppose the match pits player against , with outcome if wins and if loses. Online gradient descent on obtains
(3) 
Choosing learning rate or recovers the updates introduced by Arpad Elo in Elo78 .
The winloss probabilities predicted by Elo ratings can fail in simple cases. For example, rock, paper and scissors will all receive the same Elo ratings. Elo’s predictions are for all – and so Elo has no predictive power for any given pair of players (e.g. paper beats rock with ).
What are the Elo update’s fixed points?
Suppose we batch matches to obtain empirical estimates of the probability of player
beating : . As the number of matches approaches infinity, the empirical estimates approach the true probabilities .Proposition 1.
Elo ratings are at a stationary point under batch updates iff the matrices of empirical probabilities and predicted probabilities have the same rowsums (or, equivalently the same columnsums):
(4) 
Many different winloss probability matrices result in identical Elo ratings. The situation is analogous to how many different joint probability distributions can have the same marginals. We return to this topic in section
3.1.2.2 Antisymmetric matrices
We recall some basic facts about antisymmetric matrices. Matrix is antisymmetric if . Antisymmetric matrices have even rank and imaginaryeigenvalues . Any antisymmetric matrix admits a real Schur decomposition:
(5) 
where is orthogonal and consists of zeros except for diagonalblocks of the form:
(6) 
The entries of are real numbers, found by multiplying the eigenvalues of by .
Proposition 2.
Given matrix with rank
and singular value decomposition
. Construct antisymmetric matrix(7) 
Then the thin Schur decomposition of is where the nonzero pairs in are the singular values in and
(8) 
Combinatorial Hodge theory is developed by analogy with differential geometry, see jiang:11 ; candogan:11 ; candogan:13 ; candogan:13a . Consider a fully connected graph with vertex set . Assign a flow to each edge of the graph. The flow in the opposite direction is , so flows are just antisymmetric matrices. The flow on a graph is analogous to a vector field on a manifold.
The combinatorial gradient of an vector is the flow: . Flow is a gradient flow if for some , or equivalently if for all . The divergence of a flow is the vector . The divergence measures the contribution to the flow of each vertex, considered as a source. The curl
of a flow is the threetensor
. Finally, the rotation is .Theorem (Hodge decomposition, jiang:11 ).
(i) for any satisfying .
(ii) for any flow .
(iii) for any vector .
(iv) The vector space of antisymmetric matrices admits an orthogonal decomposition
(9) 
with respect to the standard inner product . Concretely, any antisymmetric matrix decomposes as
(10) 
Sneak peak. The divergence recovers Elo ratings or just plain average performance depending on the scenario. The Hodge decomposition separates transitive (captured by averages or Elo) from cyclic interactions (rockpaperscissors), and explains when Elo ratings make sense. The Schur decomposition is a window into the latent skills and tasks not accounted for by Elo and averages.
3 On the algebraic structure of evaluation
The Schur decomposition and combinatorial Hodge theory provide a unified framework for analyzing evaluation data in the AvA and AvT scenarios. In this section we provide some basic tools and present a multidimensional extension of Elo that handles cyclic interactions.
3.1 Agents vs agents (AvA)
In AvA, results are collated into a matrix of winloss probabilities based on relative frequencies. Construct with . Matrix is antisymmetric since .
When can Elo correctly predict winloss probabilities? The answer is simple in logit space:
Proposition 3.
(i) If probabilities are generated by Elo ratings then the divergence of its logit is . That is,
(11) 
(ii) There is an Elo rating that generates probabilities iff . Equivalently, iff for all .
Elo is, essentially, a uniform average in logit space. Elo’s predictive failures are due to the cyclic component that uniform averaging ignores.
Multidimensional Elo (mElo). Elo ratings bakein the assumption that relative skill is transitive. However, there is no single dominant strategy in games like rockpaperscissors or (arguably) StarCraft. Rating systems that can handle intransitive abilities are therefore necessary. An obvious approach is to learn a feature vector and a rating vector per player, and predict . Unfortunately, this reduces to the standard Elo rating since is a scalar.
Handling intransitive abilities requires learning an approximation to the cyclic component . Combining the Schur and Hodge decompositions allows to construct lowrank approximations that extend Elo. Note, antisymmetric matrices have even rank. Consider
(12) 
where the rows of are orthogonal to each other, to , and to . The larger , the better the approximation. Let mElo assign each player Elo rating and dimensional vector . Vanilla Elo uses . The mElo winloss prediction is
(13) 
Online updates can be computed by gradient descent, see section E, with orthogonality enforced.
3.2 Application: predicting winloss probabilities in Go
Elo ratings are widely used in Chess and Go. We compared the predictive capabilities of Elo and the simplest extension mElo on eight Go algorithms taken from extended data table 9 in DSilverHMGSDSAPL16 : seven variants of AlphaGo, and Zen. The Frobenius norms and logistic losses are and for Elo vs the empirical probabilities and and for mElo.
To better understand the difference, we zoom in on three algorithms that were observed to interact nontransitively in Tuyls18 : with value net, with policy net, and Zen. Elo’s winloss predictions are poor (Table Elo: Elo incorrectly predicts both that likely beats and likely beats Zen), whereas mElo (Table mElo) correctly predicts likely winners in all cases (Table empirical), with more accurate probabilities:
Elo  Zen  

  0.41  0.58  
0.59    0.67  
Zen  0.42  0.33   
empirical  Zen  

  0.7  0.4  
0.3    1.0  
Zen  0.6  0.0   
mElo  Zen  

  0.72  0.46  
0.28    0.98  
Zen  0.55  0.02   
3.3 Agents vs tasks (AvT)
In AvT, results are represented as an matrix : rows are agents, columns are tasks, entries are scores (e.g. accuracy or total reward). Subtract the total mean, so the sum of all entries of is zero. We recast both agents and tasks as players and construct an antisymmetric matrix. Let and be the average skill of each agent and the average difficulty of each task. Define . Let be the concatenation of and . We construct the antisymmetric matrix
(14) 
The topright block of is agent performance on tasks; the bottomleft is task difficulty for agents. The topleft block compares agents by their average skill on tasks; the bottomright compares tasks by their average difficulty for agents. Average skill and difficulty explain the data if the score of agent on task is , the agent’s skill minus the task’s difficulty, for all . Paralleling proposition 3, averages explain the data, , iff .
The failure of averages to explain performance is encapsulated in and . By proposition 2, the SVD of and Schur decomposition of are equivalent. If the SVD is then the rows of represent the latent abilities exhibited by agents and the rows of represent the latent problems posed by tasks.
4 Invariant evaluation
Evaluation is often based on metrics like average performance or Elo ratings. Unfortunately, two (or two hundred) tasks or agents that look different may test/exhibit identical skills. Overrepresenting particular tasks or agents introduces biases into averages and Elo – biases that can only be detected post hoc
. Humans must therefore decide which tasks or agents to retain, to prevent redundant agents or tasks from skewing results. At present,
evaluation is not automatic and does not scale. To be scalable and automatic, an evaluation method should always benefit from including additional agents and tasks. Moreover, it should adjust automatically and gracefully to redundant data.Definition 1.
An evaluation method maps data to a realvalued function on players (that is, agents or agents and tasks):
(15) 
Desired properties. An evaluation method should be:

Invariant: adding redundant copies of an agent or task to the data should make no difference.

Continuous: the evaluation method should be robust to small changes in the data.

Interpretable: hard to formalize, but the procedure should agree with intuition in basic cases.
Elo and uniform averaging over tasks are examples of evaluation methods that invariance excludes.
4.1 Nash averaging
This section presents an evaluation method satisfying properties . We discuss AvA here, see section D for AvT. Given antisymmetric logit matrix , define a twoplayer metagame with payoffs and for the row and column metaplayers, where . The game is symmetric because and zerosum because .
The row and column metaplayers pick ‘teams’ of agents. Their payoff is the expected logodds of their respective team winning under the joint distribution. If there is a dominant agent that has better than even odds of beating the rest, both players will pick it. In rockpaperscissors, the only unbeatableonaverage team is the uniform distribution. In general, the value of the game is zero and the Nash equilibria are teams that are unbeatable in expectation.
A problem with Nash equilibria (NE) is that they are not unique, which forces the user to make choices and undermines interpretability vonneumann:44 ; nash:50 . Fortunately, for zerosum games there is a natural choice of Nash:
Proposition 4 (maxent NE).
For antisymmetric there is a unique symmetric Nash equilibrium solving with greater entropy than any other Nash equilibrium.
Maxent Nash is maximally indifferent between players with the same empirical performance.
Definition 2.
The maxent Nash evaluation method for AvA is
(16) 
where is the maxent Nash equilibrium and is the Nash average.
Invariance to redundancy is best understood by looking at an example; for details see section C.
Example 1 (invariance).
Consider two logit matrices, where the second adds a redundant copy of agent to the first:
0.0  4.6  4.6  
4.6  0.0  4.6  
4.6  4.6  0.0 
and 0.0 4.6 4.6 4.6 4.6 0.0 4.6 4.6 4.6 4.6 0.0 0.0 4.6 4.6 0.0 0.0
The maxent Nash for is . It is easy to check that is Nash for for any and thus the maxent Nash for is . Maxent Nash automatically detects the redundant agents and distributes ’s mass over them equally.
Uniform averaging is not invariant to adding redundant agents; concretely whereas , falsely suggesting agent is superior. In contrast, and (the zerovectors have different sizes because there are different numbers of agents). Nash averaging correctly reports no agent is better than the rest in both cases.
Theorem 1 (main result for AvA^{1}^{1}1The main result for AvT is analogous, see section D. ).
The maxent NE has the following properties:

Invariant: Nash averaging, with respect to the maxent NE, is invariant to redundancies in .

Continuous: If is a Nash for and then is an Nash for .

Interpretable: (i) The maxent NE on is the uniform distribution, , iff the metagame is cyclic, i.e. . (ii) If the metagame is transitive, i.e. , then the maxent NE is the uniform distribution on the player(s) with highest rating(s) – there could be a tie.
See section C for proof and formal definitions. For interpretability, if then the transitive rating is all that matters: Nash averaging measures performance against the best player(s). If then no player is better than any other. Mixed cases cannot be described in closed form.
The continuity property is quite weak: theorem 1.2 shows the payoff is continuous: a team that is unbeatable for is beatable for nearby
. Unfortunately, Nash equilibria themselves can jump discontinuously when
is modified slightly. Perturbed best response converges to a more stable approximation to Nash hofbauer:02 ; sandholm:10 that unfortunately is not invariant.Example 2 (continuity).
Consider the cyclic and transitive logit matrices
(17) 
The maxent Nash equilibria and Nash averages of are
(18) 
The maxent Nash is the uniform distribution over agents in the cyclic case (), and is concentrated on the first player when it dominates the others (). When the optimal team has most mass on the first and last players. Nash jumps discontinuously at .
4.2 Application: reevaluation of agents on the Arcade Learning Environment
To illustrate the method, we reevaluate the performance of agents on Atari BellemareNVB13 . Data is taken from results published in wang:16 ; vanhasselt:16 ; ostrovski:17 ; hessel:17 . Agents include rainbow, dueling networks, prioritized replay, popart, DQN, countbased exploration and baselines like human, randomaction and noaction. The 20 agents evaluated on 54 environments are represented by matrix . It is necessary to standardize units across environments with quite different reward structures: for each column we subtract the and divide by the so scores lie in .
We introduce a metagame where row metaplayer picks aims to pick the best distribution on agents and column metaplayer aims to pick the hardest distribution on environments, see section D for details. We find a Nash equilibrium using an LPsolver; it should be possible to find the maxent Nash using the algorithm in ortiz:06 ; ortiz:07 . The Nash distributions are shown in figure 1. The supports of the distributions are the ‘core agents’ and the ‘core environments’ that form unexploitable teams. See appendix for tables containing all skills and difficulties. panel B.
Figure 2A shows the skill of agents under uniform and Nash averaging over environments; panel B shows the difficulty of environments under uniform and Nash averaging over agents. There is a tie for top between the agents with nonzero mass – including human. This follows by the indifference principle for Nash equilibria: strategies with support have equal payoff.
Our results suggest that the betterthanhuman performance observed on the Arcade Learning Environment is because ALE is skewed towards environments that (current) agents do well on, and contains fewer environments testing skills specific to humans. Solving the metagame automatically finds a distribution on environments that evens out the playing field and, simultaneously, identifies the most important agents and environments.
5 Conclusion
A powerful guiding principle when deciding what to measure is to find quantities that are invariant to naturally occurring transformations. The determinant is computed over a basis – however, the determinant is invariant to the choice of basis since
for any invertible matrix
. Noether’s theorem implies the dynamics of a physical system with symmetries obeys a conservation law. The speed of light is fundamental because it is invariant to the choice of inertial reference frame.One must have symmetries in mind to talk about invariance. What are the naturally occurring symmetries in machine learning? The question admits many answers depending on the context, see e.g. diaconis:88 ; lecun:98 ; kondor:03 ; kondor:08 ; zaheer:17 ; hartford:18 ; kondor:18
. In the context of evaluating agents, that are typically built from neural networks, it is unclear
a prioriwhether two seemingly different agents – based on their parameters or hyperparameters – are
actually different. Further, it is increasingly common that environments and tasks are parameterized – or are learning agents in their own right, see selfplay silver:17 ; silver:17a , adversarial attacks szegedy:13 ; tramer:18 ; kurakin:18 ; uesato:18 , and automated curricula sukhbaatar:17 . The overwhelming source of symmetry when evaluating learning algorithms is therefore redundancy: different agents, networks, algorithms, environments and tasks that do basically the same job.Nash evaluation computes a distribution on players (agents, or agents and tasks) that automatically adjusts to redundant data. It thus provides an invariant approach to measuring agentagent and agentenvironment interactions. In particular, Nash averaging encourages a maximally inclusive approach to evaluation: computational cost aside, the method should only benefit from including as many tasks and agents as possible. Easy tasks or poorly performing agents will not bias the results. As such Nash averaging is a significant step towards more objective evaluation.
Nash averaging is not always the right tool. Firstly, it is only as good as the data: garbage in, garbage out. Nash decides which environments are important based on the agents provided to it, and conversely. As a result, the method is blind to differences between environments that do not make a difference to agents and vice versa. Nashbased evaluation is likely to be most effective when applied to a diverse array of agents and environments. Secondly, for good or ill, Nash averaging removes control from the user. One may have good reason to disagree with the distribution chosen by Nash. Finally, Nash is a harsh master. It takes an adversarial perspective and may not be the best approach to, say, constructing automated curricula – although boosting is a related approach that works well freund:96 ; schapire:12 . It is an open question whether alternate invariant evaluations can be constructed, gametheoretically or otherwise.
Acknowledgements. We thank Georg Ostrovski, Pedro Ortega, José HernándezOrallo and Hado van Hasselt for useful feedback.
References
 (1) J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” in CVPR, 2009.
 (2) M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,” J. Artif. Intell. Res., vol. 47, pp. 253–279, 2013.

(3)
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in Neural Information Processing Systems (NIPS), 2012.  (4) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, 02 2015.

(5)
D. Donoho, “50 years of Data Science,” in
Based on a presentation at the Tukey Centennial workshop, 2015.  (6) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in arXiv:1312.6199, 2013.
 (7) F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel, “Ensemble Adversarial Training: Attacks and Defenses,” in ICLR, 2018.
 (8) A. Kurakin, I. Goodfellow, S. Bengio, Y. Dong, F. Liao, M. Liang, T. Pang, J. Zhu, X. Hu, C. Xie, J. Wang, Z. Zhang, Z. Ren, A. Yuille, S. Huang, Y. Zhao, Y. Zhao, Z. Han, J. Long, Y. Berdibekov, T. Akiba, S. Tokui, and M. Abe, “Adversarial Attacks and Defences Competition,” in arXiv:1804.00097, 2018.
 (9) J. Uesato, B. O’Donoghue, A. van den Oord, and P. Kohli, “Adversarial Risk and the Dangers of Evaluating Against Weak Attacks,” in ICML, 2018.
 (10) D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of Go without human knowledge,” Nature, vol. 550, pp. 354–359, 2017.
 (11) D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, “Mastering Chess and Shogi by SelfPlay with a General Reinforcement Learning Algorithm,” in arXiv:1712.01815, 2017.
 (12) E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for modelbased control,” in IROS, 2012.
 (13) C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V. Valdés, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis, S. Legg, and S. Petersen, “DeepMind Lab,” in arXiv:1612.03801, 2016.
 (14) G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “OpenAI Gym,” 2016.
 (15) J. Z. Leibo, C. de Masson d’Autume, D. Zoran, D. Amos, C. Beattie, K. Anderson, A. G. Castañeda, M. Sanchez, S. Green, A. Gruslys, S. Legg, D. Hassabis, and M. M. Botvinick, “Psychlab: A Psychology Laboratory for Deep Reinforcement Learning Agents,” in arXiv:1801.08116, 2018.
 (16) A. E. Elo, The Rating of Chess players, Past and Present. Ishi Press International, 1978.
 (17) R. Herbrich, T. Minka, and T. Graepel, “TrueSkill: a Bayesian skill rating system,” in NIPS, 2007.
 (18) M. Frean and E. R. Abraham, “Rockscissorspaper and the survival of the weakest,” Proc. R. Soc. Lond. B, no. 268, pp. 1323–1327, 2001.
 (19) B. Kerr, M. A. Riley, M. W. Feldman, and B. J. M. Bohannan, “Local dispersal promotes biodiversity in a reallife game of rock–paper–scissors,” Nature, no. 418, pp. 171–174, 2002.
 (20) R. A. Laird and B. S. Schamp, “Competitive Intransitivity Promotes Species Coexistence,” The American Naturalist, vol. 168, no. 2, 2006.
 (21) A. Szolnoki, M. Mobilia, L.L. Jiang, B. Szczesny, A. M. Rucklidge, and M. Perc, “Cyclic dominance in evolutionary games: a review,” J R Soc Interface, vol. 11, no. 100, 2014.
 (22) M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, C. Fernando, and K. Kavukcuoglu, “Population based training of neural networks,” CoRR, vol. abs/1711.09846, 2017.
 (23) D. Balduzzi, S. Racanière, J. Martens, J. Foerster, K. Tuyls, and T. Graepel, “The mechanics of player differentiable games,” in ICML, 2018.
 (24) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
 (25) M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Perolat, D. Silver, and T. Graepel, “A Unified GameTheoretic Approach to Multiagent Reinforcement Learning,” in NIPS, 2017.
 (26) S. Legg and M. Hutter, “A universal measure of intelligence for artificial agents,” in IJCAI, 2005.
 (27) S. Legg and J. Veness, “An Approximation of the Universal Intelligence Measure,” in Algorithmic Probability and Friends. Bayesian Prediction and Artificial Intelligence, 2013.
 (28) R. J. Solomonoff, “A formal theory of inductive inference I, II,” Inform. Control, vol. 7, no. 122, 224254, 1964.
 (29) A. N. Kolmogorov, “Three approaches to the quantitative definition of information,” Problems Inform. Transmission, vol. 1, no. 1, pp. 1–7, 1965.
 (30) G. J. Chaitin, “On the length of computer programs for computing finite binary sequences,” J Assoc. Comput. Mach., vol. 13, pp. 547–569, 1966.
 (31) C. Ferri, J. HernándezOrallo, and R. Modroiu, “An experimental comparison of performance measures for classification,” Pattern Recognition Letters, no. 30, pp. 27–38, 2009.
 (32) J. HernándezOrallo, P. Flach, and C. Ferri, “A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss,” JMLR, no. 13, pp. 2813–2869, 2012.
 (33) J. HernándezOrallo, The Measure of All Minds: Evaluating Natural and Artificial Intelligence. Cambridge University Press, 2017.
 (34) J. HernándezOrallo, “Evaluation in artificial intelligence: from taskoriented to abilityoriented measurement,” Artificial Intelligence Review, vol. 48, no. 3, pp. 397–447, 2017.
 (35) R. S. Olson, W. La Cava, P. Orzechowski, R. J. Urbanowicz, and J. H. Moore, “PMLB: a large benchmark suite for machine learning evaluation and comparison,” BioData Mining, vol. 10, p. 36, Dec 2017.
 (36) C. Spearman, “‘General Intelligence,’ objectively determined and measured,” Am. J. Psychol., vol. 15, no. 201, 1904.
 (37) A. Woolley, C. Fabris, A. Pentland, N. Hashmi, and T. Malone, “Evidence for a Collective Intelligence Factor in the Performance of Human Groups,” Science, no. 330, pp. 686–688, 2010.
 (38) S. Bringsjord, “Psychometric artificial intelligence,” Journal of Experimental & Theoretical Artificial Intelligence, vol. 23, no. 3, pp. 271–277, 2011.
 (39) D. R. Hunter, “MM algorithms for generalized BradleyTerry models,” Annals of Statistics, vol. 32, no. 1, pp. 384–406, 2004.
 (40) M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling, “Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents,” Journal of Artificial Intelligence Research (JAIR), vol. 61, pp. 523–562, 2018.
 (41) A. Liapis, G. N. Yannakakis, and J. Togelius, “Towards a Generic Method of Evaluating Game Levels,” in Artificial Intelligence in Digital Interactive Entertainment (AIIDE), 2013.
 (42) B. Horn, S. Dahlskog, N. Shaker, G. Smith, and J. Togelius, “A Comparative Evaluation of Procedural Level Generators in the Mario AI Framework,” in Foundations of Digital Games, 2014.
 (43) T. S. Nielsen, G. Barros, J. Togelius, and M. J. Nelson, “General video game evaluation using relative algorithm performance profiles,” in EvoApplications, 2015.
 (44) F. de Mesentier Silva, S. Lee, J. Togelius, and A. Nealen, “AIbased Playtesting of Contemporary Board Games,” in Foundations of Digital Games (FDG), 2017.
 (45) V. Volz, J. Schrum, J. Liu, S. M. Lucas, A. M. Smith, and S. Risi, “Evolving Mario Levels in the Latent Space of a Deep Convolutional Generative Adversarial Network,” in GECCO, 2018.
 (46) R. K. Hambleton, H. Swaminathan, and H. J. Rogers, Fundamentals of item response theory. Sage Publications, 1991.
 (47) F. MartínezPlumed and J. HernándezOrallo, “ AI results for the Atari 2600 games: difficulty and discrimination using IRT,” in Workshop on Evaluating GeneralPurpose AI (EGPAI at IJCAI), 2017.
 (48) X. Jiang, L.H. Lim, Y. Yao, and Y. Ye, “Statistical ranking and combinatorial Hodge theory,” Math. Program., Ser. B, vol. 127, pp. 203–244, 2011.
 (49) O. Candogan, I. Menache, A. Ozdaglar, and P. A. Parrilo, “Flows and Decompositions of Games: Harmonic and Potential Games,” Mathematics of Operations Research, vol. 36, no. 3, pp. 474–503, 2011.
 (50) O. Candogan, A. Ozdaglar, and P. A. Parrilo, “NearPotential Games: Geometry and Dynamics,” ACM Trans Econ Comp, vol. 1, no. 2, 2013.
 (51) O. Candogan, A. Ozdaglar, and P. A. Parrilo, “Dynamics in nearpotential games,” Games and Economic Behavior, vol. 82, pp. 66–90, 2013.

(52)
W. E. Walsh, D. C. Parkes, and R. Das, “Choosing samples to compute heuristicstrategy nash equilibrium,” in
Proceedings of the Fifth Workshop on AgentMediated Electronic Commerce, 2003.  (53) M. P. Wellman, “Methods for empirical gametheoretic analysis,” in Proceedings, The TwentyFirst National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, pp. 1552–1556, 2006.
 (54) S. Phelps, S. Parsons, and P. McBurney, “An Evolutionary GameTheoretic Comparison of Two DoubleAuction Market Designs,” in AgentMediated Electronic Commerce VI, Theories for and Engineering of Distributed Mechanisms and Systems, AAMAS Workshop, pp. 101–114, 2004.
 (55) S. Phelps, K. Cai, P. McBurney, J. Niu, S. Parsons, and E. Sklar, “Auctions, Evolution, and Multiagent Learning,” in AAMAS and 7th European Symposium on Adaptive and Learning Agents and MultiAgent Systems (ALAMAS), pp. 188–210, 2007.
 (56) M. Ponsen, K. Tuyls, M. Kaisers, and J. Ramon, “An evolutionary gametheoretic analysis of poker strategies,” Entertainment Computing, vol. 1, no. 1, pp. 39–45, 2009.
 (57) D. Bloembergen, K. Tuyls, D. Hennes, and M. Kaisers, “Evolutionary dynamics of multiagent learning: A survey,” J. Artif. Intell. Res. (JAIR), vol. 53, pp. 659–697, 2015.
 (58) K. Tuyls, J. Perolat, M. Lanctot, J. Z. Leibo, and T. Graepel, “A Generalised Method for Empirical Game Theoretic Analysis ,” in AAMAS, 2018.
 (59) M. Dudik, K. Hofmann, R. E. Schapire, A. Slivkins, and M. Zoghi, “Contextual Dueling Bandits,” in COLT, 2015.
 (60) A. Balsubramani, Z. Karnin, R. E. Schapire, and M. Zoghi, “Instancedependent Regret Bounds for Dueling Bandits,” in COLT, 2016.
 (61) P. R. Jordan, C. Kiekintveld, and M. P. Wellman, “Empirical gametheoretic analysis of the TAC supply chain game,” in 6th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2007), Honolulu, Hawaii, USA, May 1418, 2007, p. 193, 2007.
 (62) P. R. Jordan, Practical Strategic Reasoning with Applications in Market Games. PhD thesis, 2010.
 (63) J. von Neumann and O. Morgenstern, Theory of Games and Economic Behavior. Princeton University Press, Princeton NJ, 1944.
 (64) J. F. Nash, “Equilibrium Points in Person Games,” Proc Natl Acad Sci U S A, vol. 36, no. 1, pp. 48–49, 1950.
 (65) J. Hofbauer and W. H. Sandholm, “On the global convergence of stochastic fictitious play,” Econometrica, vol. 70, no. 6, pp. 2265–2294, 2002.
 (66) W. H. Sandholm, Population Games and Evolutionary Dynamics. MIT Press, 2010.
 (67) Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas, “Dueling Network Architectures for Deep Reinforcement Learning,” in ICML, 2016.
 (68) H. van Hasselt, A. Guez, M. Hessel, V. Mnih, and D. Silver, “Learning values across many orders of magnitude,” in NIPS, 2016.
 (69) G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos, “CountBased Exploration with Neural Density Models,” in ICML, 2017.
 (70) M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining Improvements in Deep Reinforcement Learning,” in AAAI, 2018.
 (71) L. E. Ortiz, R. E, Schapire, and S. M. Kakade, “Maximum entropy correlated equilibrium,” in Technical Report TR200621, CSAIL MIT, 2006.
 (72) L. E. Ortiz, R. E. Schapire, and S. M. Kakade, “Maximum entropy correlated equilibria,” in AISTATS, 2007.
 (73) P. Diaconis, Group Representations in Probability and Statistics. Institute of Mathematical Statistics, 1988.
 (74) Y. LeCun, , L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 (75) R. Kondor and T. Jebara, “A kernel between sets of vectors,” in ICML, 2003.
 (76) R. Kondor, “Group theoretical methods in machine learning,” in PhD dissertation, 2008.
 (77) M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola, “Deep Sets,” in NIPS, 2017.
 (78) J. Hartford, D. R. Graham, K. LeytonBrown, and S. Ravanbakhsh, “Deep Models of Interactions Across Sets,” in ICML, 2018.
 (79) R. Kondor, Z. Lin, and S. Trivedi, “Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neural Network,” in NIPS, 2018.
 (80) S. Sukhbaatar, Z. Lin, I. Kostrikov, G. Synnaeve, A. Szlam, and R. Fergus, “Intrinsic Motivation and Automatic Curricula via Asymmetric SelfPlay,” in ICLR, 2017.
 (81) Y. Freund and R. E. Schapire, “A DecisionTheoretic Generalization of OnLine Learning and an Application to Boosting,” Journal of Computer and System Sciences, 1996.
 (82) R. Schapire and Y. Freund, Boosting: Foundations and Algorithms. MIT Press, 2012.
 (83) R. J. Vandenberg and C. E. Lance, “A Review and Synthesis of the Measurement Invariance Literature: Suggestions, Practices, and Recommendations for Organizational Research,” Organizational Research Methods, vol. 3, no. 1, pp. 4–70, 2000.
A Invariance: Further motivation
Consider agents A, B, C evaluated on a benchmark suite comprising three tasks:
task 1  task 2  task 3  average  rank  

agent A  89  93  76  86  
agent B  85  85  85  85  
agent C  79  74  99  84 
On average, agent A performs best and agent C performs worst. Consider a second benchmark suite containing an additional fourth task. On the second benchmark suite, agent A performs worst and agent C performs best on average. However, a closer look at the second suite reveals that the additional task is a minor variant of one of the original three tasks.
task 1  task 2  task 3a  task 3b  average  rank  

agent A  89  93  76  77  84  
agent B  85  85  85  84  85  
agent C  79  74  99  98  88 
Measuring performance by uniformly averaging over tasks in a benchmark suite is sensitive to the set of tasks that are included in the suite. Including redundant tasks, whether consciously or not, can easily skew average performance in favor of or against particular agents.
The problem becomes more serious when agent performance is measured against other agents, as in games such as Go, Chess or StarCraft. It is easy to manipulate the measured performance of agents by tweaking the composition of the population used to evaluate them. Consider the following example:
agent A  agent B  agent C  Elo  

agent A  0.5  0.9  0.1  0 
agent B  0.1  0.5  0.9  0 
agent C  0.9  0.1  0.5  0 
The three agents exhibit rockpaperscissors dynamics; their Elo ratings (normalized to sum to zero) are all zero. However, adding a second copy of agent C decreases the Elo rating of agent A and increases the Elo rating of agent B:
agent A  agent B  agent C  agent C  Elo  

agent A  0.5  0.9  0.1  0.1  63 
agent B  0.1  0.5  0.9  0.9  63 
agent C  0.9  0.1  0.5  0.5  0 
agent C  0.9  0.1  0.5  0.5  0 
That is, the Elo ratings of agents A and B are easily manipulated by changing the structure of the population.
The examples above suggest it is important to find evaluation metrics that are
invariant to redundant changes in the population of agents or suite of tasks.Related work.
A different notion of measurement invariance has been proposed in the psychometric and consumer research literatures vandenberg:00 . There, measurement invariance refers to the statistical property that a measurement measures the same construct across a predefined set of groups. For example, whether or not a question in an IQ test is measurement invariant has to do with whether or not the question is interpreted in the same way by individuals with different cultural backgrounds.
B Proofs of propositions
Proof of proposition 1.
Proposition 1.
Batch Elo updates are stable if the matrices of empirical probabilities and predicted probabilities have the same rowsums (or, equivalently the same columnsums):
(19) 
Proof.
Player ’s weights are updated in one batch after observing winloss probabilities for each player . Observe that
(20) 
The result follows. ∎
Proof of proposition 2.
Proposition 2.
Given matrix with rank and singular value decomposition . Construct antisymmetric matrix
(21) 
Then the thin Schur decomposition of is where the eigenpairs in the are the singular values in and
(22) 
Proof.
Direct computation; multiply out the matrices. ∎
Proof of proposition 3.
Proposition 3.
(i) If probabilities are generated by Elo ratings then the divergence of its logit is . That is,
(23) 
(ii) There is an Elo rating that generates probabilities iff . Alternatively, iff for all .
Proof.
For the first claim, apply definitions and recall . For the second, apply the Hodge decomposition from section 2.2. ∎
Proof of proposition 4.
Proposition 4 (maxent NE).
The game , where is antisymmetric, has a unique symmetric Nash equilibrium , with greater entropy than any other Nash equilibrium.
Proof.
The Nash equilibria in a twoplayer zerosum are rectangular: if and are Nash equilibria then so are and . Further, they form a convex polytope. Since , the set of Nash equilibria is also symmetric: if is a Nash equilibrium then so is . The entropy is strictly concave and therefore achieves a unique maximum on the compact, convex, symmetric set of Nash equilibria. ∎
C Proof of theorem 1
Theorem (main result for AvA).
The maxent NE has the following properties:

Invariant: Nash averaging, with respect to the maxent NE, is invariant to redundancies in .

Continuous: If is a Nash for and then is an Nash for .

Interpretable: (i) The maxent NE on is the uniform distribution, , iff the metagame is cyclic, i.e. . (ii) If the metagame is transitive, i.e. , then the maxent NE is the uniform distribution on the player(s) with highest rating(s) – there could be a tie.
c.1 Proof of theorem 1.1
First, we more precisely formalize invariance to redundancy.
Definition 3 (invariance to copying the last row and column).
Given antisymmetric matrix , denote the rightmost column by . Assume the rightmost column (and bottom row) of differs from all other columns. Construct antisymmetric matrix
(24) 
by adding an additional copy of the rightmost column (and bottom row) to . A family of functions
(25) 
is invariant to adding a row and column according to (24) if
(26) 
If the copied row is not unique and receives positive mass under maxent Nash, then maxent Nash will already be spreading mass across the copies. In that case, adding yet another copy will result in the maxent Nash on the larger mass spreading mass evenly across all copies.
Lemma 1.
Suppose is antisymmetric with Nash equilibrium . Construct from by adding a redundant copy of the rightmost column and bottom row according (24). Then
(27) 
is a Nash equilibrium for for all . Conversely, if is a Nash equilibrium for then
(28) 
is a Nash equilibrium for .
Proof.
Since the value of the game on is zero, it follows that is a Nash equilibrium iff all the coordinates of are nonnegative, i.e. . Direct computation shows that , which implies is a Nash equilibrium for . The converse follows similarly. ∎
Finally, we prove theorem 1.2.
Proof.
The definition and lemma are stated in the particular case where the last column and row are copied. This is simply for notational convenience. They generalize trivially to copying arbitrary row/columns into arbitrary positions, and can be applied inductively to the cases where is constructed from by inserting multiple redundant copies. ∎
c.2 Proof of theorem 1.2
Recall the maxnorm on matrices is .
Definition 4 (Nash equilibrium).
A joint strategy is an Nash equilibrium for if the benefit from either player deviating, separately, is at most :
(29) 
We are now ready to prove

Continuous: If is a Nash for and then is an Nash for .
Proof.
Suppose is a Nash equilibrium for the antisymmetric matrix . Observe that
(30) 
for any distribution because since is antisymmetric. It follows that
(31) 
The first term on the righthandside is since is a Nash equilibrium for and the value of the game is zero. The second term on the righthandside is because and and are probability distributions. ∎
Note that since for any , the divergence from Nash is also controlled by how well approximates in the operator or Frobenius norms.
The proof is adapted from the proof of the following lemma in Tuyls18 .
Lemma 2 (Nash on approximate games).
Suppose is a Nash equilibrium for and that . Then is a Nash for .
Our result is slightly sharper because we specialize to antisymmetric matrices.
c.3 Proof of theorem 1.3
Proof.
(i) If then and implying the uniform distribution is a Nash equilibrium because there is no incentive to deviate from . The uniform distribution also has maximum entropy. Conversely, suppose . Then has at least one positive and one negative coordinate because we know by antisymmetry. It follows that if the row player chooses then the column player is incentivized to choose a distribution with more mass on the positive coordinate and less on the negative. In other words, the column player will not play the uniform distribution, and the uniform distribution is therefore not a Nash equilibrium.
(ii) By assumption , so decouples into an independent maximization problem with respect to and minimization problem with respect to . It follows that the optimal distribution concentrates mass on the maximal coordinate(s) of and so does . That is, to be a Nash equilibrium, and must place their mass on the maximal coordinate if it is unique and can distribute it arbitrarily over the set of maximal coordinates if there is a tie. Adding the condition that the Nash equilibrium has maximum entropy entails placing the uniform distribution over the maximal coordinate(s). ∎
D Nash averaging for agentvstask
Given score matrix , construct antisymmetric matrix
(32) 
Note this differs from the antisymmetrization used in section 3.3, see next remark.
Remark 1.
The graph structure underlying AvT is bipartite: agents interact with tasks and tasks with agents, but there are no direct agentagent or tasktask interactions. When done in full generality, the definitions of , and take into account the graph structure, see jiang:11 . In particular, , and are computed differently on bipartite graphs than fully connected graphs – note the definitions in section 2.2 are specific to fully connected graphs. Working in full generality is overkill for our purposes. It suffices to introduce slightly ad hoc notation to handle the specific case of AvT.
Introduce the notation
(33) 
where measures uniform average skill of agents on tasks and measures uniform average difficulty of tasks for agents. Let
(34) 
Define a twoplayer zerosum metagame
(35) 
The setup is the same as for AvA in the main text except the row and column metaplayers each play two distributions: one on agents and one on tasks/environments. The same argument as in proposition 4 shows there is a unique symmetric maxent Nash equilibrium .
Definition 5.
The maxent Nash evaluation method for AvT is
(36) 
where and are the maxent Nash equilibria over agents and environments and and are the Nash averages quantifying difficulty of environments (Nash averaged over agents) and skill of agents (Nash averaged over environments) respectively.
Suppose decomposes as
(37) 
where and . Observe that
(38) 
If then the game reduces to
(39) 
and so the row player maximizes its payoff by putting all its agentmass on the most skillful agent(s) and all of its environmentmass on the most difficult task(s) – and similarly for the column player.
It is easy to check that the maxent Nash has a uniform distribution on all agents iff , and has a uniform distribution on tasks iff . We thus obtain
Theorem 2 (main result for AvT).
The maxent NE has the following properties:

Invariant: Nash averaging is invariant to redundancies in .

Continuous: If is a Nash for and then is an Nash for .

Interpretable: (i) The agent component of the maxent NE on is the uniform distribution on agents, , iff .
(ii) The task component of the maxent NE on is the uniform distribution on tasks, , iff .
(iii) If the metagame is transitive, i.e. , then the maxent NE is the uniform distribution on the most skillful agent(s) and the uniform distribution on the most difficult task(s) – there could be ties.
E Code for computing mElo updates
The routine takes as input: a pair of players , the probability of player beating player (which could be 0 or 1 if only a single match is observed on the given round), the rating vector and the matrix quantifying nontransitive interactions. It returns updates to the and entries of and .