The Elo system for rating chess players [Elo86], named after its creator Arpad Elo, has been employed by the World Chess Federation for over four decades as a method for assessing players’ strengths and ranking them in official tournaments. Although not without controversy, it is accepted as generally reliable, and is also used in other games and sports such as Scrabble, Go, American football and major league basketball.
The Elo rating system is based on the model of paired comparisons [Dav88], which can be applied to the problem of ranking any set of objects for which we have a preference relation. The model is particularly useful in that a ranking can be obtained in situations where a preference exists only for some of the pairs of objects under consideration. Paired comparison models have been successfully applied to measure ability in competitive games and sports [Joe91, Gli99], the most notable example being the widely used Elo system for rating chess players.
Bayesian rating systems. Both these systems estimate, in addition to the rating, the degree of uncertainty that the rating represents the player’s true ability. The uncertainty allows the system to control the change made to the rating after a game has been played. In particular, if the uncertainty is low then the changes made to the rating should be smaller as the rating is already reasonably accurate, while if the uncertainty is high then the changes made to the rating should be larger.
Here we adopt the Bradley-Terry model [BT52]
, which provides the theoretical underpinning of Elo’s model, where the probabilitythat a player , whose strength is , wins against a player , whose strength is , is given by the logistic function , namely
where is a positive scaling factor. We note that is strictly monotonically increasing, , and . Moreover,
In this paper we are interested in the distribution of ratings within the pool of players that arises as a result of the model induced by (1). We are not aware of any research in this direction, although it is generally accepted that this distribution is well approximated by a Gaussian (i.e. normal) distribution [CG96, BSMG09]. It is worth mentioning that Elo [Elo86]
claimed that the distribution of ratings of established chess players was not Gaussian, and suggested the Maxwell-Boltzmann distribution as an alternative that fitted the data he used slightly better.
we do some exploratory data analysis on published official chess rating data. We show that the Gaussian distribution provides a very good fit to the data, but there is a small negative skew present. In Section4 we propose an evolutionary stochastic model, which as a first attempt assumes a symmetric distribution of ratings. The derivation of the distribution is presented in Section 5, where we prove that the resulting distribution is indeed normal, with the interesting feature that the variance increases with time in a logarithmic fashion. In Section 6 we validate the model using published rating data from January 2007 to January 2010, and in Section 7 we modify the model to allow for the skewness present in the data. With reference to this data, we show through simulation that the modified model yields a better approximation to the actual distribution. Finally, in Section 8 we give our concluding remarks.
2 Elo’s Rating System
The fundamental assumption of Elo’s rating system is that each player has a current playing strength. In a game played between players and , with unknown strengths and , the score of the game for player is denoted by , where is 1 if wins, 0 if loses and if the game is a draw. Its expected value is assumed to be [GJ99]
where is the expectation operator and
The Elo system attempts to estimate the strength of player using a calculated rating , which is adjusted according to the results of games played by . We observe that this model is related to the Bradley-Terry model for paired comparison data [BT52]; see also [Dav88].
After playing a game against player , player ’s rating is adjusted according to the following formula (see equation (2) in [GJ99])
where (known as the -factor) is the maximum number of points by which a rating can be changed as a result of a single game. (A high -factor gives more weight to recent results, while a low -factor increases the relative influence of results from earlier games.) In the Elo system the -factor is typically between 10 and 30. (There has been some controversy involving a recent proposal by the World Chess Federation to change the -factor [Son09, Zul09].) For the purpose of experimentation we have fixed the -factor at 20.
Player ’s rating is updated similarly. We note that, after updating both ’s and ’s ratings, the sum of their ratings remains unchanged. The above method can be straightforwardly extended to the case of a player competing in a tournament, or to a number of games played over a given period.
3 The Distribution of Elo Rating Data
The World Chess Federation, known as FIDE, publishes a rating list several times each year. Traditionally FIDE published the rating list every three months, but from 2009 has moved to bi-monthly publication; the official rating data can be obtained from http://ratings.fide.com.
where is the mean and
is the standard deviation of.
With these observations in mind, we performed some exploratory data analysis on the FIDE rating data from January 2007 to January 2010. To test the normality of the data, we binned each of the four data sets, taking the bin width to be 20 (the fixed -factor). The resulting plots for January 2007 to January 2010 are shown on the left-hand side of Figure 1.
We then fitted a constant multiple of a Gaussian distribution to each of the four data sets, using Matlab. The plots for the fitted data for January 2007 to January 2010 are shown on the right-hand side of Figure 1. The fitted parameters, , and , are shown in Table 1, where is the multiplicative constant. Clearly, is an approximation to the actual number of players . Table 1 also shows , the coefficient of determination [Mot95]. It can be seen that this is close to 1, which indicates a very good fit. For comparison, the last two columns in the table show the mean and standard deviation computed from the actual FIDE rating data. On average is about 7-15 Elo points greater than the fitted standard deviation .
It can be seen that the plots on the left-hand side of Figure 1 appear to show a small negative skew. (We note that this is in contrast to the positive skew of the Maxwell-Boltzmann distribution, suggested by Elo [Elo86].) As a next step, we therefore investigated the skewness of the data for 13 rating periods from October 2006 to September 2009. The skewness is defined by
The skewness of the actual FIDE rating data is shown in the left-hand plot in Figure 2. As can be seen, it shows that there is a small negative skew, which has generally slowly increased over the period. (The increase in skewness in September 2009 is mostly due to FIDE temporarily lowering the minimum rating for new players from 1400 to 1200, and then reverting to the original policy in the following period.) The negative skew can be attributed to the slow decrease in the mean rating with the growing number of players, since it is more likely that a new player joining the pool will enter with a rating lower than the average. This can be formalised as follows.
Let be the number of players in the pool at the end of the first period and let be the mean rating of those players. We define and similarly for the second period. Then the total of the ratings of all players in the pool is for the first period and for the second period. Assuming the average rating of new players joining during the second period is , we have
We can approximate (7) by the differential equation
which has the solution
where is a constant.
The right-hand graph in Figure 2 shows the mean Elo rating plotted against the logarithm of the number of players . The linear fit shown is in good agreement with (8), with , , and . Thus the average rating is decreasing slowly as a linear function of the logarithm of the number of players in the pool. In addition, knowing would allow us to predict the rate of decrease, and also to estimate the skewness shown in the left-hand graph in Figure 2.
4 An Evolutionary Urn Transfer Model
In our evolutionary stochastic model for rating game players, two main types of event may take place. The first event type occurs when a new player enters the system. We make two assumptions related to such an event:
that new players enter the system at a fixed rate, and
that once players enter the system they do not leave it.
(We note that the model can be extended to allow players to leave the pool as long as the rate at which players enter the pool is greater than the rate at which they leave.)
The second event type occurs when a game is played between two players. In this case, we assume
that the outcome of the game is either a win or loss for the first player, and
that every game occurs between two players of fairly similar strength; in particular, we assume that the absolute value of the difference in strength between the players in any game is at most .
Assumption (iii) is often made, cf. [Gli99], to avoid including extra parameters in the model, as it is reasonable to assume that a draw is equivalent to half a win and half a loss (which is consistent with the score of a draw being , as in Section 2); see [Joe91, Hen92, GJ99] for alternative ways of dealing with draws. The basis for Assumption (iv) is that players will normally play games against players of comparable strength; for example, many tournaments are divided into separate grading sections for that reason. We note that the win probabilities given by (1) satisfy
which is consistent with Assumption (iii).
In our model, we approximate the ratings using a discrete numerical scale of values at intervals of . We use urns to store the pool of players, with each urn containing players of approximately similar strength. Let denote the average rating of all the players. Then , the th urn, where , contains those players whose rating is in the range , i.e. the players are grouped into bins of width . Thus a player with rating will be in urn number .
Players enter the system at a rate , where . After playing a game, a player may stay in the same urn or be transferred to one of the two neighbouring urns, depending on the result of the game. We now describe the urn model in detail.
We assume a countable number of urns, with being the central urn; to its left are the urns with negative subscripts and to its right are the urns with positive subscripts. We let denote the number of players in at stage of the stochastic process. Initially , , with , i.e. initially has players in it, and all other urns are empty, i.e. for .
When a player enters the system, an existing player is selected uniformly at random from the urns and the new player is put into the same urn as player , i.e. we assign the new player the same approximate rating as the selected existing player . In other words, new players enter the system according to the distribution of players currently in the system.
The stochastic process modelling the changes in rating can be viewed as a random walk [RG04], where the probabilities of players increasing, decreasing or maintaining their ratings depend on their current ratings, as explained below.
At time , , a player is chosen uniformly at random from the urns, say from , i.e. is selected with probability
where means is approximately equal to for large t. (This approximation holds since the expectation of the number of players at time is .)
As above, we assume . Then one of two things may occur:
with probability , a new player is inserted into , i.e. into the same urn as the chosen player ;
with probability , an opponent for the player is chosen from urns
The probability that player is chosen from is , , where for symmetry we assume . Depending on the result of the game, player either moves to or , or remains in . The probabilities of these events are chosen so that the expected change in ’s rating is identical to that prescribed by the Elo system.
As we are working in terms of urn numbers rather than Elo ratings, we let , so is the scaling factor in terms of urn numbers. Thus, since and , the probability that player wins is , by (1). Therefore, from (5) and (2), when wins ’s new rating is given by
In order to find the new urn number for , corresponding to the rating , we first normalise (11) by subtracting and dividing by , giving
We restrict player to moving up or down by at most one urn. Moreover, we discretise the change stochastically so that the new urn number will be integral but the expected change unaffected. Hence,
We note that has to be chosen so that the probability in (12) does not exceed 1 for all , . We therefore require . For simplicity, we will choose .
The probability that player moves to is
i.e. the product of the probability that wins against and the corresponding discretisation probability.
Similarly, when loses we have
Again restricting to moving up or down by at most one urn, on stochastically discretising, we obtain
Therefore the probability that moves to is
i.e. the product of the probability that loses against and the corresponding discretisation probability.
Then, in summary, if the selected player is from and the chosen opponent is from , ,
with probability player moves to ,
with probability player moves to , and
with probability player stays in .
We note that is proportional to the derivative of the logistic function, viz.
This symmetric bell-shaped curve is proportional to the probability density function of the logistic distribution, with standard deviation [EHP00].
It is easy to show that, conditional on being chosen from and from , the variance of the change in rating is , whereas with the Elo system it is only ; the additional variance is due to the stochastic discretisation. It therefore follows that the unconditional variance in our model will also be increased by a factor of compared to that for the Elo system.
It is clear that, according to the Elo model, player ’s rating should be updated in a similar manner to player ’s. However, we simplify the analysis by considering each game as essentially equivalent to two “half games”, since the players are chosen randomly. It is therefore sufficient to analyse only the change to ’s rating.
(We note that, unlike the proposal in [GJ99], our evolutionary model does not take into account, for example, the fact that junior players tend to be under-rated and to improve more rapidly than older players.)
5 Derivation of the Distribution of Players’ Ratings
Considering all possible choices for player , it follows from the above discussion that the probability that will move to is given by
and, by symmetry, that this is also the probability that will move to .
At time , , a game is played with probability , and there are then the following three possible ways that the contents of may change.
The player chosen uniformly at random is selected from , and then plays an opponent from say . By (15), the probability that beats and moves to is , that loses to and moves to is , and that stays in is . Thus the net expected loss from is .
The player chosen uniformly at random is selected from , and then plays an opponent from say . By (15), the probability that beats and moves to is ; so the net expected gain to is . (In all other cases the contents of do not change.)
The player chosen uniformly at random is selected from , and then plays an opponent from say . By (15), the probability that loses to and moves to is ; so the net expected gain to is . (In all other cases the contents of do not change.)
If is selected from any of the other urns, the contents of do not change.
We now obtain the difference equation for the urn transfer model, by considering the expected change to , as discussed above. For integer and ,
To derive (16), we follow a mean-field theory approach, such as that in [OS01, LFLW02], replacing by its expectation , as in (10). The expected value of is equal to the previous number of players in plus the two probabilities of inserting a player into , from either or , minus the probability of moving a player from to either of the neighbouring urns, i.e. and , plus the probability of inserting a new player into .
We now take expectations in (16), and we write for . By the linearity of , we obtain
We note that (17) defines a symmetric random walk by the selected player at time , where the probability of moving right or left is proportional to , but the probability that is selected decreases over time. Thus the distribution of the players in the urns flattens asymptotically over time and the standard deviation increases, as in a diffusion process [DB03].
We will see that in our case the variance increases logarithmically with time and thus the distribution will flatten very slowly.
We now approximate our discrete model by a continuous model using a continuous function to approximate . In particular, we may approximate
), we thus derive the partial differential equation
is a constant.
If we now let
we can transform (18) into
The initial conditions of the discrete model are , where , and for . Since
the boundary conditions for the continuous model become , where is the Dirac delta function. This yields the boundary conditions
and we see from (6) that this is the density function of the Gaussian distribution with mean 0 and variance .
From (23) it follows that
6 Modelling the Distribution of Chess Players’ Ratings
We are assuming that and , as stated previously in Sections 2 and 4; thus . We consider the cases and , and for simplicity we assume that the urn from which the opponent is selected is chosen uniformly, i.e. . We can then compute from (14) and from (15).
Finally, we need estimates for , and . We assume, as indicated in Section 3, that the ratings are normally distributed; we relax this assumption in Section 7 to cater for some degree of skewness in the distribution. In order to validate our model, we obtain estimates for these parameters using the published official rating data from January 2007 to January 2010, as described in Section 3. Our methodology is to extract values for these parameters from this data, using the analysis in Section 5, and then run simulations of our model in order to see how closely the resulting distribution matches that obtained from the actual data.
To estimate from the actual rating data, we proceed in the following way. Let be the number of rated players recorded at January of a given year. Let be the number of games played and be the number of new players joining the pool of rated players during the previous year (computed as the difference between and its value for the previous January). According to the data, the rate at which players entered the system during the previous year is given by
The values for these parameters from January 2007 to January 2010, calculated using the official FIDE data, are presented in Table 2. In the simulations we took the rate to be , the average rate over the complete four-year period, as shown in the summary row. It can be seen from the table that, in reality, fluctuates somewhat, but as an approximation we assume that is constant. We can then compute from (19).
and that , the variance of the rating distribution, is . We thus obtain
To get a single value for , we simply take the average over the years 2007 to 2010, where we compute a year-specific value for from (26) using the values of and from Table 1. Finally, we estimate using (25).
For and , the estimated values for and are presented in Table 3, where the values for are rounded to the nearest 10. We also obtained alternative estimates by replacing by in (25) and (26); the two alternatives are indicated by the first column of Table 3. The alternatives will be denoted by and , respectively.
|Using||until 2007||until 2008||until 2009||until 2010|
As mentioned above, we fixed at , the value obtained in Table 2. For each set of values for the parameters , and in Table 3, we ran 10 simulations of the stochastic process described in Section 4, implemented in Matlab. In each case we then fitted a Gaussian to the distribution of the number of players in the urns, again using Matlab. Each row in Table 4 was computed from the average of the 10 simulations in exactly the same way that the values in Table 1 were computed from the actual rating data. That is, , and are the values calculated from the results of the simulations, and , and are the values obtained by fitting a Gaussian distribution to the simulation results. (In order to obtain Elo ratings from the urn numbers of the players in the simulation, the urn numbers were calibrated by means of a suitable shift. This was chosen so that the means from Table 1 for each of the four years were within the range of .) It can be seen that, in each row of Table 4, all the fitted and calculated values are very close to each other. This and the fact that is so close to one gives strong confirmation of our analysis in Section 5.
We now compare the fitted and calculated parameters from Table 4 with those in Table 1. Obviously, by construction, and are very close to the corresponding values in Table 1. In addition, it can be seen that the values for and when using and are very close to the values for in Table 1, and correspondingly close to the values for in Table 1 when using and . However, the calculated standard deviation in Table 4 is consistently lower than its counterpart in Table 1. For 2007 they are very close, for 2008 they are about 10 Elo points apart, for 2009 they are about 17 points apart, while for 2010 they are about 24 points apart. Although these results are very encouraging, we will see in the next section that we can get much closer to the actual standard deviations by introducing skewness into the model.
7 Taking Skewness into Account
As discussed in Section 3, the actual rating data exhibits a small negative skew. We now consider modifying the urn model presented in Section 4 to take this into account. Since it is likely that a new player will enter with a rating lower than the average, we can model this skewness in a simple way by making a small change to the way in which new players are added. Instead of inserting the new player into the same urn as the chosen player , say , we put the new player into , where determines the amount of negative skew we wish to introduce.
To validate the modified stochastic process, we ran a batch of simulations in Matlab, starting the process with the actual rating data as of October 2006 and ending in January 2010. For the October 2006 starting data , and (as shown in Figure 2). From October to December 2006 the number of games played was , and the number of new players was . Using these values together with the data in Table 2, we therefore took the number of simulation steps to be and, as before, the rate at which players enter the system to be . Tables 5, 6 and 7 show the average skewness , mean rating and standard deviation over 10 simulations, for and , respectively, with varying from to . As a reference point, for the actual rating data as of January 2010, and , as in Table 1, and we computed .
It can be seen that the results are rather similar in all three tables. As is increased, the skewness becomes more negative, the mean decreases and the standard deviation increases, as expected. The closest fit to the actual skewness and the standard deviation is when is . However, the closest fit to the mean Elo rating is when is or . The suggested values for therefore correspond to a new player being rated Elo points below the average rating. This latter value is in broad agreement with the value obtained in Section 3 from Figure 2. Although this value was obtained using the entire three year period, the values for the individual years calculated from (7) are similar, being roughly in the range . These results confirm that the modified process is a reasonable model for obtaining rating data with the observed parameters, despite the discrepancy between the values for . This discrepancy is not surprising, since the modified model, as a first approximation, is clearly an oversimplification. We note that the value of seems to have very little effect on the results, although it is possible that some pattern might be noticeable if a significantly larger value for was used.
8 Concluding Remarks
We have constructed a stochastic evolutionary urn model that generates the distribution of players’ ratings and have validated this model using published official rating data on chess players. For the symmetric case, our analysis of the model yielded a Gaussian distribution, which has the interesting feature that the variance increases logarithmically with time. This implies that the distribution of ratings is quite stable, but has the tendency to flatten extremely slowly over time. These results were validated by simulating the model. Although the data is well approximated by a Gaussian, there is a small negative skew present in the data. An improvement can be made to the model to account for this by breaking the symmetry and putting new players into lower-numbered urns, corresponding to new players generally having lower than average ratings. The modified stochastic process was validated by simulation starting with actual rating data. Deriving analytically the distribution for the modified process remains an open problem.
Throughout the paper we have assumed that the -factor is fixed at 20. It would be interesting to allow the -factor to vary with players’ ratings and the number of games they have played, as suggested in [GJ99], and to see whether such a modification could shed some light on the -factor controversy mentioned in Section 2.
- [BSMG09] M. Bilalić, K. Smallbone, P. McLeod, and F. Gobet. Why are (the best) women so good at chess? Participation rates and gender differences in intellectual domains. Proceedings of the Royal Society of London, Series B, 276:1161–1165, 2009.
- [BT52] R.A. Bradley and M.E. Terry. Rank analysis of incomplete block designs, I. The method of paired comparisons. Biometrica, 39:324–345, 1952.
- [CG96] N. Charness and Y. Gerchak. Participation rates and maximal performance: A log-linear explanation for group differences, such as Russian and male dominance in chess. Psychological Science, 7:46–51, 1996.
- [Dav88] H.A. David. The Method of Paired Comparisons. Hodder Arnold, London, UK, 2nd edition, 1988.
- [DB03] K.A. Dill and S. Bromberg. Molecular Driving Forces, Statistical Thermodynamics in Chemistry and Biology. Garland Science, New York, NY, 2003.
- [EHP00] M. Evans, N. Hastings, and B. Peacock. Statistical Distributions. John Wiley & Sons, New York, NY, 3rd edition, 2000.
- [Elo86] A.E. Elo. The Rating of Chessplayers Past & Present. ARCO Publishing, New York, NY, 2nd edition, 1986.
- [GJ99] M.E. Glickman and A.C. Jones. Rating the chess rating system. Chance, 12:21–28, 1999.
- [Gli99] M.E. Glickman. Parameter estimation in large dynamic paired comparison experiments. Applied Statistics, 48:377–394, 1999.
- [Hen92] R.J. Henery. An extension of the Thurstone-Mosteller model for chess. The Statistician, 41:559–567, 1992.
- [HMG06] R. Herbrich, T. Minka, and T. Graepel. TrueSkill: A Bayesian skill rating system. In Proceedings of Advances in Neural Information Processing Systems (NIPS) 19, pages 569–576, Vancouver, B.C., Canada, 2006.
- [Joe91] H. Joe. Rating systems based on paired comparison models. Statistics & Probability Letters, 11:343–347, 1991.
- [LFLW02] M. Levene, T.I. Fenner, G. Loizou, and R. Wheeldon. A stochastic model for the evolution of the Web. Computer Networks, 39:277–287, 2002.
- [Mot95] H. Motulsky. Intuitive Biostatistics. Oxford University Press, Oxford, 1995.
- [OS01] M. Opper and D. Saad, editors. Advanced Mean Field Methods; Theory and Practice. MIT Press, Cambridge, Ma., 2001.
- [RG04] J. Rudnick and G. Gaspari. Elements of the Random Walk, An Introduction for Advanced Students and Researchers. Cambridge University Press, Cambridge, UK, 2004.
- [Son09] J. Sonas. Ratings summit in Athens. See http://www.chessbase.com/newsdetail.asp?newsid=5527, June 2009.
- [Zul09] D. Zult. On the increase of the K-factor (Update). See www.chessvibes.com/reports/on-the-increase-of-the-k-factor, May 2009.