I Introduction
In applied statistics, one major challenge that comes up all the time is to infer causation from correlation. That is because data from any observation are inevitably biased by some selection effects. A claim of causation based on observed correlation, no matter how natural it feels, may be an illusion.
One concrete example occurs in gaming statistics. Human are obsessed with games. We like to play, to watch, and most importantly, to win. Thus, it is a common practice to collect records of past games and analyze which moves have been leading to victories. One of the most natural statistics to take is the probably to win given a certain move. It is a well defined conditional probability, but quoting it can often be misleading. When this quantity is quoted to a common folk, it sounds like the intrinsic value of a move, which should have the following definition.

Comparing “making this move” to “not making this move”, how much more likely does a player win, given that all other conditions are the same.
Unfortunately, this impression is seldom accurate, due to the fact that “given that all other conditions are the same” is almost never satisfied by the statistics. In realistic data, not all participating players are equally skilled. If we condition on a good move, then good players are more likely to make such a move. They will also make other good moves, which all together increase the resulting winning chances. Therefore, the conditional winning probability can be an overestimation of the intrinsic value of that move. Even more annoyingly, sometimes a bad move is favored by good players, and it can actually have a higher conditional winning probability just because of that. This creates an misconception that such a bad move is actually good.
Sociological effects also compound this issue, which is sometimes referred to as “group thinking” in the gaming society.Turner and Pratkanis (1998) When a game is complicated enough, people usually do not just learn the strategy from scratch. We often learn from experts (good players), who will often tell you that the moves they make are good. Most of those advices are actually correct, and your skill will improve as you start to follow them. However, if those good players have any misconceptions, you will likely inherit them. As you become a better player and your data shows up in statistics, the misconceptions are further solidified.
Of course, fundamentally, this is not a sociological issue. It is a mathematical fact that
(Conditional Probability to Win) (Intrinsic Value of a Move)
Treating them as being equal is nothing but a mathematical mistake. It affects human players as well as sophisticate AIs. AlphaGo Master and AlphaGo Zero have identical architecture, but the later is clearly superior than the former. Silver et al. (2017) One major difference is that the former started learning from human experts, but the later did not. This suggests that the initial human inputs did teach the AI some misconceptions, which the subsequent training from selfplays did not get rid off. This is not entirely surprising, since reinforcement learning ultimately bases on the outcome of the game, which is effectively the conditional winning probability. During the training process, the strategy of the AI evolves semistochastically, which means that the data is effectively taken from a group of players with a nonzero skill variation. Thus, the resulting AI can suffer the skillbias problem just like any human player.
In this paper, we provide a simple model to deal with this problem. We propose a way to calculate the intrinsic value of a move based on statistical data which is still quite straightforward, with only 2 more quantities involved.
(1) 
Here is the naïve conditional probability of winning given such move. is the probability for a random player from “group 1” to defeat a random player from “group 2”. Group 1 consists of players with the same average skill as those who did make this move, and group 2 consists of players with the same average skill as those who did not make this move. Subtracting from is clearly understandable as removing any advantage from simply being better players. Acute readers might immediately realize that such removal results in an underestimation. Because “making this move” is part of “being better players”. That is why we need to divide the result by a factor that is slightly less than one. is how much more likely does a player with skill 1 higher than average to make such move, compared to an average player. This formula is derived under the assumption that no single move changes the result dramatically. Namely, and should be close to for 2player games, and . If these conditions are not ssatisfied, we cannot guarantee the accuracy of Eq. (1).
The two new quantities here, and , are easily computable from data if the number of games is much larger than (number of player). When that is not the case, one can still use modern skill evaluation systems like True Skill Herbrich et al. (2006) to derive the skill of each player from the data. As long as such skill evaluation is reliable, we can use it to calculate and . Of course, this requires us to keep track of the players. If the gaming data is playerblind, then our method is not applicable. If no one plays the game repeatedly, then there is no way to have a reliable evaluation of their skills, and our method is also useless.
We should emphasize that other than the need to keep track of the players, our method does not require any extra data.
Actually, it is highly modular.
If we want to know the intrinsic value of one move, we still only need the data of that move.
Although fundamentally, the skill bias effect we want to remove is due to other moves, we managed to circumvent the need of using them.
Also, our method is independent from the average skill level of players.
We can learn the same intrinsic value from a group of experts or from a group of mediocre players.
This is only proven in our toy model where each move is mechanically independent from each other.
We are quite aware that such assumption may not be correct for realistic games.
Interestingly, applying our method to the actual data for a board game, this property seems to check out.
Outline. The rest of this paper goes like the following.
In Sec.II, we use a simple toy model to demonstrate the skill bias effect and to derive Eq. (1).
In Sec.III, we demonstrate how to compute Eq. (1) from basic statistical data.
In Sec.IV, we apply our method to the computerized version of our toy model to demonstrate two properties: (1) our result is independent of player skills, (2) our method can reveal important strategic bias.
In Sec.V, we use the online scraped data of the board game: Through the Ages, to demonstrate the result of Eq. (1).
It surprisingly confirms the property that the intrinsic values obtained by our method are independent of the average skill of players in the data.
It also reveals real misconceptions that exist among the online players of this game.
Ii Skill Bias Removal: Derivation from a Toy Model
We will start by setting up an abstract, 2player, 0sum, not entirely deterministic game with no ties. The game consists of binary matches. Real numbers for quantify the intrinsic values of these matches. or means that whether the outcome of the th match is “favorable” to either player or (if ). By definition, when a match outcome is favorable to one player, it must be unfavorable to the other player. The probability for player to win is given by
(2) 
Naturally, for player , it is
(3) 
We will assume that ’s are small enough such that the probabilities are bounded between 0 and 1.^{1}^{1}1 Strictly speaking, we should use a sigmoid or function to achieve that. However, for the purpose of this paper, we will skip such nonlinear technicalities. When the value of stays small within the approximately linear response range of those functions, our result is valid. Situations very different from such regime will be a topic of future work. All these parameters are actually hidden. They will help us to visualize the problem and find a solution, while the solution will be independent of these parameters.
In this game, a “move” is basically a successful attempt to secure the “favorable” outcome of a match. A “good move” would be securing a match whose is indeed larger than zero. Thus, we will model the strategy of a player by numbers, . For player , we denote these numbers by , which basically means how much player wants to make the outcome of the th match “favorable” to him. The probability for that to actually happen to player against player is
(4) 
One may ask why would someone “not want” the favorable outcome. Well, the thing is that, the game is complicated enough. No players are absolutely certain about the sign of each . For all they know, might be negative. In that case, a small would have been the better strategy.
The “skill” of a player is quantified in the following way. Let be the skill of player , the strategy is given as the sum of two terms.
(5) 
’s are independent random numbers with zero mean.
(6)  
(7) 
Their existence guarantees that no 2 players are identical. As we will see, they will play no role in our result. We have the freedom to choose any variation of per player per match and still get the same result. That speaks for the generality of our method.
is the “skill” of player , which is the tendency to consistently get more favorable outcomes. If , then is the “simplicity” of match . Larger implies a simple choice—only a small skill advantage is needed to recognize which outcome is actually favorable in the th match. On the other hand, implies a “misleading” choice—better players are actually less likely get the favorable outcome in this match.
The expected chance for someone with skill to defeat someone with skill is given by
(8)  
(9)  
(10) 
Thus, without loss of generality, we demand that . This ensures that more skilled players indeed have higher winning chance. ^{2}^{2}2If this were not the case, simply flip the sign of all .
Now, assume that there are players, and their skill distribution is
(11) 
From the data of many games between these players, we would like to ask the following question:
How do we find out the importance of a particular match ?
Naïvely, we should look for the conditional probability that “how likely for a player to win if the th outcome is favorable.”
This however, is only correct when all players have the same skill, .
Because while conditioning on the outcome of the th match, we are also selecting a slightly more skilled subset of players.
The increased winning probability also comes from the fact that their superior skills tend to swing other matches in their favor.
This is what we call the “skill bias” effect.
Here is the math.
This clearly shows the two contributions: the real importance of the match , and the extra contribution through other matches which is the skill bias. Thus, in order to find , we need to know the value of this skill bias term.
We can first ask about the average skill for players who get (un)favorable outcomes from match .
(13) 
As a reality check, let us think about the situation when . This means that match is quite nontrivial, such that even good players cannot determine which outcome is favorable. Naturally, everyone chooses randomly, and conditioning on such choice does not lead to a skill bias.
Next, for the two subsets of players with average skill given by Eq. (13), we calculate the expected chance that a random player form one subset defeats someone from the other subset.
(14) 
Combine Eq. (II) and (14), we get the following formula very close to the final result.
(15) 
The final input we need is the standard deviation in skill.
is just a theoretical, unobservable parameter in our model. The corresponding observable is the standard deviation in winning chances.(16) 
Combining Eq. (16) and (14), we get
(17) 
Iii How does it work?
Eq. (18) contains three probabilities that we should read from data. Among them, is somewhat a new idea. Let us look at the following example.
Game 1  Game 2  Game 3  Game 4  Game 5  

A  C  C  c  b  
b  b  a  B  A 
This is the record of 100 games between 3 players, A, B, and C. Capital letter means that the player wins the game. The first row is the player who gets the favorable outcome from the th match. For simplicity, we assume that every 5 games are exactly the same, thus the above table is repeated 20 times to form the entire data set.
Clearly, we have
(19) 
which seems to suggest that is the favorable outcome in this match. However, we can see that and seems to be better players both with winning chances, and is a weaker player with only winning chance. Incidentally, players and contribute to all the wins for , while player contributes to the majority of losses for . It is not directly clear whether is truly favorable, or simply because it is the prefered choice of the better players.
In order to resolve this ambiguity, our method requires us to calculate .
One intuitive approach is to derive it directly from pairwise winning chances in the data.
(20)  
Also, by definition, . ^{3}^{3}3Note that two of the pairwise winning chances are exactly , which is somewhat outside the scope of our assumptions of the previous section. So we should not talk the result too seriously. This is just a toy example to demonstrate how our method works.
From these probabilities, we can make Table (1).
A  C  C  C  B  

B  
B  
A  
B  
A 
Then we can plug in Eq. (20), add up all those numbers in the table, and divide by 25. That gives us
(21) 
Since this is actually higher than , it suggests that and is not really a favorable outcome of match .
In reality when the number of games is large, we may not really want to calculate the entire table. We can sample the table sparsely by randomly selecting players from the groups and match them with each other. One thing to note here is that in our method, we are always weighting by games, not players. Thus the random sample from the subset should be drawn from {A:1, B:1, C:3} and allow repeats. Namely, a player who appears in more games will contribute more to the probability. That is because the calculation of Eq. (II) naturally requires such weighting, and all other expectation values must follow the same rule to be selfconsistent. In particular, this weighting guarantees that the average winning chance is . That would not have been the case if we weight by players. Finally, can be calculated by randomly drawing pairs of players from {A:3, B:4, C:3}, which is always weighted by their appearance in all recorded games.
The reason why we want this example to have 100 games instead of 5 is that we want to take the probabilities in Eq. (20) seriously. If there were only 5 games, we would have been claiming three things:

will defeat all the time, based on the fact that it happened twice.

will defeat all the time, based on the fact that it happened once.

will defeat of the time, based on two results, and ignoring their relative performance against .
We can see that these are not the best assumptions. However, if indeed defeats 40 times in a row, the actual winning chance might be really close to . Thus, if (number of games)(number of players), then we would have enough statistics for the majority of pairwise winning chances. In those situations, using Eq. (20) is fine.
When the number of games is not that large, not only the statistical winning percentage is questionable, games between certain pairs of players might simply do not exist. In those cases, we should use functions like True Skill Herbrich et al. (2006), which will give us better estimations on the pairwise winning chances. Of course, even the True Skill system requires enough number of games per player, and each player should at least be indirectly connected. Thus, the limitation of applying our method is the requirement that the skill of majority of players can be reliably evaluated.
Iv A Simple Example
We have written the game described in Sec.II into a computer code, with the nonlinear modifications to Eq. (4) and (2).
(22)  
(23) 
When the arguments of are small, nonlinearities are not important, and it should behave it the same way as we calculated analytically. The players are coded similarly, with the extra constraint that their strategies, , are
dimensional vectors normalized to a given length,
. The normalization is imposed to prevent the values of from running away during a training process, and for the convenience that the typical value of is about . ^{4}^{4}4Due to this normalization, the statistics of will not be exactly the same as in Sec.II. That is not a problem since one of our goals is to check whether our method is applicable generally, without the specific assumption about the player skills. This can also be understood as part of the game rules that all players have the same finite total budgets that they can use to compete during each match.Clearly, if we treat both the lists of and as vectors, when the nonlinearities are not important, the best strategy should be pretty close to satisfying . In fact, we will use this inner product as our way to evaluate skill, and a little reinforcement learning can confirm that it is a good choice. We started 10 players with random vectors , and allowed them to play against each other, modified their strategies according to the outcome of each game. We plot their values of in Figure 1 and see that they in indeed grow and approach 1.
For better visualization, we will focus on a game with and for all . Namely, securing every single match will increase your winning chance for about (can be less if you are far away from due to the nonlinearities). Obviously, the best strategy is to have all positive and equal. We then introduce a “misguided expert”. This player follows a strategy that is quite good.
(24) 
Basically, he got all matches but 1 right. Although, the one he got wrong, he got that completely wrong.
This misguided expert is not playing himself. He is well respected and people learn from him. We make another 10 players who start with random strategies, but instead of reinforcement learning, we modify their strategies in the following ways to mimic “learning” from this expert.
(25) 
with the value of varies from one player to another.
In the following 2 examples, we have . Nonlinearities will be somewhat significant in each match, but we need to live with that to see the effect. Namely, we need to make sure that players have strong opinions on each match, despite the fact that their individual influence on the final result is small. This, fortunately, is quite a common behavior for human players. In Figures 2, we compare the results between a group of 10 players with random strategies, and 10 players who learned form the misguided expert to various degrees. We can see that the conditional winning probabilities are strongly influenced by the strongly biased teaching from the misguided expert. , the match that the misguided expert considers to be bad, does perform much poorly in terms of the conditional winning probability. At the same time, other matches, for which the misguided expert’s opinion is correct, perform better than they should. Our method of skillbias removal always recovers the value of more accurately. Even among random players, our answer seems to track the actual values of slightly better than the conditional probability.
On top of recovering the value of more accurately, these 2 examples also supports our claim that the method works independent of the average skill level of the players. The expert, though misguided, does teach mostly correct lessons. Thus, the group of players to learned from him are indeed better, and we recover from the records of both groups equally well.
This example helps us to demonstrate the skillinvariance of our method, and its advantage over using the conditional winning probability directly. However, one may wonder how much these 2 properties remain true in more realistic situations. In particular,

In our toy model, every match is mechanically independent from one another (except for the common budget implies by the normalization of . In a realistic game, almost every decision are intricately connected. Will the skillinvariance of our method remain true?

Skillbias removal seems to be most important when there is a strongly biased opinion among the players, such as the one introduced by our imaginary, misguided expert. How often is that true?
In the next section, we will demonstrate the result with an actual board game with only scraped data, which may shed some lights on these 2 concerns.
V A Realistic Example
Through the Ages: A New Story of Civilization is a deep strategic boardgame for 24 players. There is a website that allows real people to play against each other, and it keeps records of past games. We scraped the data of 30k+ games and use them as our example.
First of all, the multiplayer nature has a wellknown solution. A game with more than 2 players will be treated as multiple 2player games by comparing the results between all pairs present. Next, any nonbinary decision can be decomposed into multiple binary decisions. For example, in this game, each player can try to play cards from a common pool, which are limited in supply. Instead of treating all cards together like a complicate multichoice match, we can treat each cards separately. Each card is treated as a match with 2 outcomes—whether you play it or not. Then, between two players who chose differently in the same game, we can apply our method.
For example, in a game with 4 players, , , and , who ended up with scores , , , . If player is the only one who played a certain card, then we treat it as 3 games.
Game 1  Game 2  Game 3  

D  D  d  
a  b  C 
Note that player does defeat and . However, those are only taken into account in the computation of their True Skill. Since none of them played this certain card, the result of this game between those players are irrelevant in its evaluation. We applied this method to all games, then we obtained the value of the card .
In Figure 3, we show the true value of all Leader cards from Age II of the game, including the value of not choosing a leader at all. First of all, the result is not far from common consensus among seasoned players. When applied to all cards in the game, we do get a few unexpected results. But at least 90% of the results will not raise strong objections.
Another interesting thing to see in Figure 3 is that we tried to test how the result changes based on the average skill of players we collect data from. According to our toy model in Sec.II, the result should be invariant, and it indeed is. That is a little surprising. Every card (match) in our toy model is mechanically independent from each other. It is natural to suspect that such simplifying assumption is essential to the apparent independence from average player skill. In Through the Ages, in particular, in Figure 3, the decision of taking each card here is clearly not independent from each other. These cards will become available in different sequences from game to game. They stay available and useful for different durations, and taking one of them forbids you from taking another. Thus, in a multiplayer game, the decision to take each card is highly intertwined. Not to mention that each card may open up a different set of other choices, thus their next effect is likely highly nonlinear and profound. However, when the effect of each decision is subtle enough that no one can be absolutely certain, treating them as being independent does not seem to be a bad assumption.
Let us go back to the few unexpected results. Figure 4 shows the intrinsic value of each Age A Wonder cards. We choose this set of cards because it demonstrate a clear misconception, almost similar to the “teaching from a biased expert” as we designed in the previous section. We can see that the card “Pyramids” is most favored by good players, but its intrinsic value is not the highest. In fact, its conditional winning probability, , is consistent with only . However, because good players prefer it so much, , the removal of skill bias effect revealed that it is actually a bad card. In the technical terms of our toy model in Sec.II, Pyramids have yet , thus it is a misleading choice—good players evaluate such card incorrectly. In fact, we can see that another card here, Library of Alexandria, has yet consistent with being 0. That implies a difficult choice—although it is actually a good card, good players failed to recognize that. At least, better players do not pick this card more often than worse players.
We are pleasantly surprised by both facts:

The skillinvariance of our method persists in a realistic game and data.

Our method does differ significantly from using conditional probability only, and it demonstrate important misconceptions in the strategy of this game.
We hope to see our method applied to more games and data.
Acknowledgements.
We thank the administrator of boardgamingonline.com for tolerating the scraping of data used in this paper. We also thank the players from boardgamegeek.com for interesting discussions.References
 Turner and Pratkanis (1998) M. E. Turner and A. R. Pratkanis, Organizational Behavior and Human Decision Processes 73, 105 (1998), ISSN 07495978, URL http://www.sciencedirect.com/science/article/pii/S074959789892756X.
 Silver et al. (2017) D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., Nature 550, 354 (2017).
 Herbrich et al. (2006) R. Herbrich, T. Minka, and T. Graepel, pp. 569–576 (2006), URL http://dl.acm.org/citation.cfm?id=2976456.2976528.