Limits of PageRank-based ranking methods in sports data

12/11/2020 ∙ by Yuhao Zhou, et al. ∙ Beijing Normal University University of Fribourg 0

While PageRank has been extensively used to rank sport tournament participants (teams or individuals), its superiority over simpler ranking methods has been never clearly demonstrated. We use sports results from 18 major leagues to calibrate a state-of-art model for synthetic sports results. Model data are then used to assess the ranking performance of PageRank in a controlled setting. We find that PageRank outperforms the benchmark ranking by the number of wins only when a small fraction of all games have been played. Increased randomness in the data, such as intrinsic randomness of outcomes or advantage of home teams, further reduces the range of PageRank's superiority. We propose a new PageRank variant which outperforms PageRank in all evaluated settings, yet shares its sensitivity to increased randomness in the data. Our main findings are confirmed by evaluating the ranking algorithms on real data. Our work demonstrates the danger of using novel metrics and algorithms without considering their limits of applicability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans are born to compete: We thrive by measuring ourselves against the others (Frederick-Recascino et al., 2003, Le Bouc and Pessiglione, 2013). Sport in particular provides ample opportunities for competition with high significance for the economy (Tranter, 1998, Hone and Silvers, 2006, Rosner and Shropshire, 2011) and the society (McPherson et al., 1989, Giulianotti, 2015)

. Once a sport competition is over, it remains to decide who has won. What is a simple question for a single match between two participants (individuals or teams) becomes highly non-trivial for a structured tournament with multiple games between several participants. The design of sport tournaments is therefore of crucial importance. Effective tournament design helps the participants to perform well during the tournament, produces match-ups that are interesting to the fans and, crucially, allows to identify the best-performing participant with high probability 

(Szymanski, 2003) (see (Dechenaux et al., 2015) for a survey of results on sport tournaments and beyond).

We focus here on sports leagues, such as a soccer league, for example, where each participating team plays against every other team once or several times. In such a league, points are traditionally assigned to teams for every win or tie that they achieve. The teams are then ranked by their point totals (from the highest to the lowest; additional criteria can be used to break ties). This benchmark ranking method is so simple that it inevitably leads to the question: Cannot we rank the teams better? Of particular appeal is the idea to consider the strength of the opposing team. In particular: Can we improve the ranking if a win against a strong team counts more than a win against a weak team? A closely related idea has been previously formalized by the PageRank algorithm which has been originally designed to rank web pages and later used in a broad range of systems (Gleich, 2015). Unsurprisingly, PageRank and its modifications have been also applied to sports results mapped onto a directed network where team losing against team is represented by an edge from to  (Radicchi, 2011, Júnior et al., 2012, Spanias and Knottenbelt, 2013, Lazova and Basnarkov, 2015).

While PageRank is frequently used on sports results data, its superiority over simple point-based schemes has not been established yet. That is not surprising as real sports results lack robust ground truth (a correct team ranking) against which different rankings could be compared. We fill this gap by providing an extensive evaluation of three different rankings algorithms (ranking by points, PageRank, and a new PageRank variant) on synthetic data. We first calibrate our data-generating model on real sports results from numerous leagues in four major sports (baseball, basketball, ice hockey, and soccer) and find that each sport has a characteristic range of model parameters. We then compare the performance of the ranking algorithms on synthetic data generated for various values of the model parameters, thus identifying the parameter ranges that are favorable for each of the algorithms. We establish in this way, for the first time, the conditions that have to be met for PageRank to perform better than a point-based ranking. We find that PageRank requires results with little randomness which is, in fact, not typical for real sports results. While the newly proposed PageRank variant outperforms PageRank in all evaluated settings, the point-based ranking is the best algorithm in most relevant settings. We conclude this work by reproducing our key results on real data, thus showing that our findings are not model-dependent.

This paper is structured as follows. Section 2 reviews modeling sports results data and ranking algorithms for sports results. Section 3 introduces the methods used in our work—the algorithm to create synthetic data, ranking algorithms, evaluation metrics, and real datasets. Section 4 describes model calibration on the real datasets. Sections 5 and 6 present a comparison of ranking algorithms on synthetic and real datasets, respectively. Finally, Section 7 summarizes the results, discusses their limitations, and proposes new research directions.

2 Related work

2.1 Modeling sports results

Results in a sport where players or teams play against each other can be seen as outcomes of paired comparisons between the participants. The Bradley-Terry model (Bradley and Terry, 1952) is one of the first efforts to model outcomes of such paired comparisons (the authors themselves do not mention applying their model on sport specifically). The model is based on assigning a non-negative winning propensity, , to each participant and postulating that the probability of winning against in the form . A model generalization to include ties (which is important for sports such as soccer) has been introduced in (Rao and Kupper, 1967). The Bradley-Terry model was used in a broad range of problems. In (Aoki et al., 2017), for example, the authors used the model to show that sports data have a high degree of randomness. In (Deng et al., 2012), the winning propensities of tennis players were shown to be directly proportional to players’ ranks in official tennis rankings. The rank of individuals instead of a continuous-valued winning propensity was used also in (Chetrite et al., 2017).

In (Ben-Naim et al., 2013), the authors propose a simpler model based on a fixed upset probability parameter that directly specifies the probability that a weaker team beats a stronger team. The model’s simplicity makes it possible to derive analytical results for the probability that the weakest team wins an elimination tournament, for example. O’Malley (2008) goes in the opposite direction by proposing a model for tennis match outcomes based on the detailed structure of the game. See (Bradley, 1976, David, 1988, Cattelan, 2012) for detailed reviews of models for paired comparison data.

2.2 Ranking algorithms for sports results data

The most elementary method to aggregate results of multiple sports games is to compute each team’s winning percentage, , where and are the team’s numbers of wins and losses, respectively. In this way, each team is assigned a quantity in the range

. Taking into account a starting uniform prior in the same range, a modified estimate has the form

. In (Colley, 2002), this estimate was used as a basis for a ranking scheme which is particularly useful when many teams have not played against each other (early in a season or in a more complicated setup where teams are divided in multiple divisions or conferences). The information provided by respective wins and losses is limited in low-scoring sports such as soccer where a single lucky shot can greatly influence the outcome of a match. In (Brechot and Flepp, 2020), the authors propose to limit this randomness by estimating the number of expected goals.

Indirect comparisons (comparing teams A and C based on team A beating team B and team B beating team C) have been considered in (Redmond, 2003). Indirect wins and indirect losses have been quantified in (Park and Newman, 2005) where they are ultimately combined in a score which is, in fact, a generalization of the well-known Katz centrality metric. A method based on representing the available results with an incomplete pairwise comparison matrix has been proposed in (Bozóki et al., 2016) where, however, the eventually optimized cost function has to be chosen from several distinct possibilities. In (Spanias and Knottenbelt, 2013), a sort-based ranking has been proposed which also alleviates the problem of missing pair comparisons.

A popular line of research considers the use of eigenvector-based methods for sports rankings 

(Keener, 1993). In particular PageRank, a seminal ranking algorithm/centrality metric for nodes in a directed network (Brin and Page, 1998), has been widely applied to sports data such as tennis (Radicchi, 2011) or cricket (Mukherjee, 2012). PageRank-like algorithms seem well-suited for a sports ranking as they value a win against a strong opponent more than a win against a weak opponent. How to transform input sports results in a directed network on which PageRank is computed is open. The simplest approach is to represent a win of over with a directed link from to . In (Govan et al., 2008), the authors proposed to assign link weights based on the score difference in the corresponding games. A comparison of various edge-weighting methods for the soccer World Cup data is presented in (Lazova and Basnarkov, 2015). Time-aware PageRank variants (Júnior et al., 2012, Motegi and Masuda, 2012) take the time of each game into account to capture the player/team capability that varies in time.

Validation of the various ranking methods described above is often limited as it typically relies on official rankings that are directly influenced by the same results data that are used by the evaluated algorithm (see (Mukherjee, 2012, Júnior et al., 2012, Lazova and Basnarkov, 2015), for example). However, the best agreement with an official ranking is achieved by a ranking method which is identical to that used to produce the official ranking.

3 Materials and methods

3.1 Sports results model

We assume a competition setting where teams play against each other once or multiple times. In the model, we assume that the outcome of a match between teams (home team) and (away team) is stochastic. The probability that the home team wins is assumed in the form of the logistic function

(1)

where and are the intrinsic fitness values of the two competing teams, is an additive term which represents the typical home team advantage and is a fitness “weighting” parameter which helps to translate a difference in team fitness in the winning probability of the home team. The model assumes that there are only two possible outcomes: team wins or team

wins. Generalizing the model to involve the possibility of a draw or formulating a probability distribution for the score difference is beyond the scope of this article. We assume for simplicity that team fitness remains the same throughout the whole competition; allowing for fitness variations is yet another interesting direction for future research. Home advantage has been documented for a wide variety of sports 

(Nevill et al., 2005, Ribeiro et al., 2016). While it may seem as an auxiliary issue that has been ignored in (Deng et al., 2012), for example, we find to be significantly positive in a vast majority of the sports results sets that we analyze in this paper. We also find that home advantage strongly affects the ranking ability of respective algorithms.

It is helpful to study closer the implications Eq. (1) before proceeding. When (that is, the away team is much stronger than the home team), we get as expected. If increases, the same fitness difference affects the winning probability less and the match outcomes thus become more random (in the limit , for any , , and ). We thus refer as the sport randomness indicator: When is large in comparison with fitness differences among the teams, for all and . While home team advantage increases with , outcomes are random in the large limit and no home team advantage ensues. The effective strength of home team advantage is thus determined by the ratio .

3.2 Algorithm to generate synthetic sports results

The algorithm has the following parameters: number of teams , fitness sensitivity , home advantage , and the fraction of games that have been played ( and corresponds to no games played and all games played, respectively). It is also possible to consider : that would correspond to a league where the teams play more than once against each other.

Synthetic data are then created in three main steps:

  1. Fitness of team is set to where . In this way, the fitness values range from to and they are regularly distributed in the range (we investigate other fitness distributions later).

  2. Each team is assigned to play games against opponents chosen at random without any two teams playing against each other more than once (in practice, we used the random_degree_sequence_graph function from NetworkX (Hagberg et al., 2008)). If is not an integer, teams are assigned to play either or games in such a way that the total number of played games is . The home team is chosen at random for each game.

  3. Determine the outcome of each game by Eq. (1).

By varying the model parameters, we can create synthetic results corresponding to a broad range of sports. Note that the algorithm can be easily modified to encompass more complicated settings such as the regular season followed by playoffs, for example.

3.3 Ranking algorithms

In sports leagues, teams are typically ranked by the number of points that they obtain (such as two points for a win, one point for a draw, zero points for a loss). Since we focus here on sports where draws are not possible, ranking the teams by the number of points is the same as ranking them by the ratio of wins. is thus our benchmark method.

To apply the PageRank algorithm, we create a directed network of participating teams where all games are represented with directed links. In particular, when player/team wins against player/team , a directed link from to is formed along which “sports prestige” flows: A win against a highly-valued team contributes highly to the winner’s own evaluation. The process can be mathematically represented by the formula (Radicchi, 2011)

(2)

where is the prestige score of team/player/node , is the number of nodes in the network, is weight of the link from to which is equal to the number of wins of over , is the out-strength of node (the number of losses of ) and is the algorithm parameter (often referred to as the teleportation probability). The last term in Eq. (2) makes the algorithm robust against the nodes with (“dangling nodes”) which would otherwise act as score sinks. In line with (Radicchi, 2011) and other PageRank literature, we use . As it has been widely applied to sports results (Gleich, 2015), and its performance is the main focus of this work.

In addition to standard PageRank, we consider here a new method closely based on PageRank which we refer to as bi-directional PageRank (). The bi-directional PageRank score, , is defined as

(3)

where is the previously introduced PageRank score (computed on the original directed network) and is given by

(4)

In agreement with the previous definition of , here is the number of losses of against and is the in-strength of (the number of wins of ). In this way, both the winner of a match is assigned a part of the losing team’s score (through ) as well as the team that loses is assigned a part of the winning team’s negative score (through ). Bi-directional PageRank is then a simple combination of the two scores. The motivation for this modification is simple: While allows us to award team “good” score based on which teams it won against, allows us to award team “bad” score based on which teams it lost against. As a practical illustration, take teams and that lost all their matches, hence they receive identical PageRank score. If team lost against good teams (teams that lose rarely) and team lost against bad teams (teams that lose often), then . The new algorithm thus allows us to distinguish the two teams. Note that separate win and loss scores have been considered also in (Park and Newman, 2005).

3.4 Evaluation metrics

On synthetic data, the resulting ranking of teams produced by a ranking algorithm can be directly compared with their fitness values which are thus used as the ground truth. Denote the computed and ground-truth ranking of team as and , respectively. We use the following distinct metrics to quantify the ranking performance of an algorithm:

  1. The Kendall correlation coefficient, , is defined as

    (5)

    where and

    are vectors of the In the numerator, the first and second terms correspond to the number of “concordant” (their order agrees between the computed and ground-truth ranking) and “discordant” (their order differs between the computed and ground-truth ranking) pairs of teams. Kendall’s

    ranges from when the rankings are identical to when one ranking is the reverse of the other. Note that tied ranking positions in the computed ranking (the ground-truth ranking has no ties by construction) do not contribute to ; a degenerate ranking that would assign the same rank to all teams would thus achieve .

  2. While Kendall’s takes all teams into account, the other two metrics explicitly focus on how well the top teams are ranked. Firstly, the average top-5 ranking is the average computed ranking of the top 5 ground-truth teams. The smaller the value, the better the computed ranking.

  3. Secondly, the area under the ROC-curve, commonly referred to as in the statistics literature. We again use the top 5 ground-truth teams as our goal top set; the other teams constitute the ordinary set. To compute , we use the probabilistic approach (Lü and Zhou, 2011) where we pick pairs of teams, one from the top set and the other from the ordinary set. If the top team is higher ranked than the ordinary team times and tied times, the value can be computed as .

Sport Country League Name Label Size Years
Baseball U.S.A Major League Baseball – American League AL 14–15 1997–2016
U.S.A Major League Baseball – National League NL 15–16 1997–2016
Japan Nippon Professional Baseball NPB 12–14 2010–2019
Mexico Ligue Mexicaine de Baseball LMB 14–19 2007–2019
Ice hockey U.S.A National Hockey League NHL 30–31 2000–2019
Swizerland Ligue Nationale A LNA 12–22 2008–2019
Germany Deutsche Eishockey Liga DEL 14–16 2007–2020
Soccer Germany Deutsche Futball Liga Bundesliga 18 2000–2019
Italy Lega Serie A Serie A 20 2005–2019
Spain Primera division de Liga Liga 1 20 1998–2017
England England Premier League EPL 20 1999–2018
U.S.A Major League Soccer MLS 10–24 2000–2019
France Championnat de France de football Ligue 1 Ligue 1 20 2000–2019
China Chinese Football Association CSL 12–16 2004–2019
Basketball Italy Lega Basket Serie A LBSA 15–18 2008–2019
China Chinese Basketball Association CBA 17–20 2007–2017
Spain Asociacion de Clubes de Baloncesto ACB 17–18 2007–2019
U.S.A National Basketball Association NBA 30 2001–2020
Table 1: Basic description of the analyzed sports results sets. The league size is the number of competing teams (a range is provided if the number varies in the considered period).

3.5 Empirical sports results data

The analyzed sports data have been obtained from websites https://www.sports-reference.com/ and http://www.win007.com/. Except for soccer, all games have only two possible outcomes: the home team wins or the away team wins. As we consider an outcome model without draws, all draws (20–25% of games in one season for all six considered leagues) are ignored. In one season, the participating teams play against each other two or more times. We analyze only results from regular seasons, playoff matches are ignored. See Table 1 for an overview of the analyzed datasets and their basic statistical properties.

4 Model calibration on real datasets

To determine realistic values of all model parameters (team fitness values, , and ), we use maximum likelihood estimation for sets of results from various sports. For a given set of games with outcomes, and a single game as , we denote the home team as , the away team as , and the game result as where means that the home team won game and means that the home team lost. The data likelihood given the model then has the form

(6)

Likelihood maximization for the real datasets described in Section 3.5 reveals a surprising fact that maximum likelihood estimates (MLE) of team fitness values, , are close to the fraction of wins, , of each respective team in the analyzed dataset. In particular, the difference between the likelihood maximized over all model parameters and the likelihood maximized only over and (whereas is replaced with ) is not sufficient to justify the higher number of parameters in the former model ( vs. , we used the Akaike information criterion for model selection (Claeskens et al., 2008)). This allows us to write the simplified winning probability of the home team as

(7)

where is the difference between the win ratios of the home and away teams.

If the home advantage is absent (), denoting and transforms Eq. (1) to which is precisely the form assumed by the seminal Bradley-Terry model (Bradley and Terry, 1952). We see now that the model formulation presented by Eq. (1) is still advantageous as: (1) Unlike the “winning propensities” , the team fitness values directly correspond to the team win ratios, (2) introduces home advantage in a scale that can be directly compared with the teams’ win ratios and their differences ( is as important as a difference in win ratios between the teams).

Figure 1:

Relationship between the win ratio difference and the home team winning probability for four distinct leagues. Symbols show empirical winning probabilities for a given end of season win ratio difference (data points based on less than four games have been omitted), error bars show double of the standard error of the mean, and lines show the model probability of winning given by Eq. (

7) for the maximum likelihood estimates of and (shown in each panel). The horizontal and vertical dashed lines show the zero win ratio difference and the baseline win probability of , respectively.

Using four sample results sets, Figure 1 shows a comparison between Eq. (1) with maximum likelihood estimates for and and the empirical winning probability plotted as a function of the win ratio difference between the competing teams. The good agreement that can be observed in the whole range of win ratio difference confirms that Eq. (1) can model the empirical data well. The sigmoid curve’s steepness in the figure is in direct relation with the fitness sensitivity parameter (smaller yields higher steepness). Note also that when which is a direct consequence of a positive home advantage in all four results sets.

Figure 2: Parameter estimates for the sports results sets from Table 1 (each panel shows a different sport). Each symbol represents the estimates of and in a single season.

Figure 2 further summarizes the maximum likelihood parameter estimates in all analyzed results sets, divided in panels by the sport kind. We see that different sports differ in their level of randomness as characterized by (baseball and basketball are the most and the least random sport, respectively). Different leagues in the same sport have mostly similar values except for the basketball leagues NBA (U.S.A.) and CBA (China) where CBA is significantly less random than NBA (in fact, CBA is the least random league on average among the analyzed 17 leagues). The home advantage value is distributed between 0 and 0.25, and the home advantage effect of basketball and football is more significant than baseball and hockey. In agreement with Eq. (1), the effective strength of the home advantage is characterized by which is shown in Figure 2 on the vertical axes. The values of differ significantly between the sports as well as between different leagues in the same sport. CBA is again outstanding by having the highest average effective home advantage. By contrast, baseball leagues have average effective home advantage 5.4-times smaller than CBA.

Figure 3: Performance of the studied ranking algorithms on synthetic data without home advantage. Panels (a)–(c) use three different evaluation metrics and plot the results as a function of fitness sensitivity, , for various fractions of played matches, . The lines represent mean results and the shaded regions indicate double of the standard error of the mean, all determined from 100 independently created synthetic datasets. Panel (d) shows the Kendall difference between Bi-directional PageRank and the win ratio as a function of both and .

5 Results on synthetic datasets

Our goal now is to assess the performances of different ranking algorithms on synthetic sports results generated by the above-described algorithm. In simulations, we assume that there are 30 teams; the results are robust with respect to the number of teams. We begin by studying the case of no home advantage () and explore a range of values which corresponds to the empirical values in Figure 2.

Figure 3 shows the results of numeric simulations comparing the three considered ranking algorithms as a function of fitness sensitivity, , and the fraction of matches played, . Panels a–c show that the comparison results are remarkably similar for the three evaluation metrics (Kendall’s tau, and average ranking). In particular, the results show that: (1) As grows, the ranking performance improves as expected. (2) PageRank outperforms the win ratio only when is sufficiently small and the range of PageRank’s superiority shrinks as grows. The threshold below which PageRank outperforms the win ratio is considerably stable with respect to the number of teams, (results not shown). (3) The newly-proposed Bi-directional PageRank is always an improvement (or a tie) over standard PageRank. (4) Figure 2 shows that a vast majority of the analyzed datasets have which together with Figure 3 implies that the use PageRank brings no improvement in sport tournament rankings. Even more, PageRank is significantly inferior for sports with high randomness (high ) later in a season. Finally, the heatmap in Figure 3(d) provides a comparison between bi-directional PageRank and the win ratio for a broad range of the key parameters and . We can see here well that a tie between the two ranking algorithms occurs at which progressively decreases as grows.

Based on Figure 3, we can conclude that PageRank and bi-directional PageRank are both more sensitive than the win ratio to increasing randomness of outcomes (represented by increasing ). This increased sensitivity can be explained by the algorithms’ network nature: While a “surprise” outcome of a single match has only a local impact on the win ratio (only the two competing teams are affected), PageRank propagates its scores further over the whole network of teams. When is sufficiently large, the surprising outcomes are numerous and their network propagation and accumulation are ultimately detrimental to the ranking ability of PageRank and Bi-directional PageRank.

Figure 4:

Performance of the studied ranking algorithms on synthetic data with home advantage (top row) and with a non-uniform distribution of team fitness (bottom row). Panels (a)–(c) show the ranking performance vs. the home advantage parameter,

, for fixed (all other parameters as in Figure 3). (d) The relation between the fraction of wins and the team rank (from the worst to the best) for four chosen result sets (NHL, 2011; ACB, 2011; LMB, 2012; Bundesliga, 2010). Panel (e) Fits of Eq. (8) for all seasons of seven different leagues. (f) The difference for synthetic data with various values of and (, , and , results are averaged over 100 realizations).

The top row of Figure 4 evaluates the ranking performance of algorithms when the home advantage parameter, , is positive. We see that as grows, the performance of all three ranking algorithms deteriorates. At the same time, the win ratio is more robust to increasing than the other two ranking methods, which is in line with its higher robustness to increasing . In particular, the number of unexpected results (a weaker team wins against a stronger team) increases as grows and these unexpected results negatively affect the ranking results of PageRank and Bi-directional PageRank. The home advantage thus further reduces the limited range of applicability of PageRank (the range in which PageRank outperforms the win ratio).

In synthetic data so far, we assumed the team fitness values to be uniformly assumed. That this is not the case in real data can be easily illustrated as we have already shown that the win ratio is a good approximation for team fitness. Figure 4d shows the win ratio in four different datasets and shows distinct non-linearity for two of them. This motivates us to consider a non-linear assignment of fitness in the form

(8)

which fits well most of the considered datasets (see least-squares fits in Figure 4d). A power-law fitness distribution has been suggested also before in (Da Silva et al., 2013). In Eq. (8), controls the heterogeneity of the fitness distribution and determines the difference between the worst and the best team. Once and are chosen, is fixed by the relation (the average win ratio must be one half as someone’s win is always someone else’s loss).111Also, the fitness difference that influences the match outcome in Eq. (1) is directly determined by and as the absolute term cancels out.

Figure 4e further shows the fitted values and for seven representative leagues and helps us identify and as the relevant ranges for these two parameters. Figure 4f shows the difference between the win ratio and bi-directional PageRank for synthetic data generated with and in the identified range. We choose here by purpose parameters that favor bi-directional PageRank: small randomness (), no home advantage () and few games played (). In agreement with the results presented in Figure 3, outperforms when and (fitness values are then uniformly distributed in the range ). We see now that this is essentially the ideal setup for as its advantage decreases when substantially differs from as well as when is lower than . This is because the average fitness difference between the teams then decreases which, in agreement with Eq. (1), increases the probability of unexpected outcomes. Thus-introduced randomness is detrimental to the performance of PageRank and bi-directional PageRank which is well visible in Figure 4e. When , , or increase, the behavior is similar, only the region where outperforms shrinks.

Figure 5: Ranking performance of the evaluated algorithms when fraction of unexpected outcomes ( wins over when ) are: (a) removed, (b) reverted. Simulation parameters: , , , , , , results are averaged over 100 independent realizations of the model.

In summary, we identified the sensitivity of PageRank and bi-directional PageRank to unexpected results as the main factor limiting their performance. Unexpected results are due to intrinsic randomness of sport, home advantage, and similarity of team fitness values (in reality, many other factors contribute—weather, immediate form of individual players, injuries, and others). As our initial empirical analysis shows, all these factors are common to sports results data. If substantial randomness of results is inevitable, one can ask if we can at least suppress the unexpected results to help / perform better and possibly outperform the win ratio. To explore the feasibility of this idea, we benefit from the use of synthetic data where team fitness values are known. We can thus identify the unexpected results (wins of weaker teams against stronger teams) and either remove them from the dataset (see Figure 5a) or reverse them (see Figure 5b). In Figure 5, we remove or correct a gradually increasing fraction of unexpected outcomes ( means that all unexpected outcomes have been treated) which naturally benefits all three evaluated algorithms. However, PageRank and bi-directional PageRank require large for their performance to improve substantially whereas the win ratio improves uniformly in the whole range of . As a result, there is no for which PageRank or bi-directional PageRank perform better than the win ratio. In real data where team fitness values are not directly known, one would first have to identify the unexpected results, which would further lower the efficiency of this approach. We can thus conclude that the removal or correction of unexpected results cannot help PageRank and bi-directional PageRank outperform the win ratio.

6 Results on real datasets

After comparing the ranking performance on synthetic datasets in the previous section, we now present a similar comparison on real datasets. Since team fitness values are not known in real data, we use the ranking of all teams at the end of the season as the ground truth against rankings produced by respective ranking algorithms in earlier parts of the season. This choice is motivated by the earlier observation that the team win ratio is a good proxy for team fitness [see the discussion before Eq. (7)]. Denoting the number of games in season as , we then use first games as input for algorithm and quantify the algorithm’s performance using Kendall’s between the computed ranking and the end of season number of wins, thus obtaining . This is then averaged over seasons to produce . In this way, we compare the performance of the win ratio with that of bi-directional PageRank by evaluating .

Figure 6: The performance difference in real sports data where the final win ratio in each season is used as the ground truth (the results are averaged over the last 10 available seasons). Individual rows represent different leagues from the beginning (, left) until the end (, right). Vertical markers highlight the highest value where outperforms .

The results are shown in Figure 6 where each horizontal bar represents one league with left and right sides representing the start and the end of the season, respectively, and the computed differences between and are color-coded. We see that despite using the final fraction of wins in a season as the ground truth (which obviously favors ), is still able to outperform when is small. In a direct parallel with previously presented results on synthetic data, the win ratio outperforms bi-directional PageRank almost always except for the very beginning of the season (small . To highlight the transition between the early part of the league when BiPageRank is best and the later part when the win ratio is best, we mark the largest at which with a vertical line for each league. These threshold values are around 0.1 or lower except for two leagues (ACB and DEL) for which narrow ranges with weakly positive appear also for large . The overall behavior is best visible in the last row where the difference between and is averaged over all considered leagues. The threshold value here is and never outperforms by more than (as measured by Kendall’s ). This confirms in a model-free way that PageRank and bi-directional PageRank are beneficial for sports results data only when the information is scarce ( is low). When sufficiently many teams have already played against each other, the win ratio generally ranks the teams better.

7 Conclusions

In this paper, we have focused on sports results data from regular leagues where a fixed number of teams play against each other. Results of the games can be represented as a directed network where a directed link from to is drawn when team has won over . We have evaluated the ranking performance of three distinct methods to rank the competing teams: their win ratio, their PageRank score, and their newly proposed bi-directional PageRank score. Bi-directional PageRank combines two different scores: one positive which accumulates mainly through winning over good opponents (as in PageRank), the other negative which accumulates mainly through losing against bad opponents. We have calibrated a model for synthetic sports results, a variant of the classical Bradley–Terry model (Bradley and Terry, 1952). The model uses only two parameters, home advantage and sport randomness , yet it produces excellent agreement with empirical sports results.

The ranking algorithms have been first evaluated on synthetic data. The main finding is that PageRank only outperforms the win ratio when a small fraction of all games have been played and randomness of results are sufficiently small. In particular, PageRank yields for the levels of randomness found in real sports data (we considered baseball, ice hockey, soccer and basketball). Note that while (Ghoshal and Barabási, 2011) reports that incompleteness of the network is harmful to PageRank’s performance, which is natural, we find that incompleteness of the network is actually favorable when PageRank’s performance is judged relative to the win ratio benchmark.

The newly proposed outperforms for all parameter settings, yet it only outperforms only for sports with the lowest randomness when a small fraction of all games (10%) have been played. Both and further suffer when other sources of randomness—home advantage and non-uniform distribution of team abilities—are considered. The sensitivity of , and closely related , to changes in the data has been already discussed in, for example, (Chartier et al., 2011). We demonstrate here, for the first time, that this sensitivity combined with the natural randomness of sport renders PageRank of little use on results from a sport tournament. By contrast, the ranking of teams by their win ratio turns out to be comparatively robust to various sources of randomness in results.

To keep the model for synthetic sports results simple, we neglected further factors that can be addressed in future research. The assumption of fixed team fitness can be relaxed to allow for modeling a variable sport level or temporary adverse effects of injuries, for example. The simple tournament setup can be generalized to irregular games between the teams as is the case for national teams in soccer or players in tennis, for example. In tennis, in particular, each player has recently played only a small fraction of other players. Using the terminology of our model, the effective is small, which suggests that PageRank might have some merit for tennis data.

Besides providing specific results on the use of PageRank on sports results data, our work highlights the need to carefully assess the actual performance and limitations of network metrics. This need is exacerbated by the complexity of systems that produce the data, which makes it difficult to judge ex-ante if an algorithm is a good match for the data. In citation data, for example, PageRank has been frequently used yet (Mariani et al., 2015) shows that the natural growth of the citation network makes PageRank scores difficult to interpret. If a ground truth set is available, a comparative assessment on real data is possible. This can be made more robust by considering multiple real datasets and multiple ground truth sets as done recently in (Xu et al., 2020) to compare ranking metrics for citation data. If a ground truth set is not available but a credible model for a given system exists, an assessment using synthetic data (as we have used here) is a practical alternative. Using a network metric without understanding its scope and limitations directly induces the risk of obtaining unreliable or inferior results.

Acknowledgement

This work was supported by the Swiss National Science Foundation (grant No. 182498). MM and AZ acknowledge support from the National Natural Science Foundation of China (grant Nos. 11850410444 and 71731002, respectively).

References

  • Aoki et al. (2017) R. Y. Aoki, R. M. Assuncao, and P. O. Vaz de Melo. Luck is hard to beat: The difficulty of sports prediction. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1367–1376, 2017.
  • Ben-Naim et al. (2013) E. Ben-Naim, N. Hengartner, S. Redner, and F. Vazquez. Randomness in competitions. Journal of Statistical Physics, 151(3-4):458–474, 2013.
  • Bozóki et al. (2016) S. Bozóki, L. Csató, and J. Temesi. An application of incomplete pairwise comparison matrices for ranking top tennis players. European Journal of Operational Research, 248(1):211–218, 2016.
  • Bradley (1976) R. A. Bradley. Science, statistics, and paired comparisons. Biometrics, 32(2):213–239, 1976.
  • Bradley and Terry (1952) R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  • Brechot and Flepp (2020) M. Brechot and R. Flepp. Dealing with randomness in match outcomes: How to rethink performance evaluation in European club football using expected goals. Journal of Sports Economics, 21(4):335–362, 2020.
  • Brin and Page (1998) S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1):107–117, 1998.
  • Cattelan (2012) M. Cattelan. Models for paired comparison data: A review with emphasis on dependent data. Statistical Science, pages 412–433, 2012.
  • Chartier et al. (2011) T. P. Chartier, E. Kreutzer, A. N. Langville, and K. E. Pedings. Sensitivity and stability of ranking vectors. SIAM Journal on Scientific Computing, 33(3):1077–1102, 2011.
  • Chetrite et al. (2017) R. Chetrite, R. Diel, M. Lerasle, et al. The number of potential winners in Bradley–Terry model in random environment. The Annals of Applied Probability, 27(3):1372–1394, 2017.
  • Claeskens et al. (2008) G. Claeskens, N. L. Hjort, et al. Model selection and model averaging. Cambridge Books, 2008.
  • Colley (2002) W. N. Colley. Colley’s bias free college football ranking method: The Colley matrix explained. Princeton University, Princeton, 2002.
  • Da Silva et al. (2013) S. Da Silva, R. Matsushita, and E. Silveira. Hidden power law patterns in the top European football leagues. Physica A: Statistical Mechanics and Its Applications, 392(21):5376–5386, 2013.
  • David (1988) H. A. David. The method of paired comparisons, volume 12. London, 1988.
  • Dechenaux et al. (2015) E. Dechenaux, D. Kovenock, and R. M. Sheremeta. A survey of experimental research on contests, all-pay auctions and tournaments. Experimental Economics, 18(4):609–669, 2015.
  • Deng et al. (2012) W. Deng, W. Li, X. Cai, A. Bulou, and Q. A. Wang. Universal scaling in sports ranking. New Journal of Physics, 14(9):093038, 2012.
  • Frederick-Recascino et al. (2003) C. M. Frederick-Recascino, H. Schuster-Smith, et al. Competition and intrinsic motivation in physical activity: A comparison of two groups. Journal of Sport Behaviour, 26(3):240–254, 2003.
  • Ghoshal and Barabási (2011) G. Ghoshal and A.-L. Barabási. Ranking stability and super-stable nodes in complex networks. Nature communications, 2(1):1–7, 2011.
  • Giulianotti (2015) R. Giulianotti. Sport: A critical sociology. John Wiley & Sons, 2015.
  • Gleich (2015) D. F. Gleich. PageRank beyond the Web. SIAM Review, 57(3):321–363, 2015.
  • Govan et al. (2008) A. Y. Govan, C. D. Meyer, and R. Albright. Generalizing Google’s PageRank to rank national football league teams. In Proceedings of the SAS Global Forum, volume 2008, 2008.
  • Hagberg et al. (2008) A. A. Hagberg, D. A. Schult, and P. J. Swart. Exploring network structure, dynamics, and function using NetworkX. In G. Varoquaux, T. Vaught, and J. Millman, editors, Proceedings of the 7th Python in Science Conference, pages 11 – 15, Pasadena, CA USA, 2008.
  • Hone and Silvers (2006) P. Hone and R. Silvers. Measuring the contribution of sport to the economy. Australian Economic Review, 39(4):412–419, 2006.
  • Júnior et al. (2012) P. S. P. Júnior, M. A. Gonçalves, A. H. Laender, T. Salles, and D. Figueiredo. Time-aware ranking in sport social networks. Journal of Information and Data Management, 3(3):195–195, 2012.
  • Keener (1993) J. P. Keener. The Perron–Frobenius theorem and the ranking of football teams. SIAM Review, 35(1):80–93, 1993.
  • Lazova and Basnarkov (2015) V. Lazova and L. Basnarkov. PageRank approach to ranking national football teams. arXiv preprint arXiv:1503.01331, 2015.
  • Le Bouc and Pessiglione (2013) R. Le Bouc and M. Pessiglione. Imaging social motivation: distinct brain mechanisms drive effort production during collaboration versus competition. Journal of Neuroscience, 33(40):15894–15902, 2013.
  • Lü and Zhou (2011) L. Lü and T. Zhou. Link prediction in complex networks: A survey. Physica A: statistical mechanics and its applications, 390(6):1150–1170, 2011.
  • Mariani et al. (2015) M. S. Mariani, M. Medo, and Y.-C. Zhang. Ranking nodes in growing networks: When pagerank fails. Scientific reports, 5:16181, 2015.
  • McPherson et al. (1989) B. D. McPherson, J. E. Curtis, J. W. Loy, et al. The social significance of sport: An introduction to the sociology of sport. Human Kinetics Publishers, 1989.
  • Motegi and Masuda (2012) S. Motegi and N. Masuda. A network-based dynamical ranking system for competitive sports. Scientific Reports, 2:904, 2012.
  • Mukherjee (2012) S. Mukherjee. Identifying the greatest team and captain: A complex network approach to cricket matches. Physica A: Statistical Mechanics and its Applications, 391(23):6066–6076, 2012.
  • Nevill et al. (2005) A. Nevill, N. Balmer, S. Wolfson, et al. The extent and causes of home advantage: Some recent insights. Journal of Sports Sciences, 23(4):335–445, 2005.
  • O’Malley (2008) A. J. O’Malley. Probability formulas and statistical analysis in tennis. Journal of Quantitative Analysis in Sports, 4(2), 2008.
  • Park and Newman (2005) J. Park and M. E. Newman. A network-based ranking system for US college football. Journal of Statistical Mechanics: Theory and Experiment, 2005(10):P10014, 2005.
  • Radicchi (2011) F. Radicchi. Who is the best player ever? a complex network analysis of the history of professional tennis. PLoS ONE, 6(2):e17249, 2011.
  • Rao and Kupper (1967) P. Rao and L. L. Kupper. Ties in paired-comparison experiments: A generalization of the Bradley-Terry model. Journal of the American Statistical Association, 62(317):194–204, 1967.
  • Redmond (2003) C. Redmond. A natural generalization of the win-loss rating system. Mathematics Magazine, 76(2):119–126, 2003.
  • Ribeiro et al. (2016) H. V. Ribeiro, S. Mukherjee, and X. H. T. Zeng. The advantage of playing home in NBA: Microscopic, team-specific and evolving features. PLoS ONE, 11(3):e0152440, 2016.
  • Rosner and Shropshire (2011) S. Rosner and K. Shropshire. The business of sports. Jones & Bartlett Publishers, 2011.
  • Spanias and Knottenbelt (2013) A. D. Spanias and B. W. Knottenbelt. Tennis player ranking using quantitative models. Manuscript. http://www.doc.ic.ac.uk/w̃jk/publications/spanias-knottenbelt-mis-2013.pdf, 2013.
  • Szymanski (2003) S. Szymanski. The economic design of sporting contests. Journal of Economic Literature, 41(4):1137–1187, 2003.
  • Tranter (1998) N. Tranter. Sport, economy and society in Britain 1750-1914, volume 33. Cambridge University Press, 1998.
  • Xu et al. (2020) S. Xu, M. S. Mariani, L. Lü, and M. Medo. Unbiased evaluation of ranking metrics reveals consistent performance in science and technology citation data. Journal of Informetrics, 14(1):101005, 2020.