1.1 Modelling and predicting competitive sports
Competitive sports refers to any sport that involves two teams or individuals competing against each other to achieve higher scores. Competitive team sports includes some of the most popular and most watched games such as football, basketball and rugby. Such sports are played both in domestic professional leagues such as the National Basketball Association, and international competitions such as the FIFA World Cup. For football alone, there are over one hundred fully professional leagues in 71 countries globally. It is estimated that the Premier League, the top football league in the United Kingdom, attracted a (cumulative) television audience of 4.7 billion viewers in the last season.
The outcome of a match is determined by a large number of factors. Just to name a few, they might involve the competitive strength of each individual player in both teams, the smoothness of collaboration between players, and the team’s strategy of playing. Moreover, the composition of any team changes over the years, for example because players leave or join the team. The team composition may also change within the tournament season or even during a match because of injuries or penalties.
Understanding these factors is, by the prediction-validation nature of the scientific method, closely linked to predicting the outcome of a pairing. By Occam’s razor, the factors which empirically help in prediction are exactly those that one may hypothesize to be relevant for the outcome.
Since keeping track of all relevant factors is unrealistic, of course one cannot expect a certain prediction of a competitive sports outcome. Moreover, it is also unreasonable to believe that all factors can be measured or controlled, hence it is reasonable to assume that unpredictable, or non-deterministic statistical “noise” is involved in the process of generating the outcome (or subsume the unknowns as such noise). A good prediction will, hence, not exactly predict the outcome, but will anticipate the “correct” odds more precisely. The extent to which the outcomes are predictable may hence be considered as a surrogate quantifier of how much the outcome of a match is influenced by “skill” (as surrogated by determinism/prediction), or by “chance”111We expressly avoid use of the word “luck” as in vernacular use it often means “chance”, jointly with the belief that it may be influenced by esoterical, magical or otherwise metaphysical means. While in the suggested surrogate use, it may well be that the “chance” component of a model subsumes possible points of influence which simply are not measured or observed in the data, an extremely strong corpus of scientific evidence implies that these will not be metaphysical, only unknown - two qualifiers which are obviously not the same, despite strong human tendencies to believe the contrary. (as surrogated by the noise/unknown factors).
Phenomena which can not be specified deterministically are in fact very common in nature. Statistics and probability theory provide ways to make inference under randomness. Therefore, modelling and predicting the results of competitive team sports naturally falls into the area of statistics and machine learning. Moreover, any interpretable predictive model yields a possible explanation of what constitutes factors influencing the outcome.
1.2 History of competitive sports modelling
Research of modeling competitive sports has a long history. In its early days, research was often closely related to sports betting or player/team ranking [22, 26]. The two most influential approaches are due to Bradley and Terry  and Élő . The Bradley-Terry and Élő models allow estimation of player rating; the Élő system additionally contains algorithmic heuristics to easily update a player’s rank, which have been in use for official chess rankings since the 1960s. The Élő system is also designed to predict the odds of a player winning or losing to the opponent. In contemporary practice, Bradley-Terry and Élő type models are broadly used in modelling of sports outcomes and ranking of players, and it has been noted that they are very close mathematically.
In more recent days, relatively diverse modelling approaches originating from the Bayesian statistical framework [37, 13, 20], and also some inspired by machine learning principles [36, 23, 43] have been applied for modelling competitive sports. These models are more expressive and remove some of the Bradley-Terry and Élő models’ limitations, though usually at the price of interpretability, computational efficiency, or both.
A more extensive literature overview on existing approaches will be given later in Section 3, as literature spans multiple communities and, in our opinion, a prior exposition of the technical setting and simultaneous straightening of thoughts benefits the understanding and allows us to give proper credit and context for the widely different ideas employed in competitive sports modelling.
1.3 Aim of competitive sports modelling
In literature, the study of competitive team sports may be seen to lie between two primary goals. The first goal is to design models that make good predictions for future match outcome. The second goal is to understand the key factors that influence the match outcome, mostly through retrospective analysis [45, 50]. As explained above, these two aspects are intrinsically connected, and in our view they are the two facets of a single problem: on one hand, proposed influential factors are only scientifically valid if confirmed by falsifiable experiments such as predictions on future matches. If the predictive performance does not increase when information about such factors enters the model, one should conclude by Occam’s razor that these factors are actually irrelevant222… to distinguish/characterize the observations, which in some cases may plausibly pertain to restrictions in set of observations, rather than to causative relevance. Hypothetical example: age of football players may be identified as unimportant for the outcome - which may plausibly be due to the fact that the data contained no players of ages 5 or 80, say, as opposed to player age being unimportant in general. Rephrased, it is only unimportant for cases that are plausible to be found in the data set in the first place.. On the other hand, it is plausible to assume that predictions are improved by making use of relevant factors (also known as “features”) become available, for example because they are capable of explaining unmodelled random effects (noise). In light of this, the main problem considered in this work is the (validable and falsifiable) prediction problem, which in machine learning terminology is also known as the supervised learning task.
1.4 Main questions and challenges in competitive sports outcomes prediction
Given the above discussion, the major challenges may be stated as follows:
On the methodological side, what are suitable models for competitive sports outcomes? Current models are not at the same time interpretable, easily computable, allow to use feature information on the teams/players, and allow to predict scores or ternary outcomes. It is an open question how to achieve this in the best way, and this manuscript attempts to highlight a possible path.
The main technical difficulty lies in the fact that off-shelf methods do not apply
due to the structured nature of the data:
unlike in individual sports such as running and swimming where the outcome
depends only on the given team, and where the prediction task may
be dealt with classical statistics and machine learning technology
(see  for a discussion of this in the context of running),
in competitive team sports the outcome may
be determined by potentially complex interactions between two opposing teams.
In particular, the performance of any team is not measured directly using a simple metric,
but only in relation to the opposing team’s performance.
On the side of domain applications, which in this manuscript is premier league football, it is of great interest to determine the relevant factors determining the outcome, the best way to predict, and which ranking systems are fair and appropriate.
All these questions are related to predictive modelling, as well as the availability of suitable amounts of quality data. Unfortunately, the scarcity of features available in systematic presentation places a hurdle to academic research in competitive team sports, especially when it comes to assessing important factors such as team member characteristics, or strategic considerations during the match.
Moreover, closely linked is also the question to which extent the outcomes are determined by “chance” as opposed to “skill”. Since if on one hypothetical extreme, results would prove to be completely unpredictable, there would be no empirical evidence to distinguish the matches from a game of chance such as flipping a coin. On the other hand, importance of a measurement for predicting would strongly suggest its importance for winning (or losing), though without an experiment not necessarily a causative link.
We attempt to address these questions in the case of premier league football within the confines of readily available data.
1.5 Main contributions
Our main contributions in this manuscript are the following:
We give what we believe to be the first comprehensive literature review of state-of-art competitive sports modelling that comprises the multiple communities (Bradley-Terry models, Élő type models, Bayesian models, machine learning) in which research so far has been conducted mostly separately.
We present a unified Bradley-Terry-Élő model which combines the statistical rigour of the Bradley-Terry models with fitting and update strategies similar to that found in the Élő system. Mathematically only a small step, this joint view is essential in a predictive/supervised setting as it allows efficient training and application in an on-line learning situation. Practically, this step solves some problems of the Élő system (including ranking initialization and choice of K-factor), and establishes close relations to logistic regression, low-rank matrix completion, and neural networks.
This unified view on Bradley-Terry-Élő allows us to introduce classes of joint extensions, the structured log-odds models, which unites desirable properties of the extensions found in the disjoint communities: probabilistic prediction of scores and wins/draws/losses, batch/epoch and on-line learning, as well as the possibility to incorporate features in the prediction, without having to sacrifice structural parsimony of the Bradley-Terry models, or simplicity and computational efficiency of Élő’s original approach.
We validate the practical usefulness of the structured log-odds models in synthetic experiments and in answering domain questions on English Premier League data, most prominently on the importance of features, fairness of the ranking, as well as on the “chance”-“skill” divide.
1.6 Manuscript structure
Section 2 gives an overview of the mathematical setting in competitive sports prediction. Building on the technical context, Section 3 presents a more extensive review of the literature related to the prediction problem of competitive sports, and introduces a joint view on Bradley-Terry and Élő type models. Section 4 introduces the structured log-odds models, which are validated in empirical experiments in Section 5. Our results and possible future directions for research are discussed in section 6.
This manuscript is based on ZQ’s MSc thesis, submitted September 2016 at University College London, written under supervision of FK. FK provided the ideas of re-interpretation and possible extensions of the Élő model. Literature overview is jointly due to ZQ an FQ, and in parts follows some very helpful pointers by I. Kosmidis (see below). Novel technical ideas in Sections 4.2 to 4.4, and experiments (set-up and implementation) are mostly due to ZQ.
The present manuscript is a substantial re-working of the thesis manuscript, jointly done by FK and ZQ.
We are thankful to Ioannis Kosmidis for comments on an earlier form of the manuscript, for pointing out some earlier occurrences of ideas presented in it but not given proper credit, as well as relevant literature in the “Bradley-Terry” branch.
2 The Mathematical-Statistical Setting
This section formulates the prediction task in competitive sports and fixes notation, considering as an instance of supervised learning with several non-standard structural aspects being of relevance.
2.1 Supervised prediction of competitive outcomes
We introduce the mathematical setting for outcome prediction in competitive team sports. As outlined in the introductory Section 1.1, three crucial features need to be taken into account in this setting:
The outcome of a pairing cannot be exactly predicted prior to the game, even with perfect knowledge of all determinates. Hence it is preferable to predict a probabilistic estimate for all possible match outcomes (win/draw/loss) rather than deterministically choosing one of them.
In a pairing, two teams play against each other, one as a home team and the other as the away or guest team. Not all pairs may play against each other, while others may play multiple times. As a mathematically prototypical (though inaccurate) sub-case one may consider all pairs playing exactly once, which gives the observations an implicit matrix structure (row = home team, column = away team). Outcome labels and features crucially depend on the teams constituting the pairing.
Pairings take place over time, and the expected outcomes are plausibly expected to change with (possibly hidden) characteristics of the teams. Hence we will model the temporal dependence explicitly to be able to take it into account when building and checking predictive strategies.
2.1.1 The Generative Model.
Following the above discussion, we will fix a generative model as follows: as in the standard supervised learning setting, we will consider a generative joint random variabletaking values in , where is the set of features (or covariates, independent variables) for each pairing, while is the set of labels (or outcome variables, dependent variables).
In our setting, we will consider only the cases and , in which case an observation from is a so-called match outcome, as well as the case , in which case an observation is a so-called final score (in which case, by convention, the first component of is of the home team), or the case of score differences where (in which case, by convention, a positive number is in favour of the home team). From the official rule set of a game (such as football), the match outcome is uniquely determined by a score or score difference. As all the above sets are discrete, predicting will amount to supervised classification (the score difference problem may be phrased as a regression problem, but we will abstain from doing so for technical reasons that become apparent later).
The random variable and its domain shall include information on the teams playing, as well as on the time of the match.
We will suppose there is a set of teams, and for we will denote by the random variable conditioned on the knowledge that is the home team, and is the away team. Note that information in can include any knowledge on either single team or , but also information corresponding uniquely to the pairing .
We will assume that there are teams, which means that the and may be arranged in matrices each.
Further there will be a set of time points at which matches are observed. For we will denote by or an additional conditioning that the outcome is observed at time point .
Note that the indexing and formally amounts to a double conditioning and could be written as and , where are random variables denoting the home team, the away team, and the time of the pairing. Though we do believe that the index/bracket notation is easier to carry through and to follow (including an explicit mirroring of the the “matrix structure”) than the conditional or “graphical models” type notation, which is our main reason for adopting the former and not the latter.
2.1.2 The Observation Model.
By construction, the generative random variable contains all information on having any pairing playing at any time, However, observations in practice will concern two teams playing at a certain time, hence observations in practice will only include independent samples of for some , and never full observations of which can be interpreted as a latent variable.
Note that the observations can be, in-principle, correlated (or unconditionally dependent) if the pairing or the time is not made explicit (by conditioning which is implicit in the indices ).
An important aspect of our observation model will be that whenever a value of or is observed, it will always come together with the information of the playing teams and the time at which it was observed. This fact will be implicitly made use of in description of algorithms and validation methodology. (formally this could be achieved by explicitly exhibiting/adding as a Cartesian factor of the sampling domains or which we will not do for reasons of clarity and readability)
Two independent batches of data will be observed in the exposition. We will consider:
where and are i.i.d. samples from .
Note that unfortunately (from a notational perspective), one cannot omit the superscripts as in when defining the samples, since the figurative “dies” should be cast anew for each pairing taking place. In particular, if all games would consist of a single pair of teams playing where the results are independent of time, they would all be the same (and not only identically distributed) without the super-index, i.e., without distinguishing different games as different samples from .
2.1.3 The Learning Task.
As set out in the beginning, the main task we will be concerned with is predicting future outcomes given past outcomes and features, observed from the process above. In this work, the features will be assumed to change over time slowly. It is not our primary goal to identify the hidden features in , as they are never observed and hence not accessible as ground truth which can validate our models. However, these will be of secondary interest and considered empirically validated by a well-predicting model.
More precisely, we will describe methodology for learning and validating predictive models of the type
is the set of (discrete probability) distributions on. That is, given a pairing and a time point at which the teams and play, and information of type , make a probabilistic prediction of the outcome.
Most algorithms we discuss will not use added information in , hence will be of type . Some will disregard the time in . Indeed, the latter algorithms are to be considered scientific baselines above which any algorithm using information in and/or has to improve.
The models above will be learnt on a training set , and validated on an independent test set as defined above. In this scenario, will be a random variable which may implicitly depend on but will be independent of . The learning strategy - which is depending on - may take any form and is considered in a full black-box sense. In the exposition, it will in fact take the form of various parametric and non-parametric prediction algorithms.
The goodness of such an will be evaluated by a loss which compares a probabilistic prediction to the true observation. The best will have a small expected generalization loss
at any future time point and for any pairing . Under mild assumptions, we will argue below that this quantity is estimable from and only mildly dependent on .
Though a good form for is not a-priori clear. Also, it is unclear under which assumptions is estimable, due do the conditioning on in the training set. These special aspects of the competitive sports prediction settings will be addressed in the subsequent sections.
2.2 Losses for probablistic classification
In order to evaluate different models, we need a criterion to measure the goodness of probabilistic predictions. The most common error metric used in supervised classification problems is the prediction accuracy. However, the accuracy is often insensitive to probabilistic predictions.
For example, on a certain test case model A predicts a win probability of 60%, while model B predicts a win probability of 95%. If the actual outcome is not win, both models are wrong. In terms of prediction accuracy (or any other non-probabilistic metric), they are equally wrong because both of them made one mistake. However, model B should be considered better than model A since it predicted the “true” outcome with higher accuracy.
Similarly, if a large number of outcomes of a fair coin toss have been observed as training data, a model that predicts 50% percent for both outcomes on any test data point should be considered more accurate than a model that predicts 100% percent for either outcome 50% of the time.
There exists two commonly used criteria that take into account the probabilistic nature of predictions which we adopt. The first one is the Brier score (Equation 1 below) and the second is the log-loss or log-likelihood loss (Equation 2 below). Both losses compare a distribution to an observation, hence mathematically have the signature of a function . By (very slight) abuse of notation, we will identify distributions on (discrete) with its probability mass function; for a distribution , for we write for mass on the observation (= the probability to observe in a random experiment following ).
With this convention, log-loss and Brier loss are defined as follows:
The log-loss and the Brier loss functions have the following properties:
the Brier Score is only defined on with an addition/subtraction and a norm defined. This is not necessarily the case in our setting where it may be that . In literature, this is often identified with , though this identification is arbitrary, and the Brier score may change depending on which numbers are used.
On the other hand, the log-loss is defined for any and remains unchanged under any renaming or renumbering of a discrete .
For a joint random variable taking values in , it can be shown that the expected losses are minimized by the “correct” prediction .
The two loss functions usually are introduced as empirical losses on a test set , i.e.,
The empirical log-loss is the (negative log-)likelihood of the test predictions.
The empirical Brier loss, usually called the “Brier score”, is a straightforward translation of the mean squared error used in regression problems to the classification setting, as the expected mean squared error of predicted confidence scores. However, in certain cases, the Brier score is hard to interpret and may behave in unintuitive ways , which may partly be seen as a phenomenon caused by above-mentioned lack of invariance under class re-labelling.
Given this and the interpretability of the empirical log-loss as a likelihood, we will use the log-loss as principal evaluation metric in the competitive outcome prediction setting.
2.3 Learning with structured and sequential data
The dependency of the observed data on pairing and time makes the prediction task at hand non-standard. We outline the major consequences for learning and model validation, as well as the implicit assumptions which allow us to tackle these. We will do this separately for the pairing and the temporal structure, as these behave slightly differently.
2.3.1 Conditioning on the pairing
Match outcomes are observed for given pairings , that is, each feature-label-pair will be of form , where as above the subscripts denote conditioning on the pairing. Multiple pairings may be observed in the training set, but not all; some pairings may never be observed.
This has consequences for both learning and validating models.
For model learning, it needs to be made sure that the pairings to be predicted can be predicted from the pairings observed. With other words, the label in the test set that we want to predict is (in a practically substantial way) dependent on the training set . Note that smart models will be able to predict the outcome of a pairing even if it has not been observed before, and even if it has, it will use information from other pairings to improve its predictions
For various parametric models, “predictability” can be related to completability of a data matrix withas entries. In section 4, we will relate Élő type models to low-rank matrix completion algorithms; completion can be understood as low-rank completion, hence predictability corresponds to completability. Though, exactly working completability out is not the main is not the primary aim of this manuscript, and for our data of interest, the English Premier League, all pairings are observed in any given year, so completability is not an issue. Hence we refer to  for a study of low-rank matrix completability. General parametric models may be treated along similar lines.
For model-agnostic model validation, it should hold that the expected generalization loss
can be well-estimated by empirical estimation on the test data. For league level team sports data sets, this can be achieved by having multiple years of data available. Since even if not all pairings are observed, usually the set of pairings which is observed is (almost) the same in each year, hence the pairings will be similar in the training and test set if whole years (or half-seasons) are included. Further we will consider an average over all observed pairings, i.e., we will compute the empirical loss on the training set as
By the above argument, the set of all observed pairings in any given year is plausibly modelled as similar, hence it is plausible to conclude that this empirical loss estimates some expected generalization loss
where (possibly dependent) are random variables that select teams which are paired.
Note that this type of aggregate evaluation does not exclude the possibility that predictions for single teams (e.g., newcomers or after re-structuring) may be inaccurate, but only that the “average” prediction is good. Further, the assumption itself may be violated if the whole league changes between training and test set.
2.3.2 Conditioning on time
As a second complication, match outcome data is gathered through time. The data
set might display temporal structure and correlation with time. Again, this has consequences for learning and validating the models.
For model learning, models should be able to intrinsically take into account the temporal structure - though as a baseline, time-agnostic models should be tried. A common approach for statistical models is to assume a temporal structure in the latent variables that determine a team’s strength. A different and somewhat ad-hoc approach proposed by Dixon and Coles  is to assign lower weights to earlier observations and estimate parameter by maximizing the weighted log-likelihood function. For machine learning models, the temporal structure is often encoded with handcrafted features.
Similarly, one may opt to choose a model that can be updated as time progresses.
A common ad-hoc solution is to re-train the model after a certain amount
of time (a week, a month, etc), possibly with temporal discounting, though there is no general consensus about
how frequently the retraining should be performed.
Further there are genuinely updating models, so-called on-line learning models, which update model
parameters after each new match outcome is revealed.
For model evaluation, the sequential nature of the data poses a severe restriction: Any two data points were measured at certain time points, and one can not assume that they are not correlated through time information. That such correlation exists is quite plausible in the domain application, as a team would be expected to perform more similarly at close time points than at distant time points. Also, we would like to make sure that we fairly test the models for their prediction accuracy - hence the validation experiment needs to mimic the “real world” prediction process, in which the predicted outcomes will be in the temporal future of the training data. Hence the test set, in a validation experiment that should quantify goodness of such prediction, also needs to be in the temporal future of the training set.
In particular, the common independence assumption that allows application of re-sampling strategies such as the K-fold cross-validation method 
, which guarantees the expected loss to be estimated by the empirical loss, is violated. In the presence of temporal correlation, the variance of the error metric may be underestimated, and the error metric itself will, in general, be mis-estimated. Moreover, the validation method will need to accommodate the fact that the model may be updated on-line during testing. In literature, model-independent validation strategies for data with temporal structure is largely an unexplored (since technically difficult) area. Nevertheless, developing a reasonable validation method is crucial for scientific model assessment. A plausible validation method is introduced in section5.2.2 in detail. It follows similar lines as the often-seen “temporal cross-validation” where training/test splits are always temporal, i.e., the training data points are in the temporal past of the test data points, for multiple splits. An earlier occurrence of such a validation strategy may be found in .
This strategy comes without strong estimation guarantees and is part heuristic; the empirical loss will estimate the generalization loss as long as statistical properties do not change as time shifts forward, for example under stationarity assumptions. While this implicit assumption may be plausible for the English Premier League, this condition is routinely violated in financial time series, for example.
3 Approaches to competitive sports prediction
In this section, we give a brief overview over the major approaches to prediction in competitive sports found in literature. Briefly, these are:
The Bradley-Terry models and extensions.
The Élő model and extensions.
Bayesian models, especially latent variable models and/or graphical models for the outcome and score distribution.
Supervised machine learning type models that use domain features for prediction.
(a) The Bradley-Terry model is the most influential statistical approach to ranking based on competitive
With its original applications in psychometrics, the goal of the class of Bradley-Terry models is to
estimate a hypothesized rank or skill level from observations of pairwise competition outcomes (win/loss).
Literature in this branch of research is, usually, primarily concerned not with prediction, but estimation of
a “true” rank or skill, existence of which is hypothesized, though prediction
of (binary) outcome probabilities or odds is well possible within the paradigm.
A notable exception is the work of  where the problem is in essence formulated
as supervised prediction, similar to our work.
Mathematically, Bradley-Terry models may be seen as log-linear two-factor models that, at the state-of-art are usually
estimated by (analytic or semi-analytic) likelihood maximization .
Recent work has seen many extensions of the Bradley-Terry models, most notably for modelling of ties ,
making use of features  or for explicit modelling the time dependency of skill .
(b) The Élő system is one of the earliest attempts to model competitive sports
and, due to its mathematical simplicity, well-known and widely-used by practitioners .
Historically, the Élő system is used for chess rankings, to assign a rank score to chess players.
Mathematically, the Élő system only uses information about the historical match outcomes. The Élő
system assigns to each team a parameter, the so-called Élő rating.
The rating reflects a team’s competitive skills: the team with higher
rating is stronger.
As such, the Élő system is, originally, not a predictive model or a statistical model in the usual sense.
However, the Élő system also gives a probabilistic prediction for the binary match outcome based
on the ratings of two teams.
After what appears to have been a period of parallel development that is still partly ongoing,
it has been recently noted by members of the Bradley-Terry community that the Élő prediction heuristic
is mathematically equivalent to the prediction via the simple Bradley-Terry
model [see 10, , section 2.1].
The Élő ratings are learnt via an update rule that is applied whenever a new outcome is observed. This suggested update strategy is inherently algorithmic and later shown to be closely related to on-line learning strategies in neural network; to our knowledge it appears first in Élő’s work and is not found in the Bradley-Terry strain.
(c) The Bayesian paradigm
offers a natural framework to model match outcomes probabilistically, and to obtain probabilistic predictions as the posterior predictive distribution. Bayesian parametric models also allow researchers to inject expert knowledge through the prior distribution. The prediction function is naturally given by the posterior distribution of the scores, which can be updated as more observations become available.
Often, such models explicitly model not only the outcome but also the score distribution,
such as Maher’s model  which models outcome scores
based on independent Poisson random variables with team-specific means.
Dixon and Coles 
extend Maher’s model by introducing a correlation effect between
the two final scores.
More recent models also include dynamic components to model
temporal dependence [20, 50, 11].
Most models of this type only use historical match outcomes as features,
see Constantinou et al.  for an exception.
(d) More recently, the method-agnostic supervised machine learning paradigm has been
applied to prediction of match outcomes [36, 23, 43].
The main rationale in this branch of research is that the best model is not known, hence
a number of off-shelf predictors are tried and compared in a benchmarking experiment.
Further, these models are able to make use of features other than previous outcomes easily.
However, usually, the machine learning models are trained in-batch, i.e., not following a dynamic update or on-line learning strategy,
and they need to be re-trained periodically to incorporate new observations.
In this manuscript, we will re-interpret the Élő model and its update rule as the simplest case of a structured extension of predictive logistic (or generalized linear) regression models, and the canonical gradient ascent update of its likelihood - hence, in fact, giving it a parametric form not entirely unlike the models mentioned in (b), In the subsequent sections, this will allow us to complement it with the beneficial properties of the machine learning approach (c), most notably the addition of possibly complex features, paired with the Élő update rule which can be shown generalize to an on-line update strategy.
More detailed literature and technical overview is given given in the subsequent sections. The Élő model and its extensions, as well as its novel parametric interpretation, are reviewed in Section 3.1. Section 3.2 reviews other parametric models for predicting final scores. Section 3.3 reviews the use of machine learning predictors and feature engineering for sports prediction.
3.1 The Bradley-Terry-Élő models
This section reviews the Bradley-Terry models, the Élő system, and closely related variants.
We give the above-mentioned joint formulation, following the modern rationale of considering as a “model” not only a generative specification, but also algorithms for training, predicting and updating its parameters. As the first seems to originate with the work of , and the second in the on-line update heuristic of , we argue that for giving proper credit, it is probably more appropriate to talk about Bradley-Terry-Élő models (except in the specific hypothesis testing scenario covered in the original work of Bradley and Terry).
Later, we will attempt to understand the Élő system as an on-line update of a structured logistic odds model.
3.1.1 The original formulation of the Élő model
We will first introduce the original version of the Élő model, following . As stated above, its original form which is still applied for determining the official chess ratings (with minor domain-specific modifications), is neither a statistical model nor a predictive model in the usual sense.
Instead, the original version is centered around the ratings for each team . These ratings are updated via the Élő model rule, which we explain (for sake of clarity) for the case of no draws: After observing a match between (home) team and (away) team , the ratings of teams and are updated as
where , often called “the K factor”, is an arbitrarily chosen constant, that is, a model parameter usually set per hand. is if team/player has been observed to win, and otherwise.
Further, is the probability of winning against which is predicted from the ratings prior to the update by
where is the logistic function (which has a sigmoid shape, hence is also often called “the sigmoid”). Sometimes a home team parameter is added to account for home advantage, and the predictive equation becomes
Élő’s update rule (Equation 3) makes sense intuitively because the term can be thought of as the discrepancy between what is expected, , and what is observed, . The update will be larger if the current parameter setting produces a large discrepancy. However, a concise theoretical justification has not been articulated in literature. If fact, Élő himself commented that “the logic of the equation is evident without algebraic demonstration”  - which may be true in his case, but not satisfactory in an applied scientific nor a theoretical/mathematical sense.
As an initial issue, it has been noted that the whole model is invariant under joint re-scaling of the , and the parameters , as well as under arbitrary choice of zero for the (i.e., adding of a fixed constant to all ). Hence, fixed domain models will usually choose zero and scale arbitrarily. In chess rankings, for example, the formula includes additional scaling constants of the form ; scale and zero are set through fixing some historical chess players’ rating, which happens to set the “interesting” range in the positive thousands333A common misunderstanding here is that no Élő ratings below zero may occur. This is, in-principle, wrong, though it may be extremely unlikely in practice if the arbitrarily chosen zero is chosen low enough.. One can show that there are no more parameter redundancies, hence scaling/zeroing turns out not to be a problem if kept in mind.
However, three issues are left open in this formulation:
How the ratings for players/teams are determined who have never played a game before.
The choice of the constant/parameter , the “K-factor”.
If a home parameter is present, its size.
These issues are usually addressed in everyday practice by (more or less well-justified) heuristics.
The parametric and probabilistic supervised setting in the following sections yields more principled ways to address this. step (i) will become unnecessary by pointing out a batch learning method; the constant in (ii) will turn out to be the learning rate in a gradient update, hence it can be cross-validated or entirely replaced by a different strategy for learning the model. Parameters such as in (iii) will be interpretable as a logistic regression coefficient.
3.1.2 Bradley-Terry-Élő models
As outlined in the initial discussion, the class of Bradley-Terry models introduced by  may be interpreted as a proper statistical model formulation of the Élő prediction heuristic.
Despite their close mathematical vicinity, it should be noted that classically Bradley-Terry and Élő models are usually applied and interpreted differently, and consequently fitted/learnt differently: while both models estimate a rank or score, the primary (historical) purpose of the Bradley-Terry is to estimate the rank, while the Élő system is additionally intended to supply easy-to-compute updates as new outcomes are observed, a feature for which it has historically paid for by lack of mathematical rigour. The Élő system is often invoked to predict future outcome probabilities, while the Bradley-Terry models usually do not see predictive use (despite their capability to do so, and the mathematical equivalence of both predictive rules).
However, as mentioned above and as noted for example by , a joint mathematical formulation can be found, and as we will show, the different methods of training the model may be interpreted as variants of likelihood-based batch or on-line strategies.
The parametric formulation is quite similar to logistic regression models, or generalized linear models, in that we will use a link function and define a model for the outcome odds. Recall, the odds for a probability are
, and the logit function is(sometimes also called the “log-odds function” for obvious reasons). A straightforward calculation shows that , or equivalently, for any , i.e., the logistic function is the inverse of the logit (and vice versa by the symmetry theorem for the inverse function).
Hence we can posit the following two equivalent equations in latent parameters as definition of a predictive model:
That is, in the first equation is interpreted as a predictive probability; i.e., . The second equation interprets this prediction in terms of a generalized linear model with a response function that is linear in the . We will write
for the vector of; hence the second equation could also be written, in vector notation, as . Hence, in particular, the matrix with entries has rank (at most) two.
Fitting the above model means estimating its latent variables . This may be done by considering the likelihood of the latent parameters given the training data. For a single observed match outcome , the log-likelihood of and is
where the on the right hand side need to be interpreted as functions of (namely, as in equation 6). We call the one-outcome log-likelihood as it is based on a single data point. Similarly, if multiple training outcomes are observed, the log-likelihood of the vector is
We will call the batch log-likelihood as the training set contains more than one data point.
The derivative of the one-outcome log-likelihood is
hence the in the Élő update rule (see equation 3) may be updated as a gradient ascent rate or learning coefficient in an on-line likelihood update. We also obtain a batch gradient from the batch log-likelihood:
where, is team ’s number of wins minus number of losses observed in , and is the (multi-)set of (unordered) pairings team has participated in . The batch gradient directly gives rise to a batch gradient update
Note that the above model highlights several novel, interconnected, and possibly so far unknown (or at least not jointly observed) aspects of Bradley-Terry and Élő type models:
The Élő system can be seen as a learning algorithm for a logistic odds model with latent variables, the Bradley-Terry model (and hence, by extension, as a full fit/predict specification of a certain one-layer neural network).
The Bradley-Terry and Élő model may simultaneously be interpreted as Bernoulli observation models of a rank two matrix.
The gradient of the Bradley-Terry model’s log-likelihood gives rise to a (novel) batch gradient and a single-outcome gradient ascent update. A single iteration per-sample of the latter (with a fixed update constant) is Élő’s original update rule.
These observations give rise to a new family of models: the structured log-odds models that will be discussed in Section 4 and 4.1, together with concomitant gradient update strategies of batch and on-line type. This joint view also makes extensions straightforward, for example, the “home team parameter” in the common extension of the Élő system may be interpreted as Bradley-Terry model with an intercept term, with log-odds , that is updated by the one-outcome Élő update rule.
Since more generally, the structured log-odds models arise by combining the parametric form of the Bradley-Terry model with Élő’s update strategy, we also argue for synonymous use of the term “Bradley-Terry-Élő models” whenever Bradley-Terry models are updated batch, or epoch-wise, or whenever they are, more generally, used in a predictive, supervised, or on-line setting.
3.1.3 Glickman’s Bradley-Terry-Élő model
For sake of completeness and comparison, we discuss the probabilistic formulation of Glickman . In this fully Bayesian take on the Bradley-Terry-Élő model, it is assumed that there is a latent random variable associating with team . The latent variables are statistically independent and they follow a specific generalized extreme value (GEV) distribution:
where the mean parameter varies across teams, and the other two parameters are fixed at one and zero. The density function of , is
The model further assumes that team wins over team in a match if and only if a random sample (, ) from the associated latent variables satisfies . It can be shown that the difference variables then happen to follow a logistic distribution with mean and scale parameter 1, see .
Hence, the (predictive) winning probability for team is eventually given by Élő’s original equation 4 which is equivalent to the Bradley-Terry-odds. In fact, the arguably strange parametric form for the distribution of the makes the impression of being chosen for this particular, singular reason.
We argue, that Glickman’s model makes unnecessary assumptions through the latent random variables which furthermore carry an unnatural distribution . This is certainly true in the frequentist interpretation, as the parametric model in Section 3.1.2 is not only more parsimonious as it does not assume a process that generates the , but also it avoids to assume random variables that are never directly observed (such as the ). This is also true in the Bayesian interpretation, where a prior is assumed on the which then indirectly gives rise to the outcome via the .
Hence, one may argue by Occam’s razor, that modelling the is unnecessary, and, as we believe, may put obstacles on the path to the existing and novel extensions in Section 4 that would otherwise appear natural.
3.1.4 Limitations of the Bradley-Terry-Élő model and existing remedies
We point out some limitations of the original Bradley-Terry and Élő models which we attempt to address in Section 4.
The original Bradley-Terry and Élő models do not model the possibility of a draw. This might be reasonable in official chess tournaments where players play on until draws are resolved. However, in many competitive sports a significant number of matches end up as a draw - for example, in the English Premier League about twenty percent of the matches. Modelling the possibility of draw outcome is therefore very relevant. One of the first extensions of the Bradley-Terry model, the ternary outcome model by Rao and Kupper , was suggested to address exactly this shortcoming. The strategy for modelling draws in the joint framework, closely following this work, is outlined in Section 4.2.2.
Using final scores in the model
The Bradley-Terry-Élő model only takes into account the binary outcome of the match. In sports such as football, the final scores for both teams may contain more information. Generalizations exist to tackle this problem. One approach is adopted by the official FIFA Women’s football ranking , where the actual outcome of the match is replaced by the "Actual Match Percentage", a quantity that depends on the final scores. FiveThirtyEight, an online media, proposed another approach . It introduces the “Margin of Victory Multiplier” in the rating system to adjust the K-factor for different final scores.
In a survey paper, Lasek et al.  showed empirical evidence that rating methods that take into account the final scores often outperform those that do not. However, it is worth noticing that the existing methods often rely on heuristics and their mathematical justifications are often unpublished or unknown. We describe a principled way to incorporate final scores in Section 4.2.3 into the framework, following ideas of Dixon and Coles .
Using additional features
The Bradley-Terry-Élő model only takes into account very limited information. Apart from previous match outcomes, the only feature it uses is the identity of home and away teams. There are many other potentially useful features. For example, whether the team is recently promoted from a lower-division league, or whether a key player is absent from the match. These features may help make better prediction if they are properly modeled. In Section 4.2.1, we extend the Bradley-Terry-Élő model to a logistic odds model that can also make use of features, along lines similar to the feature-dependent models of Firth and Turner .
3.2 Domain-specific parametric models
We review a number of parametric and Bayesian models that have been considered in literature to model competitive sports outcomes. A predominant property of this branch of modelling is that the final scores are explicitly modelled.
3.2.1 Bivariate Poisson regression and extensions
Maher  proposed to model the final scores as independent Poisson random variables. If team is playing at home field against team , then the final scores and follows
where and measure the ’attack’ rates, and and measure the ’defense’ rates of the teams. The parameter
is an adjustment term for home advantage. The model further assumes that all historical match outcomes are independent. The parameters are estimated from maximizing the log-likelihood function of all historical data. Empirical evidence suggests that the Poisson distribution fits the data well. Moreover, the Poisson distribution can be derived as the expected number of events during a fixed time period at a constant risk. This interpretation fits into the framework of competitive team sports.
Dixon and Coles  proposed two modifications to Maher’s model. First, the final scores and are allowed to be correlated when they are both less than two. The model employs a free parameter to capture this effect. The joint probability function of is given by the bivariate Poisson distribution 7:
The function adjusts the probability function so that drawing becomes less likely when both scores are low. The second modification is that the Dixon-Coles model no longer assumes match outcomes are independent through time. The modified log-likelihood function of all historical data is represented as a weighted sum of log-likelihood of individual matches illustrated in equation 8, where represents the time index. The weights are heuristically chosen to decay exponentially through time in order to emphasize more recent matches.
The parameter estimation procedure is the same as Maher’s model. Estimates are obtained from batch optimization of modified log-likelihood.
Karlis and Ntzoufras  explored several other possible parametrization of the bivariate Poisson distribution including those proposed by Kocherlakota and Kocherlakota , and Johnson et al. . The authors performed a model comparison between Maher’s independent Poisson model and various bivariate Poisson models based on AIC and BIC. However, the comparison did not include the Dixon-Coles model. Goddard  performed a more comprehensive model comparison based on their forecasting performance.
3.2.2 Bayesian latent variable models
Rue and Salvesen  proposed a Bayesian parametric model based on the bivariate Poisson model. In addition to the paradigm change, there are three major modifications on the parameterization. First of all, the distribution for scores are truncated: scores greater than four are treated as the same category. The authors argued that the truncation reduces the extreme case where one team scores many goals. Secondly, the final scores and are assumed to be drawn from a mixture model:
The component is the truncated version of the Dixon-Coles model, and the component is a truncated bivariate Poisson distribution (7) with and equal to the average value across all teams. Thus, the mixture model encourages a reversion to the mean. Lastly, the attack parameters and defense parameters
for each team changes over time following a Brownian motion. The temporal dependence between match outcomes are reflected by the change in parameters. This model does not have an analytical posterior for parameters. The Bayesian inference procedure is carried out via Markov Chain Monte Carlo method.
Crowder et al.  proposed another Bayesian formulation of the bivariate Poisson model based on the Dixon-Coles model. The parametric form remains unchanged, but the attack parameters ’s and defense parameter changes over time following an AR(1) process. Again, the model does not have an analytical posterior. The authors proposed a fast variational inference procedure to conduct the inference.
Baio and Blangiardo  proposed a further extension to the bivariate Poisson model proposed by Karlis and Ntzoufras . The authors noted that the correlation between final scores are parametrized explicitly in previous models, which seems unnecessary in the Bayesian setting. In their proposed model, both scores are conditionally independent given an unobserved latent variable. This hierarchical structure naturally encodes the marginal dependence between the scores.
3.3 Feature-based machine learning predictors
In recent publications, researchers reported that machine learning models achieved good prediction results for the outcomes of competitive team sports. The strengths of the machine learning approach lie in the model-agnostic and data-centric modelling using available off-shelf methodology, as well as the ability to incorporate features in model building.
In this branch of research, the prediction problems are usually studied as a supervised classification problem, either binary (home team win/lose or win/other), or ternary, i.e., where the outcome of a match falls into three distinct classes: home team win, draw, and home team lose.
Liu and Lai 
applied logistic regression, support vector machines with different kernels, and AdaBoost to predict NCAA football outcomes. For this prediction problem, the researchers hand crafted 210 features.
Hucaljuk and Rakipović 
explored more machine learning predictors in the context of sports prediction. The predictors include naïve Bayes classifiers, Bayes networks, LogitBoost, k-nearest neighbors, Random forest, and artificial neural networks. The models are trained on 20 features derived from previous match outcomes and 10 features designed subjectively by experts (such as team’s morale).
Odachowski and Grekow 
conducted a similar study. The predictors are commercial implementations of various Decision Tree and ensembled trees algorithms as well as a hand-crafted Bayes Network. The models are trained on a subset of 320 features derived form the time series of betting odds. In fact, this is the only study so far where the predictors have no access to previous match outcomes.
Kampakis and Adamides  explored the possibility of predicting match outcome from Tweets. The authors applied naïve Bayes classifiers, Random forests, logistic regression, and support vector machines to a feature set composed of 12 match outcome features and a number of Tweets features. The Tweets features are derived from unigrams and bigrams of the Tweets.
3.4 Evaluation methods used in previous studies
In all studies mentioned in this section, the authors validated their new model on a real data set and showed that the new model performs better than an existing model. However, complication arises when we would like to aggregate and compare the findings made in different papers. Different studies may employ different validation settings, different evaluation metrics, and different data sets. We report on this with a focus on the following, methodologically crucial aspects:
Studies may or may not include a well-chosen benchmark for comparison. If this is not done, then it may not be concluded that the new method outperforms the state-of-art, or a random guess.
Variable selection or hyper-parameter tuning procedures may or may not be described explicitly. This may raise doubts about the validity of conclusions, as “hand-tuning” parameters is implicit overfitting, and may lead to underestimate the generalization error in validation.
In table 1, we summarize the benchmark evaluation methodology used in previous studies. One may remark that the size of testing data sets vary considerably across different studies, and most studies do not provide a quantitative assessment on the evaluation metric. We also note that some studies perform the evaluation on the training data (i.e., in-sample). Without further argument, these evaluation results only show the goodness-of-fit of the model on the training data, as they do not provide a reliable estimate of the expected predictive performance (on unseen data).
|Lasek et al. ||On-line||Yes||Binary||Brier score, Binomial divergence||Yes||Yes||NA||979|
|Dixon and Coles ||No||No||Scores||Non-standard||No||No||NA||NA|
|Karlis and Ntzoufras ||In-sample||Bayes||Scores||AIC, BIC||No||No||615||NA|
|Rue and Salvesen ||Custom||Bayes||Scores||log-loss||Yes||No||280||280|
|Crowder et al. ||On-line||Bayes||Tenary||Accuracy||No||No||1680||1680|
|Baio and Blangiardo ||Hold-out||Bayes||Scores||Not reported||No||No||4590||306|
|Liu and Lai ||Hold-out||No||Binary||Accuracy||Yes||No||480||240|
|Hucaljuk and Rakipović ||Custom||Yes||Binary||Accuracy, F1||Yes||No||96||96|
|Odachowski and Grekow ||10-fold CV||No||Tenary||Accuracy||Yes||No||1116||1116|
|Kampakis and Adamides ||LOO-CV||No||Binary||Accuracy, Cohen’s kappa||No||Yes||NR||NR|
4 Extending the Bradley-Terry-Élő model
In this section, we propose a new family of models for the outcome of competitive team sports, the structured log-odds models. We will show that both Bradley-Terry and Élő models belong to this family (section 4.1), as well as logistic regression. We then propose several new models with added flexibility (section 4.2) and introduce various training algorithms (section 4.3 and 4.4).
4.1 The structured log-odds model
Recall our principal observations obtained from the joint discussion of Bradley-Terry and Élő models in Section 3.1.2:
The Élő system can be seen as a learning algorithm for a logistic odds model with latent variables, the Bradley-Terry model (and hence, by extension, as a full fit/predict specification of a certain one-layer neural network).
The Bradley-Terry and Élő model may simultaneously be interpreted as Bernoulli observation models of a rank two matrix.
The gradient of the Bradley-Terry model’s log-likelihood gives rise to a (novel) batch gradient and a single-outcome gradient ascent update. A single iteration per-sample of the latter (with a fixed update constant) is Élő’s original update rule.
We collate these observations in a mathematical model, and highlight relations to well-known model classes, including the Bradley-Terry-Élő model, logistic regression, and neural networks.
4.1.1 Statistical definition of structured log-odds models
In the definition below, we separate added assumptions and notations for the general set-up, given in the paragraph “Set-up and notation”, from model-specific assumptions, given in the paragraph “model definition”. Model-specific assumptions, as usual, need not hold for the “true” generative process, and the mismatch of the assumed model structure to the true generative process may be (and should be) quantified in a benchmark experiment.
Set-up and notation.
We keep the notation of Section 2; for the time being, we assume that there is no dependence on time, i.e., the observations follow a generative joint random variable . The variable models the outcomes of a pairing where home team plays against away team . We will further assume that the outcomes are binary home team win/lose = 1/0, i.e., . The variable models features relevant to the pairing. From it, we may single out features that pertain to a single team , as a variable . Without loss of generality (for example, through introduction of indicator variables), we will assume that takes values in , and takes values in . We will write and for the components.
The two restrictive assumptions (independence of time, binary outcome) are temporary and are made for expository reasons. We will discuss in subsequent sections how these assumptions may be removed.
We have noted that the double sub-index notation easily allows to consider in matrix form. We will denote by to the (real) matrix with entry in the -th row and -th column. Similarly, we will denote by the matrix with entries . We do not fix a particular ordering of the entries in as the numbering of teams does not matter, however the indexing needs to be consistent across and any matrix of this format that we may define later.
A crucial observation is that the entries of the matrix can be plausibly expected to not be arbitrary. For example, if team is a strong team, we should expect to be larger for all ’s. We can make a similar argument if we know team is a weak team. This means the entries in matrix are not completely independent from each other (in an algebraic sense); in other words, the matrix can be plausibly assumed to have an inherent structure.
Hence, prediction of should be more accurate if the correct structural assumption is made on , which will be one of the cornerstones of the structured log-odds models.
For mathematical convenience (and for reasons of scientific parsimony which we will discuss), we will not directly endow the matrix with structure, but the matrix where as usual and as in the following, univariate functions are applied entry-wise (e.g., is also a valid statement and equivalent to the above).
We are now ready to introduce the structured log-odds models for competitive team sports. As the name says, the main assumption of the model is that the log-odds matrix is a structured matrix, alongside with the other assumptions of the Bradley-Terry-Élő model in Section 3.1.2.
More explicitly, all assumptions of the structured log-odds model may be written as
|satisfies certain structural assumptions|
where we have not made the structural assumptions on explicit yet. The matrix may depend on , though a sensible model may be already obtained from a constant matrix with restricted structure. We will show that the Bradley-Terry and Élő models are of this subtype.
Structural assumptions for the log-odds.
We list a few structural assumptions that may or may not be present in some form,
and will be key in understanding important cases of the structured log-odds models.
These may be applied to as a constant matrix to obtain the simplest class of log-odds models,
such as the Bradley-Terry-Élő model as we will explain in the subsequent section.
Low-rankness. A common structural restriction for a matrix (and arguably the most scientifically or mathematically parsimonious one) is the assumption of low rank: namely, that the rank of the matrix of relevance is less than or equal to a specified value . Typically,
is far less than either size of the matrix, which heavily restricts the number of (model/algebraic) degrees of freedom in anmatrix from to . The low-rank assumption essentially reflects a belief that the unknown matrix is determined by only a small number of factors, corresponding to a small number of prototypical rows/columns, with the small number being equal to
. By the singular value decomposition theorem, any rankmatrix may be written as
for some , pairwise orthogonal , pairwise orthogonal ;
equivalently, in matrix notation, where is diagonal, and (and where , and are the rows of ).
Anti-symmetry. A further structural assumption is symmetry or anti-symmetry of a matrix. Anti-symmetry arises in competitive outcome prediction naturally as follows: if all matches were played on neutral fields (or if home advantage is modelled separately), one should expect that , which means the probability for team to beat team is the same regardless of where the match is played (i.e., which one is the home team). Hence,
that is, is an anti-symmetric matrix, i.e., .
Anti-symmetry and low-rankness. It is known that any real antisymmetric matrix always has even rank . That is, if a matrix is assumed to be low-rank and anti-symmetric simultaneously, it will have rank or or etc. In particular, the simplest (non-trivial) anti-symmetric low-rank matrices have rank . One can also show that any real antisymmetric matrix with rank can be decomposed as
for some , pairwise orthogonal , pairwise orthogonal ;
equivalently, in matrix notation,
where is diagonal, and (and where , and are the rows of ).
Separation. In the above, in general, the factors give rise to interaction constants (namely: ) that are specific to the pairing. To obtain interaction constants that only depend on one of the teams, one may additionally assume that one of the factors is constant, or a vector of ones (without loss of generality from the constant vector). Similarly, a matrix with constant entries corresponds to an effect independent of the pairing.
Learning/fitting of structured log-odds models
will be discussed in Section 4.3. after we have established a number of important sub-cases and the full formulation of the model.
In a brief preview summary, it will be shown that the log-likelihood function has in essence the same form for all structured log-odds models. Namely, for any parameter on which or may depend, it holds for the (one-outcome log-likelihood) that
Similarly, for its derivative one obtains
where the partial derivatives on the right hand side will have a different form for different structural assumptions, while the general form of the formula above is the same for any such assumption.
Section 4.3 will expand on this for the full model class.
4.1.2 Important special cases
We highlight a few important special types of structured log-odds models that we have already seen, or that are prototypical
for our subsequent discussion:
The Bradley-Terry-model and via identification the Élő system are obtained under the structural assumption that is anti-symmetric and of rank 2 with one factor vector of ones.
Namely, recalling equation 6, we recognize that the log-odds matrix in the Bradley-Terry model is given by , where and are the Élő ratings. Using the rule of matrix multiplication, one can verify that this is equivalent to
where is a vector of ones and is the vector
of Élő ratings. For general , the
log-odds matrix will have rank two (general = except if for all ).
By the exposition above, making the three assumptions is equivalent to positing the Bradley-Terry or Élő model.
Two interesting observations may be made:
First, the ones-vector being a factor entails that the winning chance
depends only on the difference between the team-specific ratings , without any further interaction term.
Second, the entry-wise exponential of is a matrix of rank (at most) one.
The popular Élő model with home advantage is obtained from the Bradley-Terry-Élő model under the structural assumption that is a sum of low-rank matrix and a constant; equivalently, from an assumption of rank 3 which is further restricted by fixing some factors to each other or to vectors of ones.
More precisely, from equation 5, one can recognize that for the Élő model with home advantage, the log-odds matrix decomposes as
Note that the log-odds matrix is no longer antisymmetric due to the constant term
with home advantage parameter that is (algebraically) independent of the playing teams.
Also note that the anti-symmetric part, i.e., ,
is equivalent to the constant-free Élő model’s log-odds, while the symmetric
part, i.e., is exactly the new constant home advantage term.
More factors: full two-factor Bradley-Terry-Élő models may be obtained by dropping the separation assumption from either Bradley-Terry-Élő model, i.e., keeping the assumption of anti-symmetric rank two, but allowing an arbitrary second factor not necessarily being the vector of ones. The team’s competitive strength is then determined by two interacting factors , , as
Intuitively, this may cover, for example, a situation where the benefit from being much better may be
smaller (or larger) than being a little better, akin to a discounting of extremes.
If the full two-factor model predicts better than the Bradley-Terry-Élő model, it may certify for
different interaction in different ranges of the Élő scores.
A home advantage factor (a constant) may or may not be added, yielding a model of total rank 3.
Raising the rank: higher-rank Bradley-Terry-Élő models may be obtained by model by relaxing assumption of rank 2 (or 3) to higher rank. We will consider the next more expressive model, of rank four. The rank four Bradley-Terry-Élő model which we will consider will add a full anti-symmetric rank two summand to the log-odds matrix, which hence is assumed to have the following structure:
The team’s competitive strength is captured by three factors , and ; note that we have kept the vector of ones as a factor. Also note that setting either of to would not result in a model extension as the resulting matrix would still have rank two. The rank-four model may intuitively make sense if there are (at least) two distinguishable qualities determining the outcome - for example physical fitness of the team and strategic competence. Whether there is evidence for the existence of more than one factor, as opposed to assuming just a single one (as a single summary quantifier for good vs bad) may be checked by comparing predictive capabilities of the respective models. Again, a home advantage factor may be added, yielding a log-odds matrix of total rank 5.
We would like to note that a mathematically equivalent model, as well as models with more factors, have already been considered by Stanescu ,
though without making explicit the connection to matrices which are of low rank, anti-symmetric or structured in any other way.
Logistic regression may also be obtained as a special case of structured log-odds models. In the simplest form of logistic regression, the log-odds matrix is a linear functional in the features. Recall that in the case of competitive outcome prediction, we consider pairing features taking values in , and team features taking values in . We may model the log-odds matrix as a linear functional in these, i.e., model that
where . If , we obtain a simple two-factor logistic regression model. In the case that there is only two teams playing only with each other, or (the mathematical correlate of) a single team playing only with itself, the standard logistic regression model is recovered.
Conversely, a way to obtain the Bradley-Terry model as a special case of classical logistic regression is as follows: consider the indicator feature . With a coefficient vector , the logistic odds will be . In this case, the coefficient vector corresponds to a vector of Élő ratings.
Note that in the above formulation, the coefficient vectors are explicitly allowed to depend on the teams. If we further allow to depend on both teams, the model includes the Bradley-Terry-Élő models above as well; we could also make the depend on both teams. However, allowing the coefficients to vary in full generality is not very sensible, and as for the constant term which may yield the Élő model under specific structural assumptions, we need to endow all model parameters with structural assumptions to prevent combinatorial explosion of parameters and overfitting.
These subtleties in incorporating features, and more generally how to combine features with hidden factors will be discussed in the separate, subsequent Section 4.2.1.
4.1.3 Connection to existing model classes
Close connections to three important classes of models become apparent through the discussion in the previous sections:
Generalized Linear Models generalize both linear and log-linear models (such as the Bradley-Terry model) through so-called link functions, or more generally (and less classically) link distributions, combined with flexible structural assumptions on the target variable. The generalization aims at extending prediction with linear functionals through the choice of link which is most suitable for the target [for an overview, see 39].
Particularly relevant for us are generalized linear models for ordinal outcomes which includes the
ternary (win/draw/lose) case, as well as link distributions for scores. Some existing extensions of this type,
such as the ternay outcome model of  and the score model of ,
may be interpreted as specific choices of suitable linking distributions.
How these ideas may be used as a component of structured log-odds models will be discussed in Section 4.2.
(vulgo “deep learning”) may be seen as a generalization of logistic regression which is mathematically equivalent to a single-layer network with softmax activation function. The generalization is achieved through functional nesting which allows for non-linear prediction functionals, and greatly expands the capability of regression models to handle non-linear features-target-relations[for an overview, see 51].
A family of ideas which immediately transfers to our setting are strategies for training and model fitting. In particular, on-line update strategies as well as training in batches and epochs yields a natural and principled way to learn Bradley-Terry-Élő and log-odds models in an on-line setting or to potentially improve its predictive power in a static supervised learning setting. A selection of such training strategies for structured log-odds models will be explored in Section 4.3
. This will not include variants of stochastic gradient descent which we leave to future investigations.
It is also beyond the scope of this manuscript to explore the implications of using multiple layers
in a competitive outcome setting, though it seems to be a natural idea given the closeness of the model classes
which certainly might be worth exploring in further research.
Low-rank Matrix Completion is the supervised task of filling in some missing entries of a low-rank matrix, given others and the information that the rank is small. Many machine learning applications can be viewed as estimation or completion of a low-rank matrix, and different solution strategies exist [4, 6, 42, 32, 40, 55, 63, 33].
The feature-free variant of structured log-odds models (see Section 4.1.1) may be regarded as a low-rank matrix completion problem: from observations of for where the set of observed pairings may be considered as the set of observed positions, estimate the underlying low-rank matrix , or predict for some which is possibly not contained in .
One popular low-rank matrix completion strategy in estimating model parameters or completing missing entries uses the idea of replacing the discrete rank constraint by a continuous spectral surrogate constraint, penalizing not rank but the nuclear norm ( = trace norm = 1-Schatten-norm) of the matrix modelled to have low rank [an early occurrence of this idea may be found in 57]. The advantage of this strategy is that no particular rank needs to be a-priori assumed, instead the objective implicitly selects a low rank through a trade-off with model fit. This strategy will be explored in Section 4.4 for the structured log-odds models.
Further, identifiability of the structured log-odds models is closely linked to the question whether a given entry of a low-rank matrix may be reconstructed from those which have been observed. Somewhat straightforwardly, one may see that reconstructability in the algebraic sense, see , is a necessary condition for identifiability under respective structure assumptions. However, even though many results of  directly generalize, completability of anti-symmetric low-rank matrices with or without vectors of ones being factors has not been studied explicitly in literature to our knowledge, hence we only point this out as an interesting avenue for future research.
4.2 Predicting non-binary labels with structured log-odds models
In Section 4.1, we have not introduced all aspects of structured log-odds models in favour of a clearer exposition. In this section, we discuss these aspects that are useful for the domain application more precisely, namely:
How to use features in the prediction.
How to model ternary match outcomes (win/draw/lose) or score outcomes.
How to train the model in an on-line setting with a batch/epoch strategy.
For point (i) “using features”, we will draw from the structured log-odds models’ closeness to logistic regression; the approach to (ii) “general outcomes” may be treated by choosing an appropriate link function as with generalized linear models; for (iii), parallels may be drawn to training strategies for neural networks.
4.2.1 The structured log-odds model with features
As highlighted in Section 4.1.2, pairing features taking values in , and team features taking values in may be incorporated by modelling the log-odds matrix as
where . Note that differently from the simpler exposition in Section 4.1.2, we allow all coefficients, including , to vary with and .
Though, allowing and to vary completely freely may lead to over-parameterisation or overfitting, similarly to an unrestricted (full rank) log-odds matrix of in the low-rank Élő model, especially if the number of distinct observed pairings is of similar magnitude as the number of total observed outcomes.
Hence, structural restriction of the degrees of freedom may be as important for the feature coefficients as for the constant term. The simplest such assumption is that all are equal, all are equal, and all are equal, i.e., assuming that
for some and where may follow the assumptions of the feature-free log-odds models. This will be the main variant which will refer to as the structured log-odds model with features.
However, the assumption that constants are independent of the pairing may be too restrictive, as it may be plausible that, for example, teams of different strength profit differently from or are impaired differently by the same circumstance, e.g., injury of a key player.
To address such a situation, it is helpful to re-write Equation 13 in matrix form:
where is the matrix whose rows are the , where and are matrices whose rows are the , and where is the matrix with entries . The symbols and
denote tensors of degree 3 (= 3D-arrays) whose-th elements are and . The symbol stands for the index-wise product of degree-3-tensors which eliminates the third index and yields a matrix, i.e.,
A natural parsimony assumption for , and is, again, that of low-rank. For the matrices, , one can explore the same structural assumptions as in Section 4.1.1: low-rankness and factors of one are reasonable to assume for all three, while anti-symmetry seems natural for but not for .
A low tensor rank (Tucker or Waring) appears to be a reasonable assumption for . As an ad-hoc definition of tensor (decomposition) rank of , one may take the minimal such that there is a decomposition into real vectors such that
Further reasonable assumptions are anti-symmetry in the first two indices, i.e., , as well as some factors being vectors of ones.
Exploring these possible structural assumptions on the coefficients of features in experiments is possibly interesting both from a theoretical and practical perspective, but beyond the scope of this manuscript. Instead, we will restrict ourselves to the case of , of and having the same entry each, and following one of the low-rank assumptions in structural assumptions as in Section 4.1.1 as in the feature-free model.
We would like to note that variants of the Bradley-Terry model with features have already been proposed and implemented in the BradleyTerry2 package for R , though isolated from other aspects of the Bradley-Terry-Élő model class such as modelling draws, or structural restrictions on hidden variables or the coefficient matrices and tensors, and the Élő on-line update.
4.2.2 Predicting ternary outcomes
This section addresses the issue of modeling draws raised in 3.1.4. When it is necessary to model draws, we assume that the outcome of a match is an ordinal random variable of three so-called levels: win draw lose. The draw is treated as a middle outcome. The extension of structured log-odds model is inspired by an extension of logistic regression: the Proportional Odds model.
The Proportional Odds model is a well-known family of models for ordinal random variables . It extends the logistic regression to model ordinary target variables. The model parameterizes the logit transformation of the cumulative probability as a linear function of features. The coefficients associated with feature variables are shared across all levels, but there is an intercept term which is specific to a certain level. For a generic feature-label distribution , where takes values in and takes values in a discrete set of ordered levels, the proportional odds model may be written as
where , and . The model is called Proportional Odds model because the odds for any two different levels , , given an observed feature set, are proportional with a constant that does not depend on features; mathematically,
Using a similar formulation in which we closely follow Rao and Kupper , the structured log-odds model can be extended to model draws, namely by setting
where is the entry in structured log-odds matrix and is a free parameter that affects the estimated probability of a draw. Under this formulation, the probabilities for different outcomes are given by
Note that this may be seen as a choice of ordinal link distribution in a “generalized” structured odds model, and may be readily combined with feature terms as in Section 4.2.1.
4.2.3 Predicting score outcomes
Several models have been considered in Section 3.1.4 that use score differences to update the Élő ratings. In this section, we derive a principled way to predict scores, score differences and/or learn from scores or score differences.
Following the analogy to generalized linear models, we will be able to tackle this by using a suitable linking distribution, the model can utilize additional information in final scores.
The simplest natural assumption one may make on scores is obtained from assuming a dependent scoring process, i.e., both home and away team’s scores are Poisson-distributed with a team-dependent parameter and possible correlation. This assumption is frequently made in literature [37, 13, 11] and eventually leads to a (double) Poisson regression when combined with structured log-odds models.
The natural linking distributions for differences of scores are Skellam distributions which are obtained as difference distributions of two (possibly correlated) Poisson distributions , as it has been suggested by Karlis and Ntzoufras .
In the following, we discuss only the case of score differences in detail, predicting both team’s score distributions can be obtained similarly as predicting the correlated Poisson variables with the respective parameters instead of the Skellam difference distribution.
We first introduce some notation. As a difference of Poisson distributions whose support is , the support of a Skellam distribution is the set of integers . The probability mass function of Skellam distributions takes two positive parameters and , and is given by
where is the modified Bessel function of first kind with parameter , given by
If random variables and follow Poisson distributions with mean parameters and respectively, and their correlation is , then their difference follows a Skellam distribution with mean parameters and .
Now we are ready to extend the structured log-odds model to incorporate historical final scores. We will use a Skellam distribution as the linking distribution: we assume that the score difference of a match between team and team , that is, (taking values in ), follows a Skellam distribution with (unknown) parameter and .
Note that hence there are now two structured , each of which may be subject to constraints such as in Section 4.1.1, or constraints connecting them to each other, and each of which may depend on features as outlined in Section 4.2.1.
A simple (and arguably the simplest sensible) structural assumption is that , is rank two, with factors of ones, as follows:
equivalently, that has rank one and only non-negative entries.
As mentioned above, features such as home advantage may be added to the structured parameter matrix or using the way introduced in Section 4.2.1.
Also note that the above yields a strategy to make ternary predictions while training on the scores. Namely, a prediction for ternary match outcomes may simply be derived from predicted score differences , through defining