1 Introduction
Sporting events are widely popular and of interest to many people across several cultures. Although most attend sporting events for entertainment purposes, some have financial investment in the outcome of the game. These groups have a keen interest in accurately predicting the results of such events, so it is not surprising that several attempts have been made by researchers to develop methods and algorithms that predict the outcome of various sporting events (Aoki et al. (2017); Dani et al. (2006); Kampakis & Adamides (2014); Kampakis & Thomas (2015); Király & Qian (2017); Knowlton (2015); Pelechrinis (2017)).
Oddsmakers are one group who attempt to accurately predict the result of games in order to maximize their profit. They accomplish this by setting the spread at a value that encourages equal betting by gamblers for each team in the competition^{1}^{1}1Description of spread estimation process: https://tinyurl.com/p5yxekh. What makes this challenging is that the outcome of an event can be influenced by many factors that dynamically change across time (Aoki et al. (2017)), so the spread can be considered a highlevel metric that captures these variables.
Due to the difficulty of setting the spread, oddmakers will often leverage algorithms to set the initial value and manually adjust it as more information becomes available (Figure 1). This process introduces an interesting issue, since it is wellknown that both algorithmic performance (Tschantz & Datta (2015); David et al. (2011b); Sweeney (2013)) and human decisionmaking (Camerer (2003); David et al. (2011a); Gilovich (1993); Gilovich et al. (2002); Haghighat et al. (2013); Kahneman & Tversky (1979); Gilovich et al. (1982); Kahneman et al. (2012); Schlicht et al. (2010)) can be biased. Therefore, oddsmakers may be (implicitly) providing reliable information about the outcome of a sporting event to gamblers through the spread. Using poker terminology, the spread may provide a tell that gamblers can exploit to improve predictive performance.
This study is the first to investigate if oddsmaker bias can be exploited to improve the prediction of outcomes in NFL games. Previous research exclusively leveraged game and situational features (e.g., historic win percent, margin of victory) in their predictive models (Aoki et al. (2017); David et al. (2011a); Haghighat et al. (2013); Király & Qian (2017); Knowlton (2015); Pelechrinis (2017)), or aggregated expert predictions to improve model performance (Dani et al. (2006)). This effort will extend existing research by systematically exploring if spread biases provide reliable information about game outcomes.
First, Section 2.2 explores how models that exploit decision biases perform against models that do not leverage this information, under conditions of temporallyindependent data. Then, Section 2.3 will investigate how the same models perform under temporallydependent conditions, that better reflect the nature of realworld gambling. Before performance is considered, the next section will describe how the data used in this effort was leveraged to train and evaluate model performance under the conditions considered in this paper.
2 Methods and Results
In order to investigate if oddsmaker bias can be exploited to improve the prediction of NFL outcomes, realworld gambling data need to be acquired to train and evaluate the predictive models. Section 2.1 describes how the data used in this effort was obtained and filtered.
Section 2.2 will outline how the data was leveraged to train and test predictive models under conditions in which the date of the game is ignored (i.e., temporallyindependent data conditions). Then, Section 2.3 outlines how the models perform under situations where the data used to train and test each model are sensitive to dates in which the games were played (i.e., temporallydependent conditions).
2.1 RealWorld Data
To explore if spread bias can be exploited to improve the prediction of NFL outcomes, realworld gaming data are required that contain both spread and outcome information (Figure 2). This study leveraged data obtained from an online gambling site^{2}^{2}2OddsShark NFL Database (uploaded on 10.11.17): http://www.oddsshark.com/nfl/database.
The maximum amount of data that could be obtained from the database had an upperbound of 30 samples per team. Therefore, the maximum number of games were uploaded, which resulted in 960 unfiltered samples (30 samples x 32 NFL teams). The samples obtained were all regularseason conference games that took place between December, 2014 and October, 2017. Since the database produced duplicates for some games, they were removed which resulted in 648 unique samples that contained both NFL game outcome (Visitor Score  Home Score) and the final spread. The next section describes how the data were used to train the predictive models used in this study.
2.2 TemporallyIndependent (TI) Models
The goal of this study is to systematically explore if spread estimates provide reliable information about game outcomes. To accomplish this goal, it is necessary that the probability of an outcome (
) given a valid spread () be estimated ().For the temporallyindependent (TI) models described first, this density is estimated through a training procedure that does not consider the date in which games were played. In this respect, models can be trained using data from more recent games, and evaluated on games that were played further in the past. Obviously, this doesn’t directly approximate realworld context, but it allows for our analysis to simulate hundreds of games to provide insight into the variability of model performance. The next section describes how models were trained and evaluated under TI conditions.
2.2.1 TIModel Training
To explore if oddsmaker bias can be exploited to improve the prediction of NFL outcomes, candidate models need to be defined in order to compare their relative performance across the TI conditions considered in this section. The most straightforward model that defines a lowerbound on performance is the RandomGuess model. Since there are two possible outcomes (i.e., home wins or visitor wins), the RandomGuess model decision () is analogous to a coinflip (ignoring ties which are uncommon in the NFL). Notice that the RandomGuess model does not use potential information about the relationship between spread and outcomes when making its prediction:
(1) 
where is a random number generation sample (01) and is the number of possible outcomes ().
Another model that is considered in this paper is termed the MaximumProbability model. The MaxProbability model's predictions () select the team that maximizes the probability of an afterthespread (ATS) win for either the home () or visiting () teams. Predictions are made for each valid spread ():
(2) 
The validity of a spread corresponds to the number of samples available at each spread value. Under temporallyindependent (TI) conditions considered in this section, valid spreads were required to have at least 25 outcome samples (Figure 3). The data filtering process resulted in 7 valid spread values (), for which histograms are shown in Supplementary Figure 1.
This filtering was performed since the data need to be separated into training data used to estimate ), and test data that was used to evaluate predictive performance. Figure 3 shows the process by which the data were separated. For each of the Nsimulations (N=200), 10 data samples per valid spread were heldout as test data, resulting in 14,000 test samples (10 samples x 7 valid spreads x 200 simulations), or 2000 test samples per valid spread.
Since we required at least 25 samples per spread, this assures that there is a minimum of 15 training samples that can be used to estimate , per simulation. Supplementary Figure 2 shows the estimated pdf for each of the 7 valid spead values. Notice that the date of the game was not considered during the separation of data between train and test conditions, which is why it is termed temporallyindependent (TI) conditions.
In order to estimate the densities needed to make maximumprobability predictions (
), we used kernel density estimation (with width = 4) to estimate the quantities of interest (Figure
3):(3) 
where is the valid spread value, are the quantized outcome values used during the kernel density estimation (40 to +40), and .
Notice that both RandomGuess and MaximumProbability models make predictions across each of the valid spread values, and do not take into account oddsmaker bias. However, the main objective of this paper is to explore how decisionbias can be exploited to improve NFL outcome prediction, so models that accomplish this objective are considered next.
A first step to exploiting decisionbias is to find a method that is capable of quantitatively identifying biases. Since bias can be considered the amount of information the spread provides about the outcome, we can use Shannon entropy (Shannon (1948)) as an effective metric to for this purpose. In this respect, Shannon entropy () reflects the amount of uncertainty associated with information carried by spread estimates in their ability to predict outcomes (for each team ()). Stated another way, decisions that are strongly biased will correspond to minimumentropy spreads. Spreads that exceed some thresholdlevel of entropy (.95 in our work) are then identified and considered biased. These biased spreads can then exploited to improve predictive performance.
The MinEntropy model essentially makes the same predictions () as the MaxProbability model for a given spread. However, whereas the MaxProbability model wagers across all valid spreads, the MinEntropy model only wagers for spreads that are maximally biased (i.e., minimum entropy spread ()). In this respect, the MinEntropy approach determines which spread to select through Shannon entropy, and then uses a MaximumProbability approach to decide which team to choose (Figure 3):
(4) 
A more general version of the MinEntropy model is the kLowest Entropy model. Whereas, the MinEntropy model wagers only on the spread with the largest bias, the kLowest Entropy model wagers () on all (k) spreads that are below the thresholdlevel of entropy (.95 in this paper). Once the klowest entropy spreads () are identified, those spreads are selected and teams are chosen for each spread based on MaxProbability methods.
(5) 
This section outlined how each of the candidate models (RandomGuess, MaxProbability, MinEntropy, and kLowest Entropy) were trained and evaluated under TI conditions. The next section discusses the results obtained through 200 simulations that were conducted under these TI conditions.
2.2.2 TIModel Evaluation
Using the procedure outlined in Section 2.2.1, model performance was evaluated with the test data heldout during each training iteration. As mentioned above, this resulted in 2000 test samples for each of the 7 valid spread values.
The RandomGuess model effectively flipped a coin for each of the 2,000 test samples across each of the 7 valid spreads. The actual NFL game outcomes from the test data was compared to RandomGuess model predictions, and resulted in average performance (Mean: 49.46%; SEM: 1.11) that was close to expected levels of chance (Figure 4, Right).
Predictions from the MaxProbability model were compared to actual NFL outcomes across all valid spreads for the test data. The results show a slight improvement in average predictive performance (Mean: 58.15%; SEM: 1.13), above chance performance (Figure 4, Right).
Finally, performance of the models that exploit oddsmaker bias were examined by comparing predictions from the MinEntropy model to the 2,000 actual NFL outcomes for only the test data of the minimum entropy spread (Figure 4, Left). Similarly, the kLowest Entropy model predictions were compared to each of the (k=2) valid spreads that fall below the entropy threshold (.95), resulting in 4,000 (2 spreads x 2,000 samples) comparisons. As Figure 4 (Right) shows, both MinEntropy (Mean: 71.85%; SEM: .82) and kLowest Entropy (Mean: 66%; SEM:.99) outperform the other two approaches that do not leverage biases in oddsmaker decisions.
Although the results under these temporallyindependent conditions appear promising, it would be remiss not to evaluate how the entropybased models perform under conditions that better reflect realworld gambling conditions. The next section will provide such an evaluation under temporallydependent conditions that better reflect these natural conditions.
2.3 TemporallyDependent (TD) Models
The previous section was able to simulate hundreds of trials in order to evaluate the performance of different models under TI conditions. This section will strike the opposite balance by using historic data in an attempt to predict more recent outcomes. Although these temporallydependent (TD) conditions accurately reflect the temporal components of reallife gambling, they only allow for performance to be evaluated one time.
The hope is by leveraging these complimentary approaches for model evaluation, insight can be gained into the effectiveness expected from each of the candidate approaches if utilized outside of this study.
2.3.1 TDModel Training
The candidate TD models are identical to those outlined in Section 2.2.1, and the training procedure is also similar to that reflected in Figure 3, with a couple key differences.
First, instead of simulating over multiple iterations requiring train and test data to be randomly selected for each iteration, TD conditions separate the data only once. This onetime separation is based exclusively on the date in which the games were played. More specifically, the test data exclusively included data from games played in the year 2017 (n=85), whereas training data included games from years prior to 2017 (20142016; n = 64885 = 563). Implicitly, this reflects a situation where the gambler trained the model between the 2016 and 2017 seasons, and did not update the model as results became available. Therefore, this can be considered a conservative estimate on model performance under TD conditions.
A second difference was in the definition of what constitutes a valid spread. The minimum number of samples required per spread was reduced to 15 (instead of 25). Remember, this threshold determined the amount of training data used to estimate , but unlike TI conditions, no additional data in the TD condition need to be removed for testing. Effectively, this filtering reduction balanced the minimum training samples required for a valid spread between TI and TD conditions.
This filtering process resulted in 12 valid spreads which produced a total of 54 (of the possible 85) test samples across each of the valid spreads. Therefore, there is relatively limited data to evaluate performance due to the TD constraints, and Section 2.3.2 will overview the performance of each model under the conditions described in this section.
2.3.2 TDModel Evaluation
Figure 5 (Left) shows the entropy for each of the 12 valid spreads in the TD condition. It shows there are 4 spreads that exhibit biases that exceed the entropy threshold used in this study (.95). As a result, the kLowest Entropy model has a maximum spreads to wager across.
Surprisingly, the MinEntropy model performed poorly (37.5%) in percent ATS wins across the test samples that correspond to the MinEntropy spread (Figure 5 (Right)). However, the kLowest Entropy method realized good predictive performance from (66.67% ATS wins, ) and (65.00% ATS wins, ) conditions. The condition (44.44% ATS wins, ) performed at similar levels to the MaxProbability (50.00% ATS wins, ) and Random (48.68%, ) models.
In order to visualize the impact on performance from increasing beyond the entropy threshold (.95), Figure 5 (Right) shows how ATS win percent decreases for all spreads associated with . This suggests that performance gains were the result of exploiting spreads with significant decision bias.
3 Discussion
This effort was the first to investigate if oddsmaker decision biases can be exploited to improve the prediction of NFL game outcomes. The study evaluated two models that do not exploit decision bias (RandomGuess and MaxProbability methods) against two approaches that exploit bias (MinEntropy and kLowest Entropy) in their ability to predict the outcomes of NFL games.
The results show that under temporallyindependent conditions, methods that exploit decisionbias predict NFL outcomes better than those that do not (Figure 4 (Right)). However, when the same models were tested under conditions that account for the dates in which games were played (temporallydependent), only the kLowest Entropy methods predicted NFL outcomes above chance (Figure 5 (Right)). For maxk conditions (TI: 2, TD:4), the kLowest Entropy method was able to achieve at least 65% ATS win percent under both TI and TD conditions.
Taken together, it suggests that the kLowest Entropy method is a robust approach capable of improving predictive performance by exploiting oddsmaker decision biases. This result is surprising, given the relative simplicity of the model (only using spread as a feature), compared to previous work in this area (Aoki et al. (2017); David et al. (2011a); Haghighat et al. (2013); Király & Qian (2017); Knowlton (2015); Pelechrinis (2017)). This is likely due to the fact that in order to set spreadvalues, oddsmakers must account for many game and situational variables (Figure 1). Therefore, spread appears to be an extremely useful summary feature from which one can evaluate bias.
Another compelling issue to pursue is attempting to gain insight into the underlying reason oddsmakers exhibit biases. As our analyses show (Figure 4, (Left) and Figure 5, (Left)), biases tend to be relatively stable for some spreads (e.g., 2.5). This implies that either the algorithm or human is producing this spread for cases in which the visiting team wins frequently. It’s interesting to note looking at the marginal spread distribution ((Figure 2), that the mean (2.24) is near this maximumbias location. This may imply that this spread corresponds to the default homefieldadvantage produced when two teams are incorrectly estimated to be evenly matched. Future work will explore this topic in greater detail.
Future work will also investigate more rigorous methods to set the entropy threshold used to identify biased spreads. This value was set visually (at .95) in this study, so a quantitative threshold selection method is needed. Moreover, efforts will be made to acquire greater amounts of data that can be used to further train and evaluate the models proposed in this study.
Overall, these results demonstrate how identifying and exploiting decisionbiases can improve performance in competitive wagering situations. Indeed, these results may extend to other competitive decisionmaking tasks, such as stock selection, where human and algorithmic decisions could be biased. It will be interesting to see if this work can be generalized to other important domains.
Acknowledgments
The author would like to thank his wife and son for their patience while this analysis was being performed.
References

Aoki et al. (2017)
Aoki, R., Assuncao, R.M., and de Melo, P. Vaz.
Luck is hard to beat: The difficulty of sports prediction.
Arxiv: Machine Learning
, pp. 1–10, 2017. 
Camerer (2003)
Camerer, C.
Behavioral Game Theory: Experiments in Strategic Interaction
. Princeton University Press, 2003. 
Dani et al. (2006)
Dani, V., Madani, O., Pennock, D.M., Sanghai, S., and Galebach, Brian.
An empirical comparison of algorithms for aggregating expert
predictions.
In
22nd Conference on Uncertainty in Artificial Intelligence, UAI 2006
, Cambridge, MA, 2006. 
David et al. (2011a)
David, J.A., Pasteur, R.D., Ahmad, M.S., and Janning, M.C.
Nfl prediction using committees of artificial neural networks.
Journal of Quantitative Analysis in Sports, 7(2):193–206, 2011a.  David et al. (2011b) David, J.A., Pasteur, R.D., Ahmad, M.S., and Janning, M.C. Algorithmic bias: From discrimination discovery to fairnessaware data mining. Journal of Quantitative Analysis in Sports, 7(2):193–206, 2011b.
 Gilovich (1993) Gilovich, T. How We Know What Isn’t So: The Fallibility of Human Reason in Everyday Life. New York: The Free Press, 1993.

Gilovich et al. (1982)
Gilovich, T., Griffin, D., and Kahneman, D.
Judgement Under Uncertainty: Heuristics and Biases
. Cambridge University Press, 1982.  Gilovich et al. (2002) Gilovich, T., Griffin, D., and Kahneman, D. Heuristics and Biases: The Psychology of Intuitive Judgement. Cambridge University Press, 2002.
 Haghighat et al. (2013) Haghighat, M., Rastegari, H., and Nourafza, N. Nfl prediction using committees of artificial neural networks. Advances in Computer Science : an International Journal, 2(5), 2013.
 Kahneman & Tversky (1979) Kahneman, D. and Tversky, A. Prospect theory: An analysis of decision under risk. Econometrica, 5(4):297–323, 1979.
 Kahneman et al. (2012) Kahneman, D., Thaler, J., and Richard, H. Anomalies: The endowment effect, loss aversion, and status quo bias. Journal of Economic Perspectives, 5(1):193–206, 2012.
 Kampakis & Adamides (2014) Kampakis, S. and Adamides, A. Using twitter to predict football outcomes. Arxiv: Machine Learning, pp. 1–10, 2014.
 Kampakis & Thomas (2015) Kampakis, S. and Thomas, B. Using machine learning to predict the outcome of english county twenty over cricket matches. Arxiv: Machine Learning, pp. 1–17, 2015.
 Király & Qian (2017) Király, F.J. and Qian, Z. Modelling competitive sports: BradleyterryÉlő models for supervised and online learning of paired competition outcomes. Arxiv: Machine Learning, pp. 1–10, 2017.
 Knowlton (2015) Knowlton, E. Microsoft’s sports algorithm is probably better at picking nfl winners than you are. Fiscal Times (online article), pp. https://tinyurl.com/y7trzobr, 2015.
 Pelechrinis (2017) Pelechrinis, K. Winrnfl: A simple and wellcalibrated ingame nfl win probability model. Arxiv: Statistical Applications, pp. 1–10, 2017.
 Schlicht et al. (2010) Schlicht, E.J., Shimojo, S., Camerer, C.F., Battaglia, P., and Nakayama, K. Human wagering behvior depends on opponents faces. PLoS ONE, 5(7):https://doi.org/10.1371/journal.pone.0011663, 2010.
 Shannon (1948) Shannon, C.E. A mathemetical theor of communication. Bell System Technical Journal, 27(3):379–423, 1948.
 Sweeney (2013) Sweeney, L. Discrimination in online ad delivery. ACMQueue, 11(3), 2013.
 Tschantz & Datta (2015) Tschantz, A. Datta M. C. and Datta, A. Automated experiments on ad privacy settings. Proc. Privacy Enhancing Technologies, 1:92–112, 2015.
Supplementary Figures
Model  Percent ATS Win  N TestSamples 

Random  48.68%  54 
MaxProb  50.00%  54 
MinEnt  37.50%  8 
2Lowest Ent  44.44%  9 
3Lowest Ent  66.67%  18 
4Lowest Ent  65.00%  20 
5Lowest Ent  56.67%  30 
6Lowest Ent  58.06%  31 
7Lowest Ent  52.50%  33 
8Lowest Ent  48.84%  40 
9Lowest Ent  45.65%  46 
10Lowest Ent  48.98%  49 
11Lowest Ent  50.00%  52 
12Lowest Ent  50.00%  54 