1 Introduction
A basketball game can be framed as a collection of episodes from complex stochastic processes. Each episode, or play, is comprised of a finite number of transitions between players and locations ultimately terminating in a shot, turnover, or foul. An integral attribute of the game is that it is nonstationary; the transition probabilities are not constant over the 24 seconds in which a team has to shoot the ball. For example, consider the relationship between time on the shot clock, which counts down these 24 seconds, and the probability of taking a shot as shown in Figure 1. The plot in (fig:b) shows empirical leagueaverage shot policies, which we define as the probability that any onball event (i.e. dribbles, passes, and shots) will be a shot, for the set of court regions defined in (fig:a). As the shot clock winds down, the probability of shooting increases — quite dramatically in the final seconds of the shot clock.
Determining optimal policies for player shooting is a critical problem in the game of basketball and it remains an active area of research (Skinner & Goldman, 2015; Goldman & Rao, 2014b). However, the inherent nonstationarity introduced by the shot clock makes assessing shot selection optimality a complex problem. In this project, we propose a method to test and compare shot policies which accounts for the dynamic nature of a basketball play. Two critical assumptions underlying our approach are that shot policies are both timevarying and malleable. Basketball analysts often focus on the less flexible component of shot efficiency — field goal percentage, or the percentage of a player’s shots that he makes. Improving shooting skill can take years of practice, whereas the shot policy is comparatively controllable; players choose where and how often they shoot when they have the ball in their possession.
Given the malleable nature of shot policies, we explore what could have happened if a player’s shot policy had changed. To enable this exploration, we model plays as episodes from latent Markov decision processes (MDPs) with dynamic withinepisode transition probabilities. We approximate these functional transition probabilities via transition probability tensors (TPTs), then estimate the latent components of the MDP using Bayesian hierarchical models. Our method involves combining several Markov chains with overlapping state spaces into an average Markov chain, which we derive subject to the constraint that the expected total transition count for an arbitrary statepair is equal to the weighted sum of the expected counts of the separate chains.
We then develop a method to simulate from these processes not simply by outcome, but rather at the subsecond level, incorporating every intermediary and terminal onball event over the course of a play. The uncertainty in our estimation of the MDP gets propagated into the simulations via posterior samples of the MDP model. Ultimately, our method allows us to make distributional estimates of counterfactual scenarios such as, “What could have happened if a team took contested midrange shots less frequently early in the shot clock?” While we focus primarily on shot policies in this paper, narrowing in on a player’s choice to shoot or not at any given instant, the framework presented here can be altered to accommodate the whole space of decisions players can make onball, including movement and passing.
1.1 Related work and contributions
This paper adds to the growing literature of spatiotemporal analyses of team invasion sports (i.e. basketball, football, soccer, and hockey). We refer the reader to Gudmundsson & Horton (2017)
for a survey. Within this body of work, Markov models have significant presence:
Goldner (2012) uses a Markov model as a framework for evaluating plays in American football; Hirotsu & Wright (2002) use Markov processes to determine optimal substitution patterns in an English Premier League match; and Thomas et al. (2013) use a semiMarkov process to model team scoring rates in hockey.The landmark work of Cervone et al. (2014) is particularly relevant to the methods we introduce in this paper. The state space and hierarchical models we develop have similarities to the coarsened possession process they employ; however, our ultimate goals are fundamentally different. Cervone et. al. aim to estimate instantaneous point values of possessions whereas we utilize a decision process framework to estimate the macroeffects if player decisions were to change.
We approach the problem similarly to Routley & Schulte (2015), who apply a Markov game formalism to value player actions in hockey, incorporating context and a lookahead window in time. As in Routley & Schulte (2015), we do not aim to compute optimal strategies; however, we provide a basketball play simulator by which alternate policies can be explored. Since the defense is not an adversarial agent in our model but is built into the system via the probabilistic components of the MDP, this simulator is proposed as an exploratory tool as opposed to a mechanism to compute policy optima.
Several papers have endeavored to simulate a basketball game using Markov models (Štrumbelj & Vračar, 2012; Vračar et al., 2016). Our simulator is unique among these studies in a number of ways. We account for the uncertainty in every estimated parameter, propagating this uncertainty through to the simulations. Also, though Vračar et al. (2016) incorporate gameclock time, these simulation methods do not account for the inherent nonstationarity within a possession introduced by the shot clock. We propose a novel method to account for the nonstationarity of basketball plays using transition tensor Markov decision processes. By incorporating this dependency in our model, we can explore far more detailed policy changes with correspondingly more accurate results, particularly with respect to shot clock violations and timespecific policy changes within plays.
This work also contributes to the literature and practical application of discrete absorbing Markov chains. We formalize a method to construct and estimate an average chain from several independent Markov chains with overlapping state spaces. The term “average” as used here refers to a chain that yields the same number of statetostate transition counts in expectation for all unique state pairs spanned by the set of independent chains. This allows us to reliably estimate aggregate counts across the system of chains without having to estimate each chain individually. This result is critical in making our method both parsimonious and computationally feasible.
1.2 Description of data
We use highresolution optical tracking data collected by STATS LLC for the 20152016 NBA regular season. These data include the coordinates of all 10 players on the court and the coordinates of the ball at a frequency of 25 observations per second. These data are further annotated with features such as shots, passes, dribbles, fouls, etc. For this project we use observations with tagged ballevents including dribbles, passes, rebounds, turnovers, and shots. This significantly reduces the number of observations while retaining the core structure of a possession.
1.3 Outline
The rest of the paper is outlined as follows: In Section 2 we give a brief overview of Markov decision processes and frame the process in basketball terms. In Section 3 we describe how we incorporate tensors in the framework of an MDP, detail the inference procedures, and illustrate the model fit. In Section 4 we describe how we simulate plays from teamspecific MDPs and show calibration results from our simulations. In Section 5 we show the results of our simulations under various altered policies and discuss potential gametheoretic ramifications of altering policies. Our concluding remarks comprise Section 6.
2 Markov decision processes
Markov decision processes are utilized in many modern reinforcement learning problems to characterize interactions between an agent and his environment. In this paper we restrict our attention to finite MDPs, which can be represented as a tuple:
. represents a discrete and finite set of states. represents a finite set of actions the agent can take. defines the transition probabilities between states, and defines the immediate reward the agent receives for any given state/action pair. The agent operates in the environment according to a policy, , which defines the probabilities that govern the agent’s choice of action based on the current state of the environment. is the only aspect of the system that the agent controls. Typically, the agent’s goal is to maximize his longterm rewards, which he does by modifying his policy. We can define these functions succinctly in mathematical terms:(1)  
(2)  
(3) 
Figure 2 illustrates a MDP in context of a basketball play. In basketball terms, we use to represent the probability that the ballcarrier takes a shot (or other actions, as we explore later) given his current state. If he takes a shot, dictates the expected point value of that shot. If he decides not to shoot, denotes the probabilities of the ball entering any other state given his current state. In our case, the governing probabilities are unobserved for each of these components and hence must be estimated. We refer the reader to Sutton & Barto (1998) for a more expansive introduction to reinforcement learning and Markov decision processes.
2.1 State and action space
The state space of our model, , is defined in context of the ballcarrier. Following Cervone et al. (2014), at any time , the state is given by the identity of the ballcarrier, his court region, and an indicator of his defensive pressure (open or contested). Court region is a function of the coordinates of the ballcarrier and can take any the six regions shown in Figure 1(fig:b): rim, paint, midrange, corner 3, arc 3, or backcourt. Defensive pressure is determined by the distance of the nearest defender to the ballcarrier and is dependent on the court region of the ballcarrier: rim 3 ft, paint 3.5 ft, midrange 4 ft, and 3point regions 5 ft.
As we are primarily interested in shot policies, we have chosen a binary action space; at each step in the process, the ballcarrier decides to either shoot or not shoot () according to his policy . If a shot is taken the play terminates; otherwise, the subsequent transition is generated by . Later, we explore changes to passing probabilities via perturbations to .
2.2 Defining the average chain
Before unrolling the modeling details for the components of the MDP, we pause to explain how we have specified the data generating process. Because most teams use upwards of 500 lineups over the course of a season, we assume the transition probabilities in and action probabilities in are invariant to the lineup, i.e., other players do not impact transitions. This allows us to define the process at any given point in time using a single teamaverage transition probability matrix (TPM). We construct this teamaverage chain such that it yields the same number of state pair transitions in expectation as the sum across all the independent chains for every unique state pair spanned by the set of lineups. We now briefly detail this derivation for two lineups.
Consider Markov chains for two separate lineups, and , each having a set of transient states denoted and respectively. Let the matrix of expected statepair transition counts for lineup , (denoted by ), be defined in a cellwise manner such that the cell of equals
(4) 
where indexes the initial starting state of the episode, indexes the origin state of the statepair of interest, and indexes the destination state of the pair. is the initial distribution of the chain, is the entry of the fundamental matrix , and is the probability of immediately transitioning to state given current state . The entry of is the expected number of times that the process arrives in state given that the episode is initialized in state . We combine and with weights proportional to the number of episodes that come from each chain, then normalize the rows of the resulting matrix to create the average chain . It can then be shown for any statepair that . We have omitted some details for brevity, but a full exposition can be found in the Appendix.
The average chain allows us to accurately estimate aggregate counts (e.g. over the course of the regular season) across all lineups without having to estimate each lineup’s transition probabilities individually, making the problem tractable while still retaining enough detail to explore the nuanced questions we are interested in.
3 Transition and policy tensors
In most MDP applications the transition dynamics, , are treated as being static while is assumed to vary temporally. However, in this paper we assume the opposite; only the reward function is timeindependent. The reason for this is that in our case, time — or the shot clock rather — resets with each new episode of the process as opposed to continuing globally across episodes. We are concerned about withinepisode temporal dynamics, whereas most MDP applications consider time globally. As such, the way we consider nonstationarity is quite different than how it’s primarily treated in the literature. Our framework requires a functional form of and , whereas these are conventionally modelled statically.^{1}^{1}1In reinforcement learning applications, typically gets updated as the agent learns more about his environment. In this sense is dynamic, but this is quite different than the functional form for we refer to here.
To incorporate withinepisode nonstationarity in and , we propose using tensors to allow for dynamic transition probabilities and shot policies over the shot clock. In the stochastic processes literature, the term ‘transition probability tensor’ arises (albeit infrequently) in reference to the series of transition probability matrices induced by an order Markov chain (see Li & Ng (2014) for example). This is not what we mean by this term. Rather, we refer to a transition (or policy) tensor as an approximation to a dynamic transition probability function of a continuous temporal covariate . Specifically, we model and as tensors with 8 matrix slices, or TPMs, each representing a threesecond interval of the shot clock as illustrated in Figure 3.
The policy tensor is virtually identical to the transition probability tensor (TPT) in form, the only difference being the column space. Since the ballcarrier makes only a binary decision at every step of the process, the shot policy (given any time ) is a matrix slice with row space equal to that of the corresponding TPM slice and a column space of length two (shot and no shot). This tensor framework is the key to accurately exploring the effects of altering shot policies. The efficiency of a shot is dependent on the time remaining on the shot clock, and this model framework allows us to account for this temporal dependency and tailor our policy alterations accordingly.
3.1 Tensor model specification
We employ a Bayesian hierarchical modeling approach, which provides a natural way to share strength across parameters that are alike. Note that while we fit each model independently of the others, they each employ a common hierarchical structure — playerspecific parameters borrow strength from positionspecific parameters (e.g. point guards, power forwards, etc.), which in turn borrow strength from global location and defensive pressure parameters. Since much of the notation and details of the policy model and transition model are redundant, we only show the specifics for the policy model in this section. The details of the transition tensor model are included in the Appendix for the interested reader.
Let
be a Bernoulli random variable with 0 denoting ‘no shot’ and 1 denoting ‘shot’ in the
step of an episode from the MDP. In our context, an episode is a sequence of events that comprise a single play from start to finish. Let be a categorical indicator of the ballcarrier’s current state in the space of all player/location/defense combinations. Specifically, , , and , where and represent the total numbers of players on the roster and locations respectively, and where is a binary indicator of defensive pressure. Next, let represent the interval of the shot clock in which the play falls in its step. Since we partition the 24second shot clock into 3second intervals, this takes on values from 1 to 8. will denote the position (or group) of the ballcarrier (player ) at the step of the episode, therefore , where is the total number of unique player positions. Note that players are nested within position, hence shows that player belongs to position group . At step ,(5)  
(6)  
(7)  
(8) 
where each is an AR(1) correlation matrix with temporal correlation parameter corresponding to its level of the hierarchy (i.e. . If a shot is taken (i.e. ) then the MDP episode terminates and the reward, , is determined by . Otherwise, and the next state is determined by ^{2}^{2}2Due to the extreme infrequency of taking backcourt shots we don’t estimate playerspecific coefficients for backcourt shot policies and field goal percentages. For notational simplicity we have omitted this technicality in the model definition. Figure 4 shows a graphical representation of the model for .
3.2 Reward function
In context of a basketball play, (2) can be restated as, “How many immediate points do we expect when a player in state takes action ?”. If the action is a shot, then this expected value is his expected points per shot from the given state, otherwise it is 0 (for simplicity, in our analysis we have omitted free throws). This allows us to define the reward function of the MDP completely in terms of a shot efficiency model.
As with and , we use a hierarchical model for . However, while and define player groups using naive player position (center, power forward, etc.) this model uses new groups, , on which to base the regularization. The reason for this change is that a player’s shooting skill does not have as clear a correspondence to his naive position. As such, we create customized groupings to ensure sensitivity to this variation.
We first clustered players into three categories based exclusively on the volume of shots they took over the course of the season, irrespective of the shot locations. Next, we reclustered players into six shot propensity categories based on the proportional breakdown of their shots by court region, irrespective of volume. In both clustering procedures we used the kmeans algorithm initialized at cluster centroids calculated via Ward linkage
(Ward Jr, 1963). We then crossed these clusters, giving a total of 18 groups which differentiate players by how much they shoot and where they tend to shoot from. Table 1 shows three example players in each cluster.Shot Volume  Shot Region Propensity  

Equal Balance  3point Heavy  Mid Heavy  Rim Heavy  3point Specialist  Rim Specialist  
High  L. James  D. Lillard  J. Wall  A. Davis  S. Curry  A. Drummond 
R. Westbrook  K. Love  D. Nowitzki  I. Thomas  T. Ariza  G. Monroe  
D. Cousins  J. Harden  K. Leonard  D. Wade  W. Matthews  J. Okafor  
Med  L. Barbosa  P. Beverly  R. Rubio  D. Favors  K. Korver  D. Jordan 
L. Stephenson  E. Ilyasova  M. Speights  E. Turner  J. Terry  T. Booker  
N. Jokic  O. Porter  M. Belinelli  B. Portis  N. Mirotic  C. Capela  
Low  A. Roberson  J. Jerebko  M. Muscala  A. Miller  J. Ingles  A. Bogut 
D. Motiejunas  B. Jennings  C. Watson  A. Varejao  J. Ennis  B. Bass  
K. McDaniels  D. Augustin  T. Prince  D. Powell  B. Rush  J. McGee 
We now specify the model governing the reward function in mathematical terms. Given a shot, let be a Bernoulli random variable with 1 = make, 0 = miss. For shot ,
(9)  
(10)  
(11)  
(12) 
In this model defensive pressure is a locationspecific additive effect rather than being built into the playerspecific parameters. Also, the hierarchical parameters in this model are univariate normal rather than multivariate normal since we model a player’s shooting skill as being constant over the shot clock. We also assume independence across plays for shot make probabilites, which is a debated area of research (Neiman & Loewenstein (2011) for example). Finally, shot efficiency is determined by scaling the estimated makeprobabilities for each state by the corresponding point value of the shot — 2 or 3 points, depending on the courtregion.
3.3 Inference and validation
After removing plays we are not interested in modeling (plays that terminated in either fouls, timeouts, jumpballs, or in the backcourt) we have 155,656 plays ( million observations) on which we fit our models. We held out a sample of approximately 28,000 plays to use for model validation. We fit our models using Stan, an opensource software package which offers a suite of MCMC methods for statistical inference (Carpenter et al., 2017; Stan, 2018). For each model we initialized two chains and let them mix long enough to ensure we had a potential scale reduction factor
for every parameter. Effective sample sizes ranged from 47 to 15,000 across the set of parameters the Appendix for additional diagnostics). We used diffuse gamma priors for all variance hyperparameters and uniform priors for the correlation hyperparameters.
Due to the massive dimensionality of the joint posterior of the TPT model, we fit in two stages — one stage to fit the top two layers of the hierarchy (position and global location parameters) and a second stage to fit the lower level of the hierarchy (playerspecific parameters). After fitting the global location and positionspecific parameters, we initialized the prior means of the playerspecific parameters, , at the corresponding posterior means of the positionspecific parameter estimates, .
Next, using the posterior draws of we simulated a sample of 1000 plays at the position level (roughly six games worth of plays), then refit the first stage of the TPT model on these simulated plays.^{3}^{3}3Details on how we simulate plays are in section 4. We then used the posterior mean of the positionlevel variance parameter of these 1000 plays as the prior variance for in second stage of the estimation. Together, these strategies lead to a straightforward interpretation: the playerspecific parameters are initialized using positionspecific estimates with a weight of six games worth of plays. This allows us to shrink lowusage players’ transition probabilities toward the league average for their position while not swamping medium to highusage players with the positionspecific estimates.
Table 2 shows outofsample loglikelihoods for four models of increasing complexity for each component of the MDP. The transition model column, , represents loglikelihoods computed using only the Cleveland Cavaliers TPT model (and corresponding outofsample data), whereas the other columns comprise the entire league. For all three components, the models with playerspecific shrinkage perform best. All subsequent references to MDP model components refer to the models in row D of Table 2.
Model  

A. Empirical model  36808  17299  5956 
B. Model A + location shrinkage  25183  36748  4572 
C. Model B + position shrinkage  24464  25099  4562 
D. Model C + player shrinkage  21552  13478  4541 
3.4 Model fit
Figure 5
shows 90% credible intervals for the transition probabilities in the top hierarchy of the TPT model. As shown in the block diagonal frames of the figure, the highest transition probabilities are to the same state, due to the predominant influence of dribbles in the data. Conversely, it is improbable for the ball to transition immediately to a state which is not directly geographically adjacent. Interestingly, the defensive pressure of the destination state appears to have a much larger impact on transition probabilities than the defensive pressure of the origin state.
The estimated shot policies and reward functions for LeBron James and Kyrie Irving of the Cleveland Cavaliers are shown in Figure 6. The strong temporal autocorrelation captured by the model () significantly smooths jagged empirical policies yielding more plausible results. The two players’ policies look quite similar, with the exception that Irving tends to take contested midrange shots more frequently than James.
Interestingly, Irving also takes contested midrange shots more frequently than he takes arc 3point shots. In general, this is considered poor shot selection because most players have a higher expected points per shot (EPPS) from beyond the arc than from the midrange. However, Irving appears to be an anomaly in this respect; his midrange reward distribution is greater than his arc 3 distribution in expectation. His estimated shot policy evidences that he knows his strengths and acts accordingly.
4 Simulating plays
Having fit the models for the MDP, our next task is to simulate plays using these models. Algorithm 1 details the conceptual structure of the simulation process. The algorithm takes as inputs the MDP components: a transition probability tensor, a shot policy, and a reward function. We use the MCMC posterior draws from Stan for these inputs which naturally propagate the uncertainty in the MDP estimation through to our simulations, making the process analogous to a posterior predictive distribution.
Our simulator also requires initial states and starting shot clock times for the plays we want to simulate. For these inputs we use the observed starting states and corresponding times on the shot clock for each team’s collection of plays in the 20152016 regular season. Note that we do not treat the number of plays in a season nor the states in which plays begin as random. Consequently, we do not analyze rebounding; the number of plays is fixed beforehand and once a turnover happens or a shot is taken, the play ends.
Lastly, we need a mechanism to take time off the shot clock at each step of the MDP. This component of the simulator makes performing analytical operations on the process intractable because the distribution of time lapses between events does not lend itself to a parametric distribution. Instead, we sample the empirical distribution of timelapses between events as a mechanism to simulate the time between events. We denote this empirical distribution function by .
4.1 Calibration
We can be extremely detailed in checking the calibration of our simulations since we keep track of all simulated intermediary and terminal transitions. To assess the calibration, we simulate 300 seasons for the Cleveland Cavaliers using the observed starting states of all their 201516 plays and compare our simulations to the actual transition counts. Note that these simulations are onpolicy, meaning they are computed using variates of the shot policy estimated on the observed data. In making this comparison we must be cognizant of overfitting; the empirical model will always yield optimal calibration because the empirical model fits both trends and errors. Models with regularization may appear less calibrated, but ultimately give better fits because the modeling of errors is attenuated by the induced shrinkage.
Figure 7 shows the simulated playeraggregate transition counts for these 300 simulations for the Cavaliers’ starting lineup overlaid with the observed counts in red. Our simulations capture the aggregate transition count trends over the shot clock with high integrity; however, they appear to be biased low for some state pairs. On the other hand, simulated transition counts for lowusage players (not shown) are generally biased high. As noted previously, these phenomena are due to shrinkage in the hierarchical model, which we are quick to note is not a model deficiency. As evidenced in Table 2, this borrowing of information improves outofsample model fit, giving us more reliable calibration on macrolevel features.
We can also calculate correlations between the average simulated transition counts and observed transition counts. Using this measure, the simulations match on multiple accounts: 2point shots ( = 0.979), 3point shots ( = 0.965), and turnovers ().
5 Altering policies
With confidence that the method accurately reproduces play sequences under the observed policy model, we now simulate teamspecific plays under altered shot policies. The way we implement a policy change via simulation is simple — we transform the posterior draws of the shot policy model according to our desired alteration then simulate seasons with these modified posterior draws. However, before providing examples of altered policies we pause to discuss some relevant topics from gametheory.
5.1 Game theory
5.1.1 Optimal policies
A general assumption of this paper is that teams are not operating on optimal shot policies. This is difficult to test but there is research that supports this conclusion (Goldman & Rao, 2014b; Skinner, 2012). Regarding optimal stopping times (i.e. when a player shoots during a play relative to the shot clock), Goldman & Rao (2014b) show that while on average, the league as a whole closely follows the optimal curve, individual lineups are not perfect optimizers, often exhibiting tendency to undershoot. Even under the assumption that a team is operating optimally, players and coaches could still gain utility by exploring adverse effects of changes to this policy.
5.1.2 Allocative efficiency
A player’s shot efficiency depends on the volume of opportunities he is allocated. The mathematical formulation of this concept originates in the work of Oliver (2004). Oliver defines the relationship between a player’s usage and his efficiency as a “skill curve” and suggests that it should generally exhibit a downward trend, meaning that players become less efficient as they carry more of the scoring load. This relationship is important in context of altering shot policies. As explained in Goldman & Rao (2014a), if a team changes its shot policy to take more 3point shots, the team has to accept lower quality 3point opportunities on the margin. This will lead to lower expected values for these additional shots but higher expected values for the 2point shots that by consequence have a lower usage rate due to the increase in 3’s. There is a counterbalancing relationship for policy changes due to moving up (or down) the skill curve. For our purposes, the simulation method should not bias the results of testing policy changes, as long as the changes are not drastic.
5.1.3 Defensive response
If a team makes a tactical change that gives them an advantage, it is reasonable to assume that the defense will attempt to eliminate the advantage. This defensive response brings up some important questions in context of our project — “How sensitive are defenses to policy alterations?” and “How long does it take for a defense to respond sufficiently to make a policy change ineffectual?” These questions depend on too many variables to suggest a single answer; however, we offer some observational evidence from the past two NBA regular seasons that suggests that, in some cases, the defensive response resulting from a team’s altered shot policy does not render its strategy change ineffectual over the course of a season.
In the 201617 NBA regular season, the Toronto Raptors shot on average 30.5% of their shots from 3point range and they averaged 1.1 points per shot on these attempts.^{4}^{4}4These statistics were gathered from stats.nba.com. In the 201718 season, they shot 39.6% of their shots from 3point range. This represents a 30% increase in their team 3point shot policy. Despite this increase in their 3point shot policy, the Raptors’ expected points per shot (EPPS) from beyond the arc only decreased less than 2% (from 1.1 in 201617 to 1.08 in 201718). Additionally, the Raptors’ overall EPPS increased from 1.1 in 201617 to 1.14 in 20172018. So while the policy change resulted in a small loss of efficiency (perhaps due to defensive adaptation), the response was not such that it rendered the Raptors’ policy change a zerosum net benefit.
We acknowledge that this example is observational; these results could be due to seasontoseason variability or the outcome of other variables, such as the addition of new players or the development/decline of returning players. Ultimately, predicting season outcomes for alternate policies is an extrapolation. As such, we believe that testing minor perturbations to a team’s policy will yield more credible results and that proposed changes should be carefully crafted prior to testing.
5.2 Shot policy changes
We now show two examples of policy changes that could be explored with our methods and compare the altered policy simulations to onpolicy simulations. For each policy, we simulate 300 seasons for the Cleveland Cavaliers; the results are shown in Figure 8.
Alteration 1. Reduce the contested midrange shot policy by 20% for all players on the team while more than 10 seconds remain on the shot clock.
Alteration 2. Regardless of time on the shot clock, reduce all contested midrange shot policies by 70% while doubling all threepoint shot policies.
The most obvious distinction between the policies is the divergence between the contested midrange and 3point shot distributions, which is not surprising since we directly altered these shot policies. However, in order to measure whether the policy yields a net positive result for a given team, we must quantify how the altered policy affects efficiency and production. To measure these effects, we restrict our attention to the differences in EPPS and expected points per play (EPPP). Under policy 2, shot efficiency increases (1.100 to 1.124 in EPPS) as does play production (1.004 to 1.034 in EPPP). Under policy 1, these distributions show no practical differences, largely due to only 7.5% of plays ending in a midrange shot with over 10 seconds on the shot clock, limiting the potential impact.
5.3 Passing policy changes
With a few modifications we can consider broader policy changes that encompass not only shooting but passing and movement as well. This entails altering the probabilities of nonterminating state transitions via the TPT.^{5}^{5}5In addition to the game theoretic consequences mentioned previously, new complexities arise with altering passing/movement policies in the context of our model framework. Many statetransition pairs in the TPT are not physically possible (e.g. a player cannot transition directly from the backcourt to the restricted area). Also, any change where we increase playertoplayer transition probabilities is potentially problematic. Passing more often to a player in a specific location hinges on the assumption that the other player is correspondingly available in that location, which is something we do not control in our model. We now explore two altered policies of this nature; the results are shown in Figure 9.
Alteration 3. Reduce the transitions from Irving to James by 90%.
Alteration 4. Triple the transition probabilities from all veterans to players on rookie contracts, while reducing the transition probabilities from rookie contract players to veterans by 75%.
Alteration 3 represents a pathological example in which Irving forces his way into being the dominant player on the team by almost never distributing the ball to James. The downstream effects of the dominant Irving policy lead to a 17% increase his expected total shot count, while James’ is reduced by 14%. Interestingly, though Irving’s and James’ total shot distributions change dramatically, our method predicts that the overall differences in production would be negligible.
Alteration 4 could be described as a youth development policy, where veteran players are asked to take a back seat and players on rookie contracts are given the green light on offense. This policy change has a much larger predicted impact on production. We estimate this policy change would cost the Cavaliers .017 EPPP, which could have significant consequences on win totals and playoff outcomes.
6 Conclusion
We have developed and implemented a method to test the impact of shotclock dependent policy adjustments over the course of a season at an unprecedented level of detail while accounting for model uncertainty in every aspect of the system. These methods could have immediate practical impact across multiple levels of a basketball organization. Coaches could assess proposed strategy changes outside of games rather than risking poor results by testing them in games. Front offices could explore the performance of hypothetical rosters by leveraging the positionlevel transition probabilities in tandem with their playerspecific shot policies and reward functions. These tools could prove useful in evaluating trades and in freeagency negotiations. Additionally, our methods could enable teams to gauge the effects of having to play secondstring players if any starters suffered a longterm injury. The examples we have shown in this paper are only the tip of the iceberg in terms of how these methods could be utilized.
We have primarily considered shooting decisions in this introductory work, but as shown in Section 5.3, our methodology naturally scales to include all different types of basketball decisions, allowing coaches and analysts to explore incredibly nuanced tactical changes. Additionally, with similar tracking data now available for most major sports including hockey, football, and soccer, our methods could extend to testing decision policies in other sports.
In a broader statistical context, we have provided and implemented a novel framework for modeling withinepisode nonstationarity in MDPs through the use of policy and transition probability tensors. We have also shown how to combine multiple MDPs into a single weighted average process, which can enable solutions to problems that were previously impracticable to compute. Additionally, we’ve built a method to simulate from this type of MDP when the arrival times cannot be modeled parametrically. These contributions could be beneficial in many different areas such as traffic modeling, queuing applications, and environmental processes.
Our paper opens doors for promising further studies. In terms of reinforcement learning, a clear next step would be to solve/estimate the actionvalue function for the functional MDP we introduce in this paper. In the basketball context, addressing the game theoretic aspects by incorporating usage curves and simulating defensive response could make these methods more robust.
Appendix: Derivations, Model Specification, and Diagnostics
A.1. Deriving the average Markov chain
Here we derive the average Markov chain for two independent Markov chains with overlapping state spaces and show that the expected total transition count for an arbitrary statepair from this chain is equal to the weighted sum of the separate chain expected totals.
Consider the two absorbing Markov chains and with transient states () and absorbing states (), written in canonical form:
We note that the set of transient states in , is not equal to the set of transient states in . Following convention, we define the fundamental matrix for chain as , where . The entry of is the expected number of times that the process is in the transient state given that the episode is initialized in state . Next, we define the matrix of expected statepair transition counts, , in a cellwise manner such that the cell of equals
(13) 
where indexes the initial starting state of the episode, indexes this origin state, and indexes the destination state. is the initial distribution of and is the probability of immediately transitioning to state given current state .
Conceptually, we will combine and proportional to the number of episodes that come from each chain using weights and , then normalize the rows of the resulting matrix to create the average chain . In formal notation, define the set of state transition pairs, , as the outer product of the transient state space with the total state space . Specifically, . Next, we define the matrix of expected average chain transition counts, , in a cellwise manner such that the cell of equals
(14) 
Then, the cell of is
(15) 
Clearly the rows of this matrix sum to 1 and all entries are nonnegative, thereby making it a valid TPM. By (17), it is trivial that the expected transition count for an arbitrary statepair from is equal to the weighted average of the and expected transition counts. While we have shown this only for two chains, we can make this same argument recursively with the current iteration of and a subsequent chain to incorporate into the average, , showing that this result holds for an arbitrary number of Markov chains.
A.2. Transition tensor model specification
For a given time interval, we tally up the transitions in these play sequences yielding a matrix of transition counts between states. For each row of these transition count matrices we assume a multinomial distribution. Conditional on , let be a categorical random variable in a virtually identical space as defined above, the only difference being the one additional state that can transition to — the terminal state representing a turnover. and are also defined as they are in Section 3.1. At step ,
(16)  
(17)  
(18)  
(19) 
where is an AR(1) correlation matrix with temporal correlation parameter corresponding to its level of the hierarchy (i.e. ). Note that the model structure here is identical to that of the shot policy model. The only differences are that in this case we have a categorical likelihood rather than a Bernoulli likelihood and consequently a much larger parameter space.
A.3. MCMC diagnostics
Parameter  

MCMC samples (per chain)  2000  16000  1500  11000 
Burnin  500  1000  500  1000 
Minimum eff. sample size  80  47  371  219 
Maximum  1.037  1.096  1.004  1.014 
References
 (1)
 Carpenter et al. (2017) Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P. & Riddell, A. (2017), ‘Stan: A probabilistic programming language’, Journal of statistical software 76(1).
 Cervone et al. (2014) Cervone, D., D’Amour, A., Bornn, L. & Goldsberry, K. (2014), ‘A Multiresolution Stochastic Process Model for Predicting Basketball Possession Outcomes’, Journal of the American Statistical Association 111(514), 585–599.

Goldman & Rao (2014a)
Goldman, M. & Rao, J. (2014a), ‘Misperception of risk and incentives by experienced agents’, pp. SSRN
Scholarly Paper ID 2435551, Social Science Research Network, Rochester, NY.
https://ssrn.com/abstract=2435551 
Goldman & Rao (2014b)
Goldman, M. & Rao, J. (2014b), ‘Optimal stopping in the nba: An empirical model of the miami heat’,
pp. Scholarly Paper ID 2363709, Social Science Research Network, Rochester,
NY.
https://ssrn.com/abstract=2363709  Goldner (2012) Goldner, K. (2012), ‘A markov model of football: Using stochastic processes to model a football drive’, Journal of Quantitative Analysis in Sports 8(1).

Gudmundsson & Horton (2017)
Gudmundsson, J. & Horton, M. (2017), ‘Spatiotemporal analysis of team sports’, ACM Comput. Surv. 50(2), 22:1–22:34.
http://doi.acm.org/10.1145/3054132 
Hirotsu & Wright (2002)
Hirotsu, N. & Wright, M. (2002),
‘Using a markov process model of an association football match to determine
the optimal timing of substitution and tactical decisions’, Journal of
the Operational Research Society 53(1), 88–96.
https://doi.org/10.1057/palgrave.jors.2601254 
Li & Ng (2014)
Li, W. & Ng, M. K. (2014), ‘On the limiting probability distribution of a transition probability tensor’,
Linear and Multilinear Algebra 62(3), 362–385.
https://doi.org/10.1080/03081087.2013.777436  Neiman & Loewenstein (2011) Neiman, T. & Loewenstein, Y. (2011), ‘Reinforcement learning in professional basketball players’, Nature communications 2, 569.
 Oliver (2004) Oliver, D. (2004), Basketball on paper: rules and tools for performance analysis, Potomac Books, Inc.

Routley & Schulte (2015)
Routley, K. & Schulte, O. (2015),
‘A Markov Game Model for Valuing Player Actions in Ice Hockey’,
Uncertainty in Artificial Intelligence (UAI)
pp. 782–791.  Skinner (2012) Skinner, B. (2012), ‘The problem of shot selection in basketball’, PloS one 7(1), e30776.
 Skinner & Goldman (2015) Skinner, B. & Goldman, M. (2015), ‘Optimal strategy in basketball’, arXiv preprint arXiv:1512.05652 .

Stan (2018)
Stan, D. T. (2018), ‘Cmdstan: the
commandline interface to stan’.
http://mcstan.org  Štrumbelj & Vračar (2012) Štrumbelj, E. & Vračar, P. (2012), ‘Simulating a basketball match with a homogeneous markov model and forecasting the outcome’, International Journal of Forecasting 28(2), 532–542.
 Sutton & Barto (1998) Sutton, R. S. & Barto, A. G. (1998), Reinforcement learning: An introduction, MIT press.

Thomas et al. (2013)
Thomas, A. C., Ventura, S. L., Jensen, S. T. & Ma, S.
(2013), ‘Competing process hazard function
models for player ratings in ice hockey’, The Annals of Applied
Statistics 7(3), 1497–1524.
http://www.jstor.org/stable/23566482  Vračar et al. (2016) Vračar, P., Štrumbelj, E. & Kononenko, I. (2016), ‘Modeling basketball playbyplay data’, Expert Systems with Applications 44, 58–66.
 Ward Jr (1963) Ward Jr, J. H. (1963), ‘Hierarchical grouping to optimize an objective function’, Journal of the American statistical association 58(301), 236–244.
Comments
There are no comments yet.