Transition Tensor Markov Decision Processes: Analyzing Shot Policies in Professional Basketball

12/12/2018 ∙ by Nathan Sandholtz, et al. ∙ Simon Fraser University 0

In this paper we model basketball plays as episodes from team-specific non-stationary Markov decision processes (MDPs) with shot clock dependent transition probability tensors. Bayesian hierarchical models are employed in the modeling and parametrization of these tensors to borrow strength across players and through time. To enable computational feasibility, we combine lineup-specific MDPs into team-average MDPs using a novel transition weighting scheme. Specifically, we derive the dynamics of the team-average process such that the expected transition count for an arbitrary state-pair is equal to the weighted sum of the expected counts of the separate lineup-specific MDPs. We then utilize these non-stationary MDPs in the creation of a basketball play simulator with uncertainty propagated via posterior samples of the model components. After calibration, we simulate seasons both on-policy and under altered policies and explore the net changes in efficiency and production under the alternate policies. Additionally, we discuss the game-theoretic ramifications of testing alternative decision policies.



There are no comments yet.


page 2

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A basketball game can be framed as a collection of episodes from complex stochastic processes. Each episode, or play, is comprised of a finite number of transitions between players and locations ultimately terminating in a shot, turnover, or foul. An integral attribute of the game is that it is non-stationary; the transition probabilities are not constant over the 24 seconds in which a team has to shoot the ball. For example, consider the relationship between time on the shot clock, which counts down these 24 seconds, and the probability of taking a shot as shown in Figure 1. The plot in (fig:b) shows empirical league-average shot policies, which we define as the probability that any on-ball event (i.e. dribbles, passes, and shots) will be a shot, for the set of court regions defined in (fig:a). As the shot clock winds down, the probability of shooting increases — quite dramatically in the final seconds of the shot clock.

Figure 1: (a) Breakdown of court locations as used in our models and simulations. Rim: within a 6 ft radius of the center of the hoop; Paint: outside the restricted area but within the key; Mid-range: outside of the paint but within the 3-point line; Corner 3: beyond the 3-point line but below the break of the 3-point arc; Arc 3: beyond the 3-point line, above the arc 3 break, but within 30 ft. of the hoop; Backcourt: all locations beyond the arc 3 region. (b) Empirical league-average shot policies for the 2015-16 NBA regular season. We see lower shot probabilities in the mid-range and arc 3 regions because the on-ball events in these regions are dominated by passes and dribbles.

Determining optimal policies for player shooting is a critical problem in the game of basketball and it remains an active area of research (Skinner & Goldman, 2015; Goldman & Rao, 2014b). However, the inherent non-stationarity introduced by the shot clock makes assessing shot selection optimality a complex problem. In this project, we propose a method to test and compare shot policies which accounts for the dynamic nature of a basketball play. Two critical assumptions underlying our approach are that shot policies are both time-varying and malleable. Basketball analysts often focus on the less flexible component of shot efficiency — field goal percentage, or the percentage of a player’s shots that he makes. Improving shooting skill can take years of practice, whereas the shot policy is comparatively controllable; players choose where and how often they shoot when they have the ball in their possession.

Given the malleable nature of shot policies, we explore what could have happened if a player’s shot policy had changed. To enable this exploration, we model plays as episodes from latent Markov decision processes (MDPs) with dynamic within-episode transition probabilities. We approximate these functional transition probabilities via transition probability tensors (TPTs), then estimate the latent components of the MDP using Bayesian hierarchical models. Our method involves combining several Markov chains with overlapping state spaces into an average Markov chain, which we derive subject to the constraint that the expected total transition count for an arbitrary state-pair is equal to the weighted sum of the expected counts of the separate chains.

We then develop a method to simulate from these processes not simply by outcome, but rather at the sub-second level, incorporating every intermediary and terminal on-ball event over the course of a play. The uncertainty in our estimation of the MDP gets propagated into the simulations via posterior samples of the MDP model. Ultimately, our method allows us to make distributional estimates of counter-factual scenarios such as, “What could have happened if a team took contested mid-range shots less frequently early in the shot clock?” While we focus primarily on shot policies in this paper, narrowing in on a player’s choice to shoot or not at any given instant, the framework presented here can be altered to accommodate the whole space of decisions players can make on-ball, including movement and passing.

1.1 Related work and contributions

This paper adds to the growing literature of spatiotemporal analyses of team invasion sports (i.e. basketball, football, soccer, and hockey). We refer the reader to Gudmundsson & Horton (2017)

for a survey. Within this body of work, Markov models have significant presence:

Goldner (2012) uses a Markov model as a framework for evaluating plays in American football; Hirotsu & Wright (2002) use Markov processes to determine optimal substitution patterns in an English Premier League match; and Thomas et al. (2013) use a semi-Markov process to model team scoring rates in hockey.

The landmark work of Cervone et al. (2014) is particularly relevant to the methods we introduce in this paper. The state space and hierarchical models we develop have similarities to the coarsened possession process they employ; however, our ultimate goals are fundamentally different. Cervone et. al. aim to estimate instantaneous point values of possessions whereas we utilize a decision process framework to estimate the macro-effects if player decisions were to change.

We approach the problem similarly to Routley & Schulte (2015), who apply a Markov game formalism to value player actions in hockey, incorporating context and a lookahead window in time. As in Routley & Schulte (2015), we do not aim to compute optimal strategies; however, we provide a basketball play simulator by which alternate policies can be explored. Since the defense is not an adversarial agent in our model but is built into the system via the probabilistic components of the MDP, this simulator is proposed as an exploratory tool as opposed to a mechanism to compute policy optima.

Several papers have endeavored to simulate a basketball game using Markov models (Štrumbelj & Vračar, 2012; Vračar et al., 2016). Our simulator is unique among these studies in a number of ways. We account for the uncertainty in every estimated parameter, propagating this uncertainty through to the simulations. Also, though Vračar et al. (2016) incorporate game-clock time, these simulation methods do not account for the inherent non-stationarity within a possession introduced by the shot clock. We propose a novel method to account for the non-stationarity of basketball plays using transition tensor Markov decision processes. By incorporating this dependency in our model, we can explore far more detailed policy changes with correspondingly more accurate results, particularly with respect to shot clock violations and time-specific policy changes within plays.

This work also contributes to the literature and practical application of discrete absorbing Markov chains. We formalize a method to construct and estimate an average chain from several independent Markov chains with overlapping state spaces. The term “average” as used here refers to a chain that yields the same number of state-to-state transition counts in expectation for all unique state pairs spanned by the set of independent chains. This allows us to reliably estimate aggregate counts across the system of chains without having to estimate each chain individually. This result is critical in making our method both parsimonious and computationally feasible.

1.2 Description of data

We use high-resolution optical tracking data collected by STATS LLC for the 2015-2016 NBA regular season. These data include the coordinates of all 10 players on the court and the coordinates of the ball at a frequency of 25 observations per second. These data are further annotated with features such as shots, passes, dribbles, fouls, etc. For this project we use observations with tagged ball-events including dribbles, passes, rebounds, turnovers, and shots. This significantly reduces the number of observations while retaining the core structure of a possession.

1.3 Outline

The rest of the paper is outlined as follows: In Section 2 we give a brief overview of Markov decision processes and frame the process in basketball terms. In Section 3 we describe how we incorporate tensors in the framework of an MDP, detail the inference procedures, and illustrate the model fit. In Section 4 we describe how we simulate plays from team-specific MDPs and show calibration results from our simulations. In Section 5 we show the results of our simulations under various altered policies and discuss potential game-theoretic ramifications of altering policies. Our concluding remarks comprise Section 6.

2 Markov decision processes

Markov decision processes are utilized in many modern reinforcement learning problems to characterize interactions between an agent and his environment. In this paper we restrict our attention to finite MDPs, which can be represented as a tuple:

. represents a discrete and finite set of states. represents a finite set of actions the agent can take. defines the transition probabilities between states, and defines the immediate reward the agent receives for any given state/action pair. The agent operates in the environment according to a policy, , which defines the probabilities that govern the agent’s choice of action based on the current state of the environment. is the only aspect of the system that the agent controls. Typically, the agent’s goal is to maximize his long-term rewards, which he does by modifying his policy. We can define these functions succinctly in mathematical terms:


Figure 2 illustrates a MDP in context of a basketball play. In basketball terms, we use to represent the probability that the ballcarrier takes a shot (or other actions, as we explore later) given his current state. If he takes a shot, dictates the expected point value of that shot. If he decides not to shoot, denotes the probabilities of the ball entering any other state given his current state. In our case, the governing probabilities are unobserved for each of these components and hence must be estimated. We refer the reader to Sutton & Barto (1998) for a more expansive introduction to reinforcement learning and Markov decision processes.

Figure 2: Illustration of the components of the MDP for a single player in context of a basketball play. The blue circles represent states, the solid green circles represent actions (shots), and the curved blue lines represent transition probabilities between states. The green lines of varying width connecting the blue state circles to the green action circles represent the policy. The purple lines connecting the green action circles to the squares represent the reward function. Players may pass the ball to another player (not shown) which is also a transition to a non-terminal state.

2.1 State and action space

The state space of our model, , is defined in context of the ballcarrier. Following Cervone et al. (2014), at any time , the state is given by the identity of the ballcarrier, his court region, and an indicator of his defensive pressure (open or contested). Court region is a function of the coordinates of the ballcarrier and can take any the six regions shown in Figure 1(fig:b): rim, paint, mid-range, corner 3, arc 3, or back-court. Defensive pressure is determined by the distance of the nearest defender to the ballcarrier and is dependent on the court region of the ballcarrier: rim 3 ft, paint 3.5 ft, mid-range 4 ft, and 3-point regions 5 ft.

As we are primarily interested in shot policies, we have chosen a binary action space; at each step in the process, the ballcarrier decides to either shoot or not shoot () according to his policy . If a shot is taken the play terminates; otherwise, the subsequent transition is generated by . Later, we explore changes to passing probabilities via perturbations to .

2.2 Defining the average chain

Before unrolling the modeling details for the components of the MDP, we pause to explain how we have specified the data generating process. Because most teams use upwards of 500 lineups over the course of a season, we assume the transition probabilities in and action probabilities in are invariant to the lineup, i.e., other players do not impact transitions. This allows us to define the process at any given point in time using a single team-average transition probability matrix (TPM). We construct this team-average chain such that it yields the same number of state pair transitions in expectation as the sum across all the independent chains for every unique state pair spanned by the set of lineups. We now briefly detail this derivation for two lineups.

Consider Markov chains for two separate lineups, and , each having a set of transient states denoted and respectively. Let the matrix of expected state-pair transition counts for lineup , (denoted by ), be defined in a cell-wise manner such that the cell of equals


where indexes the initial starting state of the episode, indexes the origin state of the state-pair of interest, and indexes the destination state of the pair. is the initial distribution of the chain, is the entry of the fundamental matrix , and is the probability of immediately transitioning to state given current state . The entry of is the expected number of times that the process arrives in state given that the episode is initialized in state . We combine and with weights proportional to the number of episodes that come from each chain, then normalize the rows of the resulting matrix to create the average chain . It can then be shown for any state-pair that . We have omitted some details for brevity, but a full exposition can be found in the Appendix.

The average chain allows us to accurately estimate aggregate counts (e.g. over the course of the regular season) across all lineups without having to estimate each lineup’s transition probabilities individually, making the problem tractable while still retaining enough detail to explore the nuanced questions we are interested in.

3 Transition and policy tensors

In most MDP applications the transition dynamics, , are treated as being static while is assumed to vary temporally. However, in this paper we assume the opposite; only the reward function is time-independent. The reason for this is that in our case, time — or the shot clock rather — resets with each new episode of the process as opposed to continuing globally across episodes. We are concerned about within-episode temporal dynamics, whereas most MDP applications consider time globally. As such, the way we consider non-stationarity is quite different than how it’s primarily treated in the literature. Our framework requires a functional form of and , whereas these are conventionally modelled statically.111In reinforcement learning applications, typically gets updated as the agent learns more about his environment. In this sense is dynamic, but this is quite different than the functional form for we refer to here.

To incorporate within-episode non-stationarity in and , we propose using tensors to allow for dynamic transition probabilities and shot policies over the shot clock. In the stochastic processes literature, the term ‘transition probability tensor’ arises (albeit infrequently) in reference to the series of transition probability matrices induced by an -order Markov chain (see Li & Ng (2014) for example). This is not what we mean by this term. Rather, we refer to a transition (or policy) tensor as an approximation to a dynamic transition probability function of a continuous temporal covariate . Specifically, we model and as tensors with 8 matrix slices, or TPMs, each representing a three-second interval of the shot clock as illustrated in Figure 3.

Figure 3: A concept illustration of a transition probability tensor for the Cleveland Cavaliers. For illustration purposes the row and column space of the tensor has been condensed to five single-player states. In our models, a typical team has a row and column space of over 200 states.

The policy tensor is virtually identical to the transition probability tensor (TPT) in form, the only difference being the column space. Since the ballcarrier makes only a binary decision at every step of the process, the shot policy (given any time ) is a matrix slice with row space equal to that of the corresponding TPM slice and a column space of length two (shot and no shot). This tensor framework is the key to accurately exploring the effects of altering shot policies. The efficiency of a shot is dependent on the time remaining on the shot clock, and this model framework allows us to account for this temporal dependency and tailor our policy alterations accordingly.

3.1 Tensor model specification

We employ a Bayesian hierarchical modeling approach, which provides a natural way to share strength across parameters that are alike. Note that while we fit each model independently of the others, they each employ a common hierarchical structure — player-specific parameters borrow strength from position-specific parameters (e.g. point guards, power forwards, etc.), which in turn borrow strength from global location and defensive pressure parameters. Since much of the notation and details of the policy model and transition model are redundant, we only show the specifics for the policy model in this section. The details of the transition tensor model are included in the Appendix for the interested reader.


be a Bernoulli random variable with 0 denoting ‘no shot’ and 1 denoting ‘shot’ in the

step of an episode from the MDP. In our context, an episode is a sequence of events that comprise a single play from start to finish. Let be a categorical indicator of the ballcarrier’s current state in the space of all player/location/defense combinations. Specifically, , , and , where and represent the total numbers of players on the roster and locations respectively, and where is a binary indicator of defensive pressure. Next, let represent the interval of the shot clock in which the play falls in its step. Since we partition the 24-second shot clock into 3-second intervals, this takes on values from 1 to 8. will denote the position (or group) of the ballcarrier (player ) at the step of the episode, therefore , where is the total number of unique player positions. Note that players are nested within position, hence shows that player belongs to position group . At step ,


where each is an AR(1) correlation matrix with temporal correlation parameter corresponding to its level of the hierarchy (i.e. . If a shot is taken (i.e. ) then the MDP episode terminates and the reward, , is determined by . Otherwise, and the next state is determined by 222Due to the extreme infrequency of taking backcourt shots we don’t estimate player-specific coefficients for backcourt shot policies and field goal percentages. For notational simplicity we have omitted this technicality in the model definition. Figure 4 shows a graphical representation of the model for .

Figure 4: Policy tensor graphical model.

3.2 Reward function

In context of a basketball play, (2) can be restated as, “How many immediate points do we expect when a player in state takes action ?”. If the action is a shot, then this expected value is his expected points per shot from the given state, otherwise it is 0 (for simplicity, in our analysis we have omitted free throws). This allows us to define the reward function of the MDP completely in terms of a shot efficiency model.

As with and , we use a hierarchical model for . However, while and define player groups using naive player position (center, power forward, etc.) this model uses new groups, , on which to base the regularization. The reason for this change is that a player’s shooting skill does not have as clear a correspondence to his naive position. As such, we create customized groupings to ensure sensitivity to this variation.

We first clustered players into three categories based exclusively on the volume of shots they took over the course of the season, irrespective of the shot locations. Next, we re-clustered players into six shot propensity categories based on the proportional breakdown of their shots by court region, irrespective of volume. In both clustering procedures we used the k-means algorithm initialized at cluster centroids calculated via Ward linkage

(Ward Jr, 1963). We then crossed these clusters, giving a total of 18 groups which differentiate players by how much they shoot and where they tend to shoot from. Table 1 shows three example players in each cluster.

Shot Volume Shot Region Propensity
Equal Balance 3-point Heavy Mid Heavy Rim Heavy 3-point Specialist Rim Specialist
High L. James D. Lillard J. Wall A. Davis S. Curry A. Drummond
R. Westbrook K. Love D. Nowitzki I. Thomas T. Ariza G. Monroe
D. Cousins J. Harden K. Leonard D. Wade W. Matthews J. Okafor
Med L. Barbosa P. Beverly R. Rubio D. Favors K. Korver D. Jordan
L. Stephenson E. Ilyasova M. Speights E. Turner J. Terry T. Booker
N. Jokic O. Porter M. Belinelli B. Portis N. Mirotic C. Capela
Low A. Roberson J. Jerebko M. Muscala A. Miller J. Ingles A. Bogut
D. Motiejunas B. Jennings C. Watson A. Varejao J. Ennis B. Bass
K. McDaniels D. Augustin T. Prince D. Powell B. Rush J. McGee
Table 1: Players were independently clustered by shot volume and shot region propensity. The table shows three players in each group after crossing the resulting clusters.

We now specify the model governing the reward function in mathematical terms. Given a shot, let be a Bernoulli random variable with 1 = make, 0 = miss. For shot ,


In this model defensive pressure is a location-specific additive effect rather than being built into the player-specific parameters. Also, the hierarchical parameters in this model are univariate normal rather than multivariate normal since we model a player’s shooting skill as being constant over the shot clock. We also assume independence across plays for shot make probabilites, which is a debated area of research (Neiman & Loewenstein (2011) for example). Finally, shot efficiency is determined by scaling the estimated make-probabilities for each state by the corresponding point value of the shot — 2 or 3 points, depending on the court-region.

3.3 Inference and validation

After removing plays we are not interested in modeling (plays that terminated in either fouls, timeouts, jumpballs, or in the backcourt) we have 155,656 plays ( million observations) on which we fit our models. We held out a sample of approximately 28,000 plays to use for model validation. We fit our models using Stan, an open-source software package which offers a suite of MCMC methods for statistical inference (Carpenter et al., 2017; Stan, 2018). For each model we initialized two chains and let them mix long enough to ensure we had a potential scale reduction factor

for every parameter. Effective sample sizes ranged from 47 to 15,000 across the set of parameters the Appendix for additional diagnostics). We used diffuse gamma priors for all variance hyperparameters and uniform priors for the correlation hyperparameters.

Due to the massive dimensionality of the joint posterior of the TPT model, we fit in two stages — one stage to fit the top two layers of the hierarchy (position and global location parameters) and a second stage to fit the lower level of the hierarchy (player-specific parameters). After fitting the global location and position-specific parameters, we initialized the prior means of the player-specific parameters, , at the corresponding posterior means of the position-specific parameter estimates, .

Next, using the posterior draws of we simulated a sample of 1000 plays at the position level (roughly six games worth of plays), then refit the first stage of the TPT model on these simulated plays.333Details on how we simulate plays are in section 4. We then used the posterior mean of the position-level variance parameter of these 1000 plays as the prior variance for in second stage of the estimation. Together, these strategies lead to a straightforward interpretation: the player-specific parameters are initialized using position-specific estimates with a weight of six games worth of plays. This allows us to shrink low-usage players’ transition probabilities toward the league average for their position while not swamping medium- to high-usage players with the position-specific estimates.

Table 2 shows out-of-sample log-likelihoods for four models of increasing complexity for each component of the MDP. The transition model column, , represents log-likelihoods computed using only the Cleveland Cavaliers TPT model (and corresponding out-of-sample data), whereas the other columns comprise the entire league. For all three components, the models with player-specific shrinkage perform best. All subsequent references to MDP model components refer to the models in row D of Table 2.

A. Empirical model -36808 -17299 -5956
B. Model A + location shrinkage -25183 -36748 -4572
C. Model B + position shrinkage -24464 -25099 -4562
D. Model C + player shrinkage -21552 -13478 -4541
Table 2: Out-of-sample log-likelihoods for four models of increasing complexity over each component of the MDP.

3.4 Model fit

Figure 5

shows 90% credible intervals for the transition probabilities in the top hierarchy of the TPT model. As shown in the block diagonal frames of the figure, the highest transition probabilities are to the same state, due to the predominant influence of dribbles in the data. Conversely, it is improbable for the ball to transition immediately to a state which is not directly geographically adjacent. Interestingly, the defensive pressure of the destination state appears to have a much larger impact on transition probabilities than the defensive pressure of the origin state.

Figure 5: Estimated league-wide transition probability tensor for the top level of the hierarchy on which each team’s TPT is built. Within each plot frame, the 90% credible interval of the origin to destination transition probability is shown — the x-axis represents time on the shot clock, while the y-axis represents the transition probability. Across plot frames, the y-axis represents the origin state and the x-axis represents the destination state. Corner 3, paint, and rim states are omitted to maintain a practical size for the figure.

The estimated shot policies and reward functions for LeBron James and Kyrie Irving of the Cleveland Cavaliers are shown in Figure 6. The strong temporal autocorrelation captured by the model () significantly smooths jagged empirical policies yielding more plausible results. The two players’ policies look quite similar, with the exception that Irving tends to take contested mid-range shots more frequently than James.

Figure 6: Estimated shot policies (95% credible intervals) and reward functions (posterior densities) for Lebron James and Kyrie Irving in three sample states. The shot policies are overlaid with dots corresponding to their empirical shot policies; the sizes of the dots are relative to how many shots they took within that time interval from the indicated state. The reward functions are also overlaid with the empirical points per shot and the number of shots they took from each state is given in the legend.

Interestingly, Irving also takes contested mid-range shots more frequently than he takes arc 3-point shots. In general, this is considered poor shot selection because most players have a higher expected points per shot (EPPS) from beyond the arc than from the mid-range. However, Irving appears to be an anomaly in this respect; his mid-range reward distribution is greater than his arc 3 distribution in expectation. His estimated shot policy evidences that he knows his strengths and acts accordingly.

4 Simulating plays

Having fit the models for the MDP, our next task is to simulate plays using these models. Algorithm 1 details the conceptual structure of the simulation process. The algorithm takes as inputs the MDP components: a transition probability tensor, a shot policy, and a reward function. We use the MCMC posterior draws from Stan for these inputs which naturally propagate the uncertainty in the MDP estimation through to our simulations, making the process analogous to a posterior predictive distribution.

Inputs: , , , , , (shot clock at play-start)
Output: Tensor of counts of simulated states (terminal and intermediary), actions, and rewards (expected point values of state-action pairs)
i = 0;
while  Turnover do
       Binomial draw from , ;
       lapse = draw from ;
       - lapse;
       if  ¡ 0 then
             Turnover (shot clock violation) ;
       else if  Shot then
             Multinomial draw from ;
             break loop;
       end if
end while
Algorithm 1 Play Simulator

Our simulator also requires initial states and starting shot clock times for the plays we want to simulate. For these inputs we use the observed starting states and corresponding times on the shot clock for each team’s collection of plays in the 2015-2016 regular season. Note that we do not treat the number of plays in a season nor the states in which plays begin as random. Consequently, we do not analyze rebounding; the number of plays is fixed beforehand and once a turnover happens or a shot is taken, the play ends.

Lastly, we need a mechanism to take time off the shot clock at each step of the MDP. This component of the simulator makes performing analytical operations on the process intractable because the distribution of time lapses between events does not lend itself to a parametric distribution. Instead, we sample the empirical distribution of time-lapses between events as a mechanism to simulate the time between events. We denote this empirical distribution function by .

4.1 Calibration

We can be extremely detailed in checking the calibration of our simulations since we keep track of all simulated intermediary and terminal transitions. To assess the calibration, we simulate 300 seasons for the Cleveland Cavaliers using the observed starting states of all their 2015-16 plays and compare our simulations to the actual transition counts. Note that these simulations are on-policy, meaning they are computed using variates of the shot policy estimated on the observed data. In making this comparison we must be cognizant of overfitting; the empirical model will always yield optimal calibration because the empirical model fits both trends and errors. Models with regularization may appear less calibrated, but ultimately give better fits because the modeling of errors is attenuated by the induced shrinkage.

Figure 7 shows the simulated player-aggregate transition counts for these 300 simulations for the Cavaliers’ starting lineup overlaid with the observed counts in red. Our simulations capture the aggregate transition count trends over the shot clock with high integrity; however, they appear to be biased low for some state pairs. On the other hand, simulated transition counts for low-usage players (not shown) are generally biased high. As noted previously, these phenomena are due to shrinkage in the hierarchical model, which we are quick to note is not a model deficiency. As evidenced in Table 2, this borrowing of information improves out-of-sample model fit, giving us more reliable calibration on macro-level features.

We can also calculate correlations between the average simulated transition counts and observed transition counts. Using this measure, the simulations match on multiple accounts: 2-point shots ( = 0.979), 3-point shots ( = 0.965), and turnovers ().

Figure 7: 300 simulated (gray) season-total transition counts over the shot clock overlaid with the corresponding observed counts (red) for the 2015-2016 season. Within each plot frame, the x-axis represents time on the shot clock, while the y-axis represents total transition counts. Across plot frames, the y-axis represents the origin state and the x-axis represents the destination state.

5 Altering policies

With confidence that the method accurately reproduces play sequences under the observed policy model, we now simulate team-specific plays under altered shot policies. The way we implement a policy change via simulation is simple — we transform the posterior draws of the shot policy model according to our desired alteration then simulate seasons with these modified posterior draws. However, before providing examples of altered policies we pause to discuss some relevant topics from game-theory.

5.1 Game theory

5.1.1 Optimal policies

A general assumption of this paper is that teams are not operating on optimal shot policies. This is difficult to test but there is research that supports this conclusion (Goldman & Rao, 2014b; Skinner, 2012). Regarding optimal stopping times (i.e. when a player shoots during a play relative to the shot clock), Goldman & Rao (2014b) show that while on average, the league as a whole closely follows the optimal curve, individual lineups are not perfect optimizers, often exhibiting tendency to undershoot. Even under the assumption that a team is operating optimally, players and coaches could still gain utility by exploring adverse effects of changes to this policy.

5.1.2 Allocative efficiency

A player’s shot efficiency depends on the volume of opportunities he is allocated. The mathematical formulation of this concept originates in the work of Oliver (2004). Oliver defines the relationship between a player’s usage and his efficiency as a “skill curve” and suggests that it should generally exhibit a downward trend, meaning that players become less efficient as they carry more of the scoring load. This relationship is important in context of altering shot policies. As explained in Goldman & Rao (2014a), if a team changes its shot policy to take more 3-point shots, the team has to accept lower quality 3-point opportunities on the margin. This will lead to lower expected values for these additional shots but higher expected values for the 2-point shots that by consequence have a lower usage rate due to the increase in 3’s. There is a counter-balancing relationship for policy changes due to moving up (or down) the skill curve. For our purposes, the simulation method should not bias the results of testing policy changes, as long as the changes are not drastic.

5.1.3 Defensive response

If a team makes a tactical change that gives them an advantage, it is reasonable to assume that the defense will attempt to eliminate the advantage. This defensive response brings up some important questions in context of our project — “How sensitive are defenses to policy alterations?” and “How long does it take for a defense to respond sufficiently to make a policy change ineffectual?” These questions depend on too many variables to suggest a single answer; however, we offer some observational evidence from the past two NBA regular seasons that suggests that, in some cases, the defensive response resulting from a team’s altered shot policy does not render its strategy change ineffectual over the course of a season.

In the 2016-17 NBA regular season, the Toronto Raptors shot on average 30.5% of their shots from 3-point range and they averaged 1.1 points per shot on these attempts.444These statistics were gathered from In the 2017-18 season, they shot 39.6% of their shots from 3-point range. This represents a 30% increase in their team 3-point shot policy. Despite this increase in their 3-point shot policy, the Raptors’ expected points per shot (EPPS) from beyond the arc only decreased less than 2% (from 1.1 in 2016-17 to 1.08 in 2017-18). Additionally, the Raptors’ overall EPPS increased from 1.1 in 2016-17 to 1.14 in 2017-2018. So while the policy change resulted in a small loss of efficiency (perhaps due to defensive adaptation), the response was not such that it rendered the Raptors’ policy change a zero-sum net benefit.

We acknowledge that this example is observational; these results could be due to season-to-season variability or the outcome of other variables, such as the addition of new players or the development/decline of returning players. Ultimately, predicting season outcomes for alternate policies is an extrapolation. As such, we believe that testing minor perturbations to a team’s policy will yield more credible results and that proposed changes should be carefully crafted prior to testing.

5.2 Shot policy changes

We now show two examples of policy changes that could be explored with our methods and compare the altered policy simulations to on-policy simulations. For each policy, we simulate 300 seasons for the Cleveland Cavaliers; the results are shown in Figure 8.

Alteration 1. Reduce the contested mid-range shot policy by 20% for all players on the team while more than 10 seconds remain on the shot clock.
Alteration 2. Regardless of time on the shot clock, reduce all contested mid-range shot policies by 70% while doubling all three-point shot policies.

Figure 8: Left to right: distribution of simulated contested mid-range shots, 3-point shots, expected points per shot, and expected points per play.

The most obvious distinction between the policies is the divergence between the contested mid-range and 3-point shot distributions, which is not surprising since we directly altered these shot policies. However, in order to measure whether the policy yields a net positive result for a given team, we must quantify how the altered policy affects efficiency and production. To measure these effects, we restrict our attention to the differences in EPPS and expected points per play (EPPP). Under policy 2, shot efficiency increases (1.100 to 1.124 in EPPS) as does play production (1.004 to 1.034 in EPPP). Under policy 1, these distributions show no practical differences, largely due to only 7.5% of plays ending in a mid-range shot with over 10 seconds on the shot clock, limiting the potential impact.

5.3 Passing policy changes

With a few modifications we can consider broader policy changes that encompass not only shooting but passing and movement as well. This entails altering the probabilities of non-terminating state transitions via the TPT.555In addition to the game theoretic consequences mentioned previously, new complexities arise with altering passing/movement policies in the context of our model framework. Many state-transition pairs in the TPT are not physically possible (e.g. a player cannot transition directly from the backcourt to the restricted area). Also, any change where we increase player-to-player transition probabilities is potentially problematic. Passing more often to a player in a specific location hinges on the assumption that the other player is correspondingly available in that location, which is something we do not control in our model. We now explore two altered policies of this nature; the results are shown in Figure 9.

Alteration 3. Reduce the transitions from Irving to James by 90%.
Alteration 4. Triple the transition probabilities from all veterans to players on rookie contracts, while reducing the transition probabilities from rookie contract players to veterans by 75%.

Figure 9: Left to right: distributions of simulated transitions from Irving to James, Irving’s total shots, James’ total shots, and expected points per play.

Alteration 3 represents a pathological example in which Irving forces his way into being the dominant player on the team by almost never distributing the ball to James. The downstream effects of the dominant Irving policy lead to a 17% increase his expected total shot count, while James’ is reduced by 14%. Interestingly, though Irving’s and James’ total shot distributions change dramatically, our method predicts that the overall differences in production would be negligible.

Alteration 4 could be described as a youth development policy, where veteran players are asked to take a back seat and players on rookie contracts are given the green light on offense. This policy change has a much larger predicted impact on production. We estimate this policy change would cost the Cavaliers .017 EPPP, which could have significant consequences on win totals and playoff outcomes.

6 Conclusion

We have developed and implemented a method to test the impact of shot-clock dependent policy adjustments over the course of a season at an unprecedented level of detail while accounting for model uncertainty in every aspect of the system. These methods could have immediate practical impact across multiple levels of a basketball organization. Coaches could assess proposed strategy changes outside of games rather than risking poor results by testing them in games. Front offices could explore the performance of hypothetical rosters by leveraging the position-level transition probabilities in tandem with their player-specific shot policies and reward functions. These tools could prove useful in evaluating trades and in free-agency negotiations. Additionally, our methods could enable teams to gauge the effects of having to play second-string players if any starters suffered a long-term injury. The examples we have shown in this paper are only the tip of the iceberg in terms of how these methods could be utilized.

We have primarily considered shooting decisions in this introductory work, but as shown in Section 5.3, our methodology naturally scales to include all different types of basketball decisions, allowing coaches and analysts to explore incredibly nuanced tactical changes. Additionally, with similar tracking data now available for most major sports including hockey, football, and soccer, our methods could extend to testing decision policies in other sports.

In a broader statistical context, we have provided and implemented a novel framework for modeling within-episode non-stationarity in MDPs through the use of policy and transition probability tensors. We have also shown how to combine multiple MDPs into a single weighted average process, which can enable solutions to problems that were previously impracticable to compute. Additionally, we’ve built a method to simulate from this type of MDP when the arrival times cannot be modeled parametrically. These contributions could be beneficial in many different areas such as traffic modeling, queuing applications, and environmental processes.

Our paper opens doors for promising further studies. In terms of reinforcement learning, a clear next step would be to solve/estimate the action-value function for the functional MDP we introduce in this paper. In the basketball context, addressing the game theoretic aspects by incorporating usage curves and simulating defensive response could make these methods more robust.

Appendix: Derivations, Model Specification, and Diagnostics

A.1. Deriving the average Markov chain

Here we derive the average Markov chain for two independent Markov chains with overlapping state spaces and show that the expected total transition count for an arbitrary state-pair from this chain is equal to the weighted sum of the separate chain expected totals.

Consider the two absorbing Markov chains and with transient states () and absorbing states (), written in canonical form:

We note that the set of transient states in , is not equal to the set of transient states in . Following convention, we define the fundamental matrix for chain as , where . The entry of is the expected number of times that the process is in the transient state given that the episode is initialized in state . Next, we define the matrix of expected state-pair transition counts, , in a cell-wise manner such that the cell of equals


where indexes the initial starting state of the episode, indexes this origin state, and indexes the destination state. is the initial distribution of and is the probability of immediately transitioning to state given current state .

Conceptually, we will combine and proportional to the number of episodes that come from each chain using weights and , then normalize the rows of the resulting matrix to create the average chain . In formal notation, define the set of state transition pairs, , as the outer product of the transient state space with the total state space . Specifically, . Next, we define the matrix of expected average chain transition counts, , in a cell-wise manner such that the cell of equals


Then, the cell of is


Clearly the rows of this matrix sum to 1 and all entries are non-negative, thereby making it a valid TPM. By (17), it is trivial that the expected transition count for an arbitrary state-pair from is equal to the weighted average of the and expected transition counts. While we have shown this only for two chains, we can make this same argument recursively with the current iteration of and a subsequent chain to incorporate into the average, , showing that this result holds for an arbitrary number of Markov chains.

A.2. Transition tensor model specification

For a given time interval, we tally up the transitions in these play sequences yielding a matrix of transition counts between states. For each row of these transition count matrices we assume a multinomial distribution. Conditional on , let be a categorical random variable in a virtually identical space as defined above, the only difference being the one additional state that can transition to — the terminal state representing a turnover. and are also defined as they are in Section 3.1. At step ,


where is an AR(1) correlation matrix with temporal correlation parameter corresponding to its level of the hierarchy (i.e. ). Note that the model structure here is identical to that of the shot policy model. The only differences are that in this case we have a categorical likelihood rather than a Bernoulli likelihood and consequently a much larger parameter space.

A.3. MCMC diagnostics

MCMC samples (per chain) 2000 16000 1500 11000
Burn-in 500 1000 500 1000
Minimum eff. sample size 80 47 371 219
Maximum 1.037 1.096 1.004 1.014
Table 3: MCMC details and diagnostics for each fitted model. and refer to the first and second stage of fitting the Cleveland Cavaliers TPT, respectively.


  • (1)
  • Carpenter et al. (2017) Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P. & Riddell, A. (2017), ‘Stan: A probabilistic programming language’, Journal of statistical software 76(1).
  • Cervone et al. (2014) Cervone, D., D’Amour, A., Bornn, L. & Goldsberry, K. (2014), ‘A Multiresolution Stochastic Process Model for Predicting Basketball Possession Outcomes’, Journal of the American Statistical Association 111(514), 585–599.
  • Goldman & Rao (2014a) Goldman, M. & Rao, J. (2014a), ‘Misperception of risk and incentives by experienced agents’, pp. SSRN Scholarly Paper ID 2435551, Social Science Research Network, Rochester, NY.
  • Goldman & Rao (2014b) Goldman, M. & Rao, J. (2014b), ‘Optimal stopping in the nba: An empirical model of the miami heat’, pp. Scholarly Paper ID 2363709, Social Science Research Network, Rochester, NY.
  • Goldner (2012) Goldner, K. (2012), ‘A markov model of football: Using stochastic processes to model a football drive’, Journal of Quantitative Analysis in Sports 8(1).
  • Gudmundsson & Horton (2017) Gudmundsson, J. & Horton, M. (2017), ‘Spatio-temporal analysis of team sports’, ACM Comput. Surv. 50(2), 22:1–22:34.
  • Hirotsu & Wright (2002) Hirotsu, N. & Wright, M. (2002), ‘Using a markov process model of an association football match to determine the optimal timing of substitution and tactical decisions’, Journal of the Operational Research Society 53(1), 88–96.
  • Li & Ng (2014)

    Li, W. & Ng, M. K. (2014), ‘On the limiting probability distribution of a transition probability tensor’,

    Linear and Multilinear Algebra 62(3), 362–385.
  • Neiman & Loewenstein (2011) Neiman, T. & Loewenstein, Y. (2011), ‘Reinforcement learning in professional basketball players’, Nature communications 2, 569.
  • Oliver (2004) Oliver, D. (2004), Basketball on paper: rules and tools for performance analysis, Potomac Books, Inc.
  • Routley & Schulte (2015) Routley, K. & Schulte, O. (2015), ‘A Markov Game Model for Valuing Player Actions in Ice Hockey’,

    Uncertainty in Artificial Intelligence (UAI)

    pp. 782–791.
  • Skinner (2012) Skinner, B. (2012), ‘The problem of shot selection in basketball’, PloS one 7(1), e30776.
  • Skinner & Goldman (2015) Skinner, B. & Goldman, M. (2015), ‘Optimal strategy in basketball’, arXiv preprint arXiv:1512.05652 .
  • Stan (2018) Stan, D. T. (2018), ‘Cmdstan: the command-line interface to stan’.
  • Štrumbelj & Vračar (2012) Štrumbelj, E. & Vračar, P. (2012), ‘Simulating a basketball match with a homogeneous markov model and forecasting the outcome’, International Journal of Forecasting 28(2), 532–542.
  • Sutton & Barto (1998) Sutton, R. S. & Barto, A. G. (1998), Reinforcement learning: An introduction, MIT press.
  • Thomas et al. (2013) Thomas, A. C., Ventura, S. L., Jensen, S. T. & Ma, S. (2013), ‘Competing process hazard function models for player ratings in ice hockey’, The Annals of Applied Statistics 7(3), 1497–1524.
  • Vračar et al. (2016) Vračar, P., Štrumbelj, E. & Kononenko, I. (2016), ‘Modeling basketball play-by-play data’, Expert Systems with Applications 44, 58–66.
  • Ward Jr (1963) Ward Jr, J. H. (1963), ‘Hierarchical grouping to optimize an objective function’, Journal of the American statistical association 58(301), 236–244.