1 Introduction
Soccer is a complex and sparse game, with a large variety of actions, outcomes, and strategies. These aspects of a soccer game make the analysis strenuous. Recent breakthroughs in computer vision methods help sport analysis companies, such as InStat
[1], Wyscout[2], StatsBomb[3], STATS[4], Opta[5], etc., collect highly accurate tracking and event datasets from match videos. Obviously, the existing tracking and event data in the market contain the prior decisions and the observed outcomes of players and coaches, with some nonrandom, regular, and nonoptimized policies. We call these policies as behavioral policies throughout the rest of this paper. Nowadays, analyzing the behavioral policies obeyed by the players and dictated by coaches, has been one of the most interesting topics for researchers. Sports analysts, i.e., academic researchers, applications, scouts, and other sports professionals, are investigating the potential of using previously collected, i.e., offline, data to make counterfactual inference of how alternative decision policies could perform in a real match.There are several engrossing action valuation methods in the literature of sports analytics, focusing on passes and shots (e.g., [6, 7, 8, 9], etc.), and some others cover all type of actions (e.g., [10, 11, 12]
, etc.). They accurately evaluate the player actions and contribution to goal scoring. However, all those models leave the player and the coach with the value of the performed action, without any proper proposal of alternative and optimal actions. To fill this gap, this work goes beyond action valuation by proposing a novel policy optimization method, which can decide about the optimal action to perform in critical situations. In soccer, we consider as critical situations the moments with a high probability of losing the ball, or scoring/conceding a goal. However, the player does not have any chance of passing to teammates, or dribbling in these situations. Thus, she/he needs to immediately decide about the following options: 1) shooting, 2) sending the ball out, 3) committing a foul, 4) submitting the ball to the opponent by making an error. Moreover, we define the optimal action as the action that maximizes the expected goal for the team. Thus, our method should both evaluate the behavioral policy, and suggest the optimal target policy to the players and coaches, for critical situations. It is a challenging task to design such a system in soccer due to the following reasons: first, soccer is a highly interactive and sparse rewarding game with a relatively large number of agents (i.e., players). Thus, state representation is ambiguous in such a system and requires an exact definition. Second, the spatiotemporal and sequential nature of soccer events and dynamic players’ location on the field dramatically increases the state dimensions, which is never pleasant for machine learning tasks. Third, the game context in a soccer match is severely affecting the model prediction performance. Forth, evaluating a trained optimal policy requires deployment in a real soccer match. However, this solution sounds impossible due to the large cost of deployment. This work offers solutions to all the abovementioned challenges. Sports professionals can use our policy optimization method after the match to check what action the player performed, evaluate it, and propose the optimal alternative action in that critical situation. If the action performed by the player, and the optimal action proposed by the optimal policy are not the same, we can relate it to the player’s mistake, or poor strategy from the coach.
In summary, our work contains the following contributions:

We propose an endtoend framework that utilizes raw data and trains an optimal policy to maximize the expected goal of soccer teams via reinforcement learning (RL);

Introduce a soccer ball possession model, which we assume to be Markovian, and a new state representation to analyze the impact of actions on subsequent possessions;

Suggest spectral clustering with regards to the opponent’s position and velocity for measuring the pressure on the ball holder at any moment of the match;

Propose a new reward function for each timestep of the game, based on the output of the neural network predictor model;

Derive the optimal policy in critical situations of soccer matches, with the help of fully offpolicy, deep reinforcement learning method.
2 Related work
The stateoftheart models in soccer analytics are focusing on several aspects such as evaluating actions, players, and the strategies. Plus/minus method is an early work on player evaluation that has been proposed by Kharrat et al. [13]. This method assigns plus for each goal scored and minus for each goal conceded by the players per total time they were on the pitch. Although this is the simplest method, it ignores the rating of other players, the opposition strength, and does not account for match situations. Regression method on actions and shots was firstly proposed by Ian et al. [14]
. They estimate the number of shots as the function of crosses, dribbles, passes, clearances, etc. Coefficients show how important they are in generating shots. However, this model does not work well in some cases (e.g., when the value of pass changes, and in case we want to know where the cross occurred). Another interesting player evaluation method is percentiles and player radars by Statsbomb
[15]. This method estimates the relative rank for each player based on his actions. For example, a ranking can be assigned to a player for all his defensive actions (tackle/interception/block), his accurate passes, crosses, etc.The application of a Markovian model in action valuation was first proposed by Rudd [16, 17]. The input of this model is the probability of ball location in the next five seconds. Assuming we have these probabilities, this model estimates the likely outcomes after many iterations based on the probabilities of transitioning from one state to another. Another application of the Markovian model is the Expected threat (xT) [18], which uses simulations of soccer matches to assign value to the actions. Although, we believe that simulations tend to be unrealistic. Because the simulations with any arbitrary point are not resulting in a goal by several iterations. VAEP [10]
is another action valuation model, which considers all types of actions. This model uses a classifier to estimate the probability that an action leads to a goal within the next 10 actions, and the game state is considered as 3 actions. This model ignores the concept of possessions in valuation. Considering the possession, the Expected Possession Value (EPV) metrics in football
[6] and basketball [19] were proposed. These models assume a simple world in which the actions of the players inside possessions are limited to pass, shot, and dribble. Thus, ignoring any other actions such as foul, ball out, or the errors, which frequently happen in critical situations.Recently, researchers utilize deep learning methods due to their promising performance in valuation domains. Fernandez and Bornn
[8]present a convolutional neural network architecture that is capable of estimating full probability surfaces of potential passes in soccer. Moreover, Liu et al. took advantage of RL, by assigning value to each of the actions in icehockey
[11] and soccer [12] using Qfunction. They later used a linear model tree to mimic the output of the original deep learning model to solve the tradeoff between Accuracy and Transparency [20]. Moreover, Dick and Brefeld [21] used reinforcement learning to rate player positioning in soccer. In this paper, we go beyond the valuation of actions in critical situations, and use RL to derive the optimal policy to be performed by the teams and players.3 Our Markovian possession model
In order to train an RL model, we first represent a soccer game as a Markov decision process. To this end, in this section we introduce an episode of the game, the start, intermediate, and final states. In the next sections, we define the state, action, and reward in each timestep of the game.
Due to the fluid nature of a soccer game, it is not straightforward to have a comprehensive description of a possession, which applies to all different types of soccer logs provided by different companies (e.g., InStat, Wyscout, StatsBomb, Opta, etc.). In the InStat dataset and accordingly in this work, possessions for any home or away teams are clearly defined and numbered. A possession starts from the beginning of a deliberate and on the ball action by a team, until it either “ends” due to some event like ball out, foul, bad ball control, offside, clearance, goal (regardless of who possesses the ball afterward, i.e., the next possession can belong to the same team or opposing team), or “transfers” by a defensive action of the opponent, such as pass interception, tackle, or clearance.
The possession can be transferred if and only if the team is not in the possession of the ball over two consecutive events. Thus, the unsuccessful touches of the opponent in fewer than 3 consecutive actions are not considered as a possession loss. Consequently, all ontheball actions of players of the same team should be counted to get the possession length, not only passes, shots, and dribbles. Accordingly, we define an episode as subsequent possessions for any team, until they lose the ball, or end possession sequences with a shot.
We aspire to describe the possessions with a Markovian model. In order to take advantage of the Markovian model of possessions and their outcomes, we converted the action level nature of the dataset to possession level, and each possession is labeled by its own terminating action. This conversion expedites the usage of supervised learning methods to predict the most probable outcomes as well. Our proposed model can be separately applied to any team participating in the games.
We can model this process as a Finite State Automaton, with the initial node of “Start” of possession, the final nodes of (“Loss” or “Shot”), and the intermediate node of “Keep” the possession. The schematic view of the state transition is illustrated in Figure 1.
4 State representation and neural network architecture
In this study, we present an endtoend framework to learn an optimal policy for maximizing the expected goals in a soccer game. To achieve this goal, data preparation is a core task to achieve a reliable RL model. In this section, we present the steps of building the states. Considering the definition of the episode in Section 3, there is no necessity for existence of a goal in an episode; so, we need to define a wellsuited reward function for each timestep. We propose a neural network model, utilizing the suggested state, and obtain the underlying data to get the reward of each timestep. The structure of the datasets used in this study is provided in Appendix A.
4.1 Game context: opponent pressure
Considering a descriptive game context is one of the most important aspects of soccer analytics, when it comes to feature engineering. Several works introduced different methods, KPIs, and features to address this problem. Among the works, Decroos et al. [10] created the following game context features: number of goals scored by attacking team after action, number of goals scored by defending team after action, and goal difference after action. Fernandez et al. [6] considered the role of context by slicing the possession into 3 phases: buildup, progression, and finalization. They considered three dynamic formation or pressure lines, and grouped the actions based on particular relative locations: first vertical pressure line (forwards), second vertical pressure line (midfielders), third vertical pressure line (defenders). Another interesting approach by Alguacil et al. [7] mimicked the collective motion of animal groups, called selfpropelled particles, in soccer. They claimed that in FC Barcelona, and generally, coaches can talk about three different playing zones: intervention zone (immediate points around the ball), mutual help zone (players close to the ball, but further away than first zone), and cooperation zone (players not expected to receive ball within few second).
Our approach of modeling the pressure by the opposing team and considering game context in our valuation framework, matches the selfpropelled particle model in grouping the opponents into several zones around the ball holder. To this end, we take advantage of a clustering method, keeping into consideration that opponents inside the clusters are not distributed spherically (according to their positions and velocities). Kmeans algorithm demonstrates a Pyrrhic victory, as it assumes that the clusters are roughly spherical and operates on Euclidean distance (Figure
(a)a). But in soccer tracking data, such clusters are unevenly distributed in size. Thus, we experimented with spectral clustering to provide the number of opponents inside each cluster as an indicator of defensive pressure around the ball holder. We treated the positions and velocityof the opponents around the ball holder as graph vertices, and we constructed a knearest neighbors graph for each frame (5 neighbors in this work). In this graph, nodes are the opponent players’ positions and velocities (direction and magnitude), and an edge is drawn from each position to its k nearest neighbors in the original space. The graph Laplacian is defined by the difference of adjacency and degree matrices. Then, we used Kmeans to perform clustering on vectors of the zero eigenvalues (connected components) from the Laplacian by setting the exact position (x,y) of the ball holder at each frame as the initial centroid of the clusters. Thus, each opponent player can be perfectly assigned to a spectral cluster (See Figure
(b)b).Moreover, we experimentally selected the optimal number of clusters to be 3, using the elbow method by setting the metric to distortion (computes the sum of squared distances from each point to its assigned center) and inertia (sum of squared distances of samples to their closest cluster center). Figure 4 depicts a frame of a specific match in our dataset with Kmeans on the top, and Spectral clustering on the bottom. Opponents from the away team are clustered into 3 groups. In the bottom Figure (b)b, Cluster 1 (blue) includes 4 opponents who might immediately take the possession of the ball, cluster 2 (yellow) are opponents who might intercept the pass or dribble of ball holder, and cluster 3 (red) are opponents who cannot reach the ball in a few seconds. We compute the number of opponents for all frames of the matches in our dataset, and use them as the pressure feature throughout the rest of this work. The pseudocode of the clustering algorithm for pressure measurement is provided in Algorithm 1.
4.2 Selected features
The InStat tracking data is one frame per second representation of positions for all the players including home and away. As mentioned in Section 4.1, we have taken advantage of tracking data to calculate the velocities and opponents’ location around the ball holder, and compute the defensive pressure in 3 different pressure zones to reduce dimensionality of the feature set. Another option was to avoid clustering the position features, and feed the network with the 44dimensional ((x,y) for 22 players) normalized locations on the pitch. On the other hand, angle and distance to goal, time remaining, home/away, and body id can be directly calculated from event stream data. Table 1 shows the final list of our analyzed features used for our machine learning tasks in the following sections. Note that we either use location features (44dimensional of exact players’ locations) or pressure features (numbers of players in each cluster) to represent 3 state types in Table 2.
Feature set  Feature name  Description 
handcrafted  Angle to goal  the angle between the goal posts seen from the shot location 
handcrafted  Distance to goal  Euclidean distance from shot location to center of the goal line 
handcrafted  Time remaining  time remained from action occurrence to the end of match half 
handcrafted  Home/Away  action is performed by home or away team? 
handcrafted  Action result  successful or unsuccessful 
handcrafted  Body ID  action is performed by head? body? foot? 
contextual: clustered locations  Pressure in zone 1  number of opponents in first cluster 
contextual: clustered locations  Pressure in zone 2  number of opponents in second cluster 
contextual: clustered locations  Pressure in zone 3  number of opponents in third cluster 
contextual: exact locations  locations  44dimensional exact locations (x,y) of opponents 
4.3 Possession input representation
State representation is one of the most challenging steps in soccer analytics due to the highdimensional nature of the datasets. We describe each game state by generating the most relevant features and labels to them. To this end, we define a different set of features, i.e., handcrafted and contextual (Table 1), and 3 types of state representation (Table 2).
For each of the state types (I, II, III), we demonstrate the state as the combinations of different features vector (see Tables 1,2), and onehot representation of the action for all the actions inside each possession, excluding the ending action. Thus, the varying possession length is the number of actions inside a possession, excluding the ending one. Then, the state is a 2 dimensional array, with the first dimension of possession length: (varying for each possession), and second dimension of features number. Therefore, a state/possession with length of can be represented as .
Due to the complex and spatiotemporal nature of the dataset, we select the best representation of the state through an experimental process. To do this, we train the spatiotemporal models on three different state types. State type (I) ignores the players’ locations and only reflects the occurred actions in addition to the handcrafted features of each action. State type (II) is a highdimensional representation that considers exact players’ locations besides the actions and their corresponding handcrafted features. In the state type (III), we handled the curse of dimensionality of type (II) by clustering the locations as shown in Section
4.1. See Table 2 for more details on states.4.4 CNNLSTM architecture for deriving behavioral policy
In the soccer event dataset, each possession is represented by a sequence of actions. We aim to classify these possessions (and show the result only for the home team) based on their ending actions. Thus, each possession should be terminated by the following classes: 1) Shot (goal or unsuccessful), 2) Ball out, 3) Foul, 4) Errors (possession loss due to inaccurate pass, bad ball control, or tackle and interception by opponent). Note that foul and ball out actions are only performed by the home team. Thus, if the possession is terminated by any action from the opponent, including foul and ball out, we classify them as error. To this end, we utilize the classification capability of sequence prediction methods. In order to handle the spatiotempral nature of our dataset, we needed a sophisticated model and best feature set, which could optimize the prediction performance. Thus, model selection was the core task of this study. We first created appropriate state dimensions suitable for each model by reshaping the state inputs, then fed our reshaped arrays with the 3 state types to the following networks: 3DCNN, LSTM, AutoencoderLSTM, and CNNLSTM, to compare their classification performance (See Table
2). Validation split of 30% of consecutive possessions is used to evaluate during training, and cross entropy loss on train and validation datasets is used to evaluate the model. As the table suggests, CNNLSTM [22] trained on state type (III) outperforms other models in terms of accuracy and loss. Thus, the necessity of the exact location of the players can be rejected and sufficiency of pressure features can be proved in this analysis. Although the AutoencoderLSTM accuracy trained on state types (II and III) is quite similar to CNNLSTM, its relatively large inference time and trainable parameters make the implementation more strenuous and expensive. Thus, we continued the rest of the analysis by developing a CNNLSTM network [22], using CNN for spatial feature extraction of input possessions, and LSTM layer with 100 memory units (smart neurons) to support sequence prediction and interpret features across time steps. Figure
5depicts the architecture of our network. Since our input possessions (possession array) have a three dimensional spatial structure, i.e., first dimension: number of possessions, second dimension: dynamic possession length (maximum=10), third dimension: number of features, CNN is capable of picking invariant features for each class of possession. Then, these learned consolidated spatial features are fed to the LSTM layer. Finally, the dense output layers are used with softmax activation function to perform our multiclassification task.
Note that the feature vector
has a fixed length for each individual action, but varying for all actions in the state (because possession length or number of actions varies). This is one of the main challenges in our work as most machine learning methods require fixedlength feature vectors. In order to address the challenge of dynamic length of possession features (second dimension of possession input array), we use truncating and padding. We mapped each action in a possession to an 11 length realvalued vector. Also, we limit the total number of actions in a possession to 10, truncating long possessions and we pad the short possessions with zero values. In this case, we will have a fixed length of sequences through the whole dataset for modeling.
Consequently, this network estimates the categorized probability distribution over actions for any given possession, parameterized by
. Through the rest of the paper, we denote this probability distribution as .5 Offpolicy reinforcement learning
Most RL methods require active data collection, where the agent actively interacts with the environment to get rewards. Obviously, this situation is impossible in our soccer analytics problem, since we are not able to modify the players’ actions. Thus, our study falls right into the category of batch RL. In this case, we will not face the exploration vs. exploitation tradeoff, since the actions and rewards are not sampled randomly, but they are sampled from the real world (players’ actions in a match) before the learning process. Moreover, our network learns a better target policy from a fixed set of interactions.
Before the learning process, the players selected some ending actions according to some nonoptimal (behavioral) policy. We aim to use those selected actions and acquired rewards to learn a better policy. Therefore, we prepared our dataset of transitions in the form of current observation, action, reward, next state for learning a new policy. Through the end of the paper, we use the notation and definition shown in Table 3.
Notation  Definition 

State: ()  Sequence of actions and their features in a possession of each team, excluding the ending action 
Action: ()  Ending action of each possession, which leads to state transition 
Episode: ()  Sequence of possessions of the home team, until they lose the possession, or end it with a shot, denoted by 
Reward:  Reward acquired from each ending action at the end of a possession 
Episode reward:  Sum of rewards (expected goals) for each episode: 
Return:  Cumulative discounted and normalized reward 
Target policy distribution:  Learned policy (probability distribution of actions from the policy network) 
Behavior policy distribution:  Actual policy (probability distribution of actions collected offline from a real match) 
Length (number of actions) in possession  
Total number of possessions  
5.1 Action reward function
In this section, we aim to estimate the reward acquired for the ending actions. Owing to the complex and sparse environment of soccer games, it is tedious to design the perfect reward function. In general though, every team desires to be in the most precious states, i.e., with maximum probability of goal scoring, as much as possible.
In the soccer dataset (for either the home or the away team), each episode starts from the moment that the team acquires the possession of the ball, and it terminates when the team either loses the possession (loss), or it ends up shooting (win). According to the Markovian possession model in Section 3, we have the set of ending actions (out, foul, shot, error) which are leading to state transitions. We estimated the probabilities of a possession belonging to the shot class in Section 4.4 with the help of a CNNLSTM network. In order to define the value of each possession, we need to define the following concepts:

: computed by CNNLSTM, is the probability of possession belonging to the shot class, given the features of the possession.

:
is the probability of goal scoring, assuming that a possession belongs to the shot class, and given the shot features. This is the same concept as the stateoftheart expected goal (xG) model that classifies shots to goal and nogoal. In this work, we have computed xG using logistic regression and show its higher performance with 5fold crossvalidation in comparison with other classifiers in Table
4. (Details in Appendix C)
It has become evident that higher indicates a higher chance of a shot. Accordingly, the higher shows a better chance of goal scoring. Thus, the multiplication of these two terms will give us the Possession Value (PV) in state s, denoted in (1). The Bayesian formula for this equation is provided in Appendix E.
(1) 
Now we define the rewards acquired by each ending action in a possession. The most precious actions in critical situations have 2 criteria: 1) prevent possession loss, 2) save the possession for the team, and lead transition to a more valuable possession with higher PV. Thus, we present our reward function as depicted in (2):
(2) 
where is the reward when the state changes from to by taking action . Our proposed reward function computes the immediate reward by the arbitrary action that each player performed. Choosing the shot, the player receives the PV of the possession. If he performs any action other than shot (e.g., ball out or foul), but the next possession is still for his team, the model computes the PV of the next possession and compares it to the current possession. On the other side, if he performs any action leading to possession loss (e.g., bad ball control, inaccurate pass, tackle and interception by opponent), he should receive a negative reward. In this work, 0.1 proved to be the best reward of possession loss to confirm the convergence of the policy network. Moreover, the sum of at each timestep throughout the whole episode is the indicator of expected goal for the team. Thus, the control objective is to maximize the expected goal of the teams.
Classifier  Brier  AUC 

XGBoost  0.014  0.765 
Random Forest  0.014  0.759 
SVM  0.015  0.733 
Logistic Regression  0.012  0.798 
5.2 Training protocol and return
For each state, the network needs to decide about performing the appropriate action with the corresponding parameter gradient. The parameter gradient tells us how the network should modify the parameters if we want to encourage that decision in that possession in the future. We modulate the loss for each action taken at the end of a possession according to their eventual outcome, since we aim to increase the log probability of successful actions (with higher rewards) and decrease it for the unsuccessful actions.
We define discounted reward (return) for episode in (3).
(3) 
where is a discount factor (Appendix I), and is the estimated rewards (expected goals) for timestep
after standardization to control the gradient estimator variance.
shows that the strength of encouraging a sample action at the end of a possession is the weighted sum of rewards (expected goals) afterwards. In this work, we constrain the look ahead to the end of the episodes.5.3 Policy gradient
Policy gradient (PG) is a type of score function gradient estimator. Using PG, we aim to train a policy network that directly learns the optimal policy by learning a function that outputs the best action to be taken in each possession.
The CNNLSTM network in Section 4.4 estimated the behavioral probability distribution over actions (shot, out, foul, error) for any given possession denoted by . This categorized probability distribution demonstrates some nonrandom, regular, and nonoptimized policies obeyed by the players and possibly dictated by coaches through the matches. In order to find a better policy, which optimizes the expected goal of episodes, we need to train the network. We call this network a target policy network . The training is done with the help of gradient vector, which encourages the network to slightly increase the likelihood of highly positive rewarding actions, and decrease the likelihood of negative ones. We seek to learn how the distributions should be shifted (through its parameter ), in order to increase the reward of the taken actions.
In the general case, we have the expression of form:
in which is our return, and is our learned policy. In our soccer problem, this expression is an indicator of expected goals in each episode through the whole match. In order to maximize the expected goals, we need to compute the gradient vector as follows:
But the PG is considered to be onpolicy, i.e., training samples are collected according to the target policy. This situation is not valid in our offline setting and we encounter outofdistribution actions. Thus, we need to reformulate the PG as in (4) considering importance weight (proof in Appendix F).
(4) 
Gradient vector , is the gradient that computes a direction in the parameter space leading to an increase of the probability assigned to . Consequently, high rewarding actions will tug on the probability density stronger than low rewarding actions. Therefore, by training the network, the probability density would shift around in the direction of high rewarding actions, making them more likely to occur.
5.4 Offpolicy training
Our soccer analysis problem in this work falls right into the category of the offpolicy variant of RL methods. In this method, the agent learns (trains and evaluates) solely from historical data, without online interaction with the environment.
Figure 6 illustrates our training workflow of the policy network, with offline data collection, and gradient computation.
6 Experimental results
It is a challenging task to evaluate our implemented framework, as there is no ground truth method for action valuation or optimizing the policy in soccer. Therefore, we evaluate the performance of our proposed framework with an eye towards two questions: 1) How well our trained network can maximize the expected goal in comparison to the behavioral policy? We answer this question by the offpolicy policy evaluation (OPE) method. 2) What is the intuition behind the selected actions of our target policy? We elaborate on this by providing three scenarios of the most critical situations in a particular match from the dataset. The structure of the datasets used in this study is provided in Appendix A.
6.1 Offpolicy policy evaluation with importance sampling and doubly robust methods
Applying the offpolicy method in our soccer analysis problem, we faced the following challenge: while training can be performed without a real robot (simulator), the evaluation cannot, because we cannot deploy the learned policy in a real soccer match to test its performance. This challenge motivated us to use offpolicy policy evaluation (OPE), which is a common technique for testing the performance of a new policy, when the environment is not available or it is expensive to use.
With OPE, we aim to estimate the value and performance of our newly optimized policy based on the historical match data collected by a different behavioral policy obeyed by the players. For this aim, we use the importance sampling method used by different works such as Teng et al. [23] and doubly robust in [24] and [25]. They take samples from behavioral policy to evaluate the performance of target policy . The workflow of the evaluation with importance sampling is sketched in Figure 13 of Appendix G, and details of the doubly robust are provided in Appendix H. Moreover, the input dataset format to the OPE is shown in Table 5 of Appendix B.
6.2 Experiments
We used the 104 games on 3 state types to train our policy, and evaluate it by the OPE methods. In this section, we demonstrate the performance of the obtained policy and compare it to the behavior policy on different state representations. Then we mention three scenarios and analyze the performance of our policy versus the real players’ actions.
Figure 7
shows mean rewards over 100 epochs of the trained policy network using the different proposed state representations (see Table
2) evaluated by importance sampling and doubly robust methods. As the Figure reveals, both OPE methods show that our proposed state representation type(III) (purple line) could let the policy network converge after sufficient epochs. Particularly under state(III), the policy network converges after about 70 epochs evaluated by importance sampling, and around 80 epochs evaluated by doubly robust. On the other hand, mean rewards curves under state(I) are quickly converging (due to their lowdimensional input) to a relatively lower mean rewards, and mean rewards curves under state(II) are failing to converge (due to their highdimensional and complex input structure). Thus, the results obviously prove that our proposed state representation (III) is outperforming than other types. Using importance sampling as the better evaluator of the optimal policy with state (III), any model after epoch number 70 is suitable for going into deployment by the football club for analysis. We can see that the acquired reward (expected goal) by the trained policy is around 0.45 with some variances on average of all 104 games. This figure also shows that the optimized policy (purple line) is outperforming the mean rewards by behavioral policy (green line), which is about 0.1 for all the matches.Moreover, Figure 8
compares the Kernel density estimation (KDE) of the mean rewards by behavioral and optimal policies for all matches, evaluated by OPE. As it is shown, the density of the optimized policy has moved to the positive side and clearly has improved over the behavior policy. It also has a smaller variance compared to the behavior policy.
Now, we consider some scenarios to see how the optimized policy works compared to the behavior policy. Figure 12 sketches 3 different scenarios in the critical situations of a particular match in our dataset, when there is no chance of pass or dribble for the ball holder. Thus, the ball holder needs to decide about the 3 intentional options (shot, out, foul), or submit the ball to the opponent by an error. The scenarios of the performed action by the player, and the proposed action by policy network are the following.
Scenario 1: home player missed goal scoring opportunity: Figure (a)a shows the episode of the match. Player A from the home team stops a long sequence of passes by committing a foul and he gets the reward of 0.16. In this second, there is high pressure from away players (B,C,D). So A tries a foul to prevent possession loss. Then player D from away team gets the ball, but immediately loses it due to an inaccurate pass. As claimed before, the unsuccessful touches of opponent in less than 3 consecutive actions are not considered as possession loss. Thus, the possession is kept for the home team after committing a foul by A. Although the possession was kept for the home team after this action, the policy network suggests shooting the ball instead of committing the foul. So player A could gain the reward (expected goal) of 0.4, meaning that the probability of goal scoring was 0.4, and he missed this opportunity.
Scenario 2: goal conceding: Figure (b)b shows the episode of the match. Player A from home loses the possession by error (tackle by D) and he gets 0.1 reward. The next possession belongs to the away team, and they score a goal (red trajectory in the figure). The policy network assigns a higher probability for foul in this situation instead of this inaccurate pass (error), so there was a chance of saving the possession for A, and avoid goal conceding for the home team.
Scenario 3: goal conceding: Figure (c)c shows the episode of the match. Player A from home loses the possession due to bad ball control and high pressure from B,D,E, and gets 0.1 of reward. The next possession belongs to away players and they score a goal (red trajectory). The policy network surprisingly suggests sending the ball out in this situation, so those home players could probably save the possession and avoid goal conceding.
7 Conclusion
We proposed a datadriven deep reinforcement learning framework to optimize the impact of actions, in the critical situations of a soccer match. In these situations, the player cannot pass the ball to a teammate, or continue with dribbling. Thus, the player can only commit a foul, send the ball out, shoot it, or if not skilled enough, she/he would lose the ball by a defensive action of the opponent. Our framework built on a training policy network will help the players and coaches to compare their behavioral policy with the optimal policy. More specifically, sports professionals can feed any state with the proposed possession features and state representation to find the optimal actions. We conducted experiments on 104 matches and showed that the optimal policy network can increase the mean rewards to 0.45, outperforming the gained expected goals by the behavioral policy, which is 0.1. To the best of our knowledge, this work constitutes the first usage of offpolicy policy gradient reinforcement learning to maximize the expected goal in soccer games. A direction for future work is to expand the framework to evaluate all on the ball actions of the players, including passes and dribbles.
Acknowledgment
Project no. 128233 has been implemented with the support provided by the Ministry of Innovation and Technology of Hungary from the National Research, Development and Innovation Fund, financed under the FK_18 funding scheme. The authors thank xfb Analytics^{1}^{1}1http://www.xfbanalytics.hu/ for supplying event stream and tracking data used in this work.
References
 [1] “Instat,” https://instatsport.com/.
 [2] “Wyscout,” https://wyscout.com/.
 [3] “Statsbomb,” https://statsbomb.com/.
 [4] “Stats,” https://www.statsperform.com/.
 [5] “Opta,” https://www.optasports.com/.
 [6] J. Fernandez, L. Bornn, and D. Cervone, “Decomposing the immeasurable sport: A deep learning expected possession value framework for soccer,” in MIT Sloan Sports Analytics Conference, 2019.
 [7] F. P. Alguacil, J. Fernandez, P. P. Arce, and D. Sumpter, “Seeing in to the future: using selfpropelled particle models to aid player decisionmaking in soccer,” in MIT Sloan Sports Analytics Conference, 2020.
 [8] J. Fernandez and L. Born, “Soccermap: A deep learning architecture for visuallyinterpretable analysis in soccer,” in ECML PKDD, 2020.
 [9] L. Gyarmati and R. Stanojevic, “Qpass: a meritbased evaluation of soccer passes,” in KDD Workshop on LargeScale Sports Analytics, 2016.
 [10] T. Decroos, L. Bransen, J. V. Haaren, and J. Davis, “Actions speak louder than goals: Valuing player actions in soccer.” in ACM KDD, 2019.

[11]
G. Liu and O. Schulte, “Deep reinforcement learning in ice hockey for
contextaware player evaluation,” in
International Joint Conference on Artificial Intelligence
, 2018.  [12] G. Liu, Y. Luo, O. Schulte, and T. Kharra, “Deep soccer analytics: learning an actionvalue function for evaluating soccer players,” Data Mining and Knowledge Discovery, vol. 34, no. 2, 2020.
 [13] T. Kharrat, J. L. Peña, and I. McHale, “Plusminus player ratings for soccer,” European Journal of Operational Research, vol. 283, no. 2, 2017.
 [14] I. G. McHale, P. A. Scarf, and D. E. Folker, “On the development of a soccer player performance rating system for the english premier league,” Interfaces, vol. 42, no. 4, pp. 339–351, 2012.
 [15] STATSBOBM, “New data, new statsbomb radars,” https://statsbomb.com/2018/08/newdatanewstatsbombradars/, 2018.

[16]
S. Rudd, “A framework for tactical analysis and individual offensive production assessment in soccer using markov chains,” 2018. [Online]. Available:
http://nessis.org/nessis11/rudd.pdf 
[17]
G. Keith, “A markov model of football: Using stochastic processes to model a football drive,”
Journal of Quantitative Analysis in Sports, vol. 8, no. 1, 2012.  [18] K. Sing, “Introducing expected threat (xt) modelling team behaviour in possession to gain a deeper understanding of buildup play.” 2018. [Online]. Available: https://karun.in/blog/expectedthreat.html
 [19] D. Cervone, A. Amour, L. Bornn, and K. Goldsberry, “Pointwise: Predicting points and valuing decisions in real time with nba optical tracking data,” in MIT Sloan Sports Analytics Conference, 2014.
 [20] X. Sun, J. Davis, O. Schulte, and G. Liu, “Cracking the black box: Distilling deep sports analytics,” in ACM KDD, 2020.
 [21] U. Dick and U. Brefeld, “Learning to rate player positioning in soccer,” Big Data, vol. 7, no. 1, 2019.

[22]
J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama,
K. Saenko, and T. Darrell, “Longterm recurrent convolutional networks for
visual recognition and description,” in
IEEE Conference on Computer Vision and Pattern Recognition
, 2016.  [23] T. Xie, Y. Ma, and Y. Wang, “Optimal offpolicy evaluation for reinforcement learning with marginalized importance sampling,” CoRR, 2019.
 [24] M. Farajtabar, Y. Chow, and M. Ghavamzadeh, “More robust doubly robust offpolicy evaluation,” CoRR, 2018.
 [25] J. Nan and L. Lihong, “Doubly robust offpolicy evaluation for reinforcement learning,” CoRR, 2015.
 [26] N. Jiang and L. Li, “Doubly robust offpolicy value evaluation for reinforcement learning,” in International Conference on Machine Learning, 2016.
Appendix A Raw data description
The data used to conduct our experiments are collected by a company called InStat. The dataset provides both events and tracking information of 104 European soccer matches in 20172018 season. The original InStat datasets (both events and tracking data) have relative coordinates originated from the rightdown corner of the pitch from the attacking team’s perspective: (0 to 105 at xaxis and 0 to 68 at yaxis). The attack direction is always set to be from left to right, regardless of the home or away teams. The original columns of the event dataset are as follows: action name (pass, shot, dribble, ball out, foul, clearance, assist, and events such as goal, offside, own goal, challenges, etc.), (x,y) coordinates of the start and end, action result (successful or not), zone id, body id, time second, player name, team name, opponents, and match id. Due to the confidentiality of the InStat dataset, we are not allowed to share data. However, the respective results of the proposed algorithms can be reproduced using publicly available event and tracking datasets such as Wyscout^{2}^{2}2https://figshare.com/collections/Soccer_match_event_dataset/4415000/5. We provide the public codes available online^{3}^{3}3https://github.com/Peggy4444/soccer_RL.
Appendix B Transformed data description
The possession input to all spatiotemporal models can be constructed with most of the publicly available soccer logs, such as Wyscout dataset^{4}^{4}4https://figshare.com/collections/Soccer_match_event_dataset/4415000/2. However, generating the defensive pressure feature requires the access to tracking data, which is missing in some of the datasets.
Moreover, the input to the network (for filling the replay buffer) for OPE should be prepared in the format of Table 5.
action  all action probabilities  episode  reward  possession number  possession team  possession feature 1 (10 dimensional)  …  possession feature n (10 dimensional) 
out  [0.36,0.25,0.14,0.25]  1  0.01  1  home  [0.40,…,0.25]  …  [2,…,3] 
foul  [0.24,0.42,0.26,0.10]  1  0.05  2  home  [0.37,…,0.65]  …  [4,…,1] 
shot  [0.22,0.25,0.33,0.20]  1  0.28  3  home  [0.48,…,0,32]  …  [1,…,3] 
.  .  .  .  .  .  .  …  . 
Appendix C Expected goal model with logistic regression
According to the stateoftheart expected goal (xG) models in soccer analytics, we collected 15,225 shots, and labeled them by two classes of goal and not goal. Then we used logistic regression to estimate the probability of goal scoring, given the features of the shot: . For the implementation of logistic regression, we used scikitlearn Python package^{5}^{5}5https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html. This classification model showed the AUC of 80% with 5fold cross validation in the shot dataset.
Appendix D Spatiotemporal models implementation

The spatiotemporal models are implemented using Keras sequential model.

Training the policy gradient method is performed via Keras with Tensorflow backend.

We conducted training and experimental results using a Tesla K80 GPU.
Appendix E Bayesian formula for deriving possession value (PV)
Appendix F Deriving gradients from offpolicy policy gradient method
Given that training samples are sampled from behavioral policy , we can rewrite gradient as follows:
Appendix G offpolicy policy evaluation with importance sampling
The Importance Sampling method takes samples from behavioral policy to evaluate the performance of target policy . Figure 13 shows the workflow of evaluation with this method.
Appendix H offpolicy policy evaluation with doubly robust
According to the model by [26], for a Hstep trajectory and , we define state value function as and action value function as . Then if we are supplied with , which is an estimate of action value function, we can apply doubly robust evaluator at each timestep as follows:
(5) 
where , and . Therefore, the doubly robust of the target policy value is .
Appendix I Numerical Experiment Details
In all the experiments we used the following parameters:

discount factor

learning rate in policy gradient

learning rate in deep learning
Comments
There are no comments yet.