1 Introduction
Over the last decade there has been an increase in interest towards analytics in football (soccer), and many other teamsports. Increasing compute power and data has added to the effectiveness of statistical analysis and more importantly, allowed for computeintensive and dataintensive machine learning methods. Many success stories have been well documented in mainstream publications such as “The Numbers Game”
[Anderson2013TheWrong], “Basketball on Paper” [Oliver2020BasketballAnalysis] and perhaps most well known, “Moneyball” [Lewis2004Moneyball:Game]. As a result, a growing number of sports teams now adopt specialist roles for analytics. If we assume such trends are to continue, it is likely both compute power and the amount of available data will exponentially increase in forthcoming years. However, it will remain nearly impossible to collect realworld sport data in a scientific manner where variables can be controlled. This can not be helped since top level sports are highly competitive in nature and leave very little room for experimentation. To solve this problem, agentbased simulation (ABS) can be used as a testbed to simulate various scenarios in a scientific manner.Recently, deep reinforcement learning (RL) methods have shown it is possible to train agents, from scratch, that outperform human experts in both traditional [Silver2016MasteringSearch, silver2017mastering] and modern games [Mnih2013PlayingLearning, alphastarblog, Berner2021DotaLearning]. These breakthroughs, coupled with increasingly sophisticated simulation environments, are a promising new direction of analysis in sports. Therefore in this paper, we examine the characteristics of football playing RL agents and uncover how strategies may develop during training. Out of the many team sports that exist we choose to focus on football due to its popularity and the availability of a sufficient simulation environment (see §2 for more detail). We use the Google Research Football environment [Kurach2019GoogleEnvironment] to train football playing RL agents in a single agent manner. Fig. 1 illustrates a representation of the training setup we used. Another problem concerning the use of ABS is that the domain gap between RLagents and realworld football players is not clear. To gain a better understanding of this domain gap, we compared the characteristics of football strategies in RL agents and realworld football players. In summary, the main contributions of the study are as follows:

We compared the characteristics of football playing RL agents [Kurach2019GoogleEnvironment] in various training processes and realworld football players for the first time, thus verifying simulations as a practical approach for football analysis.

We found that more competitive RL agents have a more similar and wellbalanced passing strategy to realworld footballers in comparison to less competitive RL agents.

We analyzed how the football strategies of RLagents evolve as the competitiveness of the agent increases. Strong correlations were found between many aggregated statistics / social network analysis and the competitiveness of the agent.
The outline of this paper is as follows. §2 provides background on agentbased simulation, deep RL and football analytics. §3 and §4 discuss the preliminaries and methods used to train deep RLagents and the metrics used to analyse playing characteristics. We present results and discussions in §5. Finally, we summarise our conclusions and future work in §6.
2 Related Works
2.1 AgentBased Simulation
Agentbased simulation (ABS) is a computationally demanding technique for simulating dynamic complex systems and observing “emergent” behaviour. With the use of ABS, we can explore different outcomes of phenomena where it is infeasible to conduct research testing and hypothesis formulations in real life. In the context of football we can use ABS to examine effects of different formations on match outcomes or study various play styles using millions of simulated football games. The availability of good simulation environments are critical to ABS. Fortunately, football has received a lot of attention in this field thanks to the long history of the RoboCup simulation track [itsuki1995soccer]. In recent years, many other simulation environments have also been introduced [Liu2019EmergentCompetition, Cao2020REINFORCEMENTCRITICS, Liu2021FromFootball]. Amongst others, the Google Research Football environment [Kurach2019GoogleEnvironment] stands out as an interesting testbed. Kaggle has held a competition with over a thousand teams participating^{1}^{1}1https://www.kaggle.com/c/googlefootball and researchers have already started to develop methods to analyze football matches using Google Research Football via graphical tools [PinciroliVago2020INTEGRA:Matches] or RL inspired metrics [Garnier2021EvaluatingLearning]. Therefore we choose to use the Google Research Football environment to conduct our simulations. It reproduces a full football match with all of its usual regulations and events, as well as player tiredness, misses, etc. We list an overview of available simulation environments in Table 1).
Environment  Description 

RoboCup Soccer [itsuki1995soccer]  An 11 vs 11 soccer simulator. Agents receive noisy input from virtual sensors and perform some basic commands such as dashing, turning or kicking. 
MuJoCo 2 vs 2 [Liu2019EmergentCompetition]  A 2 vs 2 football environment with simulated physics built on MuJoCo [Todorov2012MuJoCo:Control]. Uses relatively simple bodies with a 3dimensional action space. 
Unity 2 vs 2 [Cao2020REINFORCEMENTCRITICS]  A 2 vs 2 football environment built on unity. Two types of players with slightly different action spaces are available. 
Google Research [Kurach2019GoogleEnvironment]  An 11 vs 11 soccer environment built on GameplayFootball. Simulates a full football game and includes common aspects such as goals, fouls, corners, etc. 
Humanoid [Liu2021FromFootball]  A 2 vs 2 football environment with simulated physics built on MuJoCo [Todorov2012MuJoCo:Control] designed to embed sophisticated motor control of the humanoid. Physical aspects such as the radius of the ball and goal size are adjusted in proportion to the height of the humanoid. 
2.2 Deep Reinforcement Learning
Deep RL is a subset of RL that combines the traditional reinforcement learning setup, in which agents learn optimal actions in a given environment, with deep neural networks. There have been many remarkable examples of agents trained via deep RL outperforming experts. A remarkable example of this is Deepmind’s AlphaGo
[Silver2016MasteringSearch]. Its successors AlphaZero [silver2018general] and Muzero [Schrittwieser2020MasteringModel] achieved a superhuman level of play in the games of chess, shogi and go solely via selfplay.In contrast to the singleplayer, deterministic, perfect information setup for the classical games mentioned above, football is a highly stochastic imperfect information game with multiple players that construct a team. Although these characteristics have made it difficult to learn through selfplay, recent works have shown promising results in similar categorised games such as DotA and StarCraft. For example, OpenAI Five [Berner2021DotaLearning] scaled existing RL systems to unprecedented levels, while performing “surgery” to utilise thousands of GPUs over multiple months. On the other hand, AlphaStar [Vinyals2019GrandmasterLearning] populated a league consisting of agents with distinct objectives, and introduced agents that specifically try to exploit shortcomings in other agents and in the league. This allowed agents to train while continually adapting strategies and counterstrategies.
As for research directly related to football, Robot soccer [itsuki1995soccer] has been one of the longstanding challenges in AI. Although this challenge has been tackled with machine learning techniques [Riedmiller2009ReinforcementSoccer, Macalpine2018Journal], it has not yet been mastered by endtoend deep RL. Nonetheless, baseline approaches for other simulation environments mostly utilise deep RL. [Liu2019EmergentCompetition]
used a populationbased training with evolution and reward shaping on a recurrent policy with recurrent actionvalue estimator in MuJoCo Soccer. Whereas
[Cao2020REINFORCEMENTCRITICS] showed that RL from hierarchical critics was affected in the Unity 2 vs 2 environment. Proximal Policy Optimization (PPO) [Schulman2017ProximalAlgorithms], IMPALA [Espeholt2018IMPALA:Architectures] and ApeX DQN [Horgan2018DistributedReplay] were provided as benchmark results for Google Research Football [Kurach2019GoogleEnvironment]. Finally a combination of imitation learning, single and multiagent RL and populationbased training was used in Humanoid Football
[Liu2021FromFootball].Many researchers have attempted to model the behaviour of players by predicting the short term future contrary to the longterm horizon goal approach using deep RL [le2017coordinated, Felsen2018WhereAutoencoders, Yeh_2019_CVPR]. Such research offers important insights into what architectures/time horizons/rewards may be effective.
2.3 Football Analytics
Football has been considered to be one of the most challenging sports to analyze due to the number of players, continuous events and low frequency of points (goals). Therefore, it is only recently that a datadriven approach has started to gain attention. Nevertheless, numerous approaches, from the simple aggregation of individual/team play statistics [Novatchkov2013ArtificialTraining]
, to complex methods, such as those that use gradient boosting to model the value of actions
[decroos2018actions]. In general one can observe two different types of analysis. The first focuses on evaluating the overall performance of a single player or team. In this case, an action is usually valued then aggregated by either player or team. [decroos2018actions]assigned values to onball action actions by measuring their effect on the probabilities that a team will score. In turn,
[fernandez2018wide] proposed a method to value off the ball actions by estimating pitch value with a neural network. The second category of analysis is strategy or play style analysis. Methods such as automatic formation [Bialkowski2016DiscoveringData] or tactic [Gyarmati2015AutomaticTeams, Decroos2018AutomaticData] discovery fall into this category. Social network analysis is also a well used method to analyse interactions between players [Clemente2016SocialAnalysis, Buldu2018UsingGame]. Network metrics such as betweenness, centrality and eccentricity are often used. [Pena2012AStrategies] demonstrated that winning teams presented lower betweenness scores. Similarly, [Goncalves2017ExploringFootball] provided evidence that a lower passing dependency for a given player and higher intrateam wellconnected passing relations may optimise team performance.3 Preliminaries
3.1 Proximal Policy Optimization
To learn policies for agents to play Google Research Football, we follow the original paper [Kurach2019GoogleEnvironment] and use Proximal Policy Optimisation (PPO) [Schulman2017ProximalAlgorithms]. PPO belongs to a family of reinforcement learning called policy gradient methods. These methods try to find an optimal behaviour strategy by alternating between optimising a clipped surrogate objective function and sampling data through interactions with the environment. The objective function of PPO is denoted as follows,
J (θ) =
E [
min(
r(θ) ^A_θ_old(s, a),
clip(r(θ), 1  ϵ, 1 + ϵ) ^A_θ_old(s, a))
]
where

is the probability ratio between old and new policies .

is a policy, given parameter , state and action .

clips to be in the range of and .

is an estimate of the advantage function , given actionvalue function and statevalue function .
Typically is updated via stochastic gradient ascent with an optimiser such as Adam[Kingma2014Adam:Optimization].
3.2 TrueSkill™ Ranking System
To measure the competitiveness of the learned RL agents, the TrueSkill™ ranking system [Herbrich2007TrueSkill:System]
was used. The TrueSkill™ ranking system is a skill based ranking system that quantifies a players’ rating using the Bayesian inference algorithm. This system has been frequently used in many different multiplayer games and sports applications
[Tarlow2014KnowingUncertainty]. Although It also works well with player team games and freeforall games, we focus our attention on the simplest case, a twoplayer match.Each rating is characterised by a Gaussian distribution with mean
. These values are updated based on the outcome of a game with the following update equations,(1)  
(2)  
(3)  
(4)  
(5) 
where is a configurable parameter that should be adjusted accordingly to the likeliness of a draw, and
is the variance of the performance around the skill of each player.
and are functions that are designed so that weighting factors are roughly proportional to the uncertainty of the winner/loser vs. the total sum of uncertainties. We refer the reader to the original paper [Herbrich2007TrueSkill:System] for further explanation. Finally, a socalled conservative skill estimate can be calculated by , where is usually set to 3.3.3 Social Network Analysis
To analyse the intelligence of coordinated RL agents and compare their characteristics with realworld data, an analysis framework that is not influenced by physical differences between simulations and the realworld is necessary. Passes do not rely on individual physical ability and is an important component of teamplay. Therefore we focus on social network analysis (SNA) of passes.
A pass network is a weighted directed graph that considers the direction and frequency of passes between two players. It takes the form of an adjacency matrix and weight matrix . represents the number of passes from player to player , and is simply if or otherwise. Below, we explain the three metrics used in this paper.
Closeness Centrality. Closeness is calculated by computing the sum of all the geodesic (shortest) paths between the node and all other nodes in the following equation.
Closeness(v) = 1∑w ∈Vσvw where is defined as the shortest distance between nodes and . This score indicates how easy it is for a player to be connected with teammates. Therefore a high closeness score indicates that a player is wellconnected within the team.
Betweenness Centrality. Betweenness is calculated by counting the total numbers of geodesic paths linking and and the number of those paths that intersect a node in the following equation.
Betweeness(v) = ∑_s ≠v ∈V∑_t ≠v ∈V σst(v)σst
where is the number of shortest paths from node to node that passes node . This score indicates how players acts as a bridge between passing plays, high deviation within a team may indicate wellbalanced passing strategy and less dependence on a single player.
Pagerank Centrality. Pagerank is calculated based on the total number of passes a player made in the following equation.
Pagerank(v) = p ∑_v≠wAvwLwoutPagerank(w)+q where represents the probability a player will decide not pass the ball and
can be thought of ”free popularity”, both of which are heuristic parameters. These parameters are set to
and following [Pena2012AStrategies]. A high pagerank score implies that the player is a popular choice for other players to pass too.4 Proposed Analysis Framework
In this section, we present the details of our proposed analysis framework, which is outlined in Fig. 2, and the details regarding the setup of the subsequent experiments. Our framework consists of five parts. In the first part (i), we train agents using proximal policy optimisation in the Google Research Football simulation environment. (ii) Then, we rank the agents by the TrueSkill ranking system. In the third part (iii), we extract event data concerning ontheball actions from the simulations and convert it into a tabular format. This format is similar to the Soccer Player Action Description Language (SPADL) but simplified to only include passes and shots. We also convert realworld football data into the same format as well. Finally, we perform (iv) correlation analysis and (v) social network analysis on the obtained data.
4.1 Agent Training and Ranking
In order to train agents, we closely follow the setup of the baseline agents for the Google Research Football environment presented in [Kurach2019GoogleEnvironment]. An agent will control a single active player at all timesteps and has the ability to switch to control any other player on the same team (excluding the goal keeper). Nonactive players are controlled via another ingame rule based system. In this system, the behavior of the nonactive players corresponds to simple actions such as running towards the ball when not in possession, or move forward together with the active player when in possession. Hence, the players can be regarded as being centrally controlled. In this paper we consider multiagent RL to be out of scope and hope to pursue such a setup in the future.
4.1.1 Deep RL Implementation
The training pipeline is as follows. First, we reproduce the results presented in [Kurach2019GoogleEnvironment] by using the same hyperparameter/training setup. The Deep RL agent uses the PPO algorithm [Schulman2017ProximalAlgorithms] as described in §3.1, with an Impala policy [Espeholt2018IMPALA:Architectures]. The architecture is available Fig. 3.
Each state of the simulation is represented by a Super Mini Map (SMM) based on [Kurach2019GoogleEnvironment]. The SMM consists of four matrices, each a binary representation of the locations of the home team players, the away team players, the ball and the active player, respectively. A visualisation can be found in Fig. 4. The actions available^{2}^{2}2See https://git.io/Jn7Oh for a complete overview of observations and actions to the central control agent are displayed in Table 2. Each movement action is sticky, therefore once executed, the action will persist until there is an explicit stop action.
Top  Bottom  Left 
Right  TopLeft  TopRight 
BottomLeft  BottomRight  Shot 
Short Pass  High Pass  Long Pass 
Idle  Sliding  Dribble 
StopDribble  Sprint  StopMoving 
StopSprint     
Rewards are based on whether a goal is conceded, scored, or neither. In addition to this goalbased reward a small ”checkpoint” reward is used to aid the initial development where goals are sparse. We refer the reader to [Kurach2019GoogleEnvironment] for a more indepth description of possible training setups.
Based on the above setup, in this paper, we started by training for 50 million timesteps against the builtin easy, medium and hard level bots. During this phase, we noticed that the performance of the agents had not converged. Therefore, we trained an extra 50million timesteps against the easy and medium bots and an extra 150million timesteps against the hardlevel bot. The average goal difference for the resulting agents at 50, 100 and 200 million timesteps is presented in Table 3.
Bot Level  50M  100M  200M 

Easy  5.66  8.20   
Medium  0.93  2.35   
Hard  0.08  1.25  2.81 
4.1.2 TrueSkill Ranking Implementation
To implement the TrueSkill ranking, we create a roundrobin tournament composed of 15 agents (5 from each setup, easy, medium and hard) using intermediate checkpoints saved at 20%, 40%, 60%, 80% and 100% of training. In a single roundrobin tournament, each agent plays every other agent once. We conducted a total of 50 roundrobin tournaments, resulting in a total of 5250 matches. Next, we use the resulting scores of all 5250 matches to calculate a TrueSkill rating for each agent. We show the top3 / bottom3 ranked agents of the resulting leaderboard in Table 4. Notice the agents trained against the easy level builtin bot ranks top 1, 2 and 3. This result seems counter intuitive, since agents trained longer against stronger builtin bots should be more competitive. Therefore this suggests that there could be better training strategies. However, exploring alternative training strategies is out of scope for this work and shall be left for future work.
Ranking  Bot Level  Checkpoint %  rating 

1  Easy  80%  34.1 
2  Easy  100%  31.5 
3  Easy  40%  31.5 
…  
13  Easy  20%  8.3 
14  Hard  20%  7.9 
15  Medium  20%  7.0 
4.2 Data Extraction
Action data and observation data are extracted from the games saved when calculating TrueSkill ranking. From this data, we extract all pass and shot actions and programmatically label their results based on the following events. For realworld football data, we use eventstream data for three matches from the 20192020 J1League. The J1League is the top division of the Japan professional football league. The data was purchased from DataStadium Inc. We show the match results in Table 5. The three teams, Kashima Antlers, Tokyo FC and Yokohama F Marinos were chosen since they were the top3 teams on the leaderboard at the time.
Date  Home Team  Score  Away Team  

2019/04/14  FC Tokyo  (13) 


2019/04/28 

(21) 


2019/06/29  FC Tokyo  (42) 


2019/08/10 

(21) 


2019/09/14 

(20)  FC Tokyo 
We also extract all pass and shot actions from this data. The results format of both simulation and realworld data is tabular and a simplified version of SPADL [Decroos2019ActionsSoccer]. An explanation of the variables used in analysis is listed in Table 6.
4.3 Data Analysis
Two types of football analysis are applied to the extracted data. We first focus on the finding statistics and metrics that correlate with the agent’s TrueSkill ranking. For this we calculate simple descriptive statistics, such as number of passes/shots, and social network analysis (SNA) metrics, such as closeness, betweenness and pagerank. As explained in §
3.3, SNA was chosen because it describes the a team ball passing strategy. Therefore it is sufficient for the analysis of central control based RL agents. We calculate Pearson correlation coefficient and value for testing noncorrelation. The following criteria were used to interpret the magnitude of correlation: values less than 0.3 were interpreted as trivial; between 0.3 and 0.5 as moderate; between 0.5 and 0.7 as strong; between 0.7 and 0.9 as very strong; more than 0.9 as nearly perfect. A value less than 0.05 is considered as statistically significant, any result above this threshold will be deemed unclear.Our second focus is the comparison of SNA metrics between RL agents and realworld football data. By using SNA metrics, we can compare the ball passing strategy between RL agents and realworld football data. To assure a fairness, we bootstrap samples of passes from each team before generating a pass network to analyse. We repeat this process 50 times. Then, we conduct normality tests to determine that the distribution is Gaussian. Finally, we plot and visually inspect the distribution.
5 Results and Discussion
In this section, we show the results of the two types of data analysis detailed in §4.3. The first is a correlation analysis between descriptive statistics / SNA metrics and TrueSkill rankings. The second is a comparative analysis which uses SNA metrics generated from RL agents (Google Research Football) and realworld football players (20192020 season J1League).
5.1 Correlation Analysis
For each team an agent controls, descriptive statistics and SNA metrics were calculated using the variables listed in Table 6. The Pearson correlation coefficients are shown in Table 7.
Metric 

value  

Total Passes  0.5  0.061  
Total Shots  0.77  0.001  
Successful Pass Pct  0.62  0.014  
Successful Shot Pct  0.68  0.005  
PageRank (std)  0.58  0.022  
PageRank (mean)  0.05  0.848  
PageRank (max)  0.48  0.068  
PageRank (min)  0.91  0.001  
Closeness (std)  0.54  0.036  
Closeness (mean)  0.64  0.010  
Closeness (max)  0.61  0.015  
Closeness (min)  0.66  0.007  
Betweenness (std)  0.65  0.009  
Betweenness (mean)  0.72  0.002  
Betweenness (max)  0.65  0.009  
Betweenness (min)  0.0  0.0 
As can be seen in Table 7, many of the descriptive statistics and SNA metrics have a strong correlation with TrueSkill rankings. We observe that ”Total Shots” and ”Betweenness (mean)” have a very strong positive correlation with TrueSkill rankings. On the other hand, ”PageRank (min)” has a nearly perfect negative correlation.
The metric with the largest overall correlation is the pagerank aggregated by the minimum value in the network (, ). We present a scatter plot of this metric in Fig. 5.
Since pagerank roughly assigns to each player the probability that they will have the ball after a arbitrary number of passes, the node with the minimum pagerank centrality is likely to be the goalkeeper, whom we assume that the agent is quickly learning to keep the ball away from. Another interesting finding is the strong positive correlation with the standard deviation of betweenness (, ). This metric is also presented as a scatter plot in Fig. 6.
A large variance in betweenness has been demostrated to be related with a wellbalanced passing strategy and less specific player dependence [Clemente2016SocialAnalysis]. It is fascinating that the agents learn to prefer a wellbalanced passing strategy as TrueSkill increases. In general, most of the metrics presented in Table 7 have either a negative or positive moderate strong correlation with .
5.2 Comparative Analysis Between Simulated and Realworld Football
As exaplained in §4.2, for each of the five real world football matches played by three teams, we calculated the distribution of SNA metrics. Distributions were calculated by bootstrapping samples of passes 50 times. The same procedure was taken for the matches played by the best and worst ranked agents (see Table 4.1). In Fig. 7 we visualise each of the three SNA metrics aggregated by two different methods. Aggregation methods that showed strong correlations in Table 7 were chosen. The total number of passes and shots per match can not be fairly compared between RLagents and realworld footballers because of different match lengths. In summary, a total of six variables were compared over five agents/teams (worst RL agent, best RL agent, FC Tokyo, Kashima Antlers and Yokohama F Marinos).
Observing this visualisation we can see that the distribution of ”Betweenness (mean)”, ”Betweenness (std)” and ”Closeness (std)” metrics for the worst agent is distant from the others. The fact that the best agent distribution of the same metric is much closer to that of J League teams implies that agent has learnt to play in a similar style through RL. However the same cannot be said for the other metrics, ”Closeness (mean)”, ”PageRank (std)” and ”PageRank (min)”.
From the perspective of football analysis, the distributions of ”Betweenness (std)” is very interesting. Since a high deviation in betweenness may indicate wellbalanced passing strategy and less dependence on a single player, we can hypothesise that agents are learning to play a more wellbalanced passing strategy similar to realworld footballers.
Although it is difficult to interpret the results from the PageRank and Closeness metrics, it is surprising that even the worst RL agents have overlapping distributions with the realworld footballers. Considering the fact that even the worst RL agent was trained thousands of timesteps, this may be because strategies related PageRank and Closeness are easier to learn.
6 Conclusions and Future Work
In this paper, we compared the characteristics and play styles of RL agents of increasing competitiveness. As a result, we found many metrics that strongly correlate with the competitiveness (TrueSkill rating) of an agent. Another contribution in this paper, is the comparison between RL agents and real football players. Our findings suggest that an RL agent can learn to play football in similar style to that of real player without being explicitly programmed to do so.
There are many directions we can extend the research presented in this paper. In particular, we plan to work on increasing the degree of freedom within the simulations to create a more realistic environment. This can be achieved by conducting multiagent simulation where an RL agent controls a single active player in contrast to a whole team. Another approach would be to use a less restrictive environment such as the “Humanoid Football” environment to introduce biomechanical movements. Although both approaches appear interesting, improvements in training methodology, such as imitation learning and autocurricular learning may be required to produce adequate agents.
We also noticed that it was difficult to use state of the art football analysis methods due to different representations of the underlying data. Since efficient representations such as SPADL already exist, we hope other researchers can build on top of these so that the community can easily take advantage of existing methods.