Introduction
Agentbased modelling (ABM) is a computational modelling approach most often applied to the study of complex adaptive systems. ABMs typically represent individuals directly, and situate these agents in a virtual environment of some kind. These agents then engage in varied behaviours encoded in a set of decision rules that drive their actions in response to behavioural, environmental and social change. The resultant complex interactions between agents and their environments can lead to emergent behaviours, in which the patterns of behaviour seen at the population level have new properties which are not straightforwardly attributable to individuallevel actions.
The ability of ABMs to model complex interactions and to demonstrate emergence has meant that ABM is particularly relevant to those disciplines of the social sciences where individual agency is considered important to populationlevel outcomes. This is not a new phenomenon; one of the very first ABMs was a social model – a simple model of residential housing segregation designed by Thomas Schelling Schelling (1971). Since the 1980s and Axelrod’s The Evolution of Cooperation Axelrod (1984), this synergy with the social sciences has led to the development of the field of social simulation, in which this variety of computational social science is used to examine the development and evolution of human society in a wide variety of circumstances Silverman (2018). In recent years, more applied areas of social science, such as public health, have proposed ABM as a means to investigate societal responses to new social or economic policies Rutter et al. (2017).
As ABM becomes more commonplace in policy debates, methodological discussions amongst ABM practitioners have focused on the development of more efficient means for understanding the outputs of these complex simulations. Even a simple ABM may have several interacting processes affecting its agents, meaning that the relationships between model parameters can be highly nonlinear – small changes to one parameter can lead to unexpectedly large effects on simulation outcomes, or vice versa. This, in turn, means that understanding an ABM is a complex undertaking, often involving detailed sensitivity analyses performed using thousands of repeated runs at a wide range of parameter values. When ABMs are highly complex, performing these kinds of analyses becomes both time and costprohibitive, leading many to truncate these analyses or eliminate them entirely.
Recent developments in uncertainty quantification (UQ) have proven fruitful for ABM researchers. Using methods like statistical emulation allows the modeller to create a statistical model of the simulation model, meaning the repeated simulation ‘runs’ can be completed in mere seconds using a statistical surrogate of the original complex ABM Kennedy and O’Hagan (2001); O’Hagan (2006). The most common approach is Gaussian process emulation (GP), which has been used with some success in ABM applications in a variety of fields including public health Silverman et al. (2013b, a). Note that the terms ‘surrogate model’ and ‘emulator’ are sometimes used interchangeably, but the originators of the GP approach use the term ‘emulator’ specifically for methods that provide full probabilistic predictions of simulation behaviour, not only approximations of the outputs.
Machine and deep learning approaches have shown wide applicability and versatility when simulating mechanistic processes
Zampieri et al. (2019). With the advent of accessible machinelearning methods including artificial neural networks (ANNs), various authors have proposed machine learning as a potentially productive means of creating surrogate models for ABM emulation van der Hoog (2017); Pereda et al. (2017). This seems particularly promising in the case of ANNs; while it is wellknown that shallowdepth ANNs are universal approximators capable of approximating any continuous function, recent work has shown that modern deep neural networks are capable of similar or greater expressiveness even when the network width is limited Lu et al. (2017). These properties suggest that ANNs can more easily cope with the highly nonlinear nature of ABMs than other comparable approaches. Other machinelearning approaches are significantly faster than training an ANN, and may also prove fruitful for surrogate modelling purposes.Despite its high potential, machine learning can be found only in a very limited number of applications in the ABM domain. As well as having been used as part of an ABM calibration process Torrens et al. (2011); Lamperti et al. (2017), one recent project has proposed to determine singleagent behaviour using machine learning. In this project, a machinelearning model was trained to mimic behavioural patterns, where parameters are given as input and the action is the output, all performed at singleagent resolution Kavak et al. (2018).
However, at the time of writing, these methods have not been used widely to develop surrogate models for the purpose of sensitivity analysis. The majority of extant examples of machinelearning methods applied to ABM outputs are GP implementations; to our knowledge, various versions of ANNs and other methods have been discussed in principle, but not yet designed and implemented on agentbased simulations.
We contend that machine and deep learning methods, when applied to the generation of surrogate models, can improve the theoretical understanding of the ABM, help calibrate the model more efficiently, and provide more insightful interpretations of simulation outputs and behaviours. Therefore, for parameter spaces that cannot be searched effectively with heuristics, we propose that machine learning models can be learned from ABM outputs, and machine/deep learning techniques can then be used to emulate the ABMs with high accuracy.
Interpretability of machinelearning models
The widespread use of machinelearning models today has led to concerns being raised regarding their interpretability, given that understanding the predictions produced by machinelearning models is far from straightforward Gilpin et al. (2018)
. Deeplearning models in particular are enormously complex, often containing hundreds of layers of neurons adding up to tens of millions of parameters. Recently, significant progress has been made in developing tools for interpreting these models, including a recent striking interactive attempt to make a large deeplearning model interpretable
Carter et al. (2019). Such tools are progressing rapidly, but they still require significant time investment for the modeller and are not yet in widespread use. However, we note that while large machinelearning models may appear particularly problematic, simpler methods like logistic regression can be equally difficult to interpret when dealing with large data sets
Lipton (2016).We acknowledge that machinelearning models have difficulties with interpretability, but in this usecase we advocate a pragmatic approach. When building surrogate models, our primary aim is to significantly reduce the time required to produce model outputs for sensitivity analyses, so we aim for convenient implementation and computational efficiency. The surrogate enables us to produce analyses that help us understand the behaviour of the original model, regardless of the interpretability of the surrogate itself.
In this paper, we implement and provide a thorough comparison of the performance of a multitude of machinelearning methods in an ABM emulation scenario. As an exemplar model, we have used an agentbased model of social care provision in the United Kingdom Silverman et al. (2013a), generated sets of observations across ranges of input parameter values designed for optimum parameter space coverage, and attempted to replicate the ABM outputs using machinelearning algorithms. In the following section we outline the methods we studied, and in the Results section we provide a detailed analysis of their performance in this task.
Motivations
With this paper, we investigate a question raised elsewhere Lamperti et al. (2017); van der Hoog (2017): whether neural networks and other machinelearning methods may be used successfully and efficiently as a method for surrogate modelling of ABMs. The work done in this area thus far has proposed this possibility, but has not taken the step of directly comparing machine learning to other methods for surrogate modelling. In an effort to spur further work on this topic, we have developed this comparison of nine different possible methods.
We have chosen a selection of the most widelyused machine learning methods for our comparison in this paper. The primary advantage of using surrogate models is to drastically reduce the time required to perform detailed analyses of ABM outputs; with that in mind, we chose methods that can generate predictions rapidly once the surrogate is generated. These methods can differ significantly in the time required for training the model, so we compare training times in this paper as well as predictive accuracy.
Substantial work remains to be done on developing these methods into a more accessible form aimed specifically at model calibration and emulation, and to further refine their application in this context. We here aim to inform these efforts, and to determine whether useful surrogate models can be built using machine learning methods, and which methods are most effective.
We also address the problems ABMs pose for current emulation methods, such as Gaussian process emulators, and investigate whether machinelearning methods can surpass these obstacles. We therefore included Gaussian process emulators in our comparative framework. As we demonstrate below, Gaussian process emulators are both efficient and powerful, but can struggle nonetheless to fit the output of even moderately complex ABMs. Our hope is that ultimately new methods for surrogate modelling emulation can supplement GPs in these cases.
Finally, to ensure our results are accessible to a wide range of modellers, we have performed our analyses using the widely available Mathematica and MatLab software. The code and results are shared in full in our GitHub repository. We chose this approach in the hope that even modellers using graphical ABM tools like NetLogo or Repast will be able to try these methods and test our results even without extensive programming experience. ABM is a method often applied in the social sciences as well as the physical and computational sciences, and in those cases using Python, R or Julia may introduce technical barriers.
Summary of methods studied
Table 1 provides a summary of the machine learning methods studied in this paper, and includes a description of each method and the commonlycited advantages and disadvantages of each approach. This summary is provided as a quick guide for ABM practitioners who may be new to these methods, rather than as a definitive comparison between them. For further details on each method, please refer to the key references cited in each description.
Method  Description  Advantages  Disadvantages 

Linear Regression  Predicts values using a linear combination of input features, in this case parameter values Hastie et al. (2001).  Fast, extremely wellstudied, easy to implement with any number of tools  Not wellsuited for data with nonlinear relationships. 
Decision Trees  Generates binary trees that predict the value of a variable based on several inputs, represented as interior nodes in the tree Quinlan (1986).  Fast and easy to train, simple to implement, very easy to understand and interpret.  High variance (slight changes to input data can produce very different trees), prone to overfitting. 
Random Forests  Ensemble approach to decision trees in which multiple models are trained using random split points. When making a prediction, predictions of each tree in the ensemble are averaged together to produce the final result Breiman (2001).  Easy to parallelise, low computational load, low variance  Prone to overfitting, low interpretability. 
Gradient Boosted Trees 
Ensemble method that combines weak learners into a strong learner Friedman (2001) . Weak learners are trained sequentially so that each successive learner improves on its predecessor’s predictions. XGBoost Chen and Guestrin (2016) and LightGBM Ke et al. (2017) have further optimised this approach and are frequently used in a variety of machinelearning domains. 
Fast, low variance, very successful across a wide range of problems.  Prone to overfitting, requires extensive parameter optimisation. 
KNearest Neighbours  This method makes a prediction by finding the K mostsimilar points in the entire training data set to the new data point we would like to label, then summarising the output values at those points to arrive at the prediction for the new point Cover and Hart (2006).  Simple to implement, fast, only requires computational resources when making a prediction.  Must include the entire training set, becomes ineffective as the number of input variables becomes high (the ‘curse of dimensionality’). 
Gaussian Process Emulation  The most common method for emulating computer models, GPs model a simulation as a Gaussian process O’Hagan (2006). Useful measures like the main effects, or output variance due to each input parameter, are straightforwardly derivable from the emulator.  Very computationally efficient, highly useful for sensitivity analyses, specialised free software (GEMSA) speeds the process up considerably.  Assumes that the model to be emulated is smooth (may not be the case with complex ABMs), GEMSA software no longer maintained, only copes with single model outputs. 
Support Vector Machine (SVM) 
Finds a hyperplane in a highdimensional space of data points that can separate those points with the widest possible margin. It can be used for classification or regression Cristianini and ShaweTaylor (2000). 
Scales well, few hyperparameters to optimise, flexible and powerful in higher dimensions thanks to ‘kernel trick’ 
It can be difficult to choose the right kernel, long training times for large datasets, low interpretability. 
Neural Network  A network of nodes loosely based on biological neurons, normally consisting of an input and output layer with one or more hidden layers of neurons in between. Learning algorithms adjust the weighted connections between neurons to enable regression or classification of input datasets. Deep neural networks use many hidden layers and can model complex nonlinear relationshipsHinton et al. (2006).  An enormous variety of possible layer types and network architectures, can learn supervised or unsupervised, highly suitable for modelling nonlinear relationships, very wellsupported by powerful opensource software 
Computationally expensive hyperparameter optimisation, prone to overfitting, large networks require GPU access for training, low interpretability. 
Methods
Exemplar agentbased model
In order to present a useful comparison of methods for generating surrogate ABMs, we chose to use a moderatecomplexity ABM as an exemplar. Relatively simple ABMs would have behaviour that is easier to replicate, meaning they would not present a sufficient test of the capabilities of each method. On the other hand, highcomplexity ABMs may present a very robust test in this regard; however, a highly complex model would have taken far more CPU time to generate output data. Therefore, for reasons of practicality and replicability we chose to use a moderatecomplexity model which could be run relatively quickly and easily.
The chosen model is the ‘Linked Lives’ model of social care provision in the UK, which models the interaction between supply and demand of informal social care Noble et al. (2012); Silverman et al. (2013a). The model is written in Python, and simulates the movements and life course decisions of agents in a virtual space built as a rough approximation of UK geography. Our simulation code is freely available via GitHub and the specific release used for this paper is available at https://github.com/thorsilver/SocialCareABMforUQ/releases/tag/v0.91.
Agents in the Linked Lives model are capable of migrating domestically, forming and dissolving partnerships, and participating in a simple simulated economy. As agents age they may develop longterm limiting health conditions that require social care; in such cases, family members will attempt to provide care according to their availability. Care availability varies according to an agent’s employment status, geographical location and health status. The simulation thus provides insight into patterns of social care provision in families, and how these patterns may shift in response to policy changes, which can be implemented in the simulation by altering the simulation parameters.
Parameter  Description  Default  Range 

ageingParentsMoveInWithKids  Probability agents move back in with adult children  0.1  0.1 – 0.8 
baseCareProb  Base probability used for care provision functions  0.0002  0.0002 – 0.0016 
retiredHours  Hours of care provided by retired agents  60.0  40 – 80 
ageOfRetirement  Age of retirement for working agents  65  55 – 75 
personCareProb  General individual probability of requiring care  0.0008  0.0002 – 0.0016 
maleAgeCareScaling  Scaling factor for likelihood of care need for males  18.0  10 – 25 
femaleAgeCareScaling  Scaling factor for likelihood of care need for females  19.0  10 – 25 
childHours  Hours of care provided by children living at home  5.0  1 – 10 
homeAdultHours  Hours of care provided by unemployed adults  30.0  5 – 50 
workingAdultHours  Hours of care provided by employed adults  25.0  5 – 40 
Generating simulation data
The Linked Lives model contains 22 useralterable parameters, 10 of which are potentially relevant to modelling social care policies. Table 2 summarises the ten parameters and their function within the simulation. Input values for these parameters were varied across ranges determined by experimentation to lie at the upper and lower bounds for interpretable simulation behaviour; beyond those bounds, the simulation results were generally highly unrealistic, leading to either collapsing or exploding agent populations.
The simulation output of interest is the per capita cost of social care per annum. Each simulation run starts with a random initial population in the year 1860, then runs in yearly time steps until 2050. The final per capita cost of care in 2050 is then recorded along with the input parameter values for that run.
Full simulation runs were generated in four different batches consisting of 200, 400, 800 and 1600 runs, in order to allow us to compare the performance of each machine learning method with both smaller and larger sets of observations. The LPTau algorithm was used to generate sets of input parameters that reasonably covered the simulation parameter space Sobol (1977). LPTau is simple to implement and generates appropriatelysized sets of parameterspacefilling experimental designs very quickly. When using maximin Latin hypercube designMorris and Mitchell (1995), we found that generating the designs took orders of magnitude longer than LPTau, and so we elected to use LPTau for simplicity and speed. The input parameter sets and simulation outputs for each of the four batches of runs are available in the GitHub repository.
Application of machine learning to simulation outputs
For each of the machine learning algorithms chosen for this comparison, simulation outputs for each of the four batches outlined above were used to train each algorithm to predict the output of the simulation. The ten simulation parameters described in Table 2 were used as input features, and the output to be predicted is the social care cost per capita per annum. All algorithms were tested using the same data sets for each of the four batches of simulation runs.
The loss function used was the meansquared error (MSE), the most commonlyused loss function for regression. This is calculated as the mean of the squared differences between actual and predicted values, or:
(1) 
where is the actual value and is the predicted value. Training times were recorded in units of seconds for each algorithm.
All the tested machine learning methods were trained by splitting the simulation run data into training, validation and test sets, as commonly carried out in machine learning approaches. In our simulation, the test set consisted of 20% of the initial set of runs, the validation set consisted of 20% of the remaining 80% after the test set was created, and all the remaining data formed the training set. This threeway split allows for hyperparameter optimisation to be done using the validation set, without any risk of accidentally training a model on the test set and thus obtaining biased results. The training set was used exclusively for training, and the validation set for evaluation of the trained model. The finalround MSE loss figures were then compiled into our results table (within Figure 5), providing a summary of the relative performance of all the machine learning methods for creating surrogate models of the ABM.
Creating ABM surrogate models with machine learning algorithms
We created the ABM surrogate models in Mathematica 12.0. Once the simulation runs were completed, we chose a selection of the most commonly used machine learning methods to test as possible means for creating surrogate models of the ABM. Given that the motivation for generating these surrogates is to enable detailed analysis of ABM outputs without having to run the ABM many thousands of times, we sought to test methods that would produce predictions very quickly once the model is trained. As part of our comparison table in the Results section, we included the training time required for each surrogate model.
Table 1 provides a basic summary of the methods tested. Each method description includes a reference to a seminal paper or monograph on that method; readers unfamiliar with these methods are referred to those references for further details. We also included the relative advantages and disadvantages of each method, as a guideline for readers who might be unfamiliar with such methods.
We note that in addition to the methods listed, we also tested XGBoostChen and Guestrin (2016) and LightGBMKe et al. (2017). However, both of these methods proved unable to generate a usable model of the ABM data. This may be due to these methods being more specialised for modelling very large input datasets. Our simulation datasets are comparatively small, and in this study we are seeking methods for generating surrogate models that can provide good predictions even with small training sets, thus allowing the modeller to save substantial time when performing model calibration and analyses. That being the case, we elected not to report the results for XGBoost and LightGBM in this paper, as neither method proved fit for this specific purpose.
In order to maximise the replicability and transparency of our results, we implemented each method using standard software. All the methods listed in Table 1, except for support vector machines, were implemented using Mathematica 12.0Wolfram Research Inc. (2019), which has a comprehensive machine learning suite included. All the methods listed here can be implemented using Mathematica’s Predict function^{1}^{1}1documented here: https://reference.wolfram.com/language/ref/Predict.html, which will produce a model based on supplied input training data which can then be analysed indepth. Figure 2a shows sample output from an instance of Predict implementing a gradientboosted trees model of the 800run simulation dataset.
Artificial neural network implementation
The neural network models were implemented using Mathematica’s NetTrain function^{2}^{2}2documented here: https://reference.wolfram.com/language/ref/NetTrain.html. The networks were deep neural networks with a 10node input layer, batch normalisation layer, layer, a variable number of fullyconnected hidden layers, batch normalisation layer,
layer, and a singlenode output layer (outputting a scalar). For each set of runs, a brief grid search was conducted using steadily increasing numbers of hidden layers until we found an architecture which provided the best possible fit without overfitting. This produced neural networks that increased in depth as the size of the training set increased. The learning rate was set at 0.0003 for all networks, with L2 regularisation set at 0.03 to help avoid overfitting. All networks were trained for 15,000 epochs, with batch sizes left to default values. Note that all Mathematica code and results are available in our Github repository:
https://github.com/thorsilver/EmulatingABMswithML.The right side of Figure 2b shows some sample output of a neural network implemented using NetTrain and trained on the 800run simulation dataset. Figure 1a shows a sample architecture for the most successful neural network for the 800run training set. This network contains a total of thirteen layers, nine of those being fullyconnected hidden layers with 50 nodes each. Figure 1b shows a simplified view of the network structure.
Results
Initial investigation using GPs
Following the generation of our simulation results, we performed an initial uncertainty analysis using a Gaussian process emulator (GP). This analysis was performed using the GEMSA software package O’Hagan (2006), using data gathered from 400 simulation runs. The emulator output in Figure 1 shows that the GP emulator was unable to fit a GP model to the simulation data, suffering from an extremely large output variance.
GPs are limited in their capacity to emulate certain kinds of complex simulation models, in that they assume that simulation output varies smoothly in response to changing inputs Kennedy and O’Hagan (2001); O’Hagan (2006). In the case of complex ABMs, this assumption often fails as model outputs can change in unexpected ways in response to small variations in input parameter values. Our exemplar ABM falls into this category, meaning that even though other critical GP assumptions still hold (the model is deterministic, and has a relatively low number of inputs), the emulator is unable to fit a GP model to the simulation data. As the next analysis will show, in this case, determining the impact of the input variables on the final output variance is challenging. This may also have contributed to the failure of this initial GP emulator attempt.
(a)
(b)
Output quantity  Value 

Largest standardised error  0.1147 
Crossvalidation variance range  
Estimate of mean output  11892.8 
Estimate of output variance  
Estimate of total output variance 
Investigation of simulation outputs
In order to further investigate the contributions of simulation parameters to the final output variance, we used Principal Component Analysis (PCA) to identify the variables accounting for the greatest variation in per capita social care cost. Principal Component Analysis (PCA)
Jolliffe and Cadima (2016) is a common technique used to determine the variables causing the largest variation in a dataset. The data was first normalised using the score:(2) 
where is the raw score, is the sample mean, and
is the standard deviation.
Significant components from the analysis were selected according to the Kaiser criterion Kaiser (1960) and Jolliffe’s rule Jolliffe (2011)
. According to the Kasier criterion, any component that has an eigenvalue with a magnitude greater than one should be retained. Applying Jolliffe’s rule additionally requires that 70% of the variation must be explained, which may require the addition of more components. Applying these criteria to the datasets containing 200 and 400 samples, PCA reduces the dimensionality of the data to 7 components. PCA identifies a single component in the datasets containing 800 and 1600 samples.
To determine the variable(s) that contribute most to the variance of each component, a correlation above 0.5 was considered as significant. As the number of samples in the dataset increases, PCA gives increasingly ambiguous results. At 200 and 400 samples, the first and last components are separated by less than 2% and 1% contribution to variance respectively showing very little discrimination between components. Additionally, the two highest contributing variables to the first component in the 200 dataset are different from those identified in the 400 dataset. At 800 and 1600 samples, PCA fails to discriminate between the variables and counts all them all as significant contributors to component one. From these results, we can conclude that PCA analysis is unreliable on smaller datasets, and at 800 samples and above, PCA is unable to identify which variables have the strongest contribution to per capita social care cost.
Surrogate modelling results
In total, nine different machinelearning methods were implemented to attempt to replicate the behaviour of the ABM. The PCA work above demonstrates the difficulties presented in modelling ABM outputs – the parameter spaces in such models tend to be complex, and the contribution of model parameters to output variance is difficult to unravel. We tested each of the nine methods on their ability to replicate the final output of the ABM, comparing the strength of fit and computation time. Each method was tested on four datasets containing the results of 200, 400, 800 and 1600 simulation runs; each dataset examines the same range of the model parameter space at an increasingly higher resolution. Neural networks were the strongest performers overall, particularly in the 800 and 1600sample cases, although this comes at the cost of a computational speed considerably lower than the other methods. Nonlinear SVMs and gradientboosted trees have generally a slightly higher error, but considerably faster speed (Figure 5).
200 Runs  400 Runs  800 Runs  1600 Runs  

Runtime  MSE Loss  Runtime  MSE Loss  Runtime  MSE Loss  Runtime  MSE Loss  
Neural Network  47.5s  0.965  91.77s  0.761  222.969s  0.224  695.31s  0.188 
Linear SVM  0.216s  0.919  0.248s  0.686  0.168s  0.83  0.386s  1.188 
NonLinear SVM  0.081s  0.95  0.096s  0.905  0.097s  0.86  0.134s  1.315 
Random Forest  2.14s  16.936  0.397s  12.06  1.58s  2.656  2.16s  4.31 
Linear Regression  1.47s  5.237  1.42s  4.374  3.06s  13.439  3.86s  24.18 
Gradient Boosted Trees  1.4s  3.778  1.47s  2.526  1.5s  0.452  3.88s  2.774 
KNearest Neighbours  0.33s  12.895  0.35s  13.422  0.37s  1.356  2.29s  7.404 
Gaussian Process  2.02s  4.284  2.97s  1.928  4.12s  0.33  17.7s  37.02 
Decision Trees  0.33s  19.734  0.355s  14.321  0.532s  3.855  2.09s  2.247 
Linear regression
The PCA results indicate the ABM’s behaviour is highly complex, and the linear regression results further reinforce this impression. While the finalround validation The MSE loss in the 200 and 400 sample scenarios were third highest out of the nine methods tested, 5.237 at 200 samples and 4.374 at 400 samples, at 800 and 1600 samples MSE loss was highest (13.439) and second highest (24.18), respectively. This suggests that linear regression was not able to adequately capture the complex behaviour of the original ABM.
Decision trees
Decision trees, in contrast to linear regression, showed increasing accuracy as the number of observations increased. Decision trees in the 200 and 400 sample scenarios were the worstperforming out of the nine methods tested, with MSE loss of 19.734 and 14.321 at 200 and 400 samples respectively. In the 800 sample scenario, the MSE loss of 3.855 still placed decision trees as the secondworst performer. However, at 1600 samples they became more competitive, producing the fourthlowest MSE loss of 2.247.
Random forest
Similar to decision trees, Random Forest performed poorly in the scenarios with lower numbers of samples, producing the secondworst (16.936) and thirdworst (12.06) MSE loss results at 200 and 400 samples. At 800 and 1600 samples, Random Forest produced substantially lower MSE loss but remained third worst among the nine methods at 800 samples (2.656) and 1600 samples (4.31).
Gradientboosted trees
Gradientboosted trees were among the more consistent performers, ranking in the middle of the table at 200, 400 and 1600 samples (with MSE loss of 3.778, 2.526 and 2.774 respectively). The 800 sample scenario was the notable exception, with gradientboosted trees producing the thirdlowest MSE loss out of the nine methods tested (0.452). Overall, our results suggest that gradientboosted trees are relatively strong performers on this task, with consistent performance and fast training times.
KNearest neighbours
KNearest Neighbours (hereafter KNN) followed a similar pattern, showing high MSE loss at the 200, 400 and 1600 samples (12.895, 13.422, and 7.404 respectively), then a significantly lower MSE loss at 800 samples (1.356). In the case of KNN, we observe a small difference, with the 200/400/1600 sample results being third/second/thirdhighest respectively, then the 800 sample results are the fourthhighest. We note that many of the nine methods showed improved performance at 800 samples, so even with the marked reduction in MSE loss, for KNN the actual improvement in ranking was small.
Gaussian Process emulation
As expected given the particular characteristics of Gaussian Process emulation (GPs), the method produced relatively strong results when applied to smaller datasets (MSE loss of 4.284 at 200 samples, 1.928 at 400 samples, 0.33 at 800 samples), and very poor results at high numbers of observations (MSE loss of 37.02 at 1600 samples, by far the worst performer). These results confirm that GPs are powerful and useful for emulation of many varieties of complex computer simulations, but can prove less effective with ABMs, where the outputs do not necessarily vary smoothly with small changes in input parameters.
Support Vector Machine
Two methods of SVM were used: linear SVM and nonlinear (Gaussian) SVM. Runtimes of both of these methods were faster than all other methods at all sample sizes. Linear SVM was a better performer than the nonlinear SVM, with a smaller MSE loss across all four scenarios. Linear SVM outperformed all other methods at 200 (0.919) and 400 (0.686) samples. The fast runtimes and relative low MSE make linear SVM in particular a competitive performer on the tasks, though it should be noted that at 800 and 1600 samples linear SVM is progressively outperformed by neural networks on MSE loss.
Neural networks
Neural networks proved to be highly consistent performers across all four scenarios, with very low MSE loss recorded across all four scenarios. They were the third lowest (0.965) at 200 samples, second lowest at 400, and outperformed all eight rival methods in the 800 sample (0.224) and 1600 sample (0.188) scenarios. However, their training time was orders of magnitude longer than any of the other eight methods, particularly compared with both SVM regression methods, which showed competitive performance on the task while also being the fastest of any of the nine methods. We note that in this particular comparison we trained the neural networks on CPU to allow for a direct comparison with the other methods. In practice, however, using GPUs would speed up training times considerably, making them a strong candidate method for the emulation of ABMs.
Discussion
Although we tested a range of machine learning tools when replicating the outcome of the full ABM, our results suggest that a deep learning approach (using multilayered neural networks) is the most promising candidate to create a surrogate of the ABM. Neural networks can be trained on the data deriving from the ABM simulations, and can then replicate the ABM output with high accuracy. In general, we have shown for the first time that machine learning methods are able to approximate ABM predictions with high accuracy and dramatically reduced computational effort, and thus have great potential as a means for more efficient parameter calibration and sensitivity analysis.
The results in Table 3 clearly illustrate the challenges of surrogate modelling for ABMs. The performance of most methods – with the notable exception of the neural networks – vary significantly depending on the numbers of observations. Linear regression, for example, performs well at 200 and 400 runs, but fails to produce a good fit at 800 and 1600, where the complex and jagged nature of the model’s parameter space becomes more evident. The Gaussian process emulation runs are also instructive – GPs cannot produce a good fit when the function being emulated does not vary smoothly, as is the case in the 1600run scenario.
Neural networks, by contrast, improve steadily in predictive accuracy as the number of observations increases. As per the universal approximation theorem, neural networks with nonlinear activation functions have been shown to be able to approximate any continuous function to an arbitrary degree of precision
Cybenko (1989), which enables these networks to approximate our simulation output despite its nonlinearity and lack of smoothness. There is a significant computational cost however, in that a more accurate approximation of more complex functions necessitates exponentially more neurons, which leads to significantly longer training times; these shortcomings are reflected in the large increase in run times for the neural networks as the numbers of observations increase.In our testing for this paper, we decided to perform all the machinelearning tasks using consumerclass CPUs only, to allow likeforlike comparison of computation times. This leaves neural networks at a notable disadvantage, as training neural networks on GPUs leads to enormous reductions in run times. In this case, the model was relatively simple, meaning that even the 11layer network used for the 1600run scenario could be trained in a matter of minutes. For surrogate modelling of more complex ABMs, however, we would recommend training the neural networks on GPUs.
We should note that we intentionally kept the neural network architectures very simple – all networks used here are simple stacks of fullyconnected layers with sigmoid activation functions. Despite our lack of architectural optimisation, the neural network models are very accurate in all four scenarios, in some cases significantly more than the other methods. More optimised architectures could potentially achieve greater accuracy, but at a cost in terms of testing and hyperparameter optimisation. Ultimately, we advocate a balanced approach, as our results here suggest that even simplistic neural networks can perform very well as surrogate models. Architectural optimisation is likely to consume a great deal of time for relatively little benefit.
Out of the other machinelearning methods tested, SVM, gradientboosted trees and Gaussian process emulation were the next strongest performers. Both Linear and NonLinear (Gaussian) SVM produced very accurate predictions in the 200 and 400sample scenarios, but began to fall behind the ANNs beyond this point. The gradientboosted trees performed reasonably well in all four scenarios, while the GPs were particularly strong in the 800run scenario and particularly weak in the 1600run scenario. Boosted trees were more consistent across the four scenarios, and were extremely fast to compute, which suggests that they are a viable alternative to neural networks, particularly when time is of the essence. We should note again that our attempts to use highlyoptimised boostedtree implementations like XGBoost and LightGBM were unsuccessful, as they are specialised for use on very large datasets.
While this work provides a useful comparison of machinelearningbased surrogate modelling methods applied to a moderately complex ABM, further work could expand on these comparisons. In our experiments, our surrogate models were only required to approximate the behaviour of a model with a single major output of interest; future work could examine how these methods compare when applied to models with multiple outputs. There may also be circumstances where a surrogate model that replicates simulation behaviour over time is desirable, rather than focussing only on the final outputs at the end of a simulation run. In this case, machinelearning methods used for predicting timeseries data, such as recurrent neural networks, could serve as useful surrogate models.
In conclusion, our results illustrate that neural networks can function as highly accurate surrogate models for complex agentbased computer simulations, and are better able to overcome the challenges posed by their complex and nonlinear results than other machinelearning methods. They are remarkably less computationally efficient than other approaches, but relatively simple neural network surrogates can be trained effectively even on consumerlevel CPUs; for more complex applications GPUbased training may be necessary. For timecritical applications, or when computational power is at a premium, SVM or gradientboosted trees could provide a viable alternative, given their relatively consistent performance and quick training times.
Future Work
As noted earlier, Gaussian Process emulation has become a popular means of analysing simulation outputs due to their flexibility, power, and the very useful measures of uncertainty they can produce. However, their limitations suggest that they are less useful for analysing some complex agentbased simulations. Our comparisons here have shown that alternative ML methods can be viable alternatives in these circumstances.
Before these alternative emulation methods can become widely adopted, however, we must find efficient and effective ways of making use of these more accurate surrogate models. Unlike GPs, methods like neural networks, SVM and gradientboosted trees do not bring with them insightful uncertainty quantification measures ‘for free’ – we must construct these ourselves. Future work inspired by the results we have reported here can focus on developing libraries and examples of how to use MLbased surrogate models to generate deeper insights about the behaviour of complex ABMs. This would greatly enhance future modelling efforts in the ABM community. More specifically, these additional methods for simulation analysis would serve to fill the gaps where GPs are unable to produce usable surrogates, ensuring that detailed sensitivity and uncertainty analysis become the norm in ABM studies.
Acknowledgements
Eric Silverman is part of the Complexity in Health Improvement Programme supported by the Medical Research Council (MC_UU_12017/14) and the Chief Scientist Office (SPHSU14). Claudio Angione would like to acknowledge the support from UKRI Research England’s THYME project, and from the Children’s Liver Disease Foundation.
Authors’ contributions
E.S. and C.A. designed the study. E.S. modified the ABM software and produced the analyses done in Mathematica 12.0. E.Y. produced all the analyses done in MatLab and RStudio. Neural network models and analyses were performed by E.S. and C.A. All three authors wrote, reviewed and edited the manuscript collaboratively.
Data, code and materials
The code and the results presented in this manuscript are shared in full in our GitHub repository https://github.com/thorsilver/EmulatingABMswithML. Our simulation of the ‘Linked Lives’ ABM model of social care provision in the UK is available at https://github.com/thorsilver/SocialCareABMforUQ/releases/tag/v0.91.
References
 The evolution of cooperation. Basic Books, New York, NY, USA. Cited by: Introduction.
 Random forests. Machine Learning 45 (1), pp. 5–32. External Links: Document, Link Cited by: Table 1.
 Activation atlas. Distill 4 (3), pp. e15. Cited by: Interpretability of machinelearning models.
 XGBoost: a scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 785–794. External Links: ISBN 9781450342322, Link, Document Cited by: Table 1, Creating ABM surrogate models with machine learning algorithms.
 Nearest neighbor pattern classification. IEEE Trans. Inf. Theor. 13 (1), pp. 21–27. External Links: ISSN 00189448, Link, Document Cited by: Table 1.
 An introduction to support vector machines and other kernelbased learning methods. Cambridge University Press. External Links: Document Cited by: Table 1.

Approximation by superpositions of a sigmoidal function
. Mathematics of Control, Signals and Systems 2 (4), pp. 303–314. External Links: ISSN 1435568X, Document, Link Cited by: Discussion.  Greedy function approximation: a gradient boosting machine.. Ann. Statist. 29 (5), pp. 1189–1232. External Links: Document, Link Cited by: Table 1.
 Explaining explanations: an overview of interpretability of machine learning. External Links: 1806.00069 Cited by: Interpretability of machinelearning models.
 The elements of statistical learning. Springer Series in Statistics, Springer New York Inc., New York, NY, USA. Cited by: Table 1.
 A fast learning algorithm for deep belief nets. Neural Computation 18 (7), pp. 1527–1554. Cited by: Table 1.
 Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374 (2065), pp. 20150202. Cited by: Investigation of simulation outputs.
 Principal component analysis. Springer. Cited by: Investigation of simulation outputs.
 The application of electronic computers to factor analysis. Educational and psychological measurement 20 (1), pp. 141–151. Cited by: Investigation of simulation outputs.
 Big data, agents, and machine learning: towards a datadriven agentbased modeling approach. In Proceedings of the Annual Simulation Symposium, pp. 12. Cited by: Introduction.
 LightGBM: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3146–3154. Cited by: Table 1, Creating ABM surrogate models with machine learning algorithms.
 Bayesian calibration of computer models. Journal of the Royal Statistical Society, Series B 63 (3), pp. 425–464. Cited by: Introduction, Initial investigation using GPs.
 Agentbased model calibration using machine learning surrogates. Arxiv.org. Cited by: Introduction, Motivations.
 The mythos of model interpretability. External Links: 1606.03490 Cited by: Interpretability of machinelearning models.
 The expressive power of neural networks: a view from the width. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 6231–6239. Cited by: Introduction.
 Exploratory designs for computational experiments. Journal of Statistical Planning and Inference 43 (3), pp. 381–402. External Links: Document Cited by: Generating simulation data.
 Linked lives: the utility of an agentbased approach to modelling partnership and household formation in the context of social care. In Proceedings of the 2012 Winter Simulation Conference, C. Laroque, J. Himmelspach, R. Pasupathy, O. Rose, and J.M. Uhrmacher (Eds.), External Links: Link Cited by: Exemplar agentbased model.
 Bayesian analysis of computer code outputs: a tutorial. Reliability Engineering and System Safety 91 (1011), pp. 1290–1300. Cited by: Introduction, Table 1, Initial investigation using GPs, Initial investigation using GPs.
 A brief introduction to the use of machine learning techniques in the analysis of agentbased models. In Advances in Management Engineering, pp. 179–186. Cited by: Introduction.
 Induction of decision trees. Machine Learning 1 (1), pp. 81–106. External Links: Document, Link Cited by: Table 1.
 Why we need a complex systems model of evidence for public health. The Lancet 390 (10112), pp. 2602–2604. Cited by: Introduction.
 Dynamic models of segregation. Journal of Mathematical Sociology 1, pp. 143–186. Cited by: Introduction.
 Simulating the cost of social care in an ageing population. In Proceedings of the 27th European Conference on Modelling and Simulation, W. Rekdalsbakken, R.T. Bye, and H. Zhang (Eds.), pp. 689–695. Cited by: Interpretability of machinelearning models, Introduction, Exemplar agentbased model.
 Methodological investigations in agentbased modelling – with applications for the social sciences. Springer, Berlin, Germany. Cited by: Introduction.
 When demography met social simulation: a tale of two modelling approaches. Journal of Artificial Societies and Social Simulation 16 (4), pp. 9. Note: http://jasss.soc.surrey.ac.uk/16/4/9.html External Links: Document Cited by: Introduction.
 Uniformly distributed sequences with an additional uniform property. USSR Computational Mathematics and Mathematical Physics 16, pp. 236–242. Cited by: Generating simulation data.
 Building agentbased walking models by machinelearning on diverse databases of spacetime trajectory samples. Transactions in GIS 15 (s1), pp. 67–94. Cited by: Introduction.
 Deep learning in (and of) agentbased models: a prospectus. arXiv preprint arXiv:1706.06302. Cited by: Introduction, Motivations.
 Mathematica 12.0. External Links: Link Cited by: Creating ABM surrogate models with machine learning algorithms.
 Machine and deep learning meet genomescale metabolic modeling. PLoS computational biology 15 (7). Cited by: Introduction.
Comments
There are no comments yet.