Introduction
Simulation has become an indispensable tool for researchers across the sciences to explore the behavior of complex, dynamic systems under varying conditions [Cranmer2020TheFO], including hypothetical or extreme conditions, and increasingly tipping points in environments such as climate [vautard2013simulation, dunne2013gdfl, roberts2018benefits], biology [dada2011multi, dror2012biomolecular], sociopolitics [elshafei2016sensitivity, hamilton2005climate], and others with significant consequences. Yet there are challenges that limit the utility of simulators (and modeling tools broadly) in many settings. First, despite advances in hardware to enable simulations to model increasingly complex systems, computational costs severely limit the level of geometric details, complexity of physics, and the number of simulator runs. This can lead to simplifying assumptions, which often render the results unusable for hypothesis testing and practical decisionmaking. In addition, simulators are inherently biased as they simulate only what they are programmed to simulate; sensitivity and uncertainty analyses are often impractical for expensive simulators; simulation code is composed of lowlevel mechanistic components that are typically nondifferentiable and lead to intractable likelihoods; and simulators can rarely integrate with realworld data streams, let alone run online with live data updates.
Recent progress with artificial intelligence (AI) and machine learning (ML) in the sciences has advanced methods towards several key objectives for AI/ML to be useful in sciences (beyond discovering patterns in highdimensional data). These advances allow us to import priors or domain knowledge into ML models and export knowledge from learned models back to the scientific domain; leverage ML for numerically intractable simulation and optimization problems, as well as maximize the utility of realworld data; generate myriads of synthetic data; quantify and reason about uncertainties in models and data; and infer causal relationships in the data.
It is at the intersection of AI and simulation sciences where we can expect significant strides in scientific experimentation and discovery, in essentially all domains. For instance, the use of neural networks to accelerate simulation software for climate science
[Ramadhan2020CapturingMP], or multiagent reinforcement learning and game theory towards economic policy simulations
[Zheng2020TheAE]. Yet this area is relatively nascent and disparate, and a unifying holistic perspective is needed to advance the intersection of AI and simulation sciences.This paper explores this perspective. We lay out the methodologies required to make significant strides in simulation and AI for science, and how they must be fruitfully combined. The field of scientific computing was at a similar inflection point when Phillip Colella in 2004 presented to DARPA the “Seven Dwarfs” for Scientific Computing, where each of the seven represents an algorithmic method that captures a pattern of computation and data movement [7dwarfs, Asanovi2006TheLO, kaltofen2012seven].^{ii}^{ii}iiThe 2004 “Seven Motifs for Scientific Computing”: Dense Linear Algebra, Sparse Linear Algebra, Computations on Structured Grids, Computations on Unstructured Grids, Spectral Methods, Particle Methods, and Monte Carlo [7dwarfs]. For the remainder of this paper, we choose to replace a potentially insensitive term with “motif”, a change we suggest for the field going forward.
The motifs nomenclature has proved useful for reasoning at a high level of abstraction about the behavior and requirements of these methods across a broad range of applications, while decoupling these from specific implementations. Even more, it is an understandable vocabulary for talking across disciplinary boundaries. Motifs also provide “antibenchmarks”: not tied to narrow performance or code artifacts, thus encouraging innovation in algorithms, programming languages, data structures, and hardware [Asanovi2006TheLO]. Therefore the motifs of scientific computing provided an explicit roadmap for R&D efforts in numerical methods (and eventually parallel computing) in sciences.
In this paper, we similarly define the Nine Motifs of Simulation Intelligence, classes of complementary algorithmic methods that represent the foundation for synergistic simulation and AI technologies to advance sciences; simulation intelligence (SI) describes a field that merges of scientific computing, scientific simulation, and artificial intelligence towards studying processes and systems in silico to better understand and discover in situ phenomena. Each of the SI motifs has momentum from the scientific computing and AI communities, yet must be pursued in concert and integrated in order to overcome the shortcomings of scientific simulators and enable new scientific workflows.
Unlike the older seven motifs of scientific computing, our SI motifs are not necessarily independent. Many of these are interconnected and interdependent, much like the components within the layers of an operating system. The individual modules can be combined and interact in multiple ways, gaining from this combination. Using this metaphor, we explore the nature of each layer of the “SI stack”, the motifs within each layer, and the combinatorial possibilities available when they are brought together – the layers are illustrated in Fig. 1.
We begin by describing the core layers of the SI stack, detailing each motif within: the concepts, challenges, stateoftheart methods, future directions, ethical considerations, and many motivating examples. As we traverse the SI stack, encountering the numerous modules and scientific workflows, we will ultimately be able to lay out how these advances will benefit the many users of simulation and scientific endeavors. Our discussion continues to cover important SI themes such as inverse problem solving and humanmachine teaming, and essential infrastructure areas such as data engineering and accelerated computing.
By pursuing research in each of these SI motifs, as well as ways of combining them together in generalizable software towards specific applications in science and intelligence, the recent trend of decelerating progress may be reversed [Cowen2019IsTR, Bhattacharya2020StagnationAS]. This paper aims to motivate the field and provide a roadmap for those who wish to work in AI and simulation in pursuit of new scientific methods and frontiers.
Simulation Intelligence Motifs
Although we define nine concrete motifs, they are not necessarily independent – many of the motifs are synergistic in function and utility, and some may be building blocks underlying others. We start by describing the “module” motifs, as in Fig. 1, followed by the underlying “engine” motifs: probabilistic and differentiable programming. We then describe the motifs that aim to push the frontier of intelligent machines: openendedness and machine programming.
The Modules
Above the “engine” in the proverbial SI stack (Fig. 1) are “modules”, each of which can be built in probabilistic and differentiable programming frameworks, and make use of the accelerated computing building blocks in the hardware layer. We start this section with several motifs that closely relate to the physicsinformed learning topics most recently discussed above, and then proceed through the SI stack, describing how and why the module motifs compliment one another in myriad, synergistic ways.
1. MultiPhysics & MultiScale Modeling
Simulations are pervasive in every domain of science and engineering, yet are often done in isolation: climate simulations of coastal erosion do not model humandriven effects such as urbanization and mining, and even so are only consistent within a constrained region and timescale. Natural systems involve various types of physical phenomena operating at different spatial and temporal scales. For simulations to be accurate and useful they must support multiple physics and multiple scales (spatial and temporal). The same goes for AI & ML, which can be powerful for modeling multimodality, multifidelity scientific data, but machine learning alone – based on datadriven relationships – ignores the fundamental laws of physics and can result in illposed problems or nonphysical solutions. For ML (and AIdriven simulation) to be accurate and reliable in the sciences, methods must integrate multiscale, multiphysics data and uncover mechanisms that explain the emergence of function.
Multiphysics
Complex realworld problems require solutions that span a multitude of physical phenomena, which often can only be solved using simulation techniques that cross several engineering disciplines. Almost all practical problems in fluid dynamics involve the interaction between a gas and/or liquid with a solid object, and include a range of associated physics including heat transfer, particle transport, erosion, deposition, flowinducedstress, combustion and chemical reaction. A multiphysics environments is defined by coupled processes or systems involving more than one simultaneously occurring physical fields or phenomena. Such an environment is typically described by multiple partial differential equations (PDEs), and tightly coupled such that solving them presents significant challenges with nonlinearities and timestepping. In general, the more interacting physics in a simulation, the more costly the computation.
Multiscale
Ubiquitous in science and engineering, cascadesofscales involve more than two scales with longrange spatiotemporal interactions (that often lack selfsimilarity and proper closure relations). In the context of biological and behavioral sciences, for instance, multiscale modeling applications range from the molecular, cellular, tissue, and organ levels all the way to the population level; multiscale modeling can enable researchers to probe biologically relevant phenomena at smaller scales and seamlessly embed the relevant mechanisms at larger scales to predict the emergent dynamics of the overall system [Hunt2018TheSO]. Domains such as energy and synthetic biology require engineering materials at the nanoscale, optimizing multiscale processes and systems at the macroscale, and even the discovery of new governing physicochemical laws across scales. These scientific drivers call for a deeper, broader, and more integrated understanding of common multiscale phenomena and scaling cascades. In practice, multiscale modeling is burdened by computational inefficiency, especially with increasing complexity and scales; it is not uncommon to encounter “hidden” or unknown physics of interfaces, inhomogeneities, symmetrybreaking and other singularities.
Methods for utilizing information across scales (and physics that vary across space and time) are often needed in realworld settings, and with data from various sources: Multifidelity modeling aims to synergistically combine abundant, inexpensive, lowfidelity data and sparse, expensive, highfidelity data from experiments and simulations. Multifidelity modeling is often useful in building efficient and robust surrogate models (which we detail in the surrogate motif section later) [Alber2019IntegratingML] – some examples include simulating the mixed convection flow past a cylinder [Perdikaris2016ModelIV] and cardiac electrophysiology [Costabal2019MultifidelityCU].
In computational fluid dynamics (CFD), classical methods for multiphysics and multiscale simulation (such as finite elements and pseudospectral methods) are only accurate if flow feature are all smooth, and thus meshes must resolve the smallest features. Consequently, direct numerical simulation for realworld systems such as climate and jet physics are impossible. It is common to use smoothed versions of the Navier Stokes equations to allow coarser meshes while sacrificing accuracy. Although successful in design of engines and turbomachinery, there are severe limits to what can be accurately and reliably simulated – the resolutionefficiency tradeoff imposes a significant bottleneck. Methods for AIdriven acceleration and surrogate modeling could accelerate CFD and multiphysics multiscale simulation by orders of magnitude. We discuss specific methods and examples below.
Physicsinformed ML
The newer class of physicsinformed machine learning methods integrate mathematical physics models with datadriven learning. More specifically, making an ML method physicsinformed amounts to introducing appropriate observational, inductive, or learning biases that can steer or constrain the learning process to physically consistent solutions [Karniadakis2021PhysicsinformedML]:

Observational biases
can be introduced via data that allows an ML system to learn functions, vector fields, and operators that reflect physical structure of the data.

Inductive biases are encoded as modelbased structure that imposes prior assumptions or physical laws, making sure the physical constraints are strictly satisfied.

Learning biases
force the training of an ML system to converge on solutions that adhere to the underlying physics, implemented by specific choices in loss functions, constraints, and inference algorithms.
A common problem template involves extrapolating from an initial condition obtained from noisy experimental data, where a governing equation is known for describing at least some of the physics. An ML method would aim to predict the latent solution of a system at later times and propagate the uncertainty due to noise in the initial data. A common usecase in scientific computing is reconstructing a flow field from scattered measurements (e.g., particle image velocimetry data), and using the governing Navier–Stokes equations to extrapolate this initial condition in time.
For fluid flow problems and many others, Gaussian processes (GPs) present a useful approach for capturing the physics of dynamical systems. A GP is a Bayesian nonparametric machine learning technique that provides a flexible prior distribution over functions, enjoys analytical tractability, defines kernels for encoding domain structure, and has a fully probabilistic workflow for principled uncertainty reasoning [Rasmussen2006GaussianPF, Ghahramani2015ProbabilisticML]. For these reasons GPs are used widely in scientific modeling, with several recent methods more directly encoding physics into GP models: Numerical GPs have covariance functions resulting from temporal discretization of timedependent partial differential equations (PDEs) which describe the physics [Raissi2018NumericalGP, Raissi2018HiddenPM], modified Matérn GPs can be defined to represent the solution to stochastic partial differential equations [Lindgren2011AnEL] and extend to Riemannian manifolds to fit more complex geometries [Borovitskiy2020MaternGP], and the physicsinformed basisfunction GP derives a GP kernel directly from the physical model [Hanuka2020PhysicsinformedGP] – the latter method we elucidate in an experiment optimization example in the surrogate modeling motif.
With the recent wave of deep learning progress, research has developed methods for building the three physicsinformed ML biases into nonlinear regressionbased physicsinformed networks. More specifically, there is increasing interest in
physicsinformed neural networks (PINNs): deep neural nets for building surrogates of physicsbased models described by partial differential equations (PDEs) [Raissi2019PhysicsinformedNN, Zhu2019PhysicsConstrainedDL]. Despite recent successes (with some examples explored below) [Raissi2018DeepLO, Zhang2019QuantifyingTU, Meng2020ACN, Berg2018AUD, Sirignano2018DGMAD, Sun2020SurrogateMF], PINN approaches are currently limited to tasks that are characterized by relatively simple and welldefined physics, and often require domainexpert craftsmanship. Even so, the characteristics of many physical systems are often poorly understood or hard to implicitly encode in a neural network architecture.Probabilistic graphical models (PGMs) [Koller2009ProbabilisticGM], on the other hand, are useful for encoding a priori structure, such as the dependencies among model variables in order to maintain physically sound distributions. This is the intuition behind the promising direction of graphinformed neural networks (GINN) for multiscale physics [Hall2021GINNsGN]. There are two main components of this approach (shown in Fig. 7
): First, embedding a PGM into the physicsbased representation to encode complex dependencies among model variables that arise from domainspecific information and to enable the generation of physically sound distributions. Second, from the embedded PGM identify computational bottlenecks intrinsic to the underlying physicsbased model and replace them with an efficient NN surrogate. The hybrid model thus encodes a domainaware physicsbased model that synthesizes stochastic and multiscale modeling, while computational bottlenecks are replaced by a fast surrogate NN whose supervised learning and prediction are further informed by the PGM (e.g., through structured priors). With significant computational advantages from surrogate modeling within GINN, we further explore this approach in the surrogate modeling motif later.
Modeling and simulation of complex nonlinear multiscale and multiphysics systems requires the inclusion and characterization of uncertainties and errors that enter at various stages of the computational workflow. Typically we’re concerned with two main classes of uncertainties in ML: aleatoric and epistemic
uncertainties. The former quantifies system stochasticity such as observation and process noise, and the latter is modelbased or subjective uncertainty due to limited data. In environments of multiple scales and multiple physics, one should consider an additional type of uncertainty due the randomness of parameters of stochastic physical systems (often described by stochastic partial or ordinary differential equations (SPDEs, SODEs)). One can view this type of uncertainty arising from the computation of a sufficiently wellposed deterministic problem, in contrast to the notion of epistemic or aleatoric uncertainty quantification. The field of
probabilistic numerics [Hennig2015ProbabilisticNA] makes a similar distinction, where the use of probabilistic modeling is to reason about uncertainties that arise strictly from the lack of information inherent in the solution of intractable problems such as quadrature methods and other integration procedures. In general the probabilistic numeric viewpoint provides a principled way to manage the parameters of numerical procedures. We discuss more on probabilistic numerics and uncertainty reasoning later in the SI themes section. The GINN can quantify uncertainties with a high degree of statistical confidence, while Bayesian analogs of PINNs are a work in progress [Yang2021BPINNsBP]–one cannot simply plug in MCdropout or other deep learning uncertainty estimation methods. The various uncertainties may be quantified and mitigated with methods that can utilize datadriven learning to inform the original systems of differential equations. We define this class of physics
infused machine learning later in this section and in the next motif.The synergies of mechanistic physics models and datadriven learning are brought to bear when physicsinformed ML is built with differentiable programming (one of the engine motifs), which we explore in the first example below.
Examples
Accelerated CFD via physicsinformed surrogates and differentiable programming
Kochkov et al. [Kochkov2021MachineLC] look to bring the advantages of semimechanistic modeling and differentiable programming to the challenge of complex computational fluid dynamics (CFD). The Navier Stokes (NS) equations describe fluid dynamics well, yet in cases of multiple physics and complex dynamics, solving the equations at scale is severely limited by the computational cost of resolving the smallest spatiotemporal features. Approximation methods can alleviate this burden, but at the cost of accuracy. In Kochkov et al, the components of traditional fluids solvers most affected by the loss of resolution are replaced with better performing machinelearned alternatives (as presented in the semimechanistic modeling section later). This AIdriven solver algorithm is represented as a differentiable program with the neural networks and the numerical methods written in the JAX framework [jax2018github]. JAX is a leading framework for differentiable programming, with reversemode automatic differentiation that allows for endtoend gradient based optimization of the entire programmed algorithm. In this CFD usecase, the result is an algorithm that maintains accuracy while using 10x coarser resolution in each dimension, yielding an 80fold improvement in computation time with respect to an advanced numerical method of similar accuracy.
Related approaches implement PINNs without the use of differentiable programming, such as TFNet for modeling turbulent flows with several specially designed UNet deep learning architectures [Wang2020TowardsPD]. A promising direction in this and other spatiotemporal usecases is neural operator learning: using NNs to learn meshindependent, resolutioninvariant solution operators for PDEs. To achieve this, Li et al. [Li2020FourierNO]
use a Fourier layer that implements a Fourier transform, then a linear transform, and an inverse Fourier transform for a convolutionlike operation in a NN. In a Bayesian inverse experiment, the Fourier neural operator acting as a surrogate can draw MCMC samples from the posterior of initial NS vorticity given sparse, noisy observations in 2.5 minutes, compared to 18 hours for the traditional solver.
Multiphysics and HPC simulation of blood flow in an intracranial aneurysm
SimNet is an AIdriven multiphysics simulation framework based on neural network solvers–more specifically it approximates the solution to a PDE by a neural network [Hennigh2020NVIDIASA]
. SimNet improves on previous NNsolvers to take on the challenge of gradients and discontinuities introduced by complex geometries or physics. The main novelties are the use of Signed Distance Functions for loss weighting, and integral continuity planes for flow simulation. An intriguing realworld use case is simulating the flow inside a patientspecific geometry of an aneurysm, as shown in Fig. A. It is particularly challenging to get the flow field to develop correctly, especially inside the aneurysm sac. Building SimNet to support multiGPU and multinode scaling provides the computational efficiency necessary for such complex geometries. There’s also optimization over repetitive trainings, such as training for surrogatebased design optimization or uncertainty quantification, where transfer learning reduces the time to convergence for neural network solvers. Once a model is trained for a single geometry, the trained model parameters are transferred to solve a different geometry, without having to train on the new geometry from scratch. As shown in Fig. B, transfer learning accelerates the patientspecific intracranial aneurysm simulations.
Raissi et al. [Raissi2020HiddenFM] similarly approach the problem of 3D physiologic blood flow in a patientspecific intracranial aneurysm, but implementing a physicsinformed NN technique: the Hidden Fluid Mechanics approach that uses autodiff (i.e. within differentiable programming) to simultaneously exploit information from the Navier Stokes equations of fluid dynamics and the information from flow visualization snapshots. Continuing this work has high potential for the robust and dataefficient simulation in physical and biomedical applications. Raissi et al. effectively solve this as an inverse problem using blood flow data, while SimNet approaches this as a forward problem without data. Nonetheless SimNet has potential for solving inverse problems as well (within the inverse design workflow of Fig. 33 for example).
Learning quantum chemistry
Using modern algorithms and supercomputers, systems containing thousands of interacting ions and electrons can now be described using approximations to the physical laws that govern the world on the atomic scale, namely the Schrödinger equation [Butler2018MachineLF]. Chemicalsimulation can allow for the properties of a compound to be anticipated (with reasonable accuracy) before synthesizing it in the laboratory. These computational chemistry applications range from catalyst development for greenhouse gas conversion, materials discovery for energy harvesting and storage, and computerassisted drug design [Walsh2013ComputationalAT]. The potential energy surface is the central quantity of interest in the modeling of molecules and materials, computed by approximations to the timeindependent Schrödinger equation. Yet there is a computational bottleneck: highlevel wave function methods have high accuracy, but are often too slow for use in many areas of chemical research. The standard approach today is density functional theory (DFT) [Lejaeghere2016ReproducibilityID], which can yield results close to chemical accuracy, often on the scale of minutes to hours of computational time. DFT has enabled the development of extensive databases that cover the calculated properties of known and hypothetical systems, including organic and inorganic crystals, single molecules, and metal alloys [Hachmann2011TheHC, Jain2013CommentaryTM, Calderon2015TheAS]. In highthroughput applications, often used methods include forcefield (FF) and semiempirical quantum mechanics (SEQM), with runtimes on the order of fractions of a second, but at the cost of reliability of the predictions [Christensen2021OrbNetDA]. ML methods can potentially provide accelerated solvers without loss of accuracy, but to be reliable in the chemistry space the model must capture the underlying physics, and wellcurated training data covering the relevant chemical problems must be available – Butler et al. [Butler2018MachineLF] provide a list of publicly accessible structure and property databases for molecules and solids. Symmetries in geometries or physics, or equivariance, defined as the property of being independent of the choice of reference frame, can be exploited to this end. For instance, a message passing neural network [Gilmer2017NeuralMP] called OrbNet encodes a molecular system in graphs based on features from a quantum calculation that are low cost by implementing symmetryadapted atomic orbitals [Qiao2020OrbNetDL, Christensen2021OrbNetDA]. The authors demonstrate effectiveness of their equivariant approach on several organic and biological chemistry benchmarks, with accuracies on par to that of modern DFT functionals, providing a far more efficient dropin replacement for DFT energy predictions. Additional approaches that constraining NNs to be symmetric under these geometric operations have been successfully applied in molecular design [Simm2021SymmetryAwareAF] and quantum chemistry [Qiao2021UNiTEUN].
Future directions
In general, machine learned models are ignorant of fundamental laws of physics, and can result in illposed problems or nonphysical solutions. This effect is exacerbated when modeling the interplay of multiple physics or cascades of scales. Despite recent successes mentioned above, there is much work to be done for multiscale and multiphysics problems. For instance, PINNs can struggle with highfrequency domains and frequency bias
[Basri2019TheCR], which is particularly problematic in multiscale problems [Wang2020WhenAW]. Improved methods for learning multiple physics simultaneously are also needed, as the training can be prohibitively expensive – for example, a good approach with current tooling is training a model for each field separately and subsequently learning the coupled solutions through either a parallel or a serial architecture using supervised learning based on additional data for a specific multiphysics problem.In order to approach these challenges in a collective and reproducible way, there is need to create open benchmarks for physicsinformed ML
, much like other areas of ML community such as computer vision and natural language processing. Yet producing quality benchmarks tailored for physicsinformed ML can be more challenging:

To benchmark physicsinformed ML methods, we additionally need the proper parameterized physical models to be explicitly included in the databases.

Many applications in physics and chemistry require fullfield data, which cannot be obtained experimentally and/or call for significant compute.

Often different, problemspecific, physicsbased evaluation methods are necessary, for example the metrics proposed in [Ltjens2021PhysicallyConsistentGA] for scoring physical consistency and [Vlachas2020BackpropagationAA] for scoring spatiotemporal predictions.
An overarching benchmarking challenge but also advantageous constraint is the multidisciplinary nature of physicsinformed ML: there must be multiple benchmarks in multiple domains, rather than one benchmark to rule them all, which is a development bias that ImageNet has put on the computer vision field the past decade. And because we have domain knowledge and numerical methods for the underlying data generating mechanisms and processes in physical and life sciences, we have an opportunity to quantify robustly the characteristics of datasets to better ground the performances of various models and algorithms. This is contrast to the common practice of naïve data gathering to compile massive benchmark datasets for deep learning – for example, scraping the internet for videos to compose a benchmark dataset for human action recognition (
deepmind.com/research/opensource/kinetics) – where not only are the underlying statistics and causal factors a priori unknown, the target variables and class labels are nontrivial to define and can lead to significant ethical issues such as dataset biases that lead to model biases, which in some cases can propagate harmful assumptions and stereotypes.Further we propose to broaden the class of methods beyond physicsinformed, to physicsinfused machine learning. The former is unidirectional (physics providing constraints or other information to direct ML methods), whereas the latter is bidirectional, including approaches that can better synergize the two computational fields. For instance, for systems with partial information, methods in physicsinfused ML can potentially compliment known physical models to learn missing or misunderstood components of the systems. We specifically highlight one approach named Universal Differential Equations [Rackauckas2020UniversalDE] in the surrogate modeling motif next.
Physicsinfused ML can enable many new simulation tools and usecases because of this ability to integrate physical models and data within a differentiable software paradigm. JAX and SimNet are nice examples, each enabling physicsinformed ML methods for many science problems and workflows. Consider, for instance, the JAX fluid dynamics example above: another usecase in the same DP framework is JAX MD [Schoenholz2020JAXMA] for performing differentiable physics simulations with a focus on molecular dynamics.
Another exciting area of future multiphysics multiscale development is inverse problem solving. We already mentioned the immense acceleration in the CFD example above with Fourier Neural Operators [Li2020FourierNO]. Beyond efficiency gains, physicsinfused ML can take on applications with inverse and illposed problems which are either difficult or impossible to solve with conventional approaches, notably quantum chemistry: A recent approach called FermiNet [Pfau2019AbInitioSO] takes a dual physicsinformed learning approach to solving the manyelectron Schrodinger equation, where both inductive bias and learning bias are employed. The advantage of physicsinfused ML here is eliminating extrapolation problems with the standard numerical approach, which is a common source of error in computational quantum chemistry. In other domains such as biology, biomedicine, and behavioral sciences, focus is shifting from solving forward problems based on sparse data towards solving inverse problems to explain large datasets [Alber2019IntegratingML]. The aim is to develop multiscale simulations to infer the behavior of the system, provided access to massive amounts of observational data, while the governing equations and their parameters are not precisely known. We further detail the methods and importance of inverse problem solving with SI later in the Discussion section.
2. Surrogate Modeling & Emulation
A surrogate model is an approximation method that mimics the behavior of an expensive computation or process. For example, the design of an aircraft fuselage includes computationally intensive simulations with numerical optimizations that may take days to complete, making design space exploration, sensitivity analysis, and inverse modeling infeasible. In this case a computationally efficient surrogate model can be trained to represent the system, learning a mapping from simulator inputs to outputs. And Earth systems models (ESMs), for example, are extremely computationally expensive to run due to the large range of spatial and temporal scales and large number of processes being modeled. ESM surrogates can be trained on a few selected samples of the full, expensive simulations using supervised machine learning tools. In this simulation context, the aim of surrogate modeling (or statistical emulation) is to replace simulator code with a machine learning model (i.e., emulator) such that running the ML model to infer the simulator outputs is more efficient than running the full simulator itself. An emulator is thus a model of a model: a statistical model of the simulator, which is itself a mechanistic model of the world.
For surrogate modeling in the sciences, nonlinear, nonparametric Gaussian processes (GP) [Rasmussen2006GaussianPF] are typically used because of their flexibility, interpretability, and accurate uncertainty estimates [Kennedy2001BayesianCO]. Although traditionally limited to smaller datasets because of computational cost of training (where is the number of training data points), much work on reliable GP sparsification and approximation methods make them viable for realworld use [Titsias2009VariationalLO, Hensman2013GaussianPF, Wilson2015ThoughtsOM, Izmailov2018ScalableGP].
Neural networks (NNs) can also be wellsuited to the surrogate modeling task as function approximation machines: A feedforward network deﬁnes a mapping and learns the value of the parameters that result in the best function approximation. The Universal Approximation Theorem demonstrates that sufficiently large NNs can approximate any nonlinear function with a finite set of parameters [Lin2018ResNetWO, Winkler2017PerformanceOD]. Although sufficient to represent any function, a NN layer may be unfeasibly large such that it may fail to learn and generalize correctly [Goodfellow2015DeepL]
. Recent work has shown that a NN with an infinitely wide hidden layer converges to a GP, representing the normal distribution over the space of functions.
In Fig. 6 we show an example of how a NN surrogate can be used as an emulator that encapsulates either the entire simulator or a specific part of the simulator. In the former, training is relatively straightforward because the loss function only has neural networks, and the trained network can be used towards inverse problem solving. However, we now have a blackbox simulator: there is no interpretability of the trained network, and we cannot utilize the mechanistic components (i.e. differential equations of the simulator) for scientific analyses. In the case of the partial surrogate we have several advantages: the surrogate’s number of parameters is reduced and thus the network is more stable (similar logic holds for nonparametric GP surrogate models), and the simulator retains all structural knowledge and the ability to run numerical analysis. Yet the main challenge is that backpropagation of arbitrary scientific simulators is required – thus the significance of the differentiable programming motif we discussed earlier. The computational gain associated with the use of a hybrid surrogatesimulator cascades into a series of additional advantages including the possibility of simulating more scenarios towards counterfactual reasoning and epistemic uncertainty estimates, decreasing grid sizes, or exploring finerscale parameterizations [Rackauckas2020GeneralizedPL, Goldstein2015MachineLC].
Semimechanistic modeling
It follows from the Universal Approximation Theorem that an NN can learn to approximate any sufficiently regular differential equation. The approach of recent neuralODE methods [Chen2018NeuralOD] is to learn to approximate differential equations directly from data, but these can perform poorly when required to extrapolate [Yan2020OnRO]. More encouraging for emulation and scientific modeling, however, is to directly utilize mechanistic modeling simultaneously with NNs (or more generally, universal approximator models) in order to allow for arbitrary datadriven model extensions. The result is a semimechanistic approach, more specifically Universal Differential Equations (UDE) [Rackauckas2020UniversalDE]
where part of the differential equation contains a universal approximator model – we’ll generally assume an NN is used for UDE in this paper, but other options include GP, Chebyshev expansion, or random forest.
UDE augments scientific models with machinelearnable structures for scientificallybased learning. This is very similar in motivation to the physicsinformed neural nets (PINN) we previously discussed, but the implementation in general has a key distinction: a PINN is a deep learning model that become physicsinformed because of some added physics bias (observational, inductive, or learning), whereas a UDE is a differential equation with one or more mechanistic components replaced with a datadriven model – this distinction is why we earlier defined physicsinfused ML as the bidirectional influence of physics and ML.
Bayesian optimal experiment design
A key aspect that makes emulators useful in scientific endeavors is that they allow us to reason probabilistically about outerloop decisions such as optimization [Shahriari2016TakingTH], data collection [Krause2007NonmyopicAL], and to use them to explain how uncertainty propagates in a system [Bilionis2015BayesianUP].
Numerous challenges in science and engineering can be framed as optimization tasks, including the maximization of reaction yields, the optimization of molecular and materials properties, and the finetuning of automated hardware protocols [Aldeghi2021GolemAA]. When we seek to optimize the parameters of a system with an expensive cost function , we look to employ Bayesian optimization (BO) [McIntire2016SparseGP, Frazier2018ATO, Snoek2012PracticalBO, Shahriari2016TakingTH] to efficiently explore the search space of solutions with a probabilistic surrogate model rather than experimenting with the real system. Gaussian process models are the most common surrogates due to their flexible, nonparametric behavior. Various strategies to exploreexploit the search space can be implemented with acquisition functions to efficiently guide the BO search by estimating the utility of evaluating at a given point (or parameterization). Often in science and engineering settings this provides the domain experts with a few highly promising candidate solutions to then try on the real system, rather than searching the intractably large space of possibilities themselves – for example, generating novel molecules with optimized chemical properties [Griffiths2019ConstrainedBO, GmezBombarelli2018AutomaticCD], materials design with expensive physicsbased simulations [Zhang2020BayesianOF], and design of aerospace engineering systems [Poloczek2018AdvancesIB].
Similarly, scientists can utilize BO for designing experiments such that the outcomes will be as informative as possible about the underlying process. Bayesian optimal experiment design (BOED) is a powerful mathematical framework for tackling this problem [Lindley1956OnAM, Chaloner1995BayesianED, Foster2021DeepAD, zhang2021scalable], and can be implemented across disciplines, from bioinformatics [Vanlier2012ABA] to pharmacology [Lyu2018BayesianAD] to physics [Dushenko2020SequentialBE] to psychology [Myung2013ATO]. In addition to design, there are also control methods in experiment optimization, which we detail in the context of a particle physics example below.
Examples
Simulationbased online optimization of physical experiments
It is often necessary and challenging to design experiments such that outcomes will be as informative as possible about the underlying process, typically because experiments are costly or dangerous. Many applications such as nuclear fusion and particle acceleration call for online control and tuning of system parameters to deliver optimal performance levels – i.e., the control class of experiment design we introduced above.
In the case of particle accelerators, although physics models exist, there are often significant differences between the simulation and the real accelerator, so we must leverage real data for precise tuning. Yet we cannot rely on many runs with the real accelerator to tune the hundreds of machine parameters, and archived data does not suffice because there are often new machine configurations to try – a control or tuning algorithm must robustly find the optimum in a complex parameter space with high efficiency. With physicsinfused ML, we can exploit wellverified mathematical models to learn approximate system dynamics from few data samples and thus optimize systems online and in silico. It follows that we can additionally look to optimize new systems without prior data.
For the online control of particle accelerators, Hanuka et al. [Hanuka2020PhysicsinformedGP] develop the physicsinformed basisfunction GP. To clarify what this model encompasses, we need to understand the several ways to build such a GP surrogate for BO of a physical system:

Datainformed GP using real experimental data

Physicsinformed GP using simulated data

Basisfunction GP from deriving a GP kernel directly from the physical model

Physicsinformed basisfunction GP as a combination of the above – methods 2 and 3 were combined in Hanuka et al.
The resulting physicsinformed GP is more representative of the particle accelerator system, and performs faster in an online optimization task compared to routinely used optimizers (MLbased and otherwise). Additionally, the method presents a relatively simple way to construct the GP kernel, including correlations between devices – learning the kernel from simulated data instead of machine data is a form of kernel transfer learning, which can help with generalizability. Hanuka et al. interestingly point out that constructing the kernel from basis functions without using the likelihood function is a form of Gaussian process with likelihoodfree inference, which is the regime of problems that simulationbased inference is designed for (i.e. the motif we discuss next).
This and similar physicsinformed methods are emerging as a powerful strategy for in silico optimization of expensive scientific processes and machines, and further to enable scientific discovery by means of autonomous experimentation. For instance, the recent Gemini [Hickman2021GeminiDB] and Golem [Aldeghi2021GolemAA] molecular experiment optimization algorithms, which are purposebuilt for automated science workflows with SI: using surrogate modeling techniques for proxying expensive chemistry experiments, the BO and uncertainty estimation methods are designed for robustness to common scientific measurement challenges such as input variability and noise, proxy measurements, and systematic biases. Similarly, Shirobokov et al. [shirobokov2020blackbox] propose a method for gradientbased optimization of blackbox simulators using local generative surrogates that are trained in successive local neighborhoods of the parameter space during optimization, and demonstrate this technique in the optimization of the experimental design of the SHiP (Search for Hidden Particles) experiment proposed at CERN. These and other works of Alán AspuruGuzik et al. are good sources to follow in this area of automating science. There are potentially significant causeeffect implications to consider in these workflows, as we introduce in the causality motif later.
Multiphysics multiscale surrogates for Earth systems emulation
The climate change situation is worsening in accelerating fashion: the most recent decade (2010 to 2019) has been the costliest on record with the climatedriven economic damage reaching $2.98 trillionUS, nearly double the decade 2000–2009 [Bauer2021TheDR]. The urgency for climate solutions motivates the need for modeling systems that are computationally efficient and reliable, lightweight for lowresource usecases, informative towards policy and decisionmaking, and cyberphysical with varieties of sensors and data modalities. Further, models need to be integrated with, and workflows extended to, climatedependent domains such as energy generation and distribution, agriculture, water and disaster management, and socioeconomics. To this end, we and many others have been working broadly on Digital Twin Earth (DTE), a catalogue of ML and simulation methods, datasets, pipelines, and tools for Earth systems researchers and decisionmakers. In general, a digital twin is a computer representation of a realworld process or system – from large aircraft to individual organs. We define digital twin in the more precise sense of simulating the real physics and datagenerating processes of an environment or system, with sufficient fidelity such that one can reliably run queries and experiments in silico.
Some of the main MLrelated challenges for DTE include integrating simulations of multiple domains, geographies, and fidelities; not to mention the need to integrate real and synthetic data, as well as data from multiple modalities (such as fusing Earth observation imagery with ontheground sensor streams). SI methods play important roles in the DTE catalogue, notably the power of machine learned surrogates to accelerate existing climate simulators. Here we highlight one example for enabling lightweight, realtime simulation of coastal environments: Existing simulators for coastal storm surge and flooding are physicsbased numerical models that can be extremely computationally expensive. Thus the simulators cannot be used for realtime predictions with high resolution, are unable to quantify uncertainties, and require significant computational infrastructure overhead only available to top national labs. To this end, Jiang et al. [Jiang2021DigitalTE] developed physicsinfused ML surrogates to emulate several of the main coastal simulators worldwide, NEMO [Madec2008NEMOOE] and CoSMoS [Barnard2014DevelopmentOT]. Variations of the Fourier Neural Operator (FNO) [Li2020FourierNO] (introduced in the multiphysics motif) were implemented to produce upwards of 100x computational efficiency on comparable hardware.
This usecase exemplified a particularly thorny data preprocessing challenge that is commonplace working with spatiotemporal simulators and Digital Twin Earth: One of the coastal simulators to be emulated uses a standard gridspaced representation for geospatial topology, which is readily computable with DFT in the FNO model, but another coastal simulator uses highly irregular grids that differ largely in scale, and further stacks these grids at varying resolutions. A preprocessing pipeline to regrid and interpolate the data maps was developed, along with substitute Fourier transform methods. It is our experience that many applications in DTE call for tailored solutions such as this – as a community we are lacking shared standards and formats for scientific data and code.
Hybrid PGM and NN for efficient domainaware scientific modeling
One can in general characterize probabilistic graphical models (PGM) [Koller2009ProbabilisticGM] as structured models for encoding domain knowledge and constraints, contrasted with deep neural networks as datadriven functionapproximators. The advantages of PGM have been utilized widely in scientfic ML [Friedman2004InferringCN, Beerenwinkel2015CancerEM, Leifer2008QuantumGM, Boixo2017SimulationOL], and of course recent NN methods as discussed throughout this paper. GraphInformed Neural Networks (GINNs) [Hall2021GINNsGN] are a new approach to incorporating the best of both worlds: PGMs incorporate expert knowledge, available data, constraints, etc. with physicsbased models such as systems of ODEs and PDEs, while computationally intensive nodes in this hybrid model are replaced by learned features as NN surrogates. GINNs are particularly suited to enhance the computational workflow for complex systems featuring intrinsic computational bottlenecks and intricate physical relations variables. Hall et al. demonstrate GINN towards simulationbased decisionmaking in a multiscale model of electrical doublelayer (EDL) supercapacitor dynamics. The ability for downstream decisionmaking is afforded by robust and reliable sensitivity analysis (due to the probabilistic ML approach), and orders of magnitude more computational efficiency means many hypotheses can be simulated and predicted posteriors quantified.
Autoemulator design with neural architecture search
Kasim et al. look to recent advances in neural architecture search (NAS) to automatically design and train a NN as an efficient, highfidelity emulator, as doing this manually can be timeconsuming and require significant ML expertise. NAS methods aim to learn a network topology that can achieve the best performance on a certain task by searching over the space of possible NN architectures given a set of NN primitives – see Elsken et al. [Elsken2019NeuralAS] for a thorough overview.
The NAS results are promising for automated emulator construction: running on ten distinct scientific simulation cases, from fusion energy science [Anirudh2020ImprovedSI, GaldnQuiroga2018BeamIonAD] to aerosolclimate [Tegen2019TheGA] and oceanic [Khatiwala2007ACF] modeling, the results are reliably accurate output simulations with NNbased emulators that run thousands to billions times faster than the originals, while also outperforming other NAS based emulation approaches as well as manual emulator design. For example, a global climate model (GCM) simulation tested normally takes about 1150 CPUhours to run [Tegen2019TheGA], yet the emulator speedup is a factor of 110 million in direct comparison, and over 2 billion with a GPU — providing scientists with simulations on the order of seconds rather than days enables faster iteration of hypotheses and experiments, and potentially new experiments never before thought possible.
Also in this approach is a modified MC dropout method for estimating the predictive uncertainty of emulator outputs. Alternatively, we suggest pursuing Bayesian optimizationbased NAS methods [White2021BANANASBO] for more principled uncertainty reasoning. For example, the former can flag when an emulator architecture is overconfident in its predictions, while the latter can do that and use the uncertainty values to dynamically adjust training parameters and search strategies.
The motivations of Kasim et al. for emulatorbased accelerated simulation are same as we’ve declared above: enable rapid screening and ideas testing, and realtime predictionbased experimental control and optimization. The more we can optimize and automate the development and verification of emulators, the more efficiently scientists without ML expertise can iterate over simulation experiments, leading to more robust conclusions and more hypotheses to explore.
Deriving physical laws from datadriven surrogates
Simulating complex dynamical systems often relies on governing equations conventionally obtained from rigorous first principles such as conservation laws or knowledgebased phenomenological derivations. Although nontrivial to derive, these symbolic or mechanistic equations are interpretable and understandable for scientists and engineers. NNbased simulations, including surrogates, are not interpretable in this way and can thus be challenging to use, especially in many cases where it is important the scientist or engineer understand the causal, datagenerating mechanisms.
Recent MLdriven advances have led approaches for sparse identification of nonlinear dynamics (SINDy) [Brunton2016SparseIO] to learn ODEs or PDEs from observational data. SINDy essentially selects dominant candidate functions from a highdimensional nonlinear function space based on sparse regression to uncover ODEs that match the given data; one can think of SINDy as providing an ODE surrogate model. This exciting development has led to scientific applications from biological systems [Mangan2016InferringBN] to chemical processes [Bhadriraju2019MachineLA] to active matter [Cichos2020MachineLF], as well as datadriven discovery of spatiotemporal systems governed by PDEs [Rudy2017DatadrivenDO, Schaeffer2017LearningPD]. Here we showcase two significant advances on SINDy utilizing methods described in the surrogate modeling motif and others:

Synergistic learning deep NN surrogate and governing PDEs from sparse and independent data – Chen et al. [Chen2020DeepLO] present a novel physicsinformed deep learning framework to discover governing PDEs of nonlinear spatiotemporal systems from scarce and noisy data accounting for different initial/boundary conditions (IBCs). Their approach integrates the strengths of deep NNs for learning rich features, automatic differentiation for accurate and efficient derivative calculation, and sparse regression to tackle the fundamental limitation of existing methods that scale poorly with data noise and scarcity. The special network architecture design is able to account for multiple independent datasets sampled under different IBCs, shown with simple experiments that should still be validated on more complex datasets. An alternating direction optimization strategy simultaneously trains the NN on the spatiotemporal data and determine the optimal sparse coefficients of selected candidate terms for reconstructing the PDE(s) – the NN provides accurate modeling of the solution and its derivatives as a basis for constructing the governing equation(s), while the sparsely represented PDE(s) in turn informs and constraints the DNN which makes it generalizable and further enhances the discovery. The overall semimechanistic approach – bottomup (datadriven) and topdown (physicsinformed) processes – is promising for MLdriven for scientific discovery.

Sparse identification of missing model terms via Universal Differential Equations – We earlier described several scenarios where an ML surrogate is trained for only part of the full simulator system, perhaps for the computationally inefficient or the unknown parts. This is also a usecase of the UDE, replacing parts of a simulator described by mechanistic equations with a datadriven NN surrogate model. Now consider we’re at the end of the process of building a UDE (we have learned and verified an approximation for part of the causal generative model (i.e. a simulator)). Do we lose interpretability and analysis capabilities? With a knowledgeenhanced approach of the SINDy method we can sparseidentify the learned semimechanistic UDE back to mechanistic terms that are understandable and usable by domain scientists. Rackauckas et al. [Rackauckas2020UniversalDE] modify the SINDy algorithm to apply to only subsets of the UDE equation in order to perform equation discovery specifically on the trained neural network components. In a sense this narrows the search space of potential governing equations by utilizing the prior mechanistic knowledge that wasn’t replaced in training the UDE. Along with the UDE approach in general, this sparse identification method needs further development and validation with more complex datasets.
Future directions
The UDE approach has significant implications for use with simulators and physical modeling, where the underlying mechanistic models are commonly differential equations. By directly utilizing mechanistic modeling simultaneously with universal approximator models, UDE is a powerful semimechanistic approach allowing for arbitrary datadriven model extensions. In the context of simulators, this means a synergistic model of domain expertise and realworld data that more faithfully represents the true datagenerating process of the system.
What we’ve described is a transformative approach for MLaugmented scientific modeling. That is,

Practitioner identifies known parts of a model and builds a UDE – when using probabilistic programming (an SI engine motif), this step can be done in a highlevel abstraction where the user does not need to write custom inference algorithms.

Train an NN (or other surrogate model such as Gaussian process) to capture the missing mechanisms – one may look to NAS and Bayesian optimization approaches to do this in an automated, uncertaintyaware way.

The missing terms can be sparseidentified into mechanistic terms – this is an active area of research and much verification of this concept is needed, as mentioned in the example above.

Verify the recovered mechanisms are scientifically sane – for future work, how can we better enable this with humanmachine teaming?

Verify quantitatively: extrapolate, do asymptotic analysis, run posterior predictive checks, predict bifurcations.

Gather additional data to validate^{iii}^{iii}iiiNote we use “verify” and “validate” specifically, as there’s important difference between verification and validation (V&V): verification asks “are we building the solution right?” whereas validation asks “are we building the right solution?” [lavin2021technology]. the new terms.
Providing the tools for this semimechanistic modeling workflow can be immense for enabling scientists to make the best use of domain knowledge and data, and is precisely what the SI stack can deliver – notably with the differentiable programming and probabilistic programming “engine” motifs. Even more, building in a unified framework that’s purposebuilt for SI provides extensibility, for instance to integrate recent graph neural network approaches from Cranmer et al. [Cranmer2020DiscoveringSM] that can recover the governing equations in symbolic forms from learned physicsinformed models, or the “AI Feynmann” [Udrescu2020AIFA, Udrescu2020AIF2] approaches based on traditional fitting techniques in coordinate with neural networks that leverage physics properties such as symmetries and separability in the unknown dynamics function(s) – key features such as NNequivariance and normalizing flows (and how they may fit into the stack) are discussed later.
3. SimulationBased Inference
Numerical simulators are used across many fields of science and engineering to build computational models of complex phenomena. These simulators are typically built by incorporating scientific knowledge about the mechanisms which are known (or assumed) to underlie the process under study. Such mechanistic models have often been extensively studied and validated in the respective scientific domains. In complexity, they can range from extremely simple models that have a conceptual or even pedagogical flavor (e.g. the LotkaVolterra equations describing predatorprey interactions in ecological systems and also economic theories [Gandolfo2008GiuseppePA] (expressed in Fig. 21)) to extremely detailed and expensive simulations implemented in supercomputers, e.g. wholebrain simulations [markram2015reconstruction].
A common challenge – across scientific disciplines and complexity of models – is the question of how to link such simulationbased models with empirical data. Numerical simulators typically have some parameters whose exact values are non a priori, and have to be inferred by data. For reasons detailed below, classical statistical approaches can not readily be applied to models defined by numerical simulators. The field of simulationbased inference (SBI) [Cranmer2020TheFO] aims to address this challenge, by designing statistical inference procedures that can be applied to complex simulators. Building on foundational work from the statistics community (see [sisson2018handbook] for an overview), SBI is starting to bring together work from multiple fields – including, e.g., population genetics, neuroscience, particle physics, cosmology, and astrophysics – which are facing the same challenges and using tools from machine learning to address them. SBI can provide a unifying language, and common tools [tejero2020sbi, lintusaari2018elfi, klinger2018pyabc] and benchmarks [lueckmann2021benchmarking] are being developed and generalized across different fields and applications.
Why is it so challenging to constrain numerical simulations by data? Many numerical simulators ave stochastic components, which are included either to provide a verisimilar model of the system under study if it is believed to be stochastic itself, or often also pragmatically to reflect incomplete knowledge about some components of the system. Linking such stochastic models with data falls within the domain of statistics, which aims to provide methods for constraining the parameters of a model by data, approaches for selecting between different modelcandidates, and criteria for determining whether a hypothesis can be rejected on grounds of empirical evidence. In particular, statistical inference aims to determine which parameters – and combinations of parameters – are compatible with empirical data and (possibly) a priori assumptions. A key ingredient of most statistical procedures is the likelihood of data given parameters
. For example, Bayesian inference characterizes parameters which are compatible both with data and prior by the posterior distribution
, which is proportional to the product of likelihood and prior,. Frequentist inference procedures typically construct confidence regions based on hypothesis tests, often using the likelihood ratio as test statistic.
However, for many simulationbased models, one can easily sample from the model (i.e., generate synthetic data ) but evaluating the associated likelihoods can be computationally prohibitive – because, for instance, the same output could result from a very large number of internal paths through the simulator, and integrating over all of them is prohibitive. More pragmatically, it might also be the case that the simulator is implemented in a “blackbox” manner which does not provide access to its internal workings or states. If likelihoods can not be evaluated, most conventional inference approaches can not be used. The goal of simulationbased inference is to make statistical inference possible for socalled implicit models which allow generating simulated data, but not evaluation of likelihoods.
SBI is not a new idea. Simulationbased inference approaches have been studied extensively in statistics, typically under the heading of likelihoodfree inference. An influential approach has been that of Approximate Bayesian Computation [rubin1984, beaumont2002approximate, sisson2018handbook]. In its simplest form it consists of drawing parameter values from a proposal distribution, running the simulator for these parameters to generate synthetic outputs , comparing these outputs against the observed data, and accepting the parameter values only if they are close to the observed data under some distance metric,
. After following this procedure repeatedly, the accepted samples approximately follow the posterior. A second class of methods approximates the likelihood by sampling from the simulator and estimating the density in the sample space with kernel density estimation or histograms. This approximate density can then be used in lieu of the exact likelihood in frequentist or Bayesian inference techniques
[Diggle1984MonteCM].Both of these methods enable approximate inference in the likelihoodfree setting, but they suffer from certain shortcomings: In the limit of a strict ABC acceptance criterion () or small kernel size, the inference results become exact, but the sample efficiency is reduced (the simulation has to be run many times). Relaxing the acceptance criterion or increasing the kernel size improves the sample efficiency, but reduces the quality of the inference results. The main challenge, however, is that these methods do not scale well to highdimensional data, as the number of required simulations grows approximately exponentially with the dimension of the data . In both approaches, the raw data is therefore usually first reduced to lowdimensional summary statistics. These are typically designed by domain experts with the goal of retaining as much information on the parameters as possible. In many cases, the summary statistics are not sufficient and this dimensionality reduction limits the quality of inference or model selection [robert2011lack]. Recently, new methods for learning summary statistics in a fully [jiang2017learning, chen2021neural] or semiautomatic manner [fearnhead2012constructing] are emerging, which might alleviate some of these limitations.
The advent of deep learning has powered a number of new simulationbased inference techniques. Many of these methods rely on the key principle of training a neural surrogate for the simulation. Such models are closely related to the emulator models discussed in the previous section, but not geared towards efficient sampling. Instead, we need to be able to access its likelihood [2018arXiv180507226P, 2018arXiv180509294L] (or the related likelihood ratio [Neal:2007zz, 2012arXiv1212.1479F, Cranmer:2015bka, 2016arXiv161003483M, 2016arXiv161110242D, miller2021truncated]) or the posterior [le2017using, NIPS2016_6084, 2017arXiv171101861L, greenberg2019automatic]. After the surrogate has been trained, it can be used during frequentist or Bayesian inference instead of the simulation. On a high level, this approach is similar to the traditional method based on histograms or kernel density estimators [Diggle1984MonteCM], but modern ML models and algorithms allow it to scale to higherdimensional and potentially structured data.
The impressive recent progress in SBI methods does not stem from deep learning alone. Another important theme is active learning: running the simulator and inference procedure iteratively and using past results to improve the proposal distribution of parameter values for the next runs
[ranjan2008sequential, Bect2012, 2014arXiv1401.2838M, 2015arXiv150103291G, 2015arXiv150603693M, 2018arXiv180507226P, 2018arXiv180509294L, NIPS2016_6084, 2017arXiv171101861L, jarvenpaa2019efficient]. This can substantially improve the sample efficiency. Finally, in some cases simulators are not just black boxes, but we have access to (part of) their latent variables and mechanisms or probabilistic characteristics of their stack trace. In practice, such information can be made available through domainspecific knowledge or by implementing the simulation in a framework that supports differential or probabilistic programming– i.e., the SI engine. If it is accessible, such data can substantially improve the sample efficiency with which neural surrogate models can be trained, reducing the required compute [graham2017asymptotically, Brehmer:2018hga, Stoye:2018ovl, baydin2019efficient, kersting2020differentiable]. On a high level, this represents a tighter integration of the inference engine with the simulation [Baydin2019EtalumisBP].These components – neural surrogates for the simulator, active learning, the integration of simulation and inference – can be combined in different ways to define workflows for simulationbased inference, both in the Bayesian and frequentist setting. We show some example inference workflows in Fig. 10. The optimal choice of the workflow depends on the characteristics of the problem, in particular on the dimensionality and structure of the observed data and the parameters, whether a single data point or multiple i.i.d. draws are observed, the computational complexity of the simulator, and whether the simulator admits accessing its latent process.
Examples
Simulationbased inference techniques have been used to link models to data in a wide range of problems, reflecting the ubiquity of simulators across the sciences. In physics, SBI is useful from the smallest scales in particle colliders [Brehmer:2018kdj, Brehmer:2018eca, Brehmer:2019xox], where it allows us to measure the properties of the Higgs boson with a higher precision and less data, to the largest scales in the modeling of gravitational waves [Delaunoy:2020zcu, Dax:2021tsq], stellar streams [Hermans:2020skz], gravitational lensing [Brehmer:2019jyt], and the evolution of the universe [Alsing:2018eau]. These methods have also been applied to study the evolutionary dynamics of protein networks [ratmann2007using] and in yeast strains [avecilla2021simulation]. In neuroscience, they have, e.g., been used to estimate the properties of ion channels from highthroughput voltageclamp, and properties of neural circuits from observed rhythmic behaviour in the stomatogastric ganglion [gonccalves2020training], to identify network models which can capture the dynamics of neuronal cultures [sukenik2021neuronal], to study how the connectivity between different celltypes shapes dynamics in cortical circuits [bittner2021interrogating], and to identify biophysically realistic models of neurons in the retina [oesterle2020bayesian, yoshimatsu2021ancestral].
SBI is not restricted to the natural sciences. In robotics, for example, simulators are commonly used to generate training data on which the parameters of control and policysearch algorithms are subsequently trained. A common challenge is the question of how policies trained on simulated data can subsequently be transferred to the real world, called “simtoreal” transfer. SBI has been used to infer posteriordistributions over the parameters of simulators [ramos2019bayessim, muratore2021neural] – the idea is that simulations from the posterior would yield more realistic data than ‘handtuned’ simulators, and that by simulating from the posterior, one would also cover a range of different domains (or distributions, environment variations, etc.) rather than just overfit on a single parameter regime. For example, [marlier2021simulation] used this approach to identify posteriors over handconfiguration, on which a multifingered robotic grasping algorithm was trained.
Here we will not be able to do the wide variety of example applications of SBI justice, and instead briefly zoom in on two particular use cases from particle physics and neuroscience.
Measuring the properties of the Higgs boson
The experiments at the Large Hadron Collider probe the laws of physics at the smallest length scales accessible to humans. A key goal is to precisely measure how the Higgs boson interacts with other elementary particles: patterns in these interactions may provide evidence for physics beyond the “Standard Model”, the established set of elementary particles and laws that govern them, and ultimately help us answer some of the open fundamental questions of elementary physics.
In this endeavour physicists rely on a chain of highfidelity simulators for the interactions between elementary particles, their radiation and decay patterns, and the interactions with the detector. Owing to the large number of stochastic latent variables (stateoftheart simulations involve multiple millions of random variables), the likelihood function of these simulators is intractable. Inference is traditionally made possible by first reducing the highdimensional data to handcrafted summary statistics and estimating the likelihood with histograms, as discussed above, but popular summary statistics are often lossy, leading to less precise measurements.
Recently, particle physicists have begun developing and applying simulationbased inference methods driven by machine learning to this problem [Brehmer:2020cvb, Brehmer:2018kdj, Brehmer:2019xox]. These methods on the one hand draw from the growing interdisciplinary toolbox of simulationbased inference and on the rapid progress in deep learning. On the other hand, they are tailored to the characteristics of this scientific measurement problem: they make use of the latent structure in particle physics simulators, are geared towards a large number of i. i. d. samples, and support inference in a frequentist framework, the established paradigm in particle physics. Multiple different algorithms have been proposed, but the basic strategy commonly consists of three steps (as we sketch in Fig. 11):

The chain of simulators is run for various values of the input parameters , for instance characterizing the interaction patterns between the Higgs boson and other elementary particles. These parameters are sampled from a proposal distribution, which needs to cover the region of interest in the parameter space (but does not have to be related to a Bayesian prior). For each simulated sample, the simulator inputs and outputs are stored. The outputs can consist of lowlevel data like energy deposits in the various detector components, but often it is more practical to parameterize them in form of preprocessed highlevel variables like the kinematic properties of reconstructed particles. In addition, certain additional latent variables can be stored that characterize the simulated process.

A machine learning model, often a multilayer perceptron
[Tang2016ExtremeLM], is trained on the simulated dataset, minimizing one of several proposed loss functions. They all have in common that (assuming sufficient capacity, infinite data, and perfect optimization) the network will converge to the likelihood ratio function , where is a reference distribution (for instance the marginal likelihood or the likelihood for some value of the parameters). Some loss functions leverage the latent variables from the simulator process to improve the sample efficiency of the training. 
The observed dataset consists of multiple i. i. d. events . The trained network is evaluated for each of these events. This provides a tractable and differentiable approximate likelihood function (up to an irrelevant overall constant), which is then used to define the maximum likelihood estimator as well as confidence regions based on likelihood ratio tests, often relying on convenient asymptotic properties—in this frequentist approach, nuisance parameters through profiling [Cowan:2010js].
First phenomenological studies on synthetic data have shown that these methods have the potential to substantially improve the precision of Higgs measurements at the LHC [Brehmer:2018kdj, Brehmer:2018eca, Brehmer:2019xox, Brehmer:2019gmn, Chen:2020mev, Barman:2021yfh, Bahl:2021dnc]. It remains a challenge to scale these analyses to real data and in particular to a realistic modeling of systematic uncertainties, which are usually modeled with thousands of nuisance parameters.
Future directions
An intriguing direction to further develop SBI methods is towards humanmachine inference: an efficient and automated loop of the scientific method (at least, for sufficiently well posed problems). Described in Fig. 12, the overall process is (1) perform experiment, (2) perform inference, (3) optimize experiment, and (4) repeat from (1). This approach provides an efficient means of physical experiments, particularly for expensive experiments in large search and hypothesis spaces, for instance the Large Hadron Collider at CERN [Louppe2017TeachingMT]. This is one means by which simulation intelligence can broaden the scientific method. It will be fascinating to advance this SIbased science workflow by replacing the information gain objectives with openended optimization (discussed later in the SI engine motifs).
Simulationbased inference is a rapidly advancing class of methods, with progress fueled by newer programming paradigms such as probabilistic programming and differentiable programming. Cranmmer et al. [Cranmer2020TheFO]
predict that “several domains of science should expect either a significant improvement in inference quality or the transition from heuristic approaches to those grounded in statistical terms tied to the underlying mechanistic model. It is not unreasonable to expect that this transition may have a profound impact on science.”
4. Causal Reasoning
Standard approaches in statistical analysis and pattern recognition draw conclusions based on associations among variables and learned correlations in the data. Yet these approaches do not suffice when scientific queries are of
causal nature rather than associative – there is marked distinction between correlations and causeeffect relationships. Such causal questions require some knowledge of the underlying datagenerating process, and cannot be computed from the data alone, nor from the distributions that govern the data [Pearl2000CausalityMR, Pearl2009CausalII]. Understanding causality is both the basis and the ultimate goal of scientific research in many domains – from genomics [Gruber2010AnAO] and healthcare [Prosperi2020CausalIA], to economics [Athey2018TheIO] and game theory [Toulis2016LongtermCE], to Earth systems sciences [Runge2019InferringCF] and more.Abstractly we can consider a causal model to be qualitatively between a mechanistic model (described by differential equations, for example) and a statistical model (learned from data). That is, a mechanistic model is a description of the dynamics governing a system but is manually encoded by human experts, while a statistical model such as a neural network learns a mapping of variables from observational data but that only holds when the experimental conditions are fixed. Neither allows for causal queries. Causal learning lies between these two extremes, seeking to model the effects of interventions and distribution changes with a combination of datadriven learning and assumptions not already included in the statistical description of a system [Scholkopf2021TowardCR].
Reasoning about cause and effect with a causal model corresponds to formulating causal queries, which have been, organized by Pearl & Mackenzie [Pearl2018TheBO] in a threelevel hierarchy termed the ladder of causation [Peyrard2020ALO] (Fig. 13):

Observational queries: seeing and observing. What can we tell about if we observe ?

Interventional queries: acting and intervening. What can we tell about if we do ?

Counterfactual queries: imagining, reasoning, and understanding. Given that actually happened, what would have happened to had we done ?
where and are variables in the system of interest with events
. The difference between observation (conditional probability) and action (interventional calculus) is what motivated the development of causality
[Hardt2021PatternsPA]. More precisely the disagreement between interventional statements and conditional statements is defined as confounding: and are confounded when the causal effect of action on does not coincide with the corresponding conditional probability.It is common to approach the analysis of causal relationships with counterfactual reasoning: the difference between the outcomes if an action had been taken and if it had not been taken is defined as the causal effect of the action [Pearl2009CausalII, Morgan2014CounterfactualsAT]. Counterfactual problems involve reasoning about why things happened, imagining the consequences of different actions for which you never observe, and determining which actions would have achieved a desired outcome. A counterfactual question may be “would the virus have spread outside the city if the majority of adults were vaccinated one week sooner?” In practice this can be done with structural causal models (SCMs), which give us a way to formalize the effect of hypothetical actions or interventions on the population within the assumptions of our model. An SCM, represented as a directed acyclic graph (DAG), is a collection of formal assumptions about how certain variables interact and compose the data generating process.
Answers to counterfactual questions strongly depend on the specifics of the SCM; it is possible to construct two models that have identical graph structures, and behave identically under interventions, yet give different answers or potential outcomes [Peters2017ElementsOC]. Without an SCM, we can still obtain causal estimates based on counterfactuals by using the potential outcomes framework, or the RubinNeymann causal model [Rubin1974EstimatingCE, Rubin2005CausalIU], which computes causeeffect estimates by considering the set of potential events or actions leading to an outcome .
A step below on the causal ladder are interventional questions, for instance “how does the probability of herd immunity change if the majority of adults are vaccinated?” Interventions in an SCM are made by modifying a subset of variable assignments in the model. Several types of interventions exist [Eaton2007ExactBS]: hard/perfect involves directly fixing a variables value, soft/imperfect corresponds to changing the conditional distribution, uncertain intervention implies the learner is unsure which mechanism/variable is affected by the intervention, and of course the no intervention base case, where only observational data is obtained from the SCM.
From the machine learning perspective, counterfactual reasoning is strictly harder than interventional as it requires imagining action rather than doing the action. And associative reasoning – that is, with a statistical model such as a neural network – is the simplest of all as it purely observes data (i.e., level 1 of the causal ladder). We illustrate these three levels of causality in Fig. 13. For full treatment please refer to Pearl’s many works [Pearl2000CausalityMR, Pearl2009CausalII], the causal inference survey by Yao et al. [Yao2020ASO], Schölkopf et al. [Scholkopf2019CausalityFM, Scholkopf2021TowardCR] and the references therein for more details on causality in machine learning, and Guo et al. [Guo2020ASO]
for an overview of causal data science broadly. Here we focus on causality in scientific simulators.
Simulators provide virtual testbeds for the exploration of complex realworld scenarios, and, if built with the tooling for causal machine learning (specifically counterfactual reasoning), can provide “whatif” analyses in an efficient and costeffective way. That is, each run of a simulator is a realization of a particular possible world, and counterfactual whatif claims can be studied by varying independent parameters of the model and rerunning the simulation.
Simulation tracebased causality
Given simulator execution traces in general consist of states, events, and actions, we can define causal relationships based on these entities – without this grounding, one cannot make reliable causal claims, which we discuss in the final example of this section. Herd & Miles [Herd2019DetectingCR] provide a practical formulation: event causes event in an actual trace if and only if there exists a counterfactual trace that is similar enough to which requires that neither nor happens in the same state in which they happened in . This suggests counterfactual analysis can be performed by searching all traces produced by the simulator for counterfactual , but this is clearly intractable for any nontrivial simulator. However, if we intervene on the trace during the simulation process, a variety of counterfactual traces can be generated with slight differences in events such that we may reason about the causal relationships between events. It follows that this approach is called interventionbased counterfactual analysis.
Simulation traces can be complex, amounting to several forms of causation based on the occurrence and omission of events in trace [Herd2019DetectingCR]:

Basic causation – In order for event to cause event (), ’s occurrence has to be both sufficient and necessary for in . That is, for the mathematically inclined reader, .

Prevention – In order for event to prevent event , ’s occurrence has to be both necessary and sufficient for ’s omission. That is, .

Transitivity – If causes and causes then causes .

Causation by omission– If ’s omission is both necessary and sufficient for ’s occurrence and we accept that omissions are allowed to be causes, then ’s omission can be considered to cause ’s occurrence.

Prevention by caused prevention– It follows from the prior causal claims that if causes and prevents , then prevents .
The execution of a simulator results in one or more simulation traces, which represent various paths of traversing the state space given the underlying datagenerating model, which is not unlike the execution traces resulting from probabilistic programming inference. By formalizing the sequence of simulator states as tracebased definition of causal dependencies, we have the computational and mathematical tooling to make causal inquiries, which we introduce in examples below. Further, with a simulator that produces a full posterior over the space of traces, as in probabilistic programming, we can attempt to draw causal conclusions with causal inference and discovery algorithms. This we introduce later in the motifs integrations section.
Examples
Tracing causal epidemiology chains with in silico interventions
Systems of networked entities or agents that evolve over time can be difficult to understand in a practical way towards decisionmaking. For instance, the propagation of a virus in an environment, where it can be useful to identify “patient zero” or the agent that initiated the chain of infection. Methods for interventionbased counterfactual analysis can be used in agentbased modeling simulation to determine the root cause for each eventually infected agent.
First the researchers [Herd2019DetectingCR] begin with a simple counterfactual simulation experiment without intervention – that is, only manipulating the initial infection states before starting the simulation, not actively intervening in the simulation execution. This exemplifies the standard use of simulators for counterfactual reasoning: modify the input variables and environment parameters, observe the change in outputs per run, and assume causeeffect relationships. This inactive approach is susceptible to preemption confounding, where only manipulating the initial state of the disease transmission model may open the door for spurious background effects to creep in, and, from an observer’s perspective, mask the original causal relationship. Thus, except for very simplistic simulations, inactive counterfactual reasoning is not necessarily reliable for determining root causes. To actively intervene in the simulation, one needs to implement a series of experiments with both negated infections and altered times of infection. In a small testbed of 10 agents, the root cause analysis produces a correct partitioning of the agents into two groups, where each group shared a common root cause infection, shown in Fig. 15. This demonstration is simplistic in the representation and dynamics of agent states. A more interesting usecase for future work would be simulations with multiagent policies, such as the economic simulations we discuss next in the agentbased modeling motif, where one could actively intervene to discover why agents make certain decisions. Additional challenges to this approach can arise when considering multiple competing and interacting causal chains, which is to be expected in simulations of complex scenarios.
Active causal discovery and optimizing experiment design
Asking the causal questions discussed above requires a causal model to begin with. The problem of inferring such a model from observational, interventional, or mixed data is called causal discovery. Dedicated causal discovery algorithms exist and can be separated into two subtypes: constraintbased and scorebased. The constraintbased algorithms construct the causal structure based on conditional independence constraints, while the scorebased algorithms generate a number of candidate causal graphs, assign a score to each, and select a final graph based on the scores [Shen2020ChallengesAO]. For an overview of the causal discovery subfield see Glymour et al. [Glymour2019ReviewOC]. Here we’re interested in active causal discovery, illustrating a method that can be a novel “humanmachine teaming” workflow.
The setting for active causal discovery involves an agent iteratively selecting experiments (or targeted interventions) that are maximally informative about the underlying causal structure of the system under study. The experiment process is shown in Fig. 16. At each iteration the agent,

Performs an experiment in the form of intervening on one of the variables and fixing its value, ;

Observes the outcome by sampling from the interventional distribution ;

Updates its beliefs with the observed outcome;

Plans the next experiment to maximize information gain (or similar criteria).
The process repeats until the agent converges on a causal world model (with sufficient confidence) or an experiment threshold is reached. We use the syntax , which refers to Pearl’s “docalculus” [Pearl1995CausalDF, Tucci2013IntroductionTJ]. In short, the dooperator forces a variable to take a certain value or distribution, which is distinct from conditioning that variable on a value or distribution. This distinct operation allows us to estimate interventional distributions and causal effects from observational distributions (under certain conditions). Notice in the SIstack Fig. 1 that docalculus is stated explicitly with the causality module.
The active learning approach with targeted experiments allows us to uniquely recover the causal graph, whereas score or constraintbased causal discovery from observational data can struggle going beyond the Markov equivalence class [Kgelgen2019OptimalED]. In the simulation setting we have virtual agents, while real scientists can follow the same procedures in the lab, potentially to minimize the experiments needed on costly equipment – notice the relationship to the sequential designofexperiment methods we previously discussed, but now with the mathematics of causality. One can also imagine several variations involving humanmachine teams to iterate over causal models and interventions (both in silico and in situ).
Von Kügelgen et al. [Kgelgen2019OptimalED] introduce a probabilistic approach with Bayesian nonparametrics, and there’s been one application we know of, in the domain of neurodgenerative diseases [Lavin2020NeurosymbolicND]. The main developments are twofold:
1. The causal models contain continuous random variables with nonlinear functional relationships, which are modeled with
Gaussian process (GP)priors. A GP is a stochastic process which is fully specified by its mean function and covariance function such that any finite set of random variables have a joint Gaussian distribution
[GPML]. GPs provide a robust method for modeling nonlinear functions in a Bayesian nonparametric framework; ordinarily one considers a GP prior over the function and combines it with a suitable likelihood to derive a posterior estimate for the function given data. GPs are flexible in that, unlike parametric counterparts, they adapt to the complexity of the data. GPs are largely advantageous in practice because they provide a principled way to reason about uncertainties [Ghahramani2015ProbabilisticML].2. The space of possible interventions is intractable, so Bayesian optimization (BO) is used to efficiently select the sequence of experiments that balances exploration and exploitation relative to gained information and optimizing the objective. BO is a probabilistic ML approach for derivativefree global optimization of a function , where
is typically costly to evaluate (in compute, time, or interactions with people and environment). BO consists of two main components: a Bayesian statistical model for modeling the objective function (typically a GP, thus with uncertainty quantification), and an acquisition function for deciding where to sample next
[Shahriari2016TakingTH, Frazier2018ATO]. The general idea is to trade off computation due to evaluating many times with computation invested in selecting more promising candidate solutions , where the space can represent parameters of a realworld system, a machine learning model, an in situ experiment setup, and so on.The use of probabilistic ML methods here enables us to maintain uncertainty estimates over both graph structures and their associated parameters during the causal discovery process, and potentially to propagate uncertainty values to downstream tasks. We are further investigating how to reliably quantify uncertainties with humanmachine causal discovery. Another promising direction for work is to consider using physicsinfused GP methods (discussed in the surrogate modeling motif) as the BO surrogate model – we explore this and related combinations of motifs in the Integrations section later.
Building a driving simulator for causal reasoning
Robust and reliable simulation tools can help address the overarching challenge in counterfactual (and causal) reasoning: one never observes the true counterfactual outcome. In theory, the parameters of simulation environments can be systematically controlled, thereby enabling causal relationships to be established and confounders to be introduced by the researcher. In practice, simulators for complex environments are typically built on game engines [Chance2021OnDO] that do not necessarily suffice for causal reasoning, including the simulation testbeds for AI agents such as Unity AI [Juliani2018UnityAG] and OpenAI Gym [Brockman2016OpenAIG]. We suggest several main limitations to overcome:

Counterfactual reasoning in general is not as simple as changing a variable and observing the change in outcomes. Rather, principled counterfactual reasoning is more precise: quantifying the potential outcomes under the RubinNeymann causal model [Rubin2005CausalIU] and within the constraints of potential confounders to derive true causeeffect relationships. For example, the implementation of counterfactual Gaussian Processes [Schulam2017WhatIfRW] to simulate continuoustime potential outcomes of various clinical actions on patients.

Confounders – factors that impact both the intervention and the outcomes – can be known or measurable in some cases and hidden in others. The challenge here is to enable the researcher to systematically introduce, remove, and examine the impact of many different types of confounders (both known and hidden) while experimenting with simulations. As mentioned in the ABM example, only manipulating simulation inputs is subject to preemption confounding.

Prior work with causalityspecific simulators has remained in relatively small environments with simplistic entities and actions, for example balls or objects moving on 2D and 3D surfaces [Yi2020CLEVRERCE, Li2020CausalDI, Gerstenberg2021ACS].
The recent “CausalCity” [McDuff2021CausalCityCS] simulation environment aims to provide explicit mechanisms for causal discovery and reasoning – i.e., enabling researchers to create complex, counterfactual scenarios including different types of confounders with relatively little effort. The main novelty over similar simulation environments is the notion of agency: each entity in the simulation has the capability to “decide” their lowlevel behaviors, which enables scenarios to be designed with simple highlevel configurations rather than specifying every single lowlevel action. They suggest this is crucial to creating simulation environments that reflect the nature and complexity of these types of temporal realworld reasoning tasks. CausalCity addresses problem (3) above with a city driving simulation of configurable, temporal driving scenarios and large parameter spaces. However, the main aims of a causalfirst simulator are left unfulfilled, with only parts of problems (1) and (2) addressed: The notion of counterfactual reasoning in CausalCity is to forward simulate scenarios with specific conditions or parameters changed, and observe the difference in outcomes, meaning the counterfactual claims are limited to the full sequence of events rather than specific causeeffect relationships in the space between simulation inputs and outputs. For analyzing confounders, CausalCity provides tools to control extraneous variables – all variables which are not the independent variable but could affect the results of the experiment – which do not necessarily imply confounding, let alone show specific confounding relationships (for instance, as you would in an SCM).
Thus the CausalCity simulator does not include the means for deriving the precise causal mechanisms through direct interventions, nor control of causal inference bounds due to confounder assumptions. There needs to be tooling for causal modeling directly on the simulator execution trace, as we suggest later in probabilistic programming based causal simulations. Nonetheless CausalCity represents a useful step in the direction of simulators for causal reasoning.
Future directions
Causal reasoning facilitates the simulation of possible futures and prediction of intervention outcomes, leading to more principled scientific discovery and decisionmaking. In several of the preceding examples we highlighted “active” processes with causal machine learning and simulators. Our belief is further development and validation can bring these active causal reasoning workflows into scientific practice to accelerate discovery and experimentation in diverse domains. Further, one can make the case that humanmachine inference methods (Fig. 12) embodied with causal reasoning are necessary to make progress towards the Nobel Turing Challenge [Kitano2021NobelTC] to develop AIscientists capable of autonomously carrying out research to make major scientific discoveries – it’d difficult to imagine such an AIscientist without, for example, the ability to intelligently exploreexploit a system to derive its causal structure as in Fig. 16.
The specific directions of work we described are,

Causality definitions and inference constraints based on simulation trace relationships

Potential outcomes methods purposebuilt for simulation counterfactual reasoning

Tooling for active interventions on running simulations

Tooling for autonomous and humanmachine active causal discovery

Game engines and testbeds with infrastructure and programmable interfaces for principled counterfactual (and interventional) reasoning
Another direction we propose in the Integrations section later is probabilistic programming as a formalism for causal reasoning, as a natural representation framework and also enabling probabilistic causal reasoning. Under the umbrella of advancing probabilistic causal reasoning, and continuing the thread from our causal optimization example above, work on causal inference with Bayesian optimization can offer new, improved approaches for decision making under uncertainty. Aglietti et al. [Aglietti2020CausalBO] propose an intriguing approach for global causal optimization, and continuing this line work is an important direction – for example, combining their proposed framework with causal discovery algorithms, to remove dependency on manual encoding graphs while also accounting for uncertainty in the graph structure.
Perhaps the most important contribution of causal reasoning in simulation intelligence systems comes from the fact that correlations may well indicate causal relationships but can never strictly prove them. A model, no matter how robust, that reproduces the statistical properties of data, is a rather weak form of validation since there are many possible causal structures for a given set of statistical dependencies. Building simulators with proper representations and tooling for causal inference is critical for validating that simulators faithfully represent the true data generating processes, and ultimately for the reliability of simulation outcomes. One must consider the ethical implications of downstream actions and decisionmaking based on such outcomes, from epidemiology and medicine, to economics and government policy [lavin2021technology]. Simulationbased approaches can better deal with cause and effect than purely statistical methods from widely used in AI, ML, and data science, yet similar ethical concerns like model trust and interpretability [Lipton2018TheMO], and algorithmic fairness [Loftus2018CausalRF, Oneto2020FairnessIM, Fazelpour2020AlgorithmicFF] are crucial.
5. AgentBased Modeling
In agentbased modeling (ABM) [Holland1991ArtificialAA, Bonabeau2002AgentbasedMM], systems are modeled as collections of autonomous interacting entities (or agents) with encapsulated functionality that operate within a computational world. ABM agents can represent both individual or collective entities (such as organizations or groups). For example, in an artificial economy simulation for quantitatively exploring consequences of policy decisions, agents could be skilled and unskilled workers at varying classlevels in a city environment [Farmer2009TheEN]. Or in biology, individual bacteria can be modeled as agents from population statistics obtained with flow cytometry [Garca2018StochasticIM].
ABM combines elements of game theory, complex systems, emergence, computational sociology, multiagent systems, evolutionary programming, Monte Carlo methods, and reinforcement learning. Complex systems theory, for example, can be used to investigate how the interactions between parts can create collective behaviour within a dynamic system [Mainzer2007ThinkingIC]. And the dynamic interaction between multiple agents or decisionmakers, which simultaneously affects the state of the environment, can cause the emergence of specific and novel behavioral patterns, which can be studied with reinforcement learning techniques such as shaping reward structures and development of language between agents. The decisionmaking process with autonomous individuals in a bounded rational environment is often heterogeneous and lends itself well to ABM as a tool for analysis [Batty2008GenerativeSS]; individuals represented by agents are dynamically interacting with other agents based on simple rules that will give rise to complex behaviours and patterns of displacement.
It is standard that agents can only acquire new data about their world constructively (through interactions), and agents can have uncomputable beliefs about their world that influence their interactions. Uncomputable beliefs can arise from inborn (initially configured) attributes, from communications received from other agents, and/or from the use of nonconstructive methods (e.g., proof by contradiction) to interpret acquired data. These uncomputable beliefs enable agents to make creative leaps, i.e. to come up with new ideas about their world not currently supportable by measurements, observations, or logical extrapolations from existing information [Borrill2011AgentbasedMT].
Formally a standard framework to use is partialobservable multiagent Markov Games (MGs) [Sutton2018ReinforcementLA] with a set of agents interacting in a state and action space that evolves over time—at each time step the environment evolves, and each agent receives an observation, executes an action, and receives a reward. Using machine learning to optimize agent behavior, each agent aims to learn a policy that maximizes its expected reward, which depends on the behavioral policies of the other agents and the environment transition dynamics. In a multiagent economy setting, for example, the reward is modeled as a utility function, and rational economic agents optimize their total discounted utility over time. This describes the individual perspective in ABM, or aiming to achieve one’s own goals. An additional planner perspective instead seeks to achieve some notion of social welfare or equilibrium for an interacting population, which usually involves intervening on populationlevel dynamics through policies, institutions, and centralized mechanisms [Dafoe2020OpenPI]. We describe this in the context of a machine learned social planner to improve societal outcomes with the first example later.
The utility of ABM has proved challenging due to the complexity of realistically modeling human behavior and large environments (despite the ability of ABM to adopt behavioral rules that are deduced from human experiments [Arthur1991DesigningEA]). Recent advances in deep reinforcement learning (RL) provide opportunity to overcome these challenges by framing ABM simulations as multiagent reinforcement learning (MARL) problems. In MARL, agents need to learn together with other learning agents, creating a nonstationary environment [HernandezLeal2017ASO]. The past few years have seen tremendous progress on games such as go [Silver2016MasteringTG, Schrittwieser2020MasteringAG], StarCraft II [Vinyals2019GrandmasterLI], Dota2 [Berner2019Dota2W], and twoplayer poker [Brown2019DeepCR]. These represent twoplayer zerosum games, which are convenient testbeds for multiagent research because of their tractability: the predefined solutions coincide with the Nash equilibrium, the solutions are interchangeable, and they have worstcase guarantees [Neumann1945TheoryOG]. Yet these testbeds limit research to domains that are inherently adversarial, while nonwarfare realworld scenarios such as simulated economies imply cooperative objectives and dynamics, where agents learn to communicate, collaborate, and interact with one another. The MARL field is starting to trend towards games of pure common interest, such as Hanabi [Bard2020TheHC] and Overcooked [Carroll2019OnTU], as well as team games [Jaderberg2019HumanlevelPI, Baker2020EmergentTU]. Mixedmotives settings are more aptly connected to realworld settings, for example in alliance politics [Dafoe2020OpenPI] and tax policy [Zheng2020TheAE], which are important to research since some strategic dynamics only arise in these settings, such as issues of trust, deception, and commitment problems. The field of cooperative AI investigates a spectrum of these commonvsconflicting interests dynamics, as well as myriad types of cooperative agents—machinemachine, humanmachine, humanhuman, and more complex constellations. While much AI research to date has focused on improving the individual intelligence of agents and algorithms, cooperative AI aims to improve social intelligence: the ability of groups to effectively cooperate to solve the problems they face. Please see Dafoe et al. [Dafoe2020OpenPI] for a thorough overview.
Reinforcement learning (including MARL and cooperative AI) provides a useful toolbox for studying and generating the variety of behaviors agents develop. The primary factor that influences the emergence of agent behavior is the reward structure, which can be categorized as cooperative, competitive, or intrinsic [Gronauer2021MultiagentDR]. Intrinsic motivation implies the agent is maximizing an internal reinforcement signal (intrinsic reward) by actively discovering novel or surprising patterns, a concept which resembles developmental learning [Piaget1971TheOO, Gopnik1999TheSI] where organisms have to spontaneously explore their environment and acquire new skills [Barto2013IntrinsicMA, Aubret2019ASO]. In addition to the various reward strategies, the development of language corpora and communication skills of autonomous agents is significant in emerging behaviors and patterns. Gronauer et al. [Gronauer2021MultiagentDR] categorize this as referential and dialogue. The former describes cooperative games where the speaking agent communicates an objective via messages to another listening agent, and the latter includes negotiations, questionanswering, and other backandforth dialogues between agents. Multiagent communication can in general improve task performance and cumulative reward, but quantifying communication remains an open question – numerical, taskspecific performance measurements provide evidence but do not give insights about the communication abilities learned by the agents.
It is natural for ABM testbeds to be implemented in game environments; while the AI Economist uses a simple 2dimensional grid environment, the Dota2 and Starcraft II deep RL examples utilize those full video games. More generally, game engines such as Unity [Juliani2018UnityAG] provide powerful simulation environments for MARL and ABM problems broadly. Minecraft, for instance, has been the testbed for RL and humanAI interaction [Shah2021TheMB], and the “CausalCity” [McDuff2021CausalCityCS] simulation environment we discussed earlier is built with the Unreal game engine (leveraging the AirSim [Shah2017AirSimHV] package for simulations for autonomous vehicles). Game and physicsengines are increasingly useful components of the AI toolbox and the SI ecosystem.
Examples
The ABM toolbox is applicable for study of complex interactive processes in all scientific disciplines. ABM is particularly interesting for simulating and studying systems in social sciences and economics, given the relative ease of mapping agents to recognizable social entities, the natural hierarchical selforganization in social systems, and the interesting results in emergent behavior – applications range from urban traffic simulation [Balmer2008AgentbasedSO] to humanitarian assistance [Crooks2013GISAA] to emergency evacuations [Schoenharl2006WIPERAM]. Complex emergent economic behavior has been previously studied in economics through agentbased modeling [Bonabeau2002AgentbasedMM, Garrido2013AnAB], but this has largely proceeded without the benefit of recent advances in AI. We illustrate AIinfused ABM towards societal intelligence with a couple of examples below, followed by additional domains in life sciences.
As with the other motifs, we advocate for probabilistic approaches to ABM, to properly calibrate models, assimilate observational data into models, and quantify uncertainties in results. For instance, we may make an agent’s behavior stochastic in order to express our uncertainty in the agent’s behavior, and we may implement Monte Carlo methods to compute a posterior of behaviors rather than single point predictions. In some cases we may be able to build a surrogate model of the ABM simulation (i.e. statistical emulation) for efficient execution and uncertainty quantification [Angione2020UsingML, zhang2020modern].
Optimizing economic policies
The “AI Economist” [Zheng2020TheAE] aims to promote social welfare through the design of optimal tax policies in dynamic economies using ABM and deep RL. It uses a principled economic simulation with both workers and a policy maker (i.e. individual agents and a planner, respectively), all of whom are collectively learning in a dynamic innerouter reinforcement learning loop (see Fig. 17). In the inner loop, RL agents gain experience by performing labor, receiving income, and paying taxes, and learn through trialanderror how to adapt their behavior to maximize their utility. In the outer loop, the social planner adapts its tax policy to optimize the social objective–a social welfare function from economic theory that considers the tradeoff between income equality and productivity. The AI Economist framework is designed in a way that allows the social planner to optimize taxes for any social objective, which can be userdefined, or, in theory, learned. Another key benefit is the planner in this ABM setup does not need prior world knowledge, prior economic knowledge, or assumptions on the behavior of economic agents.
The datadriven AI Economist simulations are validated by matching the learned behavior to results known from economic theory, for example agent specialization. Specialization emerges because differently skilled workers learn to balance their income and effort. For example, in some scenarios agents learn taxavoidance behaviors, modulating their incomes across tax periods. The dynamic tax schedule generated by the planner performs well despite this kind of strategic behavior. This exemplifies the rich, realistic dynamics and emergent behavior of a relatively simplified economic simulation. The validations also build confidence that agents respond to economic drivers, and demonstrate the value of a datadriven dynamic tax policy versus prior practice.
The AI Economist provides a powerful framework for integrating economics theory with datadriven simulation, producing transparent and objective insights into the economic consequences of different tax policies. Yet there are limitations: the simulations do not yet model humanbehavioral factors and interactions between people, including social considerations, and they consider a relatively small economy. It would also be interesting to improve the framework with counterfactual reasoning for whatif experimentation – this is a key valueadd of AIdriven simulators in general, which we detail in the causality motif.
Simulating networked acts of violence
Particularly intriguing usecases for new ABM approaches are in seemingly random and chaotic human events, such as insurrections and other acts of violence. Political revolutions, for example, are not understood nearly well enough to predict: similar patterns lead to heterogeneous scenarios, the cascading effects of actions (and their timing) have high uncertainty with expensive consequences, origins of emergent behaviors are often unknown and thus unidentifiable in the future, and it is impossible to know which episodes lead to local versus mass mobilization [Moro2016UnderstandingTD]. Other ABM concerns are especially relevant in these applications: data limitations (particularly challenges with multimodal, sparse, noisy, and biased data), as well as challenges validating simulations (and further, trust).
Recent work applying ABM methods to violence events, such as aiming to model the insurrection events of the Arab Spring [Moro2016UnderstandingTD], can yield interesting outcomes, but nonetheless suffer from critical roadblocks: existing approaches use simplifying assumptions without realworld validation, cannot incorporate real data (historical nor new), and are not necessarily interpretable nor probeable. Another is the agentbased insurgency model of Overbey et al. [Overbey2013ALS], where they simulate hundreds of thousands of agents that can exhibit a variety of cognitive and behavioral states and actions, in an environment loosely based on the demographic and geophysical characteristics of Ramadi, Iraq in the early 2000s. Large scale parallel compute with GPUs was critical to produce simulations of realistic insurgent and counterinsurgent activities. Yet more acceleration and scaling would be needed for validation, including generating a posterior of potential outcomes for quantifying uncertainty and reasoning about counterfactual scenarios. These two usecases exemplify the bottleneck ABM faces in geopolitical settings without newer AIdriven simulation techniques.
One important development would be multiscale formalisms that are amenable to reinforcement learning and machine learning in general. A nice example in this direction is the hierarchical multiagent, multiscale setup of Bae & Moon [Bae2016LDEFFF], which was effective in simulating evacuation through a city’s road networks due to bombing attacks, and simulating ambulance effectiveness transporting victims. For overcoming compute bottlenecks, existing HPC systems can be leveraged, but NNbased surrogates would be needed, for acceleration and additional emulation uses (as described in the emulation motif earlier).
Existing works in geopolitics ABM rely solely on mechanistic, theoretical descriptions of human behavior rooted in theory [Epstein2002ModelingCV]. And recent work with deep RL uses simplistic boardgame environments [Paquette2019NoPD]. For reliable simulations it’s important to utilize realworld data in a way that complements and validates the theory. In the applications here, we would implement a semimechanistic modeling approach that coordinates both mechanistic equations (such as information theoretic models of disinformation [Kopp2018InformationtheoreticMO]) and datadriven function approximations. Further, building this ABM framework in a generative modeling (and more specifically probabilistic programming) way would be essential for simulating the multitudes of outcomes and their full posteriors. The probabilistic programming approach would also behoove counterfactual reasoning, or quantifying the potential outcomes due to various courses of action, as discussed in both the probabilistic programming and causality motifs.
Multiscale systems biology
Currently a main challenge in molecular modeling is the complex relationship between spatial resolution and time scale, where increasing one limits the other. There are macroscopic methods that make simplifying assumptions to facilitate modeling of the target molecular system (such as ODEs and PDEs operating at the populationlevel), and microscopic methods that are much finer in spatial resolution but highly computationally expensive at temporal scales of interest (such as Brownian and molecular dynamics methods). The reader may recall a similar spatialtemporal cascade of simulation methods in the motif on multiscale modeling (Fig. 5).
To highlight this cascade and how ABM can help overcome existing spatialtemporal systems biology limitations, consider several of the potentially many usecases for modeling mRNA [Soheilypour2018AgentBasedMI]:

The nuclear export of mRNA transcripts requires a millisecond time scale, which has been explored via ABM [Azimi2014AnAM], while highresolution techniques such as Brownian or molecular dynamics are only tractable at nano and microsecond scales, respectively. Further, these methods can compute 10s to 100s of molecules, whereas ABM can do 1000s.

Cellular processes are often defined by spatial details and constrained environments – e.g., RNAbinding proteins can travel between the nucleus and the cytoplasm (depending on their binding partners), while others like the nucleo and cytoskeleton linker is within the nuclear envelope. These spatial characteristics can be easily incorporated in ABM as prior knowledge that improves modeling (Fig. 19).

Tracking of individual particles over time is valuable in study of molecular systems, such as mRNA export. ABM is wellsuited for particle tracking – for instance, recall the viruspropagation example in Fig. 15 of the causality motif – and can complement in vivo particle tracking advances [Katz2016MappingT] towards cyberphysical capabilities.

The concept of emergence in ABM is valuable here as well – for instance, Soheileypour & Mofrad [Soheilypour2016RegulationOR] use ABM to show that several lowlevel characteristics of mRNA (such as quality control mechanisms) directly influence the overall retention of molecules inside the nucleus. ABM, as a complex systems approach, has the ability to predict how a molecular system behaves given the rules that govern the behavior of individual molecules.
Another illuminating example is multiscale ABM for individual cell signalling, cell population behavior, and extracellular environment. “PhysiBoSS” is an example framework to explore the effect of environmental and genetic alterations of individual cells at the population level, bridging the critical gap from singlecell genotype to singlecell phenotype and emergent multicellular behavior [Letort2019PhysiBoSSAM]. The ability of ABM to encompass multiple scales of biological processes as well as spatial relationships, expressible in an intuitive modeling paradigm, suggests that ABM is well suited for molecular modeling and translational systems biology broadly.
Generating agentbased model programs from scratch
In recent work, a combination of agentbased modeling and program synthesis via genetic programming was used to develop a codified ruleset for the behavior of individual agents, with the loss function dependent on the behaviors of agents within the system as a whole
[Greig2021GeneratingAM]. Genetic programming is the process of evolving computer programs from scratch or initial conditions: a set of primitives and operators, often in a domainspecific language. In program synthesis, the desired program behavior is specified but the (complete) implementation is not: the synthesis tool determines how to produce an executable implementation. Similarly, the behavior or any partial implementation can be ambiguously specified, and iteration with the programmer, datadriven techniques, or both are used to construct a valid solution.Together, program synthesis via genetic programming allow nontrivial programs to be generated from example data. ABM is an intriguing area to apply this methodology, where emergence and individualtomacro complexity are main characteristics to study and exploit. Focusing on two use cases – flocking behavior and opinion dynamics (or how opinions on a topic spread amongst a population of agents) – the researchers were able to generate rules for each agent within the overall system and then evolve them to fit the overall desired behavior of the system. This kind of approach is early but has implications for large systems, particularly in sociological or ecological modeling. This and related approaches are emblematic of the artificial life field (ALife), which we discuss more in the openended optimization motif next.
Future directions
The previous examples elucidate a central feature of ABM which should help motivate the application of ABM in diverse domains: they are able to generate systemlevel behaviors that could not have been reasonably inferred from, and often may be counterintuitive to, the underlying rules of the agents and environments alone. The implication is there can be simulationbased discovery of emergent phenomena in many systems at all scales, from particle physics and quantum chemistry, to energy and transportation systems, to economics and healthcare, to climate and cosmology.
There are of course contexts in which the use of ABM warrants thorough ethical considerations, for instance ABM of social systems. In some applications, decisionmaking and social norms are modeled and inferred (e.g. of functional groups such as households, religious congregations, coops, and local governments [BealCohen2021IntragroupDI]), which may unintentionally have elements of bias or ignorance when bridging individuallevel assumptions and populationlevel dynamics [Hammond2015ConsiderationsAB]. There should be awareness and explicit systems in place to monitor these concerns in the use of ABM and for downstream tasks because the resulting predictions can inform the design or evaluation of interventions including policy choices.
Broadly in ABM, domainspecific approaches and narrow solutions are the norm. The lack of generalizable methods and unifying frameworks for comparing methods poses a significant bottleneck in research progress. Even more, this limits the utility of ABM for the many researchers interested in interdisciplinary simulations. A basic infrastructure of ABM is needed, with formalisms built on ABM logic and deep RL techniques (to integrate with stateofart machine learning), and supporting the generic design of simulation engines working with multimodal data. Building atop differentiable and probabilistic programming (the engine motifs discussed next) would behoove such an ABM framework. This is also a hardware problem: in line with the layerbylayer integrated nature of simulation and AI computing that we describe throughout this paper, existing ABM simulation platforms for the most part can’t take advantage of modern hardware, nor are they scalable to highperformance computing (HPC) deployments (which we discuss later in this work).
The AI Economist and similar works are moving towards an integration of traditional ABM methodology with newer deep RL approaches. In addition to powerful results, this can be a fruitful paradigm for studying language and communication between agents in many environments and settings – as mentioned earlier, the development of metrics and quantitative tools for studying such communication (and its emergent behaviors) is essential. It will likely be necessary to utilize the machinery of causal inference and counterfactual reasoning, for studying various communication types as well as motivations (e.g. [Jaques2019SocialIA]). It will also be critical to build methods in a hierarchical way for multiscale ABM, to help overcome the burden of simplifying assumptions and approximations about individuals and groups of agents – notice the close similarity in challenges and methods described in the multiscale multiphysics motif.
The Engine
6. Probabilistic Programming
The probabilistic programming (PP) paradigm equates probabilistic generative models with executable programs [van2018introduction]. Probabilistic programming languages (PPL) enable practitioners to leverage the power of programming languages to create rich and complex models, while relying on a builtin inference backend to operate on any model written in the language. This decoupling is a powerful abstraction because it allows practitioners to rapidly iterate over models and experiments, in prototyping through deployment. The programs are generative models: they relate unobservable causes to observable data, to simulate how we believe data is created in the real world (Fig. 20).
PPL pack additional advantages for scientific ML, mainly by design allowing experts to incorporate domain knowledge into models and to export knowledge from learned models. First, PPL are often highlevel programming languages where scientists can readily encode domain knowledge as modeling structure – there is a direct correspondence between probability theory and coding machinery that is obfuscated in standard programming paradigms. Second, since models are encoded as programs, program structure gives us a structured framework within which to reason about the properties and relationships involved in a domain before inference is even attempted
[Pfeffer2018StructuredFI]. Crucially, in PPL we get uncertainty reasoning as a standard feature allowing us to express and work with aleatoric (irreducible uncertainty due to stochasticity such as observation and process noise) and epistemic (subjective uncertainty due to limited knowledge) uncertainty measures.The main drawback of probabilistic programming is computational cost, which is more pronounced especially in the case of unrestricted (universal, or Turingcomplete) PPL [Gordon2014ProbabilisticP] that can operate on arbitrary programs [Goodman2008ChurchAL, Pfeffer2009FigaroA, Mansinghka2014VentureAH, WoodAISTATS2014, GoodmanStuhlm14]. Another class of PPL is more restrictive and thus computationally efficient, where constraining the set of expressible models ensures that particular inference algorithms can be efficiently applied [Lunn2009TheBP, winn2009probabilistic, Milch2005BLOGPM, Tran2016EdwardAL]. Note that extending an existing Turing complete programming language with operations for sampling and conditioning results in a universal PPL [Gordon2014ProbabilisticP, Munk2019DeepPS].
Although a relatively nascent field at the intersection of AI, statistics, and programming languages, probabilistic programming has delivered successful applications across biological and physical sciences. For example, inferring physics from highenergy particle collisions [Baydin2019EtalumisBP], pharmacometric modeling and drug development [Weber2018BayesianAO], inferring cell signalling pathways [Merrell2020InferringSP], predicting neurodegeneration [Lavin2020NeurosymbolicND], optimal design of scientific experiments [Ouyang2016PracticalOE], modeling epidemiological disease spread [Wood2020PlanningAI, Witt2020SimulationBasedIF], untangling dynamics of microbial communities [Ranjeva2019UntanglingTD], wildfire spread prediction [Joseph2019SpatiotemporalPO], spacecraft collision avoidance [Acciarini2020SpacecraftCR], and more. We next discuss several key mechanisms for connecting probabilistic programming and scientific simulation across these and other domains.
Probabilistic programs as simulators
First, a probabilistic program is itself a simulator, as it expresses a stochastic generative model of data; that is, given a system or process sufficiently described in a PPL, the forward execution of the program produces simulated data. Typically one runs a simulator forward to predict the future, yet with a probabilistic program we can additionally infer the parameters given the outcomes that are observed. For example, the neurodegeneration programs in Lavin2020NeurosymbolicND define a statistical model with random samplings, from which we can generate probabilistic disease trajectories, and infer parameters for individual biomarkers. Another advantage of simulating with probabilistic programs is for semimechanistic modeling, where Turingcomplete PPL provide flexible modeling for integrating physical laws or mechanistic equations describing a system with datadriven components that are conditioned or trained on data observations – the PPL Turing.jl [Ge2018TuringAL]
, for example, illustrates this with ordinary differential equations. Similarly, some PPL enable semiparametric modeling
[Dalibard2017BOATBA, Lavin2018DoublyBO], where the inference engine can handle programs that combine parametric and nonparametric models: The parametric component learns a fixed number of parameters and allows the user to specify domain knowledge, while a nonparametric model (e.g. a Gaussian process) learns an unbounded number of parameters that grows with the training data. These two programming features – semimechanistic and semiparametric modeling – along with the overall expressiveness of PPL make them ideal for nonML specialists to readily build and experiment with probabilistic reasoning algorithms. Utilizing probabilistic programs as simulators plays an important role in the simulationbased inference motif.Compiling simulators to neural networks
Second, we have inference compilation [Le2017InferenceCA, Harvey2019AttentionFI], a method for using deep neural networks to amortize the cost of inference in universal PPL models. The transformation of a PPL inference problem into an amortized inference problem using neural networks is referred to as “compilation”, based on the analogy of compiling computer code between two programming languages. In the context of simulation, one use for inference compilation is as a preprocessing step for probabilistic programming algorithms, where initial runs of the PPLbased simulator are used to train a neural network architecture, which constructs proposal distributions for all the latent random variables in the program based on the observed variables – this represents one of the simulationbased inference workflows in Fig. 10. Another main use for inference compilation is to construct probabilistic surrogate networks (PSN): any existing stochastic simulator, written in any programming language, can be turned into a probabilistic program by adding a small amount of annotations to the source code (for recording and controlling simulator execution traces at the points of stochasticity). Munk et al. [Munk2019DeepPS] demonstrate this on a nontrivial usecase with composite materials simulation, yet more testing with additional simulators is needed to validate the generalizability claims and robustness. PSN and other extensions to inference compilation [Casado2017ImprovementsTI] help unify probabilistic programming with existing, nondifferentiable scientific simulators in the third and perhaps most powerful mechanism, simulator inversion.
Simulator inversion
The Etalumis project (“simulate” spelled backwards) uses probabilistic programming methods to invert existing, largescale simulators via Bayesian inference [Baydin2019EtalumisBP], requiring minimal modification of a given simulator’s source code. Many simulators model the forward evolution of a system (coinciding with the arrow of time), such as the interaction of elementary particles, diffusion of gasses, folding of proteins, or the evolution of the universe on the largest scale. The task of inference refers to finding initial conditions or global parameters of such systems that can lead to some observed data representing the final outcome of a simulation. In probabilistic programming, this inference task is performed by defining prior distributions over any latent quantities of interest, and obtaining posterior distributions over these latent quantities conditioned on observed outcomes (for example, experimental data) using Bayes rule. This process, in effect, corresponds to inverting the simulator such that we go from the outcomes towards the inputs that caused the outcomes. A special protocol, called the probabilistic programming execution protocol (PPX) (Fig. 22), is used to interface PPL inference engines with existing stochastic simulator code bases in two main ways:

Record execution traces of the simulator as a sequence of sampling and conditioning operations in the PPL. The execution traces can be used for inspecting the probabilistic model implemented by the simulator, computing likelihoods, learning surrogate models, and generating training data for inference compilation NNs – see Fig. 22.

Control the simulator execution with the PPL by “hijacking” the simulator’s random number generator. More precisely, generalpurpose PPL inference guides the simulator by making random number draws not from the prior but from proposal distributions that depend on observed data [GramHansen2019HijackingMS] – see Fig. 23.
These methods have been demonstrated on massive, legacy simulator code bases, and scaled to highperformance computing (HPC) platforms for handling multiTB data and supercomputerscale distributed training and inference. The ability to scale probabilistic inference to largescale simulators is of fundamental importance to the field of probabilistic programming and the wider scientific community.
Examples
Compilation and inversion of a particle physics simulator
The Standard Model of particle physics has a number of parameters (e.g., particle masses) describing the way particles and fundamental forces act in the universe. In experiments such as those at the Large Hadron Collider (described in the simulationbased inference motif earlier), scientists generate and observe events resulting in cascades of particles interacting with very large instruments such as the ATLAS and CMS detectors. The Standard Model is generative in nature, and can be seen as describing the probability space of all events given the initial conditions and model parameters. Baydin2019EtalumisBP, baydin2019efficient developed the Etalumis approach originally to work with the stateoftheart Sherpa simulator [osti_1562545], an implementation of the Standard Model in C++. Based on the PPX protocol, they worked with this very large scale simulator involving approximately one million lines of C++ code and more than 25k latent variables and performed the first instance of Bayesian inference in this setting, demonstrating the approach on the decay of the (tau) lepton. Due to the scale of this simulator, the work made use of the Cori supercomputer at Lawrence Berkeley National Lab, mainly for distributed data generation and training of inference compilation networks for inference in Sherpa.
Probabilistic, efficient, interpretable simulation of epidemics
Epidemiology simulations have become a fundamental tool in the fight against the epidemics and pandemics of various infectious diseases like AIDS, malaria, and Covid [schroederdewitt2020simulation]. However, these simulators have been mostly limited to forward sampling of epidemiology models with different sets of handpicked inputs, without much attention given to the problem of inference. Even more, inefficient runtimes limit the usability for iterating over large spaces of scenarios and policy decisions. Motivated by the Covid19 pandemic, wood2020planning demonstrated how parts of the infectious diseasecontrol and policymaking processes can be automated by formulating these as inference problems in epidemiological models. Using the Etalumis approach, GramHansen2019HijackingMS developed a probabilistic programming system to turn existing populationbased epidemiology simulators into probabilistic generative models (Fig 23). Applied to two stateoftheart malaria simulators (namely EMOD [Bershteyn2018ImplementationAA] and OpenMalaria [Smith2008TowardsAC]) the system provides interpretable introspection, full Bayesian inference, uncertainty reasoning, and efficient, amortized inference. This is accomplished by “hijacking” the legacy simulator codebases by overriding their internal random number generators. Specifically, by replacing the existing lowlevel random number generator in a simulator with a call to a purposebuilt universal PPL “controller”, the PPL can then track and manipulate the stochasticity of the simulator as it runs. The PPL controller can then run a handful of novel tasks with the hijacked codebase: running inference by conditioning the values of certain data samples and manipulating others, uncovering stochastic structure in the underlying simulation code, and using the execution traces from the PPL backend to automatically produce result summaries. The resulting execution traces can potentially be used for causal chain discovery (for instance, deriving the infection spread from host to host) as described in the causality motif.
The probabilistic programming advantage for generating molecules
As we’ve covered in several motifs and integrations in this paper, designing and discovering new molecules is key for addressing some of humanity’s most pressing problems, from developing new medicines to providing agrochemicals in the face of shifting populations and environments, and of course discovering new materials to tackle climate change. The molecular discovery process usually proceeds in designmaketestanalyze cycles (shown in Fig. 33). ML can help accelerate this process (ultimately reducing the number of expensive cycles) by both broadening the search and by searching more intelligently. That respectively amounts to learning strong generative models of molecules that can be used to sample novel molecules for downstream screening and scoring tasks, and finding molecules that optimize properties of interest (e.g. binding affinity, solubility, nontoxicity, etc.) [Bradshaw2020BarkingUT]. We’ve suggested gaps in this process from lack of information sharing across phases, particularly the necessity for communication between the molecule design / lead generation phase and the synthesis phase – see for instance Fig. 37. One possibility to better link up the design and make steps in ML driven molecular generation is to explicitly include synthesis instructions in the design step. To this end, Bradshaw et al. [Bradshaw2020BarkingUT] present a framework for multistep molecular synthesis routes: directed acyclic graphs (DAG) represent is the synthetic routes, which are arranged in a hierarchical manner, where a novel neural message passing procedure exchanges information among multiple levels, all towards gradientbased optimization of the system that is enabled by differentiable programming. Intuitively this works as follows: The model aims to generate “recipes” (i.e., synthesis trees or synthesis graphs as DAGs) which start from a small number of building blocks (e.g., easily purchasable compounds) that then can be combined (possibly iteratively) to produce progressively more complex molecules. We assume that a “reaction outcome predictor” exists, which can be anything that takes in a set of reactants and outputs a distribution over products. Given a dataset of potential products and their synthetic routes, this can then be fit as structured deep generative model, which generates a distribution over molecules by first selecting a handful of easily obtainable compounds as starting points, then iteratively combining them. This generative model can then be used for Bayesian optimization of molecules (or perhaps openended optimization in SI future work), in a way which regularizes the “suggested” molecules to only be those that have a high probability of being synthesized reliably in a lab. We discuss similar pipelines and methods in the Integrations section later. The overall structure of the probabilistic model is rather complex as it depends on a series of branching conditions. It’s thus advantageous to frame the entire generative procedure as a probabilistic program – notice our location in reference to Fig. 25 at the intersection of DP and PP. Such a probabilistic program defines a distribution over DAG serializations, and running the program forward will sample from the generative process. The program can just as easily be used to evaluate the probability of a DAG of interest by instead accumulating the log probability of the sequence at each distribution encountered during the execution of the program. Inspecting the PPL execution traces may provide deeper levels of interpretability for the molecule generation process. This exemplifies the utility of probabilistic programs for “partially specified” models, which are the rich area between datadriven (blackbox, deep learning) and fullyspecified simulators. We give another example of this in the Integrations section later, also in the context of molecular synthesis.
Simulationbased common sense cognition
The pursuit of common sense reasoning has largely been missing from mainstream AI research, notably characteristics such as generalizability and causal reasoning [Lake2016BuildingMT, George2017AGV]. To achieve common sense, and ultimately general machine intelligence, Marcus & Davis [Marcus2021InsightsFA] suggest starting by developing systems that can represent the core frameworks of human knowledge: time, space, causality, basic knowledge of physical objects and their interactions, basic knowledge of humans and their interactions. Practical pursuit of this objective has largely focused on reverse engineering the “common sense core” from cognitive and neurosciences: intuitive physics [Tgls2011PureRI] and intuitive psychology, or basic representations of objects in the world and the physics by which they interact, as well as agents with their motives and actions [Liu2017TenmontholdII]. In particular, studying the hypothesis that many intuitive physical inferences are based on a mental physics engine that is analogous in many ways to the machine physics engines used in building interactive video games [Ullman2017MindGG], and by modeling the intuitive physics and intuitive psychology of children as mental simulation engines based on probabilistic programs [Lake2016BuildingMT, Ullman2020BayesianMO].
To capture these kinds of ideas in engineering terms probabilistic programming is needed; PPL integrate our best ideas on machine and biological intelligence across three different kinds of mathematics: symbolic manipulation (i.e., algebra and logic), Bayesian inference (probability), and neural networks (calculus). By wrapping physics engine programs and physics simulators inside frameworks for probabilistic modeling and inference, we gain the powerful tools necessary to capture common sense intuitive physics. In the works of Tenenbaum et al. this has been called “the simulator in your head”, and also “theory of mind” [Gershman2016PlansHA], which has a purpose distinct from the typical use of game engines in most of recent AI research: to capture the mental models that are used for online inference and planning, rather than using games for training a NN or other model over a massive amount of simulated data.
These common sense cognition concepts have been demonstrated in a series of simulationbased cognitive science experiments to test and analyze the PPLbased intuitive physics engine [Battaglia2013SimulationAA, Hamrick2016InferringMI], and the intuitive psych engine [Baker2017RationalQA] where the same kind of model can be used to infer jointly what somebody wants and also what somebody believes, (because people sometimes take inefficient actions [Gershman2016PlansHA]). Uncertainty reasoning is once again key here: the PPLbased approach leads to a more robust machine intelligence because any realworld AI agent will never have perfect perception nor perfect physics models, so, like the human brain, making probabilistic inferences based on the available data and models is a robust, generalizable approach. Followup neuroscience experiments with Tenenbaum et al. aim to show where and how your intuitive physics engine works, and how the neural mechanisms connect synergistically to other cognitive functions based on their neural locus. An implication is that these kinds of probabilistic program models are operative even in the minds and brains of young babies, which may bear significance in developing methods for cooperative AI, which we discuss in the agentbased modeling and uncertainty reasoning sections.
As we noted earlier, PPL samplingbased inference is predominant but in general slow, and certainly restrictive with games/graphics engines. We can alternatively use NNs to learn fast bottomup approximations or inverses to these graphics engines, also known as “vision as inverse graphics” [Mansinghka2013ApproximateBI, Kulkarni2015PictureAP]. Looking to lean on NNs, PhysNet [Wang20183DPhysNetLT] attempts to treat intuitive physics as a pattern recognition task. Although able to reproduce some of the stabilityprediction results, typical of deep learning the PhysNet approach required massive amounts of training and still cannot generalize beyond the very specific physical object shapes and numbers.
Now a physics engine defined as a probabilistic program has to be a program that inputs and outputs other programs. This “hard problem of learning” represents a highly challenging search problem to find good programs, yet humans can solve that problem so machine intelligence should be able to as well. Bayesian Program Learning
[Lake2015HumanlevelCL] is an early example of this but there is still considerable progress to be made. Additional notable works towards learning as programming include “the child as hacker” [Rule2020TheCA], and “Dream Coder” which combines neural networks with symbolic program search to learn to solve programs in many different domains, essentially learning to construct a deep library of programs [Ellis2020DreamCoderGG].Future directions
Probabilistic programming and simulation should continue to advance the cognitive and neurosciences work we described above from Tenenbaum et al. The hypothesis, with a growing body of support, is that a simulator or game engine type of representation is what evolution has built into our brains in order to understand the world of objects and agents; babies start off with a minimal proto game engine, learn with the game engine and expand it to cases not necessarily built for, and use this as the subsequent foundation for learning everything else.
Another general area to advance and scale PPL use is in simulationbased inference. Specifically in the context of humanmachine inference, using probabilistic programs as the simulator has many potential advantages: modularity in modeling systems, generalpurpose inference engines, uncertainty quantification and propagation, and interpretability. See Fig. 12 and the simulationbased inference motif.
With the Etalumis project we now have the ability to control existing scientific simulators at scale and to use them as generative, interpretable models, which is highly valuable across simulation sciences where interpretability in model inference is critical. The proof of concept was developed in particle physics with the Large Hadron Collider’s (LHC) “Sherpa” simulator at scale. This enables Bayesian inference on the full latent structure of the large numbers of collision events produced at particle physics accelerators such as the LHC, enabling deep interpretation of observed events. Work needs to continue across scientific domains to validate and generalize these methods, which has begun in epidemiology [GramHansen2019HijackingMS, Witt2020SimulationBasedIF] and astrophysics [Acciarini2020SpacecraftCR]. Other promising areas that are bottlenecked by existing simulators include hazard analysis in seismology [Heinecke2014PetascaleHO], supernova shock waves in astrophysics [Endeve2012TURBULENTMF], market movements in economics [Raberto2001AgentbasedSO], blood flow in biology [Perdikaris2016MultiscaleMA], and many more. Building a common PPL abstraction framework for different simulators would be ideal, for experimentreproducibility and to further allow for easy and direct comparison between related or competing simulators.
In general, the intersection of probabilistic programming and scientific ML is an actively growing ecosystem. Still, much more progress stands to be made in computational efficiency, both in the inference algorithms and potentially in specialized compute architectures for PPL. We also believe the development of causal ML methods within PPL will be imperative, as the abstractions of probabilistic programming provide opportunity to integrate the mathematics of causality with ML software. The larger opportunity for probabilistic programming is to expand the scope of scientific modeling: Just as Turing machines are universal models of computation, probabilistic programs are universal probabilistic models – their model space
comprises all computable probabilistic models, or models where the joint distribution over model variables and evidence is Turingcomputable [Ullman2020BayesianMO].7. Differentiable Programming
Differentiable programming (DP) is a programming paradigm in which derivatives of a program are automatically computed and used in gradientbased optimization in order to tune program to achieve a given objective. DP has found use in a wide variety of areas, particularly scientific computing and ML.
From the ML perspective, DP describes what is arguably the most powerful concept in deep learning: parameterized software modules that can be trained with some form of gradientbased optimization. In the DP paradigm, neural networks are constructed as differentiable directed graphs assembled from functional blocks (such as feedforward, convolutional, and recurrent elements), and their parameters are learned using gradientbased optimization of an objective describing the model’s task [Baydin2017AutomaticDI]. The ML field is increasingly embracing DP and freely composing NN building blocks in arbitrary algorithmic structures using control flow, as well as the introducing novel differentiable architectures such as the Neural Turing Machine [Graves2016HybridCU] and differentiable versions of data structures such as stacks, queues, deques [Grefenstette2015LearningTT].
DP frameworks for deep learning have dominated the ML landscape during the past decade, most notably autodifferentiation frameworks such as Theano
[Bergstra2010TheanoAC][Abadi2016TensorFlowAS], and PyTorch
[Paszke2017AutomaticDI]. In contrast to deep learning, which mainly involves compositions of large matrix multiplications and nonlinear elementwise operations, physical simulation requires complex and customizable operations due to the intrinsic computational irregularities, which can lead to unsatisfactory performance in the aforementioned frameworks. To this end, recent works have produced powerful, parallelizable DP frameworks such as JAX [jax2018github, Frostig2018CompilingML] and DiffTaichi [Hu2020DiffTaichiDP] for endtoend simulator differentiation (and thus gradientbased learning and optimizations), and each can integrate DP into simulator code. Both have promising applications in physical sciences, which we discuss more in the context of examples below. Further, the Julia language and ecosystem [bezanson2017julia] provides systemwide differentiable programming, motivated by scientific and numerical computing, with recent AI integrations for deep learning and probabilistic programming.Automatic differentiation (AD)
At the intersection of calculus and programming, AD is the workhorse of DP. AD involves augmenting the standard computation with the calculation of various derivatives automatically and efficiently [Baydin2017AutomaticDI]
. More precisely, AD performs a nonstandard interpretation of a given computer program by replacing the domain of the variables to incorporate derivative values and redefining the semantics of the operators to propagate derivatives per the chain rule of differential calculus. AD is not simply nonmanual differentiation as the name suggests. Rather, AD as a technical term refers to a specific family of techniques that compute derivatives through accumulation of values during code execution to generate numerical derivative evaluations rather than derivative expressions. This allows accurate evaluation of derivatives at machine precision with only a small constant factor of overhead and ideal asymptotic efficiency. Availability of generalpurpose AD greatly simplifies the DPbased implementation of ML architectures, simulator codes, and the composition of AIdriven simulation, enabling their expression as regular programs that rely on the differentiation infrastructure. See Baydin et al.
[Baydin2017AutomaticDI] for a thorough overview.In the context of simulators, DP, like probabilistic programming, offers ways to exploit advances in ML [Cranmer2020TheFO] and provides an infrastructure to incorporate these into existing scientific computing pipelines. An intriguing approach with many applications is machinelearned surrogate modeling with scientific simulators and solvers. In Fig. 27 we illustrate one example of chaining a neural network surrogate with a simulator (as an ODE solver) to solve an inverse problem, or inferring hidden states or parameters from observations or measurements. DP enables this approach because the endtoend differentiable workflow, including both simulation code and ML model, can backpropagate errors and make updates to model parameters. We discuss this and related methods in depth in the multiphysics and surrogate modeling motifs, and how all the motifs enable a spectrum of inverse problem solving capabilities.
Examples
Differentiable pipeline for learning protein structure
Computational approaches to protein folding not only seek to make structure determination faster and less costly (where protein structure is a fundamental need for rational drug design), but also aim to understand the folding process itself. Prior computational methods aim to build explicit sequencetostructure maps that transform raw amino acid sequences into 3D structures [Kolinski2011MultiscaleAT], either with expensive molecular dynamics (MD) simulations or “fragment assembly” methods that rely on statistical sampling and a given template solution. Another category of prior methods are “coevolution” where multiple sequence alignment is used to infer geometric couplings based on evolutionary couplings, yet in practice to go from the sequence contact map to a full 3D structure is only reliable 25%–50% of the time [Ovchinnikov2017ProteinSD]. In general there are major shortcomings in the prior work such as inability to predict structural consequences from slight changes or mutations [Hopf2017MutationEP] and inability to work with de novo sequences [AlQuraishi2019EndtoEndDL], not to mention the computational burdens (for example, the prior stateofart pipeline that combines templatebased modeling and simulationbased sampling takes approximately 20 CPUhours per structure [Zhang2018TemplatebasedAF]). To address the above limitations, AlQuraishi [AlQuraishi2019EndtoEndDL] proposes differentiable programming to machinelearn a model that replaces structure prediction pipelines with differentiable primitives. The approach, illustrated in Fig. 28
, is based on four main ideas: (1) encoding protein sequences using a recurrent neural network, (2) parameterizing (local) protein structure by torsional angles to enable a model to reason over diverse conformations without violating their covalent chemistry, (3) coupling local protein structure to its global representation via recurrent geometric units, and (4) a differentiable loss function to capture deviations between predicted and experimental structures that is used to optimize the recurrent geometric network (RGN) parameters. Please see the original work for full details
[AlQuraishi2019EndtoEndDL], including thorough experiments comparing quantitative and qualitative properties of the DPbased approach (RGN) versus prior art. One important highlight is the RGN approach is 67 orders of magnitude faster than existing structure prediction pipelines, although there is the upfront cost of training RGN for weeks to months. Beyond results that overall surpass the prior methods (including performance on novel protein folds), the differentiable structure approach can be useful towards a bottomup understanding of systems biology. AlphaFold, the recent breakthrough approach that surpasses RGN and everything else in the field [Jumper2021HighlyAP], also exploits autodiff (Fig. 25 but in a deeplearned way that doesn’t necessarily improve scientific knowledge of protein folding; more on AlphaFold is mentioned in the Honorable Mentions subsection later. A learningbased approach, in contrast to a deterministic pipeline, can provide valuable information towards future experiments and scientific knowledge. In this case of machinelearning protein sequences, the resulting model learns a lowdimensional representation of protein sequence space, which can then be explored, for example with search and optimization algorithms we’ve discussed in the openendedness and multiphysics motifs. In a standard pipeline (and to a different degree in AlphaFold), this knowledge is obfuscated or not present – the full domain knowledge must be encoded a priori. The use of simulation intelligence motifs in addition to DP holds great potential in this area, notably with semimechanistic modeling to incorporate experimental and structural data into integrative multiscale simulation methods that can model a range of length and time scales involved in drug design and development [Jagger2020MultiscaleSA].JAX: DP for physical simulation from molecular dynamics to rigid body motion
We’ve highlighted JAX several times now as a valuable system for utilizing machine learning to improve simulation sciences. One usecase demonstrating integration of several motifs is the example of building semimechanistic surrogate models in the DP framework in order to accelerate complex computational fluid dynamics (CFD) simulations [Kochkov2021MachineLC]. Another usecase we alluded to in Fig, 5 is spanning multiple modeling scales including molecular dynamics (MD). MD is frequently used to simulate materials to observe how small scale interactions can give rise to complex largescale phenomenology [jax2018github]. Standard MD packages used across physics simulations (such as HOOMD Blue [Anderson2013HOOMDblueAP] or LAMMPS [Plimpton1993FastPA, Thompson2021LAMMPSA]) are complicated, highly specialized pieces of code. Poor abstractions and technical debt make them difficult to use, let alone extend for advances in accelerated computing and ML – for instance, specialized and duplicated code for running on CPU vs. GPU, hardcoded physical quantities, and complex manual differentiation code. For these reasons, JAX MD [Schoenholz2020JAXMA] leverages differentiable programming as well as automatic hardware acceleration (via the compilation scheme in JAX). Several design choices rooted in functional programming make JAX MD distinct from traditional physics software, and more capable with autodiff and ML methods: transforming arrays of data, immutable data, and firstclass functions. JAX MD primitives are functions that help define physical simulations, such as “spaces” (dimensions, topology, boundary conditions) and “interactions” (energy, forces). Higher level functions can then be used to compose simulations of particles or atoms.
The initial JAX MD work [Schoenholz2020JAXMA] demonstrated training several stateofart neural networks to model quantum mechanical energies and forces, specifically to represent silicon atoms in different crystalline phases based on density functional theory (DFT, the standard first principles approach to electronic structure problems). DFT is prevalent across physics problems but nontrivial to reliably compute or approximate in practice. In recent years there’s been interest in utilizing ML to improve DFT approximations, notably towards JAXDFT [Li2021KohnShamEA]. They leverage the differentiable programming utilities of JAX to integrate the DFT solver equations (i.e., KohnSham equations) directly into neural network training – that is, leveraging JAX as a physicsinformed ML framework. This integration of prior knowledge and machinelearned physics, the authors suggest, implies rethinking computational physics in this new paradigm: in principle, all heuristics in DFT calculations (e.g., initial guess, density update, preconditioning, basis sets) could be learned and optimized while maintaining rigorous physics [Li2021KohnShamEA]. JAX has similarly been leveraged for automated construction of equivariant layers in deep learning [Finzi2021APM], for operating on graphs and point clouds in domains like particle physics and dynamical systems where symmetries and equivariance are fundamental to the generalization of neural networks – equivariant neural networks are detailed later in the Discussion section. A more AImotivated extension of JAX is Brax [Freeman2021BraxA], a differentiable physics engine for accelerated rigid body simulation and reinforcement learning. Brax trains locomotion and dexterous manipulation policies with high efficiency by making extensive use of autovectorization, device parallelism, justintime compilation, and autodifferentiation primitives of JAX. For similar reasons, JAX has been leveraged specifically for normalizing flows
, a general mechanism for defining expressive probability distributions by transforming a simple initial density into a more complex one by applying a sequence of invertible transformations
[rezende2015variational, Papamakarios2019NormalizingFF]. We touch on normalizing flows more in the Discussion section later.DiffTaichi: purposebuilt DP for highlyefficient physical simulation
Taichi [Hu2019TaichiAL] is a performance oriented programming language for writing simulations, and DiffTaichi [Hu2020DiffTaichiDP] augments it with additional automatic differentiation capabilities. While JAX MD and the various others in the JAX ecosystem can provide physical simulation workflows out of the box because they’re purpose built for those applications, DiffTaichi is “closer to the metal” for fast simulations and thus in general better computational performance over JAX. This can be explained by several key language features required by physical simulation, yet often missing in existing differentiable programming tools:

Megakernels – The programmer can naturally fuse multiple stages of computation into a single kernel, which is later differentiated using source code transformations and justintime compilation. Compared to mainstream DP frameworks for deep learning (PyTorch [Paszke2017AutomaticDI] and TensorFlow [Abadi2016TensorFlowAS]) which use linear algebra operators, the megakernels have higher arithmetic intensity and are thus more efficient for physical simulation tasks.

Imperative parallel programming – DiffTaichi is built in an imperative way for better interfacing with physical simulations than the functional array programming approach of mainstream deep learning frameworks. First, this approach helps provide parallel loops, “if” statements and other control flows in a differentiable framework, which are widely used constructs in physical simulations for common tasks such as handling collisions, evaluating boundary conditions, and building iterative solvers. Second, traditional physical simulation programs are written in imperative languages such as Fortran and C++.

Flexible indexing – The programmer can manipulate array elements via arbitrary indexing, handy for the non elementwise operations such as numerical stencils and particlegrid interactions that are specific to physical simulations and not deep learning.
The initial work by Hu et al. [Hu2020DiffTaichiDP] explains the design decisions and implementations nicely, along with illustrative examples in incompressible fluids, elastic objects, and rigid bodies with significant performance gains over comparable implementations in other DP (and deep learning) frameworks. Nonetheless these are on toy problems and much work remains to be done in complex realworld settings.
Julia : differentiable programming as a scientific ML lingua franca
Dynamic programming languages such as Python, R, and Julia are common in both scientific computing and AI/ML for their implementation efficiency, enabling fast code development and experimentation. Yet there’s a compromise reflected in most systems between convenience and performance: programmers express highlevel logic in a dynamic language while the heavy lifting is done in C/C++ and Fortran. Julia [bezanson2017julia], on the other hand, may give the performance of a statically compiled language while providing interactive, dynamic behavior. The key ingredients of performance are: rich type information, provided to the compiler naturally by multiple dispatch, code specialization against runtime types, and justintime (JIT) compilation using the LLVM compiler framework [Lattner2004LLVMAC] – see Bezanson et al. [bezanson2017julia] for details.
Potentially resolving the tradeoffs and annoyances of dynamicstatic systems that couple Python and C, for example, Julia is a natural fit for differentiable programming towards scientific ML (not to mention probabilistic programming, geometric deep learning, and others we’ve touched on). The work by Innes et al. [Innes2019ADP] is a DP system that is able to take gradients of Julia programs, making automatic differentiation a first class language feature. Thus Julia can support gradientbased learning and program transforms for the SI motifs and deep learning. We detailed several examples in the multiphysics and surrogate modeling motifs, in particular the Universal Differential Equations (UDE) system [Rackauckas2020GeneralizedPL] (Fig. 8). Other illustrative examples include differentiable programming for finance, where contract valuation can be mixed with market simulations and other DP tools like neural networks; accelerated neural differential equations in financial time series modeling; and endtoend automatic differentiation of hybrid classicalquantum systems (in simulation) [Innes2019ADP].
Future directions
In the artificial intelligence field, differentiable programming has been likened to deep learning [lecunDP]. This does not do it justice, as DP can be viewed as a generalization of deep learning. The potential of DP is immense when taken beyond deep learning to the general case of complex programs – in existing programs taking advantage of the extensive amount of knowledge embedded within them, and enabling never before possible models and in silico experiments. DP has the potential to be the lingua franca that can further unite the worlds of scientific computing and machine learning [Rackauckas2020GeneralizedPL]. The former is largely governed by mechanistic equations and inefficient numerical solvers, and the latter with datadriven methods that learn highdimensional spaces. DP provides exciting opportunity at the intersection, as we explored in the multiphysics motif.
DP tied to ML and simulation presents a relatively new area and there is much to do in both theory and practice. For the former, it would be valuable to derive techniques for working with nondifferentiable functions like ReLU used in deep learning
[Abadi2020ASD]or with nonsmooth functions. It would also be useful to work with approximate reals, rather than reals, and to seek numerical accuracy theorems. For practical use, particularly in ML, it is important for DP to add explicit tensor (multidimensional array) types with accompanying shape analysis. In general, richer DP languages are needed to support wide ranges of types and computations – e.g. recursion, or programming with Riemannian manifolds (to accommodate natural gradient descent). There is also a need to move beyond Python as a host language, with computation runtimes magnitudes more efficient in Julia, C++ (e.g. DiffTaichi), and closetothemetal domainspecific languages (DSLs, see Fig.
1).Just as AD is the practical power underlying DP, nonstandard interpretation is the fundamental power underlying AD. Nonstandard interpretation poses the question how can I change the semantics of this program to solve another problem?, the theory and formalism of which could provide rich developments beyond AD.
The Frontier
There are two nascent fields of AI and ML that we suggest have potentially massive upside for advancing simulation sciences and machine intelligence: openendedness and machine programming. We now describe these as motifs in the SI stack.
8. OpenEnded Optimization
Openendedness, or openended search, can be defined as any algorithm that can consistently produce novelty or increases in complexity in its outputs. Such openendedness is a ubiquitous feature of biological, technosocial, cultural, and many other complex systems [Banzhaf2016DefiningAS]. But its full algorithmic origins remain an open problem [Stepney2021]. This openendedness stands in contrast to most other kinds of optimization which generally seek to minimize loss functions, or otherwise reduce uncertainty [Friston2017, Krakauer2020a]. By instead seeking to increase complexity, openended search tends to increase uncertainty by definition [Rasmussen2019TwoMO]. We believe this is an extremely desirable property in optimization (and intelligence) as it has proven to be a source of robust behavior, and mechanistic innovation [Adams2017] as we will review below.
Openended learning algorithms are in general those with characteristics of lifelong learning: adaptively and continually pose new objectives, adapt to environment changes, and avoid pitfalls such as local minima, deceptive loss functions [Lehman2011a], or catastrophic forgetting [Kumaran2016WhatLS]. It follows that openended optimization is where components of a system continue to evolve new forms or designs continuously, rather than grinding to a halt when some sort of optimal, iteration threshold, or stable position is reached [Taylor1999FromAE]. Openended optimization also has deep connections to fundamental issues in computer science. In theoretical analysis it has been shown to produce unbounded complexity, via computational universality [Hernandez2018]. Yet by these arguments we must embrace undecidability and irreducibility in the optimization [Hernandez2018], which has in turn lead to lead to practical results, such as the appearance of modularity in a memory system [Hernandez2018b].
In general at least two major kinds of openendedness can be distinguished: (1) processes that do not stop, and (2) processes that do not have specific objectives. It is straightforward to draw the connection to these features and to evolution. We believe openended algorithms, like those we will outline below, must serve as the mechanisms underlying evolution’s creativity, at multiple levels of abstraction, from the evolution of protein subunits, to protein networks, to the development eukaryotic systems, to speciation events, and perhaps the Cambrian explosion [Hernandez2018b]. Rasmussen & Sibani [Rasmussen2019TwoMO] provide useful discussion to this end, by providing a means for disentangling the processes and algorithms of optimization and expansion within complex evolutionary processes. We aim to extend these concepts broadly in life sciences and other domains such as physics and engineering, as alluded to in the methods and examples discussed next.
Examples
Randomness and optimization
Randomness is, in a sense, the simplest kind of openended search: A random process does indefinitely continue, and can generate different values which for a time leads to increases in entropy, and therefore output complexity. But of course for a finite tailed random process, entropy of the samples converges, and so complexity converges. Nevertheless, the right amount of randomness (or noise) can bump an optimization process out of local minima, and so aid in improving the optimization [Burke2005]. Randomness offers a simple demonstration of how increasing complexity can later on aid an optimization whose objective is decreasing uncertainty. But, as the measured entropy of a random process does plateau, it is insufficient for the consistent introduction of complexity over the lifetime [Thrun1998].
Some Bayesian optimization (BO) methods are examples of entropybased search strategies, where acquisition functions intelligently balance exploreexploit search strategies [Bonawitz2014] – recall an acquisition function guides the BO search by estimating the utility of evaluating the objective function at a given point or parameterization. A useful class of acquisition functions are informationbased, such entropy search [Hennig2012EntropySF, Haarnoja2018] and predictive entropy search [HernndezLobato2014PredictiveES]. Yet these informationbased methods aim to reduce uncertainty in the location of the optimizer, rather than increasing entropy continuously; BO is a key example of MLbased optimization that is orthogonal to openendedness. Some BO strategies, and broadly Monte Carlo sampling, utilize search algorithms based on randomness. Rather, these are pseudorandom algorithms, such as Sobol sequences [Sobol1967OnTD] and other quasiMonte Carlo techniques [Lemieux2009MonteCA, Letham2019ConstrainedBO, Lavin2018DoublyBO].
Deceptive objectives and novelty search
Deceptive optimization problems, are defined as problems for which the global optima appears highly suboptimal over the first few samples, and generally not solved by simple noise injection [Lehman2011]. A concrete example of deception can be found by considering an agent in a maze. If the environment provides a proximity signal to the end of the maze, it would seem such a signal would make the maze trivial. But this is often not so. Consider that in most mazes an agent must walk away from the objective, for a time (to get around a bend, for example). In this case the proximity signal becomes a deceptive signal because while it is the optimal policy to follow the proximity signal overall, for short periods the winning agent must violate this optimal policy.
Stanley and colleagues [Lehman2011a] were the first to suggest that openended novelty search, defined as openended search to increase behavioral complexity, is sufficient to overcome deception. This is because the deceptive states are novel states. Other researchers have gone so far to suggest that novelty and curiosity (discussed below) are sufficiently good search/exploration algorithms that they can and should be used universally, in place of traditional objectivebased optimization methods [Peterson2019, Fister2019, Mouret2011b]
. While novelty search does converge, there is early evidence novelty search interacts with traditional evolutionary algorithms to make evolvability–modularization and reusability–inevitable
[Doncieux2020].To increase searching efficiency in openendedness, some efforts combine novelty search with interactive evolution (IEC), where a human intervenes on the process to guide search [Takagi2001InteractiveEC, Woolley2014ANH, Lwe2016AcceleratingTE]. IEC is particularly well suited for domains or tasks for which it is hard to explicitly define a metric, including when criteria are inherently subjective or illdefined (e.g., salience, complexity, naturalism, beauty, and of course openendedness).
Woolley & Stanley [Woolley2014ANH]
implement noveltyassisted interactive evolutionary computation (NAIEC) as a humancomputer interaction system, where a human user is asked to select individuals from a population of candidate behaviors and then apply one of three evolutionary operations: a traditional IEC step (i.e. subjective direction by the human), a short term novelty search, or a fitnessbased optimization. Experiments in a deceptive maze domain show synergistic effects of augmenting a humandirected search with novelty search: NAIEC exploration found solutions in fewer evaluations, at lower complexities, and in significantly less time overall than the fullyautomated processes.
Curiosity and selftraining artificial systems
Developmental artificial intelligence, defined as the invention and use of training algorithms which force agents to be selfteaching, relies on giving agents a sense of curiosity. Curiosity, loosely defined as learning for learning’s sake [Berlyne1954], has been difficult to pin down experimentally both in psychological research [Kidd2015, Gottlieb2018] and in AI research [Savinov2019, Burda2018, Ecoffet2019, Kosoy2020, Colas2019, Oudeyer2018a]. That said, there is recent progress toward a unified mathematical view [Peterson2019]. Details aside, the broad goal of selfteachingbycuriosity is to lessen, or remove, the need for carefully curated datasets which are both expensive to create and often innately and unfairly biased [Gebru2017, Jo2020].
According to leading practitioners, effective selfteaching [Sigaud2021] requires an agent go about selfgenerating goals [Colas2019], discovering controllable features in the environment [LaversanneFinot2018], and as a result make causal discoveries [Gopnik2012, Sontakke2020]. It also requires agents to learn to avoid catastrophic mistakes [Lipton2018], and handle environmental nonstationary [Thrun1998]. Selflearning inverse models may also prove important [Baranes2013]; in other words, selfteaching agents must generate selforganizing longterm developmental trajectories.
Qualitydiversity and robustness to unknown in robotic systems
The broad goal of qualitydiversity (QD) algorithms is to produce a diverse set of high performing solutions to any given optimization problem [Pugh2016a]. Having diverse potential solutions for AIdriven research and decision making is desirable as it enables the practitioner or downstream algorithms to be robust in problem solving [Pugh2016a, Colas2020]. In practice, this is done by combining novelty search (or other openended search approaches) and objectivebased optimization. One of the most widely used algorithms is known as MAPelites, developed by Mouret & Clune [Mouret2015]. Clune and colleagues use MAPelites to generate generalizable action plans in robotic systems [Velez2014]. Remarkably though, QD also seems to solve a longstanding problem in robotics research: adaption to the unexpected [Cully2015]. Unlike agents in the natural world, robotic systems which work perfectly in trained environments, will fail frequently and catastrophically when even small but unexpected changes occur. They especially struggle when there are changes or damage to their appendages. However, robots trained using QD methods showed a remarkable ability to recover from the unknown, as well as to purposeful damage inflicted by the experimenters. In other words, seeking novelty, prioritizing curiosity, and building complexity, prepares agents to handle the unexpected [Jaderberg2021]; openended learning and optimization can produce more robust AI agents.
Robotic systems for scientific discovery
In addition to creating robust, flexible, and selftraining robotics platforms, openended optimization methods have significant future potential for use in directly exploring scientific unknowns. For instance, Häse and colleagues have produced several objectivebased optimization schemes for autonomous chemical experimentation, including approaches that combine objectives with userdefined preferences (with similar motivations as NAIEC) [Hse2018ChimeraEH], and in BO schemes mentioned above [Hse2018PhoenicsAB]. These are developed in the context of “Materials Acceleration Platforms” for autonomous experimentation in chemistry [FloresLeonar2020MaterialsAP], where the specific implementations include he “Chemputer” [Steiner2019OrganicSI] for organic synthesis (an automated retrosynthesis module is coupled with a synthesis robot and reaction optimization algorithm) and “Ada” [MacLeod2020SelfdrivingLF] (Autonomous discovery accelerator) for thinfilm materials. We continue discussing openended SI systems of this nature in the Integrations section later.
Lifelong machine learning and efficient memory
A lifelong learning system is defined as an adaptive algorithm capable of learning from a continuous stream of information, with such information becoming progressively available over time and where the number of tasks to be learned are not predefined. In general the main challenge is catastrophic forgetting, where models tend to forget existing knowledge when learning from new observations [Thrun1998]. This has been an active area of research in the neural network community (and connectionist models broadly, such as Hopfield Networks [Hopfield1982NeuralNA]) for decades, and most recently focused on deep learning. For a comprehensive review refer to Parisi et al. [Parisi2019ContinualLL]. Here, in the context of openendedness, we focus on the interplay of resource limitations and lifelong (continual) learning.
Memorybased methods write experience to memory to avoid forgetting [Gepperth2016IncrementalLA], yet run into obvious issues with fixed storage capacity. Further, it is nontrivial to predefine appropriate storage without strong assumptions on the data and tasks, which by definition will evolve in continual learning settings. The openendedness setting would exacerbate this issue, as increasing novelty over time would increasingly require writing new knowledge to memory. Central to an effective memory model is the efficient updating of memory. Methods with learned read and write operations, such as the differentiable neural computer (DNC) [Graves2016HybridCU], use endtoend gradientbased learning to simultaneously train separate neural networks to encode observations, read from the memory, and write to the memory. For continual learning problem domains, a DNC could hypothetically learn how to select, encode, and compress knowledge for efficient storage and retrieve it for maximum recall and forward transfer. Design principles for DNC and related external NN memories [Fortunato2019GeneralizationOR, Banino2020MEMOAD, Munkhdalai2019MetalearnedNM] are not yet fully understood; DNC is extremely difficult to train in stationary datasets, let alone in openended nonstationary environments[Hadsell2020EmbracingCC].
Associative memory architectures can provide insight into how to design efficient memory structures, in particular using overlapping representations to be space efficient. For example, the Hopfield Network [Hopfield1982NeuralNA, Ramsauer2021HopfieldNI] pioneered the idea of storing patterns in lowenergy states in a dynamic system, and Kanerva’s sparse distributed memory (SDR) model [Kanerva1988SparseDM], which affords fast reads and writes and dissociates capacity from the dimensionality of input by introducing addressing into a distributed memory store whose size is independent of the dimension of the data. The influence of SDR on modern machine learning has led to the Kanerva Machine [Wu2018TheKM, Marblestone2020ProductKM] that replaces NNmemory slot updates with Bayesian inference, and is naturally compressive and generative. By implementing memory as a generative model, the Kanerva Machine can remove assumptions of uniform data distributions, and can retrieve unseen patterns from the memory through sampling, both of which behoove use in openended environments. SDR, initially a mathematical theory of human longterm memory, has additionally influenced architectures derived directly from neocortex studies, namely Hierarchical Temporal Memory (HTM) [Hawkinsetal2016Book]. HTM is in a class of generative memory methods that is more like human and rodent intelligence than the others we’ve mentioned, and implements openended learning in sparse memory architecture. HTM networks learn timebased sequences in a continuous online fashion using realistic neuron models that incorporate nonlinear active dendrites. HTM relies on SDR as an encoding scheme that allows simultaneous representation of distinct items with little interference, while still maintaining a large representational capacity, which has been documented in auditory, visual and somatosensory cortical areas [Cui2017TheHS]. Leveraging the mathematical properties of sparsity appears to be a promising direction for continual learning models, and openended machine intelligence overall.
Future directions
Curiosity, novelty, QD, and even memory augmentation algorithms are in practice quite limited in the kinds of complexity they generate, and all converge rather quickly such that complexity generation will plateau and then end. This is in striking contrast to even the simplest of biological systems (for example, E. Coli monocultures [Rasmussen2019TwoMO]). And the approaches for working, short term, and long term memories remain limited compared to the observed characteristics of their biological counterparts. We seem to be missing some central insight(s); this is a ripe area for extending the domain of artificial intelligence, where complexity sciences will have significant roles to play [Wolpert2019TheEO].
In the field of artificial life (ALife), openendedness is a crucial concept. ALife is the study of artificial systems which exhibit the general characteristics of natural living systems. Over decades of research practitioners have developed several kinds of Alife systems, through synthesis or simulation using computational (software), robotic (hardware), and/or physicochemical (wetware) means. A central unmet goal of ALife research has been to uncover the algorithm(s) needed to consistently generate complexity, and novelty, consistent with natural evolution [Frans2021]. Discovering and developing these algorithms would have significant implications in openendedness and artificial intelligence broadly [Stanley2019WhyOM].
In the context of simulation more generally, a fully implemented openended system would mean not only solving optimization problems as posed by practitioners, but also posing new questions or approaches for practitioners to themselves consider. We believe this would be revolutionary. It would allow AI systems to leverage their prior expertise to create new problem formulations, or insights, and eventually, scientific theories. As important as investing in R&D to build such systems, we need to invest in understanding the ethical implications of this approach to science, and in monitoring the realworld, downstream effects.
9. Machine Programming
Machine programming (MP) is the automation of software (and potentially hardware) development. Machine programming is a general technology, in ways that are quite different than the other simulation intelligence motifs. As systems based on the simulation intelligence motifs grow in complexity, at some point these systems will necessitate automating aspects of software development. Machine programming can understand code across motifs, compose them into working systems, and then optimize these systems to execute on the underlying hardware – this interaction is illustrated in Fig. 1 at a highlevel. Hence, while MP is not tied to any specific class of SI problems, MP will be integral as the glue that ties the other motifs together, serving as a bridge to the hardware.
The term machine programming made its debut in a paper by Gottschlich et al. [gottschlich:2018:mapl] that defined the MP in terms of three distinct pillars: (i) intention, (ii) invention, and (iii) adaptation (Fig. 30). Intention is principally concerned with identifying novel ways and simplifying existing ways for users to express their ideas to a machine. It is also concerned with lifting meaning (i.e., semantics) from existing software and hardware systems [ahmad:2019:siggraph, kamil:2016:pldi, lee:2021:plp]. Invention aims to construct the higherorder algorithms and data structures that are necessary to fulfill a user’s intention. In many cases, inventive systems simply fuse existing components together to form a novel solution that fulfills the user’s intent. However, there are some corner cases, which are likely to become more prevalent in the future, where an inventive system constructs a truly novel solution not yet been discovered by humans (e.g., Alam et al.’s inventive performance regression testing system [alam:2019:neurips] and Li and Malik’s inventive optimization algorithm system [li:2017:iclr]). Adaptation takes an invented highorder program and adapts it to a specific hardware and software system. This is done to ensure certain quality characteristics are maintained such as correctness, performance, security, maintainability, and so forth.
If trends continue, we anticipate programming languages will become more intentional by design. The byproduct of such intentionality is that such languages will be capable of more holistic automation of the inventive and adaptive pillars of MP, while simultaneously achieving superhuman levels in key quality characteristics (e.g., software performance, security, etc.). Early evidence has emerged demonstrating this such as Halide [adams:2019:siggraph]
, a programming language (more precisely a domainspecific language embedded in C++) designed for highperformance image and array processing code on heterogeneous modern hardware (e.g., various CPU and GPU architectures) and operating systems (Linux, Android, etc.). For example, Halide, in concert with verified lifting, has demonstrably improved the performance of Adobe Photoshop by a geometric mean of
for approximately 300 transpiled functions [ahmad:2019:siggraph]. Moreover, other forms of intentional communication from humans to machines are already emerging, such as translation from natural language to programming language and translation of visual diagrams to software programs. Two embodiments of such systems are GitHub’s CoPilot [Chen2021EvaluatingLL] and Ellis et al.’s automated synthesis of code for visual diagrams [ellis:2018:neurips].In machine programming, there appear to be at least two fundamental approaches for automation. The first is statistical, which consists of methods from machine learning and probabilistic analysis [alam:2019:neurips, hasabnis:2021:maps]. The second is rulesbased, which consist of approaches using formal methods that include techniques such as satisfiability solvers (SAT solvers) or satisfiability modulo theory (SMT) solvers [alur:2013:fm, kamil:2016:pldi] – SAT is concerned with the problem of determining if there exists an interpretation that satisfies a given Boolean formula, and SMT is a generalization of SAT to more complex formulas involving real numbers, integers, and/or various data structures in order determine whether a mathematical formula is satisfiable.
As the field of MP matures, we are already seeing an emergence of systems that consist of both statistical and rulesbased elements. A concrete example of this is SolarLezama et al.’s work fusing formal sketching techniques (where a program sketch is the schematic outline of a full program) with deep neural networks to improve computational tractability [nye:2019:infer]. Another example is the Machine Inferred Code Similarity (MISIM) system by Ye et al., an automated engine for quantifying the semantic similarity between two pieces of code, which uses a deterministic and configurable contextaware semantics structure (CASS) in conjunction with a learned similarity scoring system driven by a Siamese deep neural network with an extensible neural backend [ye:2020:misim].
The Department of Energy’s 2020 report on Synthesis for Scientific Computing [Lopes2021PROGRAMSF] provides an overview of where and how MP can have significant impact for scientific programming. In general, the consortium of experts anticipate increases in productivity of scientific programming by orders of magnitude by making all parts of the software life cycle more efficient, including reducing the time spent tracking down software quality defects, such as those concerned with correctness, performance, security, and portability [Alam2019AZL, Alam2016ProductionRunSF]. It is also believed that MP will enable scientists, engineers, technicians, and students to produce and maintain high quality software as part of their problemsolving process without requiring specialized software development skills, which today creates several choke points in organizations of researchers and engineers in many disconnected capacities.
Future directions
One of the core directions the field of MP is heading towards is intentional programming. Intentional programming is principally concerned with separating the intention of the program from the components that are used to realize that intention. A critical reason for this separation of concerns is that it tends to enable the machine to more exhaustively explore the inventive and adaptive components of software development – the programmer focuses on supplying the core ideas, while the machine handles the other two MP pillars, invention and adaptation.
A concrete embodiment of this is seen in the Halide programming language. Halide was designed with two primary components for two different types of programmers: (i) an algorithmic DSL for domain experts and (ii) the scheduling component for programmers with a deep understanding of building optimal software for a given hardware target. Due to the separation of the algorithm and schedule, the Halide team was able to automate the scheduling component to automatically extract performance. This is currently done using the Halide autoscheduler using a learned cost model, which can synthesize more performant software (in the form of code, not neural networks) than the best humans it was pit against, which includes those individuals who developed the language itself [adams:2019:siggraph].
We anticipate the field of MP to help accelerate the development and deployment of the SI stack. Referring back to the SI operating system conceptualization in Fig. 1, without MP one can imagine bottomup constraints on the types of data structures and algorithms that can be implemented above the hardware layers This is not unlike a “hardware lottery”, where a research idea or technology succeeds and trumps others because it is suited to the available software and hardware, not necessarily because it is superior to alternative directions [Hooker2021TheHL]. Historically this has been decisive in the paths pursued by the largely disconnected hardware, systems, and algorithms communities; the past decade provides an industryshaping example with the marriage of deep learning and GPUs. The various mechanisms of MP, however, provide opportunity to mitigate such hardware lottery effects for SI, and broadly in the AI and software fields. With MP in the SI operating system, as a motif that interfaces hardware and software, the integration of SI motifs can become far more efficient (auto HWSW optimizations and codesign), robust (less prone to human and stochastic errors), and scalable (flexibly deployable modules from HPC to the edge).
Simulation Intelligence Themes
We first discuss several common themes throughout the SI motifs, namely the utility of the motifs for solving inverse problems, the advantages (and scientific necessities) of implementing uncertaintyaware motif methods, synergies of humanmachine teaming, and additional value to be realized from integrating motifs in new ways. Each of these themes may prove significant in the development of SIbased scientific methods. The discussion then continues to cover important elements for SI in practice, critically the data engineering and highperformance computing elements of SI at scale.
InverseProblem Solving
An inverse problem is one of inferring hidden states or parameters from observations or measurements. If the forward simulation (or datageneration) process is , we seek to infer given . In scientific settings, an inverse problem is the process of calculating from observations the causal factors that produced them. One typical example is computational image reconstruction, including MRI reconstruction, CT tomographic reconstruction, etc., as shown in Fig. 32
. Most reconstruction methods could be classified as direct, analytical, or iterative reconstruction
[wang2020deep]. Based on the knowledge and information, a forward model can be derived to predict data given an underlying objective, either a simulation model or a mathematical/physical model. The reconstruction is addressed by computing the mathematical inverse of the forward model via nonlinear programming and optimization with regularization. Recently, deep learning based approaches provide a more promising way to inverse problem solving when conventional mathematical inversion is challenging: Rather than relying entirely on an accurate physics or mathematical model, the new datadriven machine learning methods can leverage large datasets to learn the inverse mapping endtoend.Beyond the classic inverse problems, such as image reconstruction, inverse scattering, and computed tomography, much recent effort has been dedicated to the design of new materials and devices, and to the discovery of novel molecular, crystal, and protein structures via an inverse process. This can be described as a problem of inverse design: discovering the optimal structures or parameters of a design based on the desired functional characteristics – for example, computational inverse design approaches are finding many usecases in optimizing nanophotonics [Molesky2018InverseDI, So2020DeepLE, zhang2021directional] and crystal structures [corpinot2018practical, noh2019inverse, noh2020machine, fung2021inverse]. Fig. 33 illustrates an example of how inverse design cycles can replace the existing, deductive paradigm that can be expensive and inefficient, and yield suboptimal solutions. In vast, complex spaces like materials science, inverse design can yield de novo molecules, foreshadowing what could be an SIdriven, inductive scientific method.
Inverse design differing from conventional inverse problems poses several unique challenges in scientific exploration. First, the forward operator is not explicitly known. In many cases, the forward operator is modeled by firstprinciples calculations, including molecular dynamics and density functional theory (DFT). This challenge makes inverse design intractable to leverage recent advances in solving inverse problems, such as MRI reconstruction [wang2020deep], implanting on images through generative models with large datasets [asim2020invertible]. Second, the search space is often intractably large. For example, small druglike molecules have been estimated to contain between to unique cases [ertl2009estimation], while solid materials have an even larger space. This challenge results in obvious obstacles for using global search, such as Bayesian optimization [snoek2015scalable], evolutionary strategies [hansen2003reducing, salimans2017evolution, zhangenabling]
, or Bayesian inference via Markov Chain Monte Carlo (MCMC), since either method is prohibitively slow for highdimensional inverse problems. Third, we encounter multimodal solutions with onetomany mapping. In other words, there are multiple solutions that match the desirable property. This issue leads to difficulties in pursuing all possible solutions through gradientbased optimization, which converges to a single deterministic solution and is easy trapped into local minima. Probabilistic inference also has limitations in approximating the complex posterior, which may cause the simple point estimates (e.g., maximum a posterior (MAP)) to have misleading solutions
[sun2020deep]. Generally speaking, inverse design methods include Bayesian and variational approaches, deep generative models, invertible learning, and surrogatebased optimization.Bayesian and variational approaches. From the inference perspective, solving inverse design problems can be achieved by estimating the full posterior distributions of the parameters conditioned on a target property. Bayesian methods, such as approximate Bayesian computing [yang2018predictive], are ideal choices to model the conditional posterior but this idea still encounters various computational challenges in highdimensional cases. An alternative choice is variational approaches, e.g., conditional GANs [wang2018high] and conditional VAE [sohn2015learning], which enable the efficient approximation of the true posterior by learning the transformation between latent variables and parameter variables. However, the direct application of both conditional generative models for inverse design is challenging because a large number of data is typically required [tonolini2020variational].
Deep generative models. Many recent efforts have been made on solving inverse problems via deep generative models [asim2020invertible, whang2021composing, whang2021solving, daras2021intermediate, sun2020deep, kothari2021trumpets, zhangflow]. For example, Asim et al. [asim2020invertible] focuses on producing a point estimate motivated by the MAP formulation and [whang2021composing] aims at studying the full distributional recovery via variational inference. A followup study from [whang2021solving] is to study image inverse problems with a normalizing flow prior which is achieved by proposing a formulation that views the solution as the maximum a posteriori estimate of the image conditioned on the measurements. For MRI or implanting on images, strong baseline methods exist that benefit from explicit forward operator [sun2020deep, asim2020invertible, kothari2021trumpets].
Surrogatebased optimization. Another way is to build a neural network surrogate and then conduct surrogatebased optimization via gradient descent. This is more common in scientific and engineering applications [forrester2009recent, gomez2018automatic, white2019multiscale, xue2020amortized, brookes2019conditioning], specifically for the inverse design purpose. The core challenge is that the forward model is often timeconsuming so a faster surrogate enables an intractable search. A recent study in the scope of surrogatebased optimization is the neuraladjoint (NA) method [ren2020benchmarking] which directly searches the global space via gradient descent starting from random initialization, such that a large number of interactions are required to converge and its solutions are easily trapped in the local minima [deng2021neural]. Although the neuraladjoint method boosts the performance by down selecting the top solutions from multiple starts, the computational cost is significantly higher, especially for highdimensional problems [khatib2021deep].
Invertible models and learning. Flowbased models [rezende2015variational, dinh2016density, kingma2018glow, grathwohl2018ffjord, wu2020stochastic, nielsen2020survae], may offer a promising direction to infer the posterior by training on invertible architectures. Some recent studies have leveraged this unique property of invertible models to address several challenges in solving inverse problems [ardizzone2018analyzing, ardizzone2019guided, kruse2021benchmarking]. However, these existing inverse model approaches suffer from limitations [ren2020benchmarking] in fully exploring the parameter space, leading to missed potential solutions, and fail to precisely localize the optimal solutions due to noisy solutions and inductive errors, specifically in materials design problems [deng2020neural]. To address this challenge, Zhang et al. [zhanginverse] propose a novel approach to accelerate the inverse design process by leveraging probabilistic inference from deep invertible models and deterministic optimization via gradient descent. Given a target property, the learned invertible model provides a posterior over the parameter space; they identify these posterior samples as an intelligent prior initialization which enables us to narrow down the search space. Then a gradient descent is performed to calibrate the inverse solutions within a local region. Meanwhile, a spacefilling sampling [shields2016generalization] is imposed on the latent space to better explore and capture all possible solutions. More discussion on flowbased models follows in the context of “honorable mention motifs” later in this section.
Across sciences one can consider the simulation as implicitly defining the distribution , where refers to the observed data, are unobserved latent variables that take on random values inside the simulation, and are parameters of the forward model. As we’ve suggested, there exist many usecases and workflows for the inverse, where one wishes to infer or from the observations . The inverse design loop in Fig. 33 represents a class of this problem in a practical application. In many cases it is possible the solution to the inverse problem is illposed [Carleo2019MachineLA]
: small changes in observed outputs lead to large changes in the estimate, implying the inverse problem solver will have high variance. For this reason, and to meet the needs of scientific applications in general, we often require uncertainty quantification, specifically given limited datasets
[zhang2018quantification, zhang2018effect, zhang2020quantification] from highfidelity simulations or experiments. Moreover, to fully assess the diversity of possible inverse solutions for a given measurement, an inverse solver should be able to estimate the complete posterior of the parameters (conditioned on an observation). This makes it possible to quantify uncertainty, reveal multimodal distributions, and identify degenerate and unrecoverable parameters – all highly relevant for applications in science and engineering [Ardizzone2019AnalyzingIP]. The challenge of inverse problems is also key in the context of causal discovery in science. Zenil et al. [nmi] describe formalisms connecting these challenges, based on the foundations of algorithmic probability, a framework where the parallel simulation of a large number of computable programs are used as generative models.Uncertainty Reasoning
How to design a reliable system from unreliable components has been a guiding question in the fields of computing and intelligence [Neumann1956ProbabilisticLA]. In the case of simulators and AI systems, we aim to build reliable systems with myriad unreliable components: noisy and faulty sensors, human and AI error, unknown or incomplete material properties, boundary conditions, and so on. There is thus significant value to quantifying the myriad uncertainties, propagating them throughout a system, and arriving at a notion or measure of reliability.
Many of the recommended methods for each motif are in the class of probabilistic ML, first and foremost probabilistic programming, which naturally represents and manipulates uncertainty about models and predictions [Ghahramani2015ProbabilisticML]. Probabilistic ML encompasses Bayesian methods that use probabilities to represent aleatoric uncertainty, measuring the noise inherent in the observations, and epistemic uncertainty, accounting for uncertainty in the model itself (i.e., capturing our ignorance about which model generated the data). For example, Zhu & Zabaras [Zhu2018BayesianDC] develop a probabilistic encoderdecoder neural network in order to quantify the predictive uncertainty in fluid dynamics surrogate models, and recent extensions to physicsinformed ML methods aim to implement Bayesian counterparts, such as Bayesian Neural ODE [Dandekar2020BayesianNO] and Bayesian PINN [Yang2021BPINNsBP] – interestingly, some findings suggest that typical NN uncertainty estimation methods such as MCdropout [Gal2016DropoutAA] do not work well with PDEs, which is counter to other PINN uncertainty quantification work [Zhang2019QuantifyingTU], implying there is still much work to be done in this area. Gaussian processes, on the other hand, used in the same physicsmodeling and emulation scenarios quantify aleatoric and epistemic uncertainties naturally [Bilionis2015BayesianUP].
Probabilistic numerics [Hennig2015ProbabilisticNA, Poincare1896] is a highly relevant class of methods that provide statistical treatment of the errors and/or approximations that are made en route to the output of deterministic numerical methods, such as the approximation of an integral by quadrature, or the discretized solution of an ordinary or partial differential equation [Oates2019AMR]. The many numerical algorithms we’ve discussed in this paper estimate quantities not directly computable by using the results of more readily available computations, and the probabilistic numeric viewpoint provides a principled way to manage the parameters of such numerical computations. By comparison, epistemic uncertainty arises from the setup of the computation rather than the computation itself, and aleatory is concerned with the data. Hennig et al. [Hennig2015ProbabilisticNA] describe this class of methods in detail, along with practical uses such as Bayesian quadrature in astronomy. The more recent overview by Oates & Sullivan [Oates2019AMR] provides more context, not to mention the suggestion that probabilistic programming will be critical for integration of probabilistic numerics theory and software, as well as enabling the tools from functional programming and category theory to be exploited in order to automatically compile codes built from probabilistic numerical methods (e.g., Ref. [Scibior2015PracticalPP]).
In nearly all experimental sciences at all scales, from particle physics to cosmology, an important use of machine learning and simulation is to generate samples of labeled training data. For example, in particle physics, when the target refers to a particle type, particular scattering process, or parameter appearing in the fundamental theory, it can often be specified directly in the simulation code so that the simulation directly samples [Carleo2019MachineLA]. In some cases this may not be feasible in the simulation code, but the simulation may provide samples , where are latent variables that describe what happened inside the simulation, but which are not observable in an actual experiment. Either case enables the researchers to generate accurate labeled training data. Yet while the datagenerating simulator has high fidelity, the simulation itself has free parameters to be tuned and residual uncertainties in the simulation must be taken into account in downstream tasks. Comprehensively quantifying uncertainties in datagenerators and simulators in general remains an area of active research – notably it must be interdisciplinary, as the uncertainty measures for ML do not necessarily suffice for physical sciences, and vice versa.
We have implied that development of robust AI and simulation systems require careful consideration of dynamic combinations of data, software, hardware, and people [lavin2021technology]. AI and simulation systems in sciences and decisionmaking often construct humanmachine teams and other variations of cooperative AI [Dafoe2020OpenPI], where it is valuable to calculate the various uncertainties in communications between combinations of humans, machines, agents, and sources of data and information broadly. Uncertaintyaware cooperative AI is a relatively new area of research, which we touch on more in the humanmachine teaming section.
An additional direction for our work on uncertainty reasoning is to investigate how the motifs provide more powerful uncertainty methods than currently used in probabilistic ML, for instance the use of differentiable programming (more specifically autodiff) towards automatic uncertainty quantification. Models built in probabilistic programming frameworks are particularly useful because not only is uncertainty quantification trivial, the modeler is able to encode their expected uncertainties in the form of priors. Modern neural networks, on the other hand, are often miscalibrated (in the sense that their predictions are typically overconfident) due to ignoring epistemic uncertainty: NNs are typically underspecified by the data and thus can represent many different models that are consistent with the observations. Uncertainty estimation methods for NNs exist, although some research has demonstrated difficulties in translating these methods to physicsinformed ML usecases [Yang2021BPINNsBP]. One promising direction is uncertaintydriven openended search. We’ve described the future benefits in optimization settings, and also suggest these same methods may be key in the hypothesisgeneration part of automating the scientific method – Kitano [Kitano2021NobelTC] describes these pursuits – and helping shape new scientific methods (for instance, built on principles of uncertainty and complexity, rather than falsifiability).
Integrations
Throughout this paper we’ve suggested there’s significant and valuable overlap between the module motifs. For instance, the use of multiscale and agentbased modeling in applications such as systems biology and sociopolitics, the use of physicsinfused ML for building surrogate models in multiphysics scenarios, using simulationbased causal discovery in agentbased network models, and more. We’ve also discussed some of the many ways that engine motifs integrate with and enable the module motifs. For instance, probabilistic programming languages (PPL) provide useful representations for multimodal multiscale simulation [Pfeffer2009FigaroA], agentbased modeling [agentmodels], semimechanistic modeling [Lavin2018DoublyBO], and causal inference [Perov2019MultiVerseCR]. Another type of integration would be using probabilistic programs within the broader workflow of a module motif: e.g., within simulationbased inference (not to mention “humanmachine inference” broadly) as shown in Fig. 10, or in scenarios for openendedness. Same goes for the other engine motif, differentiable programming: Semimechanistic modeling, which is critical in multiphysics simulation amongst other scenarios, is only useful because of differentiable programming. And as we explained earlier, differentiable programming is a generalization of deep learning, and these neural network methods play significant roles in many motifs and workflows, e.g. multiphysics and inverse design, respectively. Not to mention, at the engine level, consider that a common approach for implementing PPL is embedding stochastic macros within existing languages such as Python or C++, and utilizing automatic differentiation features from differentiable programming. And there are valuable implications for leveraging both in the development of machine programming and openendedness: for instance, autodifferentiating software to better enable learning and optimization in intentional programming, and utilizing PPL for readily encoding and tuning priors into probabilistic generative models in openended environments, respectively. Naturally, both machine programming and openendedness will make heavy use of the module motifs such as causal discovery, ABM, multiphysics and multiscale modeling.
The Nine Motifs of Simulation Intelligence (SI) are useful independently, yet when integrated they can provide synergies for advanced and accelerated science and intelligence. It is through these motif synergies that humanmachine teams and AI agents can broaden the scientific method and potentially expand the space of achievable knowledge (Fig. 29). Implementation and some integration of the methods in existing machine learning frameworks and programming languages is doable, especially with growing ecosystems around frameworks based on autodiff and vector operations (notably PyTorch [Paszke2017AutomaticDI] and Tensorflow [Abadi2016TensorFlowAS]), as well as probabilistic programming systems at multiple levels of abstraction (such as highlevel Pyro [Bingham2019PyroDU] and PyMC3 [Salvatier2016ProbabilisticPI] based on Python, and lower level Gen [CusumanoTowner2019GenAG] based on Julia [bezanson2017julia]). For future work, we aim to develop a more purposebuild framework for integrating the motifs. It would be advantageous to use a functional paradigm where models are represented (implicitly or explicitly) as functions; in homage to Pearl, functions could be pi functions (generative direction), lambda functions (inference direction), or coupled pairs of pi and lambda functions. Functional representations lend naturally to composition, abstraction, and multiscale reasoning, and SI methods work by analyzing these functions. Such a framework would bring multiple motifs together by composing these analyses – e.g. probabilistic programming, multiscale physics, differentiable programming, and causal discovery used together in a principled way in the same application. In general this can provide a unifying framework for machine intelligence, enabling the integration of our best methods for learning and reasoning.
Next we highlight several more motif integrations, amongst the many synergies through the SI stack to further develop:
Causal reasoning with probabilistic programs
The execution of a simulator results in one or more simulation traces, which represent various paths of traversing the state space given the underlying datagenerating model, which is not unlike the execution traces resulting from probabilistic programming inference. Formalizing the sequence of simulator states as a sequence of (probabilistic) steps, we can define a tracebased definition of causal dependency, which we introduce in an example below. With a simulator that produces a full posterior over the space of traces, as in probabilistic programming, we can attempt to draw causal conclusions with causal inference and discovery algorithms.
As alluded to in the probabilistic programming motif, the abstractions of probabilistic programming languages (PPL) provide ample opportunity to integrate the mathematics of causality with machine learning, which thus far has been nontrivial to say the least [Scholkopf2021TowardCR, HernndezOrozco2020AlgorithmicPM]. There are several non mutually exclusive ways we suggest viewing this integration:

A probabilistic program is fundamentally a simulator that represents the datagenerating process. As mentioned above, to address causal queries we must know something about the data generation process. A promising approach is to build a probabilistic program in Box’s modelinferencecriticismrepeat loop [Blei2014BuildCC] to arrive at a model that robustly represents the true data generating process, and use this data generator to infer causal relationships in the system – see Fig. 14. Said another way, one can conceptualize a structural causal model as a program for generating a distribution from independent noise variables through a sequence of formal instructions [Hardt2021PatternsPA], which also describes a probabilistic program. The Omega PPL is a newer framework that provides initial steps towards this type of causal interpretation [Tavares2019SoftCF]. A resourcebounded approach to generate computable models was also introduced in the context of algorithmic information dynamics[nmi] leading to large sets of generative models sorted by algorithmic likeliness.

PPL by definition must provide the ability to draw values at random from distributions, and the ability to condition values of variables in a program via observations [Gordon2014ProbabilisticP]. These two constructs are typically implemented within a functional or imperative host language as and macros, respectively. Similarly, one can readily implement the dooperator from the aforementioned docalculus of causality [Pearl1995CausalDF, Tucci2013IntroductionTJ], to force a variable to take a certain value or distribution. This allows simulating from interventional distributions, provided the structural causal model (including the governing equations) is known. Witty et al. [Witty2019BayesianCI] have provided intriguing work in this direction, although limited both in expressiveness and scope – more development in robust PPL is needed. In general, via docalculus or otherwise, the abstractions of PPL enable interventions, notably in the context of PPLbased simulators: the simulation state is fully observable, proceeds in discrete time steps [WoodAISTATS2014], and can be manipulated such that arbitrary events can be prevented from occurring while the simulation is running.

PPL provide the mechanisms to easily distinguish intervention from observational inference, and thus implement counterfactual reasoning. A counterfactual query asks “whatif”: what would have been the outcome had I changed an input? More formally: what would have happened in the posterior representation of a world (given observations) if in that posterior world one or more things had been forced to change? This is different from observational inference as it (1) fixes the context of the model to a “posterior world” using observational inference, but (2) then intervenes on one or more variables in that world by forcing each of them to take a value specified by the programmer or user. Interventions in the “posterior” world can cause variables — previously observed or otherwise — to take new values (i.e., “counter” to their observed value, or their distribution). Thus for counterfactual reasoning, we build a probabilistic program that combines both conditioning and causal interventions to represent counterfactual queries such as given that X is true, what if Y were the case? Recent extensions to the Omega language mentioned above aim to provide counterfactual queries via probabilistic programming, again with a version of Pearl’s dooperator [Tavares2019ALF].
Causal surrogates and physicsinfused learning
Similar to the perspective that a probabilistic program encodes a causal datagenerating process, one can view a system of differential equations as encoding causal mechanisms of physical systems. To be more precise, consider the system of differential equations where , , and . The Picard–Lindelöf theorem [Nevanlinna1989RemarksOP] states there is a unique solution as long as is Lipschitz, implying that the immediate future of is implied by its past values. With the Euler method we can express this in terms of infinitesimal differentials: . This represents a causal interpretation of the original differential equation, as one can determine which entries of the vector cause the future of others [Scholkopf2019CausalityFM]. A generalization of this approach in the context of algorithmic probability covers all computable functions including differential equations and mathematical models [nmi]. And in [HernndezOrozco2020AlgorithmicPM], it was shown that differentiability is not needed to perform an effective search optimization. The implication is then the algorithmic or physicsinfused ML methods we discussed can represent the causal dynamics of physical systems, whereas neural networks fundamentally cannot, and differentiability is not the essential feature believed to allow neural networks to minimize loss functions – see Table 1, and also Mooij et al. [Mooij2013FromOD] for a formal link between physical models and structural causal models. This provides theoretical grounding that algorithmic and physicsinfused ML can be used as causal surrogate models which are a more accurate representation of the true causal (physical) processes than universal function approximators (including physicsinformed neural networks). This insight also suggests that using algorithmic physicsinfused surrogates may provide grounded counterfactual reasoning. We highlight several specific approaches:
Continuoustime neural networks are a class of deep learning models with their hidden states being represented by ordinary differential equations (ODEs) [Funahashi1993ApproximationOD]. Compared to discretized deep models, continuoustime deep models can enable approximation of functions classes otherwise not possible to generate – notably the Neural ODE [Chen2018NeuralOD] which showed CTNN can perform adaptive computations through continuous vector fields realized by advanced ODE solvers. Vorbach et al. [Vorbach2021CausalNB] investigate various CT network formulations and conditions in search of causal models. The popular Neural ODEs do not satisfy the requisite conditions for causal modeling simply because the can’t account for interventions; the trained Neural ODE statistical model can predict in i.i.d. settings and learn from data, but it cannot predict under distribution shifts nor implicit/explicit interventions. Further, Neural ODEs cannot be used for counterfactual queries [Mooij2013FromOD], which naturally follows from the causal hierarchy discussed previously (see Fig. 13). Liquid timeconstant networks (LTCs) [Hasani2021LiquidTN] do satisfy the properties of a causal model: The uniqueness of their solution and their ability to capture interventions make their forward and backward mode causal. Therefore, they can impose inductive biases on their architectures to learn robust and causal representations [Lechner2020NeuralCP]. By this logic, we suggest UDEs can also be used as causal models of physical systems and processes. We also suggest the use of these “causal” models in practice may be more elusive than structural causal models that make assumptions and confounders explicit (whereas these NNbased approaches do not).
Physicsinfused Gaussian processes in causal discovery – Earlier we described a “active causal discovery” approach where an agent iteratively selects experiments (or targeted interventions) that are maximally informative about the underlying causal structure of the system under study. The Bayesian approach to this sequential process can leverage a Gaussian process (GP) surrogate model, effectively implementing a special case of Bayesian optimization. The use of a physicsinfused GP can provide robustness in the sense that the physical process being simulated is grounded as a causal surrogate – i.e., encoding causal mechanisms of said physical system – rather than a nonparametric function approximator. The several physicsinfused GP methods we highlighted are Numerical GPs that have covariance functions resulting from temporal discretization of timedependent partial differential equations (PDEs) which describe the physics [Raissi2018NumericalGP, Raissi2018HiddenPM], modified Matérn GPs that can be defined to represent the solution to stochastic partial differential equations [Lindgren2011AnEL] and extend to Riemannian manifolds to fit more complex geometries [Borovitskiy2020MaternGP], and the physicsinformed basisfunction GP that derives a GP kernel directly from the physical model [Hanuka2020PhysicsinformedGP].
Algorithmic Information Dynamics (AID) [scholarpedia2020] is an algorithmic probabilistic framework for causal discovery and causal analysis. It enables a numerical solution to inverse problems based on (or motivated by) principles of algorithmic probability. AID studies dynamical systems in software space where all possible computable models can be found or approximated under the assumption that discrete longitudinal data such as particle orbits in state and phase space can approximate continuous systems by Turingcomputable means. AID combines perturbation analysis and algorithmic information theory to guide a search for sets of models compatible with observations, and to precompute and exploit those models as testable generative mechanisms and causal first principles underlying data and processes.
Predict in I.I.D. setting 
Predict under shifts/interventions 
Answer counterfactual queries 
Obtain physical insight 
Learn from data 

Mechanistic / Physical  ✓  ✓  ✓  ✓  ? 
Structural causal (SCM)  ✓  ✓  ✓  ?  ? 
Causal graph  ✓  ✓    ?  ? 
Physicsinfused  ✓  ✓  ?  ✓  ✓ 
Probabilistic graph (PGM)  ✓  ✓    ?  ✓ 
Physicsinformed  ✓  ✓      ✓ 
Statistical          ✓ 
Multifidelity simulation of electron paths for chemical reactions
In an earlier example of differentiable and probabilistic programming we introduced how generative modeling in molecular synthesis can help mitigate inefficiencies in the standard designmaketestanalyze molecular discovery process. In that specific method we assumed a “reaction outcome predictor” exists [Bradshaw2020BarkingUT]. That ability, to reliably predict the products of chemical reactions, is of central importance to the manufacture of medicines and materials, and to understand many processes in molecular biology. In theory all chemical reactions can be described by the stepwise rearrangement of electrons in molecules [Herges1994CoarctateTS]. In practice this sequence of bondmaking and breaking is known as the reaction mechanism, which can be described as several levels of abstraction (corresponding do multiple spatial scales) [Bradshaw2019AGM]: computationally expensive quantummechanical simulations of the electronic structure at the lowest level, and rulesbased methods for transforming reactant molecules to products in a single step that is often far too simplifying. A conceptual abstraction in the middle of these levels is used by chemists in practice: a qualitative model of quantum chemistry called arrow pushing, or sequences of arrows which indicate the path of electrons throughout molecular graphs [Herges1994CoarctateTS]. Bradshaw et al. aim to leverage the domain knowledge embedded in arrowpushing diagrams to machinelearn the mechanism predictions – in contrast to other ML approaches for directly predicting the products of chemical reactions [Coley2017PredictionOO, Jin2017PredictingOR, Segler2017NeuralSymbolicML, Wei2016NeuralNF, Lee2019MolecularTU]. Much like the interpretability the emerges from causal structural models versus purely datadriven ones, the resulting mechanism models are more interpretable than product prediction models.
The aim of their approach is to describe the problem of predicting outcomes of chemical reactions – i.e., identifying candidate products, given a particular set of reactants  as a generative process which simulates arrowpushing diagrams conditional on the reactant set. The generative modeling approach comes in handy with the next “choice” of atom in the sequence, which is something domainexpert humans have essentially zero intuition about. Since this would simulate a mechanism, even though it is an approximate one, it is useful as a predictive model which does more than simply produce an answer (at the very least, the predicted mechanism can function as a type of explanation). The arrow pushing diagrams are sequences of arrows from atom to atom, which can be viewed as sequentially defining conditional distributions over the next atoms (i.e., a discrete distribution over atoms in the reactants), subject to some simple chemical constraints. The aim is to learn a distribution over model parameters for the probability of a series of electron actions that transform a set of reactant graphs into product graphs, which can also include graphs of reagents that aid the reaction process. In the current implementation the generative model is parameterized by a graph neural network and implemented in the autodiff (deep learning) library PyTorch. However, the wellpracticed probabilistic programmer will recognize the direct analogy between series of conditional distributions and probabilistic programming inference, in particular sequential Monte Carlo based methods. The implementation in a PPL has notable advantages, for instance tracebased interpretability: different electron paths can form the same products (see Fig. 34), for instance due to symmetry, which can be inspected with PPL traces.
With both GNN and probabilistic programming, we can more easily encode the constraints imposed by chemistry. For GNN, this is the geometric deep learning and equivarience properties we’ve discussed. Both also enable reasoning about multiple levels of fidelity: instead of simulating reactions at a lowlevel representation (involving atomic coordinates and other representations that are not understandable) we can simulate reactions at an intermediate granularity, one which we know domain experts are already familiar with and trust as an abstraction. Further for probabilistic programming, this again exemplifies the value of PP for “partially specified” models, which are the rich area between datadriven (blackbox, deep learning) and fullyspecified simulators. See the paper [Bradshaw2019AGM] for details, which has impressive results by training on large, unannotated reaction datasets (i.e., learning unsupervised), and further learns chemical properties it was not explicitly trained on.
SIbased materials acceleration platforms and openended search
One property to exploit with openended search and optimization is the potential to discover entirely novel solutions, in spaces that otherwise would never have been explored. For instance, in socioeconomics as a variation on the multiscale agentbased modeling AI Economist framework we discussed above – the dynamic optimization scheme suggests an alternative to typical objectivebased convergence criteria (corresponding to Nash equilibria [Holt2004TheNE] inthe multiagent setting) and can be extended with openended search algorithms to find nonobvious or even counterintuitive socioeconomic solutions. A different sector has been gaining interest in AI and simulation tools to search for de novo solutions, pursuing “materials acceleration platforms” for autonomous experimentation in chemistry [FloresLeonar2020MaterialsAP]. A fundamental goal of materials science and chemistry is to learn structure–property relationships and from them to discover de novo materials or molecules with desired functionalities. Yet the chemical space is estimated to be organic molecules [Drew2012SizeEO] and the space of materials is even larger [FloresLeonar2020MaterialsAP]. The traditional approach is based on human expert knowledge and intuition, resulting in a process that is slow, expensive, biased, and limited to incremental improvements. Even with highthroughput screening (HTS), the forward search process is limited to predesignated target areas [CorreaBaena2018AcceleratingMD]. Inverse design methods can provide significant efficiency gains [sanchez2018inverse], representing a key component for the molecule/hypothesisgeneration phase of automating the discovery of chemicals and materials – see Fig. 33. As we described, inverse design focuses on efficient methods to explore the vast chemical space towards the target region, and fast and accurate methods to predict the properties of a candidate material along with chemical space exploration [noh2020machine].
The focus in materials acceleration platforms has thus far been on objectivebased optimization schemes for autonomous chemical experimentation, including approaches that combine objectives with userdefined preferences (with similar motivations as the NAIEC approach we described for openended search) [Hse2018ChimeraEH], and in Bayesian optimization schemes [Hse2018PhoenicsAB]. Specific implementations include he “Chemputer” [Steiner2019OrganicSI] for organic synthesis (where an automated retrosynthesis module is coupled with a synthesis robot and reaction optimization algorithm) and “Ada” [MacLeod2020SelfdrivingLF] (Autonomous discovery accelerator) for thinfilm materials (with a purposebuilt software package to coordinate researchers with lab equipment and robots [chemos]). These operate on closedloop optimization schemes, which some suggest is a requirement for fully autonomous discovery systems [FloresLeonar2020MaterialsAP, chemos]. It is interesting to explore if openended search and optimization methods can advance these discovery platforms, and how they could be implemented in the environment. In such an expansive space of molecules, one can leverage openended search to not only explore more broadly, but also in ways that increase uncertainty (and entropy) towards potentially unknown territories that can match target characteristics. We are not yet aware of work in this direction, and even so it may call for a rethinking of the broader discovery pipeline. That is, the closed loop setting of current approaches may not support openended search in optimal ways. Continual learning methods may be usable, as the modeling and search algorithms aim to improve on subsequent passes through the cycle, but that is still very much objectivebased and does not allow for exploration with ongoing, targeted increases in entropy. Perhaps an offline process for openended search that is decoupled from the main process is needed, along with some mechanism for information flow between the two (so the openended search can learn from the materials synthesis process, for example). And to consider integrating robotics with openendedness would introduce additional challenges (not to mention risks). One can potentially look to surrogate modeling to alleviate some of the challenges implementing openended search and optimization, but at this point we are well into speculative territory and there is still much platform engineering and baseline experiments to be done.
It’s straightforward to connect materials acceleration platforms with nearly all the SI motifs, as well as inverse design and humanmachine teaming workflows. For instance, semimechanistic modeling can be advantageous for incorporating domain expert priors into the AIdriven search, and well as physicsinfused learning. A significant challenge with chemical and materials design is the translation of properties towards production scaling, which can potentially be alleviated with improved multiscale modeling techniques. With such a massive, complex environment that cannot possibly be understood by humans, causal modeling and inference can prove valuable for interpreting the solution space and navigating causeeffect relationships in the overall process.
Inducing physics programs with openended learning
One would suggest that openended methods train in realtime rather than batches or episodes like mainstream deep learning – online learning could be a requirement of any model developed for openended use. However, “DreamCoder” [Ellis2020DreamCoderGG] provides an interesting counterexample with “wakesleep” [Dayan2000HelmholtzMA] probabilistic program learning.
DreamCoder is a system that learns to solve problems by writing programs – i.e., a program induction perspective on learning, where learning a new task amounts to searching for a program that solves it. This perspective provides a number of advantages: Nonetheless practical use is limited, as applications of program induction require not only carefully engineered domainspecific language (DSL) with strong biases/priors on the types of programs to find, but also a search algorithm handdesigned to exploit that DSL for fast inference as the search space is otherwise intractable. The wakesleep approach in DreamCoder resolves these issues by learning to learn programs: its learned language defines a generative model over programs and tasks, where each program solves a particular hypothetical task, and its inference algorithm (parameterized by a neural network) learns to recognize patterns across tasks in order to best predict program components likely to solve any given new task. This approach produces two distinct kinds of domain expertise: (1) explicit declarative knowledge, in the form of a learned DSL that captures conceptual abstractions common across tasks in a domain; and (2) implicit procedural knowledge, in the form of a neural network that guides how to use the learned language to solve new tasks, embodied by a learned domainspecific search strategy [Ellis2020DreamCoderGG]. This succinct description does not do DreamCoder justice, as the paper thoroughly discusses the approach and learned representations, and presents illustrative examples like that in Fig. 36. The authors show this setup allows learning to scale to new domains, and to scale within a domain provided there are sufficiently varied training tasks. DreamCoder is openended in the sense that it learns to generate domain knowledge ceaselessly, implements a continual learning algorithm, and doesn’t have a specific optimization objective. Another important distinction from deep neural networks is the system is not learning tabula rasa, but rather starting from minimal bases and discovering the vocabularies of functional programming, vector algebra, and physics.
Physical computing
Here we highlight several promising directions for improving physical computation with the incorporation of SI methods: optimizing the design of sensors and chips.
MLenabled intelligent sensor design is the use of inverse design and related machine learning techniques to optimally design data acquisition hardware with respect to a userdefined cost function or design constraint. Here is a general outline of the process (motivated by Ballard et al. [Ballard2021MLSensorD]):

Standard engineering practices produce an initial design , or a randomized initialization of parameters and spatial configuration (in terms of transduction elements or sensing components, for example).

Acquisition of raw sensing data from the prior , and data preprocessing and transformations to such that the data is amenable to machine learning (e.g., normalized, Gaussian noise, etc.).

Train a ML model (typically a neural network or Gaussian process, potentially a probabilistic program) to output sensing results as

A cost function is used to evaluate the learned model, using the groundtruth sensing information – here it is useful to develop a problemspecific cost function with physicsinformed constraints (see the multiphysics motif section).

The results from step (4) inform a redesign that consists of adjusting or eliminating the less informative/useful features, and possibly replacing such features with alternative sensing elements. Ideally one can use a Bayesian optimization (BO) outerloop to inform the subsequent design, where the learned NN or GP serves as the surrogate (as in the surrogate modeling motif). Future work may investigate BObased neural architecture search methods (e.g. [White2021BANANASBO]) applied to sensor components and modules.

Repeat from step (2) for , where termination conditions may be the number of budgeted cycles (), convergence on design or outcome, or achieving a downstream cost function. Depending on the surrogate model choice and optimization scheme, it may be useful to employ transfer learning techniques to reduce the data burden of subsequent iterations.
Such an approach can drastically improve the overall performance of a sensor and broader data ingestion/modeling system, perhaps with nonintuitive design choices. Consider an intriguing example in the field of synthetic biology, aiming to intelligently design RNA sequences that synthesize and execute a specific molecular sensing task: Angenent et al. [AngenentMari2020ADL] developed a surrogatebased design framework with a neural network trained on RNA sequencetofunction mappings, resulting in an in silico
design framework for engineering RNA molecules as programmable response elements to target proteins and small molecules. The datadriven approach outperforms the prediction accuracy resulting from standard thermodynamic and kinetic modeling. In general datadriven, MLbased sensor designs can outperform “intuitive” designs that are solely based on analytical/theoretical modeling, but there are nontrivial limitations: generally large, wellcharacterized training datasets are needed to engineer and select sensing features that can statistically separate out various inherent noise terms or artifacts from the target signals of interest. Not to mention the the curse of dimensionality, where the highdimensional space of training data may drownout the meaningful correlations to the target sensing information. See Ballard et al.
[Ballard2021MLSensorD] for more discussion on MLenabled intelligent sensor design.MLbased chip design aims to automate and optimize the various processes in designing, fabricating, and validating computer chips. In particular, recent work has made significant strides in MLbased “chip floorplanning” [Mirhoseini2021AGP], or designing the physical layout of a computer chip. This slow, costly process involves placing hypergraphs of circuit components (such as macros (memory components) and standard cells (logic gates such as NAND, NOR and XOR)) onto chip canvases (twodimensional grids) so that performance metrics (e.g., power consumption, timing, area, and wire length) are optimized, while adhering to hard constraints on density and routing congestion. Engineers approach this manual design task over the course of months per chip layout, where performance evaluation of each design takes multiple days. Mirhoseini et al. [Mirhoseini2021AGP]
developed an reinforcement learning (RL) approach, not unlike those described in the agentbased modeling motif, by framing the task as a sequential Markov decision process (MDP) as follows:

States: hypergraph (as an adjacency matrix), node features (width, height, type), edge features (number of connections), current node (macro) to be placed, and metadata of the netlist graph (routing allocations, total number of wires, macros and standard cell clusters).

Actions: all possible locations (grid cells of the chip canvas) onto which the current macro can be placed subject to predefined hard constraints on density or blockages.

Transitions: probability distribution over next states, given a state and an action.

Rewards: always zero but for the last action, where the reward is a negative weighted sum of proxy wirelength, congestion, and density. (See the paper for details [Mirhoseini2021AGP])
A graph convolutional NN is used to learn feature embeddings of the macros and other hypergraph components – this architecture provides advantageous geometric properties for this design space, which we discuss more in the “honorable mention motifs” later. The benchmark results of this approach show the generation of chip floorplans that are comparable or superior to human experts in under six hours, compared to multiple months of manual human effort. This is a nice practical example of AIdriven knowledge expansion in Fig. 29: the artificial agents approach the problem with chip placement experience that is magnitudes greater in size and diversity than any human expert.
As alluded to above, this is one step of many in a resourceintensive process of computer chip development. There are many ways automation and humanmachine teaming can potentially optimize and accelerate the process, including additional ways simulationAI can advance chip design: for example, the SimNet [Hennigh2020NVIDIASA] platform of NNsolvers has been applied to the problem of design optimization of an FPGA heat sink, a multiphysics problem that can be approached with multiple parallel ML surrogates (with one NN for the flow field and another NN for the temperature field).
Physical Neural Networks – In the simulation intelligence spirit of minimizing the gaps between the in silico and in situ worlds, physical neural networks (PNNs) are an intriguing direction for physical learning hardware. That is, PNNs describe hierarchical physical computations that execute deep neural networks, and trained using backpropagation through physics components. The approach is enabled by a hybrid physicaldigital algorithm called physicsaware training (PAT), which enables the efficient and accurate execution of the backpropagation algorithm on any sequence of physical inputoutput transformations, directly in situ [Wright2021DeepPN]. For a given physical system, PAT is designed to integrate simulationbased and physical modeling in order to overcome the shortcomings in both. That is, digital models (or simulations) describing physics are always imperfect, and the physical systems may have noise and other imperfections. Rather than performing the training solely with the digital model (i.e., gradientbased learning with the simulator), the physical systems are directly used to compute forward passes, which keep the training on track – one can view this as another type of constraint in the physicsinfused learning class of methods we discussed earlier. Naturally this necessitates a differentiable simulator of the physical system, one to be built with differentiable programming and perhaps additional SI motifs. Wright et al. [Wright2021DeepPN] trained PNNs based on three distinct physical systems (mechanical, electronic, and optical) to classify images of handwritten digits, amongst other experiments, effectively demonstrating PAT and the overall physical computing (or physical learning) concept. They also describe in depth the mathematical foundations of all components of the approach, including the implementation of autodiff for physicsaware training. The overall approach is very nascent but could expand possibilities for inverse design (such as inverse design of dynamical engineering systems via backpropagation through physical components and mechanisms. Physical neural networks may also facilitate cyberphysical simulation environments with a variety of machines and sensor systems, and may lead to unconventional machine learning hardware that is orders of magnitude faster and more energy efficient than conventional electronic processors.
HumanMachine Teaming
The stack in Fig. 1 indicates the motifs and applications interact with people of diverse specialties and usecases. Not to mention the humanmachine inference methods we’ve described (in simulationbased inference, causal reasoning, and other motifs) call for humanintheloop processes. To this end we define humanmachine teaming (HMT) to be the mutually beneficial coordination of humans and machine intelligence systems (which include AI and ML algorithms and models, various data and information flows, hardware including compute architecture and sensor arrays, etc.). HMT overlaps with the more wellestablished field of humancomputer interaction (HCI), which integrates the disciplines of computer science, cognitive science, and humanfactors engineering in order to design technologies that enable humans to interact with computers in efficient and novel ways.^{iv}^{iv}iv We intentionally use the name “machine” in a general sense, overlapping with and extending beyond HCI, humanAI interaction, and humanrobot interaction. The development of HMT is to include the variety of machines we’ve discussed in this paper, from in silico agents and algorithms, to in situ nuclear fusion reactors and particle accelerators, and of course the nascent field of machine programming.
Common practice in machine learning is to develop and optimize the performance of models in isolation rather than seeking richer optimizations that consider humanmachine teamwork, let alone considering additional humanintheloop challenges affecting model performance, such as changes in policies or data distributions. Existing work in HMT includes systems that determine when to consult humans, namely when there’s low confidence in predictions [Horvitz2007ComplementaryCP, Kamar2012CombiningHA, Raghu2019TheAA], or informing the human annotator which data to label for active learning [Settles2012ActiveL, Hu2019ActiveLW].
HMT research addresses several fundamental issues with applicability across domains of science and intelligence:

How humans and machines communicate

How to model the relationship between humans and machines for accomplishing closed feedback loops

How to quantify and optimize humanmachine teamwork

Machines learning how to best compliment humans [Wilder2020LearningTC]

How humanmachine dynamics can influence and evolve the characteristics of errors made by humans and AI models, especially in continual learning contexts

How to quantify humanmachine uncertainties and predict reliability, especially in the context of broader and dynamic systems [lavin2021technology]

How to provide theoretical guarantees for various HMT subclasses (e.g., online optimization of experiments, active causal discovery, humanmachine inference (Fig. 12)), and how to empirically validate them

How to capture and express the humanmachine interactions for a particular application domain – where do general methods suffice, and where are domainspecific algorithms and interfaces needed?

Synergistic humanAI systems can be more effective than either alone – we should understand this tradeoff precisely, which will likely vary in metrics and dimensions depending on the domain and usecase

Laboratory processes and equipment are currently designed for the traditional, deductive scientific method, where tools assist human scientists and technicians. Simulation intelligence provides the opportunity to rethink this humanmachine relationship towards new science methods that are inductive, datadriven, and managed by AI (beyond standard automation).
The standard HMT approach is to first train a model on data for predicting , then is held fixed when developing the humaninteraction components (and broader computational system) and when training a policy for querying the human. One of the more glaring issues with this general approach is the introduction of can cause the and distributions to shift, resulting in altered behavior from that necessitates retraining. Joint training approaches could mitigate this concern, and further provide a more optimal teaming workflow. For instance, Wilder et al. [Wilder2020LearningTC] propose several joint training approaches that consider explicitly the relative strengths of the human and machine. A joint discriminative approach trains while learning endtoend from features to decisions, and a a new decisiontheoretic approach to additionally construct probabilistic models for both the AI task and the human response. The latter allows a followup step that calculates the expected value of information for querying the human, and joint training optimizes the utility of the endtoend system.
In addition to quantifying the value of information in humanmachine dynamics, using probabilistic ML methods can also enable the quantification and propagation of uncertainties in HMT (and overall throughout SIbased workflows for automating science – Fig. 37). We’ve described candidate methods in the previous section, and here point out another class of methods for uncertainty estimation in the humanmachine coalition setting: Evidential learning maps model inputs to the parameters of a Dirichlet distribution over outputs (classes), where smaller parameter values represent less evidence for a class and thus produce a broader distribution representing greater epistemic uncertainty [Sensoy2018EvidentialDL, Meinert2021MultivariateDE]. Evidential learning enjoys a direct mapping to subjective logic, a calculus for probabilities expressed with degrees of uncertainty where sources of evidence can have varying degrees of trustworthiness [Jsang2016SubjectiveLA]. In general, subjective logic is suitable for modeling and analysing situations where the available information is relatively incomplete and unreliable, which can be utilized in HMT scenarios for uncertaintyaware logical reasoning. One can extend this to trust calibration, a valuable yet overlooked factor in many HMT scenarios: automation bias and automation complacency can occur when humans too readily accept AI or machine outputs. In these cases, humans either replace their own thinking or judgement, or are less sensitive to the possibility of system malfunctions [Goddard2012AutomationBA, Tomsett2020RapidTC]. Conversely, algorithm aversion occurs when humans disregard algorithms for lack of trust, despite the algorithms consistently outperforming humans. In general for feedback systems powered by ML, one cannot simply improve the prediction components and reliably improve performance. As in control theory, the “Doyle effect” [Doyle1978GuaranteedMF, Petersen2014RobustCO] suggests that an improved prediction system can increase sensitivity to a modeling or data flaw in another part of the engineering pipeline – this certainly holds for humanintheloop ML, and scientific ML pipelines in general.
The HMT concerns are relevant and perhaps exacerbated in the broader perspective of a full workflow like that of Fig. 37, where we illustrate a scientist in the context of a largely autonomous workflow for AIdriven science.
Simulation Intelligence in Practice
The implementation and integration of simulation intelligence in workflows such as this can address the noted issues in important ways. For instance, the semimechanistic modeling and physicsinfused learning we shared earlier are essential in the modelingunderstanding phase, and are also critical in the simtoreal transfer phase(s); we suggest these need to be integrated in closedloop ways that coordinate learning and infosharing between upstream and downstream phases. The inverse design approaches we presented (such as Fig. 33) could be useful to this end, and inverse problem solving may prove useful for mitigating losses in the synthesis steps – this is an area of future SI work for us. More generally it can be advantageous to reshape such workflows from the perspective of SI. For instance, humanmachine inference can in essence replace the left side of the diagram, and SIbased inverse design can flip/rewire much of the inner modelingdesignsynthesisvalidation loop. Nonetheless the data and engineering challenges remain nontrivial in practice, which we discuss next.
Dataintensive science and computing
As we’ve described, scientific datasets are different from the data typically available to ML research communities; scientific data can be highdimensional, multimodal, complex, sparse, noisy and entangled in ways that differ from image, video, text and speech modalities. Scientific questions often require the combination of data from multiple sources, including data to be fused from disparate and unknown sources, and across supercomputing facilities. Further, many science applications where ML and AI can be impactful are characterized by extreme data rates and volumes, highly diverse and heterogeneous scenarios, extremely low latency requirements, and often environmentspecific security and privacy constraints.
For these challenges and more, data processing, engineering, sharing, governance, and so on are essential to scientific AI and simulation intelligence, which are inherently dataintensive: problems characterized by data volume, velocity, variety, variability (changes both in the structure of data and their underlying semantics, especially in realtime cases), and veracity (data sources are uncontrolled and not always trustable) [HPCS2015datahpc]. Admittedly we do not discuss nearly enough the importance of data in the SI motifs and in simulation sciences broadly – one or more survey papers on this is warranted. Here we highlight several key concepts and references.
Data challenges and opportunities in merging scientific computation and AI/ML
In recent years there have been numerous reports on data and AI/ML bestpractices from various perspectives and fields. Argonne, Oak Ridge, and Lawrence Berkeley national laboratories have recommended that computational science communities need AI/ML and massive datasets [Stevens2020AIFS], including the suggestion that a lack of data is by far the largest threat to the dawn of strongly AIenabled biology, advocating for advances in synthetic data generation, privacypreserving ML, and data engineering. Leaders in the nuclear fusion community collaborate on a set of needs for AI to move the field forward [Humphreys2020AdvancingFW], calling for an industrywide “fusion data machine learning platform”. NASA perspectives on big data systems for climate science [Schnase2015BigDC, Schnase2017MERRAAS] cover topics such as data ETL (extracttransformload) in massive multitier storage environments, and the need to discover and reuse data workflows. The National Science Foundation is pushing for domaininspired AI architectures and optimization schemes to enable bigdata driven discovery [Huerta2020ConvergenceOA] across various domainspecific fields such as chemistry and materials sciences [Himanen2019DataDrivenMS, Blaiszik2019ADE], and agriculture [Janssen2017TowardsAN]. Here we highlight several of the common themes across those reports, which are data themes shared with simulation intelligence. We also expand on them and clarify what is meant by concepts such as “MLready data” and “open data”:

Interoperability is one of the tenets of “FAIR” [Wilkinson2016TheFG], guiding principles that define 15 ways in which data should be findable, accessible, interoperable, and reusable. In general these are rather straightforward: for example, to be findable data are assigned a globally unique and persistent identifier, to be reusable data are released with a clear and accessible data usage license, and so on. Although generally agreed upon, FAIR is rather highlevel and warrants more specific details for the tenets. Doubleclicking on interoperable is particularly important for the dataintensive applications of SI. Data interoperability, as it pertains to AI/ML and simulation sciences, addresses the ability of computational systems to exchange data with unambiguous, shared meaning and content. The medical ecosystem is unfortunately emblematic of the main data interoperability challenges [Lehne2019WhyDM] that stymie ML and scientific progress: incompatible standards, isolated and legacy databases, incompatible systems and proprietary software, inconsistent privacy and security concerns, and datasets that are inherently difficult to exchange, analyze, and interpret. Many of these interoperability challenges are a consequence of business incentives or legal needs. Other problems could be solved by at least a common ontology and a standardized representation of knowledge within this ontology – for example, the “SyBiOnt” ontology for applications in synthetic biology design [Misirli2016DataIA], to bring together the collective wealth of existing biological information in databases and literature. We further suggest that such an ontology should integrate with a unanimous data lifecycle management framework in that field, where a main valueadds are proper semantic versioning for datasets and lingua franca amongst stakeholders [lavin2021technology]; more on such frameworks is mentioned in the readiness entry. When considering distributed computing and HPC (both highly important in scientific computing and simulation), these challenges can obstruct the sharing and interconnection of data such that scaling the experiments is infeasible.

Data readiness at one time could have described the preparedness and availability of data for programmer queries and data science or analytics use. The set of concerns are far more complex and dynamic with ML. And further, scientific usecases can impose stricter requirements. Datareadiness describes the successful execution of the data tasks ahead of modeling: collating, processing, and curating data for the intended use(s). Challenges include poor and undocumented data collection practices, missing values, inconvenient storage mechanisms, intellectual property, security and privacy – notice the overlap with the interoperability issues above. To help overcome these challenges, several frameworks for data readiness have been developed [Afzal2021DataRR, Lawrence2017DataRL, lavin2021technology]. These provide data handling structure and communication tools for ML practitioners, including required implementations such as data profiling and quality analyses as a key step before data enters a ML pipeline, and continuous testing for shifts in the data and environments of deployed models. These frameworks are highly relevant for SI practitioners and systems as well.

Provenance in scientific workflows is paramount: a data product contains information about the process and data used to derive the data product, and provenance provides important documentation that is key to preserving the data, to determining the data’s quality and authorship, and to reproduce as well as validate the results. These are all important requirements of the scientific process [Davidson2008ProvenanceAS]. Several overviews on the topic point to methods for scientific data provenance as a means of reproducability [Davidson2008ProvenanceAS] and integrity [Huang2018FromBD]. We suggest including knowledge provenance in this theme, referring to the audit trail, lineage, and pedigree of the computed information and results. In recent years the ML community has prioritized systems with tools for versioning datasets and models, which should be embraced by all in scientific computing and simulation, and extended to versioning of environments and testbeds.

Synthetic data is paramount in scientific simulation and AI/ML. The primary usecase is data generation to make up for shortcomings in real data: availability, annotation, distribution (variance), representation of various modes and anomalies, computational constraints, and of course sample size. With the popularity of datahungry deep learning in recent years, there have been many works on synthetic data generation for ML, largely in general computer vision and robotics [Bousmalis2018UsingSA, JohnsonRoberson2017DrivingIT], and also scientific areas such as medical imaging and clinical records [Shin2018MedicalIS, Abay2018PrivacyPS, Torfi2020DifferentiallyPS], social and economic systems [Burgard2017SyntheticDF, Zheng2020TheAE], particle physics [Deja2019GenerativeMF] and astrophysics [DeRose2019TheBF], and human cognition and behavior [Battaglia2013SimulationAA, Hamrick2016InferringMI, Gerstenberg2021ACS]. The validation of generated data against the real environment, and the reliability of subsequent models to perform in said environment, are of principal importance – Rankin et al. [Rankin2020ReliabilityOS], for example, demonstrate the necessary analysis in the context of medical ML. A variety of statistical methods exist for quantifying the differences in data distributions [Kar2019MetaSimLT] (and relatedly for dataset drifts [Rabanser2019FailingLA]) In some usecases, domainspecific metrics may be called for, imposing physicsinformed or other environment constraints on the synthetic data [Ltjens2021PhysicallyConsistentGA].
An overarching challenge is the integrated use of synthetic and real datasets. Techniques for “simtoreal” transfer learning help ML models train synthetically and deploy with real data [Kar2019MetaSimLT]. The AI/ML perspective of the AI and ML field is largely dominated by simulation testbeds (including video games) for robotics usecases. However in scientific usecases we encounter numerous scenarios where simulated data can be used as a catalyst in machinebased experiment understanding: generating synthetic diagnostics that enable the inference of quantities that are not directly measured in machinebased experiments – e.g., extracting meaningful physics from extremely noisy signals in particle physics, and automatically identifying important nuclear fusion plasma states and regimes for use in supervised learning. Humphreys et al. [Humphreys2020AdvancingFW] refer to this as one means of “machine learning boosted diagnsotics”, which also includes the systematic integration of multiple data sources that encompasses the next theme.

Data fusion corresponds to the integration of information acquired from multiple sources (sensors, databases, manual trials, experiment sites, etc.). Many challenges exist that are highly dependent on the complexity of application environments and scales of sensors and systems. For example, data imperfections and inconsistencies due to noise in sensors and environments, the alignment or registration of data from different sensors or machines with different frames, and more discussed in the data fusion and ML survey by Meng et al. [Meng2020ASO]. With open data, an ongoing challenge is finding true values from conflicting information when there are a large number of sources, among which some may copy from others [Dong2013DataFR]. Largescale computer simulations of physical phenomena must be periodically corrected by using additional sources of information, including snapshots of reality and model forecasts. The fusion of observations into computer models is known as data assimilation and is extensively used in many areas, such as atmospheric prediction, seismology, and energy applications. Assimilating multiple sources of information into computer simulations helps maintain an accurate representation of the evolution of the physical phenomena and promotes realistic longterm prediction. Yet developing efficient and scalable assimilation algorithms capable of handling complex environments with nonlinear and nonGaussian behavior is still an open research area, which is strongly aligned with several SI research directions we’ve discussed: optimal design of experiments and sensor placement, physicsinfused optimization algorithms, and inverse problem solving.

Openness – “open science” can be defined as a new approach to the scientific process based on cooperative work and new ways of diffusing knowledge by using digital technologies and new collaborative tools [Burgelman2019OpenSO]. Hesitancy to share can come from several places including security restraints, concerns of intellectual property (IP), and fear of being scooped by other researchers. Openness for data can be thought of as making data searchable, accessible, flexible, and reusable – open data is thus FAIR data. Open data also invites concerns with ethics and accountability, in addition to those prevalent in AI and scientific computing broadly. Please see Hutchinson et al. [Hutchinson2021TowardsAF] and references therein for details. We further suggest that practical openness requires industry or domain specific standards that are agreed upon and adhered to – from appropriate metrics and storage formats, to procedures for gathering and labeling data – in order to resolve challenges in data acceptance, sharing, and longevity. These are also essential for multisource databases, for instance the IDA (Image and Data Archive, ida.loni.usc.edu) with over 150 neuroimaging studies, most of which collect data from multiple clinical sites and over several years such that shared, agreedupon data procedures are necessary for open data.
Data pipelines and platforms for SI
Along with the SI motifs and their integrations that we’ve presented, the advancement of SI calls for cohesive data platforms that exemplify the above principles. The report on advancing nuclear fusion with ML research [Humphreys2020AdvancingFW] urges that “the deployment of a Fusion Machine Learning Data Platform could in itself prove a transformational advance, dramatically increasing the ability of fusion science, mathematics, and computer science communities to combine their areas of expertise in accelerating the solution of fusion energy problems” [Humphreys2020AdvancingFW]
. Many fields in physical and life sciences could make the same assertion, including all the domains we’ve discussed in the context of SI. A recent example of such a platform is Microsoft’s “Planetary Computer”, an opensource cloud platform for multipetabye global environmental data:
planetarycomputer.microsoft.com. Largely an Earth observation catalogue and development hub, the datasets exemplify the above list of principles, and the ecosystem is designed to cultivate interdisciplinary and open science. The opensource code should eventually include pipelines for data processing and engineering, including modules for the principled use of real and synthetic data.Hutchinson et al. [Hutchinson2021TowardsAF] provide a thorough regimen for ML dataset lifecycles and supporting software infrastructure, while the design of SI data platforms should also consider how the new scientific workflows like Fig. 37 can have different requirements and challenges from ML in general. Consider that each phase of such a workflow may have several implicit dataflows, each executing a set of chained data transformations consuming and producing datasets; A challenge is these phases are often heterogeneous in a few ways: First, software, hardware, and development practices are inconsistent between collaborating people, teams, and organizations. Second, various fields call for domainspecific stuff such as specialized data types and structures. And mainly, most of the available software has grown through accretion, where code addition is driven by the need for producing domain science papers without adequate planning for code robustness and reliability. [Grannan2020UnderstandingTL] The Department of Energy’s 2019 AI for Science Report states “Today, the coupling of traditional modeling and simulation codes with AI capabilities is largely a oneoff capability, replicated with each experiment. The frameworks, software, and data structures are distinct, and APIs do not exist that would enable even simple coupling of simulation and modeling codes with AI libraries and frameworks.” The interdisciplinary nature of simulation intelligence, along with the many possible motif integrations, offer opportunity for SI data platforms to address these concerns.
The implementation and engineering of such data platforms is far from trivial [Chervenak2000TheDG, Simmhan2009BuildingRD]. First consider that the curation and preparation of data are often the most timeconsuming and manual tasks, which end up being disconnected and distributed in scientific endeavors. For the former, much manual and highly specialized work is performed by domain scientists to gather the relevant scientific data, and/or to use ML to achieve automated knowledge extraction from scientific data. Even then there is a significant gap between raw (or even cleaned) scientific data and useful data for consumption by ML and other parts of the workflow including simulators. Not to mention most scientific domains have specialized data formats that may require industryspecific software and domainspecific knowledge to inspect, visualize, and understand the data. See, for example, Simmhan et al. [Simmhan2009BuildingRD] discuss pipeline designs for reliable, scalable data ingestion in a distributed environment in order to support the PanSTARRS repository, one of the largest digital surveys that accumulates 100TB of data annually to support 300 astronomers. Blamey et al. [Blamey2020RapidDO] also provide an informative walk through the challenges and implementations of cloudnative intelligent data pipelines for scientific data streams in life sciences. Deelman et al. [Deelman2018TheFO] advocate for colocated (or in situ) computational workflows, to minimize inefficiencies with distributed computing. One motivation for colocated workflows is to minimize the cost of data movement by exploiting data locality by operating on data in place. Other supporting reasons include supporting humanintheloop interactive analysis, and capturing provenance for interactive data analysis and visualization (as well as any computational steering that results ^{v}^{v}vComputational steering methods provide interactivity with an HPC program that is running remotely. It is most commonly used to alter parameters in a running program, receive realtime information from an application, and visualize the progress of an application. The idea is to save compute cycles by directing an application toward more productive areas of work, or at a minimum stopping unproductive work as early as possible. In the numerical simulation community it more specifically refers to the practice of interactively guiding a computational experiment into some region of interest.). Data pipelines need to provide efficient, scalable, parallel, and resilient communication through flexible coupling of components with different data and resource needs, and utilize extremescale architectural characteristics such as deep memory/storage hierarchy and high core count.
Technologies in large scale data processing and engineering have flourished in the previous decades of “big data”, and even more so with the accelerating adoption of AI and ML. It can be expected that SI will be a similar accelerant of data technologies and ecosystems. Large scale computing (i.e., petatoexa scale simulation) imposes significant challenges in engineering data pipelines [Fiore2018OnTR, Chang2019NISTBD], in particular for the HPC community which we discuss in the next section.
Accelerated computing
High performance computing (HPC) plays a critical role in simulation systems and scientific computing. HPC systems are mainly used in domains with complex dataintensive applications, applying advanced mathematics and sophisticated parallel algorithms to solve large scale problems appearing in physics, medicine, mathematics, biology, and engineering [Chen2013SynergisticCI]. Established supercomputer codes (where a supercomputer is system optimized for HPC applications) build on many years of research in parallelization [Mezmaz2007AGB, Melab2018MulticoreVM, Oger2016OnDM] and numerical methods such as linear and nonlinear solvers, algorithms for continuous and discrete variables, automatic mesh refinements [Bryan2014ENZOAA, Hopkins2015ANC] and others – for example, the PETSc (petsc.org) suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations [petscuserref]. Advances in ML and AI have impacted supercomputers in significant ways: enabling lower precision data types, leveraging tensor operations, and distributed machine learning frameworks. Further, it is becoming easier to solve traditional problems in a nontraditional way, leveraging neural networks. For instance, ML has proven highly successful for analyzing largescale simulation data produced by supercomputers, such as extracting climate extremes [Kurth2018ExascaleDL], and towards efficient surrogate models or emulators based on HPC ensemble simulations [Lu2019EfficientSM]. One prominent example of a problem requiring massive computing resources is molecular dynamics (MD). “Anton” [7012191] is a family of supercomputers designed from scratch to solve precisely this one problem; recall from Fig. 5 that MD is one subset of the broader spatiotemporal space needed to simulate dynamics of our world. Recently, researchers at Argonne National Laboratory developed an AIdriven simulation framework for solving the same MD problem, yielding 50x speedup in time to solutions over the traditional HPC method [10.1145/3468267.3470578]. And some work suggests these approaches need not be mutually exclusive: it has been shown in the context of computational fluid dynamics that traditional HPC workloads can be run alongside AI training to provide accelerated datafeed paths [stencil]. Colocating workloads in this way may be necessary for petascale (approaching exascale) scientific simulations: Compute aside, supercomputers or large clusters (distributed compute nodes) are the only way to host some of the largest currently available models – as their memory requirements go into trillions of parameters, partitioning the model is the only way. In particular, the ensemble simulations can only be done in HPC. That said, a unified, twoway coupling between such large scale simulators and ML (which calls for data assimilation at each time step, or updating parameters and state variables) remains a significant challenge due to the complexity of the system’s mathematical equations (which are often nonlinear) as well as the existing inputoutput (I/O) structures that are often built in a blackbox way.
Many HPC experts suggest that the highest performing computing systems of the future will be specialized for the application at hand, customized in the hardware itself and in the algorithms and abstractions underlying the software.^{vi}^{vi}viA different perspective can be drawn from the example of “Folding@home” [Larson2009FoldingHomeAG], a distributed computing project for running protein dynamics simulations on networks of volunteers’ personal computers that consist of heterogeneous hardware components. The project aims to help scientists develop new therapeutics for a variety of diseases by simulating the processes of protein folding and the movements of proteins at scale. Both of these considerations – distributed compute and HWSW specialization, not to mention heterogeneous systems incorporating a variety of specialized components – can be powerful but make application development more difficult [Stoller2019FutureDF]. At the hardware level, Fiore et al. [Fiore2018OnTR] suggest new processors, memory hierarchy, and interconnects have to be taken into consideration when designing HPC systems to succeed existing petascale systems. Some works have stated that silicon photonic interconnects could provide large bandwidth densities at highenergy efficiencies to help solve some issues related to the increased parallelism and data intensity and movement [Rumley2017OpticalIF, Cheng2018RecentAI]. At the software level [Grannan2020UnderstandingTL], compilers, libraries, and other middleware, should be adapted to these new platforms – perhaps in automated ways enabled by machine programming and program synthesis (e.g. in “Arbor” for neuronal network simulations on HPC [Akar2019ArborA]). At the applications level, algorithms, programming models, data access, analysis, visualization tools and overall scientific workflows [Deelman2018TheFO], have to be redesigned when developing applications in petascale (potentially to exascale) computing infrastructures – for instance, HPC systems (and teams) would do well to adopt the cycles and tooling of the DevOps and newer MLOps [Renggli2021ADQ] ecosystems, although unique characteristics related to HPClevel data storage and resource sharing will present new challenges. The interdisciplinary nature of these computing topics is paramount: collaboration and synergy among the domain science, mathematics, computer science and engineering, modeling and simulation, etc., an optimal system and solution can be built. Like the integration of SI motifs and the broader SI stack, more comprehensive and holistic approaches are required to tackle the many complex realworld challenges.
An interesting and active usecase for HPC is quantum computing, where researchers use supercomputers to simulate quantum circuits [haner2016high, smelyanskiy2016qhipster, villalonga2020establishing, liu2021closing, liu2021redefining] and quantum computers more generally. The invention of quantum computing was motivated [feynman1982simulating] by the fact that quantum computers are exponentially (in the number of quantum bits, circuit depth, or both) expensive to simulate—this both gives QCs their power and also makes it very challenging to simulate even intermediatescale QC [preskill2018quantum]. Classical simulations of QCs are important in at least two areas: quantumalgorithm design (enabling empirical testing and verification of new quantum algorithms) and quantumsystem design (enabling the simulation of both quantum hardware and layers of the quantum stack [jones2012layered] above hardware but below algorithms). Accelerated and accurate simulation of quantum computers will play an important role in the development of practical, useful quantum computers. In parallel to the growth of conventional HPC acceleration of QC over the past several years, there has been an explosion of interest in the application of machinelearning tools to problems in quantum physics in general [carleo2017solving, torlai2018neural, carrasquilla2020machine] and to the simulation of QC in particular [carrasquilla2021probabilistic]. We are already seeing SI approaches applied to the domain of QC. Another potential application of HPC to QC, which is essentially entirely separate from the simulation use case, is that of performing classical coprocessing to aid in the execution of errorcorrection protocols for quantum hardware. Already there have been efforts to apply neural networks to help with error decoding [carrasquilla2020machine], and it seems likely that quantum error correction will ultimately be implemented at scale using a combination of HPC and AI methods. Finally, the practical uses of QC typically involve both quantum and classical processing [nielsen2011quantum], with many use cases requiring very substantial amounts of classical computation that themselves warrant the use of HPC resources. As a concrete example, quantumchemistry simulations are anticipated to involve a combination of both classical and QC computations [malone2021towards] (which we’ve alluded to earlier in several motif examples).
As the field makes advances in the specific motifs and SI overall, there are several open questions to keep an eye on regarding computation:

Will advances in the motifs lead to new breeds of algorithms that require radically new systems and/or architectures? Perhaps these can be implemented as conventional ASICs (ApplicationSpecific Integrated Circuits), or will new substrates beyond CMOS electronics[shalf2015computing] be preferred?

Will SI inverse design produce novel compute architectures and substrates?

How will the openendedness paradigm affect computing systems (memory allocation, load balancing, caching, etc.)?

How will Machine Programming influence hardwaresoftware codesign for simulation intelligence (and highperformance computing)?

In addition to the traditional goals of performance and scalability (latency, throughput, etc.), energy efficiency, and reliability, how will the advancing technologies build abstractions to facilitate reasoning about correctness, security, and privacy [Stoller2019FutureDF]?

Will SIbased surrogate modeling enable a landscape shift where computationally lightweight surrogates running on a small number of GPUs replace massive supercomputing workloads (particularly in physical and life sciences simulations)? What will be the metrics and benchmarks to quantify the tradeoffs between such orthogonal computation approaches?
To the first question, there has already been a resurgence of interest in novel hardware platforms for both simulation [athale2016optical, estakhri2019inverse] and machine learning [tanaka2019recent, hughes2019wave, xiao2020analog, wetzstein2020inference, markovic2020physics, wright2021deep, furuhata2021physical]. Although CMOS platforms are still used essentially exclusively due to their performance and reliability, we anticipate that specialpurpose hardware built on nonCMOS physical substrates may gain traction over the next decade. It remains to be seen if and how appropriate benchmarks are developed for comparing different compute substrates, which may ultimately be infeasible in a general, applicationagnostic way. This echoes a significant benchmarking issue we discussed earlier in the context of physicsinfused ML: It is customary that distributed training algorithms in HPC platforms are benchmarked using idealized neural network models and datasets [Huerta2020ConvergenceOA], which does not impart any insights regarding the actual performance of these approaches in real deployment scenarios – i.e., when using domaininspired AI architectures and optimization schemes for datadriven discovery in the context of realistic datasets (which are noisy, incomplete, and heterogeneous), and when using purposebuilt HPC stacks that can have applicationspecific customizations upanddown the layers we described above.
Domains for useinspired research
Much of the work to develop the motif methods and their many integrations is foundational SI research: Understanding the mechanisms underlying thought and intelligent behavior and their implementation in machines. The main descriptions of each motif fall in this category, from the mathematics of causality to multiagent cooperation. Pursuing foundational research insync with useinspired research is essential, in SI and broadly in simulation, AI, and data technologies. Useinspired research is basic research with uses for society in mind, thus a main driver of breakthrough innovations. Sometimes referred to as “Pasteur’s quadrant”^{vii}^{vii}viiPasteur’s quadrant is a classification of scientific research projects that seek fundamental understanding of scientific problems, while also having immediate use for society. Louis Pasteur’s research is thought to exemplify this type of method, which bridges the gap between “basic” and “applied” research., the many example motif applications we’ve discussed align with this division of research and development. The National Science Foundation similarly describes their AI Institutes as operating in cycles of foundational and useinspired AI research: nsf.gov/cise/ai.jsp.
Table 2 provides a highlevel, nonexhaustive look of the domains for useinspired research per motif. Note many methods and usecases arise from the integrations of motifs, as discussed throughout the text, which are not necessarily elucidated in this highlevel table view.
Machine programming 
Differentiable programming 
Probabilistic programming 
Multiphysics multiscale 
Simulationbased inference 
Surrogate modeling, emulation 
Causal reasoning 
Agentbased modeling 
Openendedness 

HEPhysics  theoretical    ✓  ✓  ✓  ✓    ✓    ✓ 
HEPhysics  experimental  ✓  ✓  ✓  ✓  ✓  ✓  ✓     
Complexity        ✓      ✓  ✓  ✓ 
Synthetic biology  ✓  ✓  ✓  ✓  ✓  ✓  ✓     
Chemistry    ✓  ✓  ✓  ✓  ✓       
Materials    ✓  ✓  ✓  ✓  ✓       
Medicine    ✓  ✓  ✓  ✓  ✓  ✓     
Systems biology    ✓  ✓  ✓    ✓  ✓  ✓  ✓ 
Neuro and Cog sciences  ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ 
Energy  nuclear fission & fusion    ✓  ✓  ✓  ✓  ✓      ✓ 
Energy  materials & storage    ✓  ✓  ✓  ✓  ✓      ✓ 
Manufacturing  ✓  ✓    ✓    ✓      ✓ 
Energy systems        ✓    ✓  ✓  ✓  ✓ 
Transportation & infrastructure        ✓    ✓  ✓  ✓  ✓ 
Agriculture    ✓    ✓    ✓  ✓  ✓  ✓ 
Ecology    ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ 
Socioeconomics & markets    ✓    ✓      ✓  ✓  ✓ 
Finance  ✓  ✓  ✓        ✓  ✓  ✓ 
Geopolitics        ✓      ✓  ✓  ✓ 
Defense  ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ 
Climate    ✓    ✓  ✓  ✓  ✓  ✓  ✓ 
Earth systems    ✓    ✓  ✓  ✓  ✓  ✓  ✓ 
Astrophysics & cosmology    ✓  ✓  ✓  ✓  ✓  ✓    ✓ 
Brain simulation
We’ve discussed several neuroscience (and cognitive science) usecases of SI motifs, notably with simulationbased inference and probabilistic programming (as well as “working memory” and inverse design). The intersections of neuroscience and simulation, including big data and machine learning, have had profound effects on the various fields for decades. The review by Fan & Markrum [Fan2019ABH] provides a thorough overview that is out of scope for the current paper. The authors detail the last century of brain research, building up to today’s supercomputer wholebrain models [Markram2006TheBB], with a focus on the recent phases of “simulation neuroscience” and the “multiscale brain”. The former describes a collaborative, integrated research setting where in silico and in situ experiments interplay to drive mathematical and predictive models (as well as produce hypotheses), all leveraging a wealth of biological datasets on a cohesive, open platform. The latter describes the need for causally linking molecular, cellular and synaptic interactions to brain function and behavior. We aim for SI to play critical roles in these phases of brain research for decades to come.
Simulations provide the crucial link between data generated by experimentalists and computational (as well as theoretical) neuroscience models; comparing predictions derived from such simulations with experimental data will allow systematic testing and refinement of models, and also inform theory in ways that will directly influence neuroinspired AI (e.g., as in neural circuitry models for computer vision [George2017AGV], working memory [Hawkins2016WhyNH, Soto2008AutomaticGO], predictive coding [Bastos2012CanonicalMF, Friston2010TheFP], and others such as spiking NNs [Taherkhani2020ARO], language of thought [Ullman2012TheoryLA], and neuromorphic computing [Markovi2020PhysicsFN]) – see Refs. [Ullman2019UsingNT, George2020FromCT, Hassabis2017NeuroscienceInspiredAI] for more discussion.
Discussion
In addition to the nine SI motifs, a number of advances in ML also provide encouraging directions for simulationbased sciences and ML, which we describe below as “honorable mention” methods in scientific AI and ML. We then close with discussion on the science of science: methods, breakthroughs, and the role for SI going forward.
Honorable mention motifs
Geometric and equivariant neural networks
In various physics we encounter symmetries, or equivariance, defined as the property of being independent of the choice of reference frame. A function is equivariant to a transformation if applying that transformation to its inputs results in an equivalent transformation of its outputs. A special case of equivariance that has been exploited largely in machine learning is invariance, where any transformation of the inputs results in an identity transformation of the outputs: the architectures of CNNs and RNNs make them invariant to image translations and sequence position shifts, respectively. Thus neural networks with equivarience properties are an intriguing direction of research. Further, these methods share maths with category theory, which may help define equivariant architectures beyond CNN and RNN. To this end, Cohen & Welling [Cohen2016GroupEC] generalize CNNs to GCNNs, where generalizedconvolution replace the position shifting in a discrete convolution operation with a general group of transformations (like rotations and reflections, in addition to translations). More recent developments by Cohen et al. [Cohen2019GaugeEC] extend neural network equivariance beyond group symmetries, developing neural networks that consume data on a general manifold dependent only on the intrinsic geometry of the manifold.
Graphstructured data, one of the more prevalent types of data used in ML, presents many opportunities and challenges for exploiting symmetries. Graph neural networks (GNNs) are an extremely active area in the ML community for learning and predicting on flexible, graphbased inputs, with many applications in chemistry, material sciences, medicine, and others [Fung2021BenchmarkingGN, Wu2019ACS]
. Because of their scalability to large graphs, graph convolutional neural networks or message passing networks are widely used
[Haan2020NaturalGN]. Yet these networks, which pass messages along the edges of the graph and aggregate them in a permutation invariant way, are fundamentally limited in their expressivity [Xu2019HowPA]. Equivariant graph networks are more expressive [Maron2019InvariantAE], thanks to the generalization properties (rooted in category theory). Recent work from Haan et al. [Haan2020NaturalGN] leverage category theory to derive a more expressive class of “natural graph networks” that can be used describe maximally flexible global and local graph networks – some of this relationship is discussed in [Shiebler2021CategoryTI].Specializing GNN architectures for the task at hand has delivered stateofart results in areas like drug discovery [Stokes2020ADL]. One can view such architecture specialities as inductive biases, specifically on the structure of the input (i.e., the input’s dimensions are related such that they form a graph structure). This has taken shape as a class called Geometric Deep Learning, where encoding the symmetries present in the problem domain are required for building neural network models of geometric objects. Said another way, this direction of work aims to generalize equivariant and invariant neural network operators to nonEuclidean domains (e.g. graphs and manifolds) [Bronstein2017GeometricDL]. Bronstein et al. [Bronstein2017GeometricDL] illustrate that convolution commutes with the Laplacian operator, and they use this insight to define both spatially and spectrally motivated generalizations of convolution for nonEuclidean domains. In particular, equivariant neural networks have architectures designed for vectortransforms under symmetry operations (on reference frames) [Cohen2019GaugeEC]. By constraining NNs to be symmetric under these operations, we’ve seen successful applications in areas like molecular design [Simm2020ReinforcementLF] and quantum chemistry [Qiao2021UNiTEUN]. There is also recent exploration in tying these structural constraints to classes of structural casual models (SCMs) that we discussed earlier, which could leade to theoretical foundations of how GNNs can be used to perform causal computations [Zecevic2021RelatingGN]. This is a robust area of ML and applied maths research – the interested reader can explore [Gerken2021GeometricDL] for thorough overview, and [Esteves2020TheoreticalAO] for theoretical foundations of group equivariance and NNs.
Learning meshbased simulators
Software for modeling of physical bodies and systems typically use meshbased representations for the many geometries of surfaces and volumes. Meshes are used to solve underlying differential equations in the context of finite element analysis (FEA), computational fluid dynamics (CFD), and related statics and dynamics solvers. These present significant computational bottlenecks, which are weakly mitigated by adaptive meshes that are finer in areas that need high precision and coarser otherwise. New approaches from the machine learning community aim to alleviate such bottlenecks by learning the dynamics of deforming surfaces and volumes rather than the expensive discretization schemes [blazek2015principles] – that is, machine learned emulators for CFD and FEA. Pfaff et al. [Pfaff2021LearningMS] develop a GNN model that can be trained to pass messages on a mesh graph and to adapt the mesh discretization during forward simulation. They show early results on common physical dynamics simulations of deforming plate, cloth in wind field, and flow are cylinders and airfoils. Similarly using learned message passing to compute dynamics, SanchezGonzalez et al. [SanchezGonzalez2020LearningTS] implement another GNNbased approach for efficiently simulating particlebased systems (and they suggest the same approach may apply to data represented as meshes as well). Both works point to the value of differentiable simulators, providing gradients that can be used for design optimization, optimal control tasks, and inverse problems.
Differentiable rendering
Algorithms from the computer graphics community can have intriguing implications for ML, for instance for the generation of synthetic training data for computer vision [wood2021fake, behl2020autosimulate, jaipuria2020deflating] and the class of methods referred to as “vision as inverse graphics” [Romaszko2017VisionasInverseGraphicsOA]. Rendering in computer graphics is the process of generating images of 3D scenes defined by geometry, textures, materials, lighting, and the properties and positions of one or more cameras. Rendering is a complex process and its differentiation is not uniquely defined, yet differentiable rendering (DR) techniques based on autodiff can tackle such an integration for endtoend optimization by obtaining useful gradients [Kato2020DifferentiableRA]. A driving interest from the ML perspective is towards computer vision models (largely deep neural networks and PGMs) that understand 3D information from 2D image data. One expectations is the improvement in vision applications while reducing the need for expensive 3D data collection and annotation. One general approach is an optimization pipeline with a differentiable renderer, where the gradients of an objective function with respect to the scene parameters and known groundtruth are calculated. Another common approach is selfsupervision based on differentiable rendering, where the supervision signal is provided in the form of image evidence and the neural network is updated by backpropagating the error between the image and the rendering output. See Kato et al. [Kato2020DifferentiableRA]
for an overview of DR algorithms, data representations, evaluation metrics, and best practices.
Normalizing and Manifold Flows
Although very expressive generative models, a main limitation of the popular VAE and GAN approaches is intractable likelihoods (probability densities) which make inference challenging. They essentially represent a lowerdimensional data manifold that is embedded in the data space, and learn a mapping between the spaces. A class of generative models called normalizing flows [rezende2015variational, Papamakarios2019NormalizingFF, kobyzev2020normalizing] are based on a latent space with the same dimensionality as the data space and a diffeomorphism; their tractable density permeates the full data space and is not restricted to a lowerdimensional surface [Brehmer2020FlowsFS]. Normalizing flows describe a general mechanism for defining expressive probability distributions by transforming a simple initial density into a more complex one by applying a sequence of invertible (bijective) transformations [rezende2015variational, Papamakarios2019NormalizingFF], can provide a richer, potentially more multimodal distributions for probabilistic modeling. In theory the expressiveness is unlimited: Papamakarios et al. [Papamakarios2019NormalizingFF] show that for any pair of wellbehaved distributions and (the target and base, respectively), there exists a diffeomorphism that can turn into ; and must have the same dimensionality and every subtransformation along the way must preserve dimensionality. With distributions of larger dimensions there is larger computational cost per subtransform; various classes of composable transformations exist, such as residual and radial flows, with known computational complexities. Multiscale architectures can be used mitigate the growing complexity by clamping most dimensions in the chain of transforms, thus only applying all steps to a small subset of dimensions, which is less costly than applying all steps to all dimensions [Dinh2017DensityEU]. The same approach can be utilized for multiscale modeling, which is a natural modeling choice in granular data settings such as pixels and waveforms [Papamakarios2019NormalizingFF], and in many scientific environments as we’ve discussed. A class of flows particularly relevant to SI may be continuoustime flows, where the transformation from noise to data is described by an ordinary differential equation (rather than a series of discrete steps) [Salman2018DeepDN, Grathwohl2019FFJORDFC, Durkan2019NeuralSF]. There are also equivariant manifold flows [Katsman2021EquivariantMF] which aim to learn symmetryinvariant distributions on arbitrary manifolds, to overcome issues with the main generalized manifoldmodeling approaches (both Euclidean and Riemannian). Much like the other equivariant methods we’ve discussed, we need to properly model symmetries for many applications in the natural sciences – e.g., symmetry requirements arise when sampling coupled particle systems in physical chemistry [Khler2020EquivariantFE], or sampling for lattice gauge theories in theoretical physics in order to enable the construction of model architectures that respect the symmetries of the Standard Model of particle and nuclear physics and other physical theories [Boyda2021SamplingUS]. For a thorough review of normalizing flows for probabilistic modeling and inference, refer to Papamakarios et al. [Papamakarios2019NormalizingFF].
Brehmer & Cranmer [Brehmer2020FlowsFS] highlight a limitation: Normalizing flows may be unsuited to data that do not populate the full ambient data space they natively reside in, and can learn a smearedout version rather than the relatively higher dimensional manifold exactly (illustrated in Fig. 38). They introduce manifoldlearning flows (Mflows)
: normalizing flows based on an injective, invertible map from a lower dimensional latent space to the data space. The key novelty is that Mflows simultaneously learn the shape of the data manifold, provide a tractable bijective chart, and learn a probability density over the manifold, thus more faithfully representing input data that may be off manifold (in ambient data space). Mflows describe a new type of generative model that combines aspects of normalizing flows, autoencoders, and GANs; flows that simultaneously learn the data manifold and a tractable density over it may help us to unify generative and inference tasks in a way that is tailored to the structure of the data. In many scientific cases, domain knowledge allows for exact statements about the dimensionality of the data manifold, and Mflows can be a particularly powerful tool in a simulationbased inference setting
[Cranmer2020TheFO].AlphaFold
The recent landmark results from DeepMind’s AlphaFold [Jumper2021HighlyAP] on the Critical Assessment of Techniques for Protein Structure Prediction (CASP) competition showed dominance over the field for stateofart in threedimensional protein structure modeling.^{viii}^{viii}viiiSee “It will change everything: DeepMind’s AI makes gigantic leap in solving protein structures” in Nature News, and “AlphaFold: a solution to a 50yearold grand challenge in biology” from DeepMind. AlphaFold’s approach to computational protein structure prediction is the first to achieve nearexperimental accuracy in a majority of cases. Overall there were several deep learning innovations to assemble the network purposebuilt for protein structures [Jumper2021HighlyAP]: a new architecture to jointly embed multiple sequence alignments and pairwise features, a new output representation and associated loss that enable accurate endtoend structure prediction, a new equivariant attention architecture, and others including curated loss functions. While the success of AlphaFold is undeniable, questions remain as to how much of the AlphaFold modeling framework, and knowledge about it, can be adapted to solve other, similar fundamental challenges in molecular biology, such as RNA folding and chromatin packing – how much effort would be needed to customize the tool to other areas to make a fresh contribution to them, and how to evaluate the contribution of the AI tool itself versus the human in the loop at most stages? A more philosophical outcome of AlphaFold is the implication that a scientific challenge can be considered partially or totally solved even if human scientists are unable to understand or gain any new knowledge from the solution, thus leaving open the question of whether the problem has really been solved. It will be intriguing to explore this perspective and the related area of AIanthropomorphism [Salles2020AnthropomorphismIA] as more scientific, mathematical, and engineering challenges are overcome by various types of machine intelligence.
Quantum computing
Quantum computing [nielsen2011quantum] has strong connections with both simulation and machine learning. With the recent experimental demonstration of quantum computational supremacy [arute2019quantum] and rapid progress in the development of noisy intermediatescale quantum (NISQ) machines [preskill2018quantum] and algorithms[bharti2021noisy], there is optimism that quantum computers might even deliver an advantage in tasks involving the simulation of quantum systems in the foreseeable future. In the longer term, once largescale, faulttolerant quantum computation is achieved, quantum simulation is widely expected to be a transformative use case [tacchino2020quantum, mcardle2020quantum, cerezo2021variational].
There also exists a large body of work investigating diverse ways in which quantum computers might be used to perform machinelearning tasks – the subfield of quantum machine learning (QML) [biamonte2017quantum, dunjko2020non]. QML methods exist both for tasks involving classical data as well as for tasks involving quantum data, although there is a general expectation that it will be easier to obtain an advantage with tasks on quantum data. The notion of differentiable programming has also been extended to quantum computing, first as part of QML, but now also applied to other areas of quantum algorithms [mitarai2018quantum, bergholm2018pennylane, broughton2020tensorflow, cerezo2021variational]. Since quantum algorithms – including quantumsimulation algorithms, but also algorithms simulating classical systems on quantum computers – can generate quantum data, there is a natural point of connection between simulation on quantum computers and QML for quantum data, and one might expect even closer interplay between simulation and machine learning methods in quantum computing, as we have seen develop in classical computing, and which this paper describes the merger of.
Conclusion
Donald Braben, a pioneer in innovation practices and the culture of scientific research, defines transformative research as “research that sets out radically to change the way we think about an important subject” [Braben2008ScientificFT]. Advancing the Nine Motifs and realizing synergistic artificial intelligence and simulation sciences can provide the requisite tools for transformative research across domains.
Yet simulation in conjunction with artificial intelligence does not necessarily yield solutions. Rather, when integrated in the ways we’ve described, it enables intelligent navigation of complex, highdimensional spaces of possible solutions: bounding the space helps to guide the user to optimal solutions, and, in many cases, expanding the space to new territories can enable solutions beyond existing knowledge or human priors (Fig. 29).
Unlike current statistical machine learning approaches, simulationbased approaches can better deal with causeeffect relationships to help find first principles and the generating mechanisms required to make progress in areas of scientific discovery. Whether this is for problems in inverse design of synthetic biology compounds, predicting the results of interventions in social systems, or optimizing experimental design in highenergy physics, the Nine Motifs of SI explored here enable one to intelligently and tractably navigate the search space of possibilities.
This is why we put forth the concept of new SIbased scientific methods. While the traditional scientific method can be described as hypothesisdriven deduction, 21st century advances in big data brought about a major shift affecting nearly all fields, opening the door for a second dimension of the scientific method: datamininginspired induction [Voit2019PerspectiveDO]. More recently, machine learning along with high performance computing shaped this inductive reasoning class of scientific method to lean on the patterns and predictions from datadriven models.
It is clear the traditional scientific method provided a useful stepbystep guide for knowledge gathering, but this guide has also fundamentally shaped how humans think about the exploration of nature and scientific discovery. That is, presented with a new research question, scientists have been trained to think immediately in terms of hypotheses and alternatives, and how to practically implement controlled tests [Voit2019PerspectiveDO]. While effective for centuries, the process is also very slow, and subjective in the sense that it is driven by the scientist’s ingenuity as well as bias – this bias has sometimes been a hindrance to necessary paradigm shifts [Kuhn1962TheSO]. It is possible that additional dimensions (or classes) of scientific methods can shift this classical regime to additional modes of knowledge gathering, which is what we aim to motivate in this work with simulation intelligence.
One can expect significant advances in science and intelligence with the development and adoption of these modes, but it is also possible that additional ethical and societal implications will arise. Automation bias and algorithm aversion are concerns we raised in the context of humanmachine teaming, and can appear in varied ways when exploring new dimensions of scientific methods with simulation and artificial intelligence technologies. High costs associated with data collection, data access, and computing power are already problematic in today’s society, particularly in the creation and exacerbation of inequalities in research output between state and private institutions as well as between developed and developing nations. Not to mention the ethics and safety challenges we encounter with AI in highstakes settings, such as biomedical and social research, are similarly relevant with SIbased methods in these settings; performance variability of healthcare and sociopolitical applications across subsets of the population are already a challenge of conventional datadriven methods, and increasing the reliance on such methods at the research stage might exacerbate these issues further.
The continued blending of datadriven, AI, simulation, and largescale computing workflows is a promising path towards major progress and the emergence of a new scientific methodology [Succi2019BigDT]. It is not unreasonable to expect that the SI motifs and the integrations we’ve described can broaden the scientific method in general, with reliable and efficient hypothesis–simulation–analysis cycles [Stevens2020AIFS], including machinedriven hypothesis generation for highly autonomous science. Each of these motifs, particularly in combinations described by the SI stack, allows this process to happen faster and more effectively – from probabilistic programming decoupling modeling and inference, to surrogate modeling allowing for rapid and powerful simulation of systems when detailed mechanistic modeling is infeasible, to simulationbased causal discovery deriving the true mechanisms of disease processes, to physicsinfused learning deriving the governing equations of unknown physics, and so on. The SI driven scientific methods can help practitioners produce optimal and novel solutions across domains and usecases in science and intelligence, and provide an essential step towards the Nobel Turing Challenge, creating the machine intelligence engine for scientific discovery [Kitano2021NobelTC].
References
Acknowledgements
The authors would like to thank Tom Kalil and Adam Marblestone for their support, Joshua Elliot and Josh Tenenbaum for fruitful discussions, Sam Arbesman for useful reviews.
Competing interests
The authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to A.L.
Comments
oballan ∙
amazingly similar to the human brain , drawing on multiple frames of references , the magic will be in the inferences and conclusions on the relative weightings. congratulations to the team and especially our #hectorzenil #makeadifference