Arby - Fast data-driven surrogates

by   Aarón Villanueva, et al.

The availability of fast to evaluate and reliable predictive models is highly relevant in multi-query scenarios where evaluating some quantities in real, or near-real-time becomes crucial. As a result, reduced-order modelling techniques have gained traction in many areas in recent years. We introduce Arby, an entirely data-driven Python package for building reduced order or surrogate models. In contrast to standard approaches, which involve solving partial differential equations, Arby is entirely data-driven. The package encompasses several tools for building and interacting with surrogate models in a user-friendly manner. Furthermore, fast model evaluations are possible at a minimum computational cost using the surrogate model. The package implements the Reduced Basis approach and the Empirical Interpolation Method along a classic regression stage for surrogate modelling. We illustrate the simplicity in using Arby to build surrogates through a simple toy model: a damped pendulum. Then, for a real case scenario, we use Arby to describe CMB temperature anisotropies power spectra. On this multi-dimensional setting, we find that out from an initial set of 80,000 power spectra solutions with 3,000 multipole indices each, could be well described at a given tolerance error, using just a subset of 84 solutions.



There are no comments yet.


page 7

page 8


PySINDy: A comprehensive Python package for robust sparse system identification

Automated data-driven modeling, the process of directly discovering the ...

Surrogate Models for Optimization of Dynamical Systems

Driven by increased complexity of dynamical systems, the solution of sys...

Reduced Order and Surrogate Models for Gravitational Waves

We present an introduction to some of the state of the art in reduced or...

On the stability and accuracy of the Empirical Interpolation Method and Gravitational Wave Surrogates

The combination of the Reduced Basis (RB) and the Empirical Interpolatio...

Deep Surrogate Models for Multi-dimensional Regression of Reactor Power

There is renewed interest in developing small modular reactors and micro...

Adjoint-Matching Neural Network Surrogates for Fast 4D-Var Data Assimilation

The data assimilation procedures used in many operational numerical weat...

A framework for data-driven solution and parameter estimation of PDEs using conditional generative adversarial networks

This work is the first to employ and adapt the image-to-image translatio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Several problems arise in observational and theoretical contexts that demand the resolution of computationally intensive differential equations. From structural analysis in engineering to spacetime simulations in astrophysics, rapid and reliable evaluations of solutions to these equations are crucial due to the need for computing in real or near-real time quantities that depend on those solutions. Due to pervasive computational costs led by the inherent complexities present in many problems, achieving fast responses becomes an ubiquitous bottleneck.

As a case study, let us take the example of an ongoing problem in the field of gravitational wave research, namely the template bank problem (Field et al. (2012)

). To be able to detect the very faint gravitational wave signals among the noisy background of ground-based interferometers, the LIGO-Virgo collaboration use match filtering against a bank of GW signal templates to maximize the signal-to-noise ratio (SNR) (

Cutler & Flanagan (1994); Abbott et al. (2016, 2020)).

The observed time series is filtered to decide whether a binary coalescence took place, and the parameter estimation allows us to infer the properties of precursors of the remnant, such as masses and spins for binary black hole mergers.

It is convenient to count with large enough template banks of theoretical waveforms to fill the parameter space of GWs. However, this pose an exceedingly challenging task due to the complexity of solving the Einstein Equations of General Relativity. A single numerically generated GW waveform can take from days to weeks to become available for production (Lehner & Pretorius (2014)).

A palliative solution is the construction of approximate waveform models, which are more direct to build and deploy than numerical relativity ones. There are several methods to build these approximations. Analytical examples are the Post-Newtonian (Blanchet (2006)), Effective One Body (Damour & Nagar (2011)) and Phenomenological (Sturani et al. (2010); Hannam et al. (2014); Khan et al. (2019)) approximations. However, we focus here on a particular set of methods that have proven to be very fertile in GW research and produced some of the milestones in waveform modeling in recent years: Reduced Order methods.

Reduced Order Modeling (ROM) is an umbrella term that encompasses a variety of techniques for dimensional reduction, developed to address the problem of complexity in numerical simulations. In particular, Surrogate Models obtained through application of ROM to ground truth solutions are low resolution representations intended to be fast to build and evaluate without compromising accuracy. We take a data-driven approach, i.e. driven only by data, as opposed to more standard and intrusive ones in which reduction methods are coupled to differential solvers to build solutions (Quarteroni et al. (2015); Hesthaven et al. (2015)).

In waveform modeling, the combination of two ROM methods originally posed for intrusive problems and recreated later for data-driven ones led to significant success in constructing surrogate models for GWs. Those methods are the Reduced Basis (RB) (Boyaval et al. (2009); Field et al. (2011, 2012)) and the Empirical Interpolation (EI) (Barrault et al. (2004); Chaturantabut & Sorensen (2010)) methods, which we describe in the next section.

The primarily purpose of this work is to disseminate these tools in the astronomy/astrophysics community by introducing a single and user-friendly Python package for data-driven dimensional reduction: Arby.

Arby arises as a response to a lack of well documented, tested and actively maintained code for reduced basis and surrogate modeling in the scientific community while adhering to the data-driven and user-friendliness premise.

2 Theory Overview

This section briefly describes the basics of a reduced-order pipeline for building surrogate models. As we stated above, it merges two main ingredients, the Reduced Basis and the Empirical Interpolation methods, for dimensional reduction of raw data. This pipeline was first introduced in (Field et al. (2014)) followed by the construction of several surrogate waveform models based on this method (Blackman et al. (2015, 2017b, 2017a); Varma et al. (2019a, b); Rifat et al. (2020)). See also (Tiglio & Villanueva (2021)) for a review.

Representation.– We are interested in parametrized scalar (real or complex) functions, solutions or models of the form


where represents the parameter/s of the model and is the independent variable. Both are real and possibly multidimensional. In physical models, can represent parametrized time series with being the time variable. For convenience, we denote the spaces for and as the parameter and physical domains, respectively.

In the first stage, we look for a low-resolution representation of the model . The RB method consist in representing a whole set of solutions, usually called the training set

by linear combinations of basis elements of the form




defines the inner product between training functions, is the complex conjugate of –if we deal with complex functions– and is the physical domain. The set is called the reduced basis and is composed by a subset of optimally chosen solutions from the training set. The construction of the reduced basis is iterative: at each step of the algorithm, the most dissimilar (orthogonal) element from the training set joins to the current basis, and the process stops when an user-specified tolerance is met. This tolerance is related to the maximum error of the difference between solutions and approximations. In consequence, the different spaces spanned by each reduced basis built at each step are nested, i.e. The addition of a new element to the basis implies a fine tuning of the previous approximation space.

The RB method supposes a compression in the parameter space of solutions. The next step is to achieve compression in time. To this we turn to interpolation replacing the projection-based approach described so far by an interpolation scheme. We pose the problem by looking for an efficient linear interpolation operator such that


subject to


for strategically selected nodes out from the physical domain. The EI method (Maday et al. (2009); Barrault et al. (2004); Chaturantabut & Sorensen (2010)) gives us an algorithm for building such interpolant. The Empirical Interpolation (EI) algorithm, as described in (Field et al. (2014)), selects iteratively the nodes from the physical domain following a local optimization criteria (Tiglio & Villanueva (2020)).

The EI algorithm receives as unique input the reduced basis and selects the interpolation nodes for building the interpolant. Note that there is no need of the whole training set since the RB algorithm already did the introspection of it, and we assume that the relevant information about is hard-coded in the reduced basis.

It is possible to show that, under some conditions, the interpolation error is similar to the projection one, which in most applications has exponential decay in the number of basis elements (see (Tiglio & Villanueva (2021)) and citations therein). This leads to an efficient and (in most cases of interest) accurate representation of the training set by means of an empirical interpolation.

To summarizing: first, a reduced basis is built from a training set using the RB method, which leads to the linear representation in (2). This step completes a compression in the parameter domain. Next, an empirical interpolant is built solely from the reduced basis. This step completes a compression in the physical domain. Finally, we end up with an empirical interpolation (4) which provides efficient and high-accuracy representations of all functions in the training set.

Predictive models.– We want our model to represent solutions that are not present in the training set. That is, we look for the predictability. For this, we perform parametric fits at each empirical node along with training values. Let us break it down.

Let us rewrite the interpolation in (4) as





Recall the approach is data-driven, so we do not fill the training set with more solutions to approximate newer ones. Instead, we predict them by performing fits along data that we already know, that is, along . For -D problems (), deciding for problem-agnostic fitting methods that are well-suited for most cases can be a challenging task, not to mention the high dimensional case, which remains an open question (Tiglio & Villanueva (2021)). For a rough classification, we refer to regression and interpolation methods. The former deals with calibrating free parameters of some model by optimizing an error function; the latter, with solving an interpolation problem consisting essentially in solving algebraic systems which are possibly subject to constraints (e.g. Eqs. (4,5)).

Here we address the second approach for parametric fits. The procedure consists in interpolating the values

along parameter samples for each empirical node . As we describe in Section 4, as of now, Arby uses splines for this step, though we expect to generalize this. Once fits are done, we end up with a surrogate model which can represent and predict at any in the parameter domain with high accuracy and low computational cost.

3 Algorithms

The pipeline for building surrogates described in the previous section is valid in any number of parameter and physical dimensions if we consider arbitrary fittings through parameter space. For surrogate modeling, Arby supports in its present version 1-D parameter and physical domains (real number intervals of the form for ) and real-valued functions. Again, this restriction is only for building surrogate models. On the other hand, Arby supports multidimensional parameter domains (although still restricted to 1-D domains in the physical dimension) and complex-valued functions for building reduced bases and empirical interpolants.

Below we summarize the algorithm for surrogate modeling. We refer the reader to the Appendices for technical details about the RB and EI algorithms used in intermediate stages.

The inputs are the training set , the parameter set , and the greedy tolerance .

1:Input: , ,
2:Build the reduced basis up to tolerance .
3:Find the empirical nodes and build the interpolant by assembling the functions (see 7).
4:for  do
Build a continuous function by doing fits along values .
6:end for
7:Assemble the surrogate:
8:Output: surrogate model
Algorithm 1 Surrogate modeling

Let’s make some remarks on Alg. 1.

  • The training set is built from a discretization of the parameter domain. In the current version of Arby, is a discretized real interval .

  • The RB algorithm used for building the reduced basis is fully described in Alg. 2, see the Appendices. It selects from points , called the greedy points, labeling those functions in the training set that conform the reduced basis. For conditioning purposes, the basis is orthonormalized at each step, so the algorithm’s final output is a set of orthonormal basis elements along with the set of greedy points. So we use the term reduced basis interchangeably for both, the basis conformed by greedy solutions and its orthonormal version, due to its equivalence (they span the same space).

    The number depends on the greedy tolerance . In Arby we must specify a discretization of the physical domain so as to be able to do integrals (see (3)). In this context, is a real interval and Arby must receive as input an equispaced discretization of it.

  • Step 3 implements the EI algorithm described in Alg. 3, see the Appendices. It receives the reduced basis as unique input and finds empirical nodes to build the interpolant. In practice, the interpolant is specified by assembling the functions defined in Eqs. (6,7).

  • To achieve predictability Steps 4-6 perform parametric fits along training values for each empirical node . Let’s illustrate this by looking at the first iteration of the loop in Steps 4-6. For the first node collect all values

    and perform a fit along them. This results in a function that is continuous in the interval . This is repeated times for each empirical node. The resulted functions constitute along the reduced basis the building blocks for the final surrogate assembly. The current Arby version implements splines (J.H. Ahlberg & Walsh (1967)) for parametric fits, i.e., piecewise polynomial interpolation of some degree arbitrarily set by the user.

  • From eqs. 7 and 6 the empirical interpolant is defined through the functions . They comprise all the RB-EI information. Combining the functions with the fits built on previous steps, Step 7 leads to the desired surrogate which is continuous in inside the real interval .

3.1 Related works

A previous implementation of the RB-EI approach is GreedyCpp 111ttps://, an MPI/OpenMP parallel code written in C++ (Antil et al. (2018)). Although it is not designed for building surrogates and training sets have to be loaded at runtime, it allows for building reduced bases, empirical interpolants and reduced-order quadratures. Another example is ROMPy 222, a previous attempt written in pure Python which supports surrogate modeling.

Other ROM implementations in the Python ecosystem are not fully data-driven. Typically they are weakly or strongly coupled to solvers for differential equations. Mature examples are PyMOR (Milk et al. (2016)) and RBniCS (Ballarin et al. (2015)). The latter built on top of the FEniCS Logg et al. (2011) library for differential equations, and the former allows for coupling with external solvers.

4 Arby

Arby is a Python package for data-driven surrogate modeling satisfying standard software compliance along quality assurance (see Section 4.4). It allows the user for building reduced basis and empirical interpolants at any number of parameter dimensions. At the current release, Arby builds surrogate models for 1-D domains.

4.1 Implementation

Integrals and inner products (see 3) must be discretized in implementing Alg. 1. For this, the physical interval is sampled in equispaced points to define a discrete inner product between two functions,

The bar represents complex conjugation in case of being complex. The ’s are positive real values called weights. Weights and sample points constitute a quadrature rule. Arby uses quadrature rules to compute integrals.

4.2 Public API

Classes.– The main class in Arby is PythonReducedOrderModel. It implements Alg. 1 for surrogate modeling. There are three basic inputs for this class:

  • Pythontraining_set: 2-D array storing training functions (it corresponds to in Alg. 1);

  • Pythonphysical_points: 1-D array storing an equispaced discretization of the physical interval (see 3);

  • Pythonparameter_points: 1-D array storing the parameter points ( in Alg. 1);

These inputs represent the minimum and indispensable for building surrogates. Optional parameters are:

  • Pythonintegration_rule: a string specifying the quadrature rule;

  • Pythongreedy_tol: the greedy tolerance for the reduced basis, and

  • Pythonpoly_deg: the polynomial degree of splines.

These parameters can be tuned for controlling the model accuracy. See the Arby documentation 333 for a thorough tutorial on this.

Once a PythonReducedOrderModel object is created, the construction and evaluation phases of the surrogate are unified in a single call: just invoke the surrogate at some parameter/s to obtain a prediction.

In principle, the first call of the surrogate is more expensive than subsequent ones, but this is not a problem given that it is done only once. The algorithm design allows upcoming calls to be of fast deployment. By doing so, the process was split in offline-online stages (Field et al. (2014)). Thus, the offline stage corresponds to a (possibly) expensive first building; the online one corresponds to fast model evaluations.

There is a class to compute inner products and integrals: the PythonIntegration class. It receives a 1-D array corresponding to the physical points and the integration rule. Available rules are Riemann (default), Trapezoidal and Euclidean. The former two works for continuous data; the latter for discrete data reduces to discrete inner products. Once a rule is specified, the following methods are unlocked: Pythonintegral, Pythondot, Pythonnorm and Pythonnormalize. They compute integrals, inner products, norms and normalizations, respectively.

The PythonBasis class encompasses data utilities for handling arbitrary bases, whether they are reduced bases or user-specified ones. The PythonBasis class receives as input a basis and an PythonIntegration object. Available methods are: Pythoninterpolate, Pythonproject and Pythonprojection_error. The former two take input arrays and interpolate/project them in a finite-dimensional space spanned by the basis. The interpolation is empirical, i.e. following the EI directives. The projection is simply an orthogonal projection onto the basis. The Pythonprojection_error method computes squared projection errors due to projecting arrays onto the basis.

Auxiliary classes are PythonRB and PythonEIM. They are containers for RB/EIM information.

Functions.– The main function in Arby is Pythonreduced_basis. It builds a reduced basis giving as input a training set and physical points. This function takes full advantage of the hierarchical feature of the RB algorithm, leading to constant complexity with the number of basis elements (Tiglio & Villanueva (2021)). For conditioning purposes, there are two patterns related to the normalization of the training set which lead to two different implementations of the greedy algorithm.

There is a function for Gram-Schmidt orthonormalization called Pythongram_schmidt. It implements an Iterative Gram-Schmidt algorithm (Hoffmann (1989)) to orthonormalize a set of linearly independent arrays. Internally, this algorithm builds the reduced bases.

4.3 Benchmarks

It is interesting within the work to present a performance analysis of the the most important routine of the project: Pythonreduce_basis.

From a theoretical point of view, we expect an overall complexity of , where are the number of training and physical samples, respectively, and is the number of basis elements. Since the training set size is fixed, the addition of one element to the basis at some stage of the greedy loop carries a constant computational cost. If data is amenable for dimensional reduction, we hope . On the other side, in the worst case scenario we expect . This happens if there is almost no redundancy in data to exploit, for example random data. In this case the cost grows as .

Besides theoretical assumptions, we intend to empirically know how the different parameters of Pythonreduce_basis function influence the general performance of the algorithm and if this adjusts to the theoretical estimates. These parameters are

  • Pythonintegration_rule – The quadrature rule to define an integration scheme, can be: riemann, trapezoidal and euclidean.

  • Pythonnormalize – True if training data must be normalized before training or False otherwise.

  • Pythongreedy_tol – The greedy tolerance as a stopping condition for the reduced basis. This allows to control the representation accuracy of the basis. We choose to test two different tolerances, and .

  • Pythontraining_set – The training data as a 2-D array. We tested on square random arrays with sizes and .

  • Pythonphysical_points – Physical points for quadrature rules. Must match the number of columns of training set.

With these parameters, we simulated 100 training sets for each one of the 24 possible combinations, giving a total of 24000 test cases. The benchmark was then executed on a computer with the following specifications:

  • CPU – 4 x 2.4 GHz AMD Opteron(tm) Processor 6282 SE.

  • RAM – 251GB DDR3L.

  • OS – CentOS 7 Linux 3.10.0-514.el7.x86_64

  • Software – Python (64 bit), NumPy 1.21.1 and SciPy 1.7.0

The results are presented in Figure 1. As we can anticipate, the size of the training set is the most important factor impacting the execution times. All other parameters, out of Pythongreedy_tol, are mainly used for initial configurations outside the greedy loop, hence used only once.

Figure 1: Results of the benchmark on 24000 test cases varying the values of Pythonintegration_rule, Pythonnormalize, Pythongreedy_tol and Pythontraining_set_shape. In all cases, the horizontal axis represents the different parameter values, and the vertical axis represents the execution time in seconds. We can see that the algorithm increases execution time as the size of the Pythontraining_set_shape increases, while in all other cases the times remain relatively unchanged.

To further explore the relationship between Pythontraining_set sizes and execution time, we run a second benchmark leaving fixed Pythonnormalize=False, Pythonintegration_rule=’riemman’ and Pythongreedy_tol=1e-12. We generate 100 random Pythontraining_set by varying the size between and , giving a total of test samples.

In this case, the results (Figure 2) show the anticipated behavior: in front of random others the cost has a growth as we increase the size of Pythontraining_set.

For more details the entire benchmark dataset can be found in 444 (Villanueva et al., 2021).

Figure 2: Measured times for test samples of Pythontraining_set between and . The samples keep the Pythonintegration_rule, Pythongreedy_tool and Pythonnormalize fixed at riemman, and False, respectively.

4.4 Quality assurance

To ensure the proper software quality of Arby, we provide standard quantitative and qualitative metrics, in particular i) unit testing and ii) code-coverage, and adhere to the PEP style guide(Van Rossum et al., 2001) throughout the entire project.

  1. Unit testing:

    Its purpose is to ensure that the individual software components work as expected (Jazayeri, 2007).

  2. Code-coverage:

    It measures the amount of code covered by the unit test suite, expressed as a percentage of executed sentences(Miller & Maloney, 1963). By providing comprehensive code-coverage we ensure code validation, expand the ability and efficiency of error handling, and increase confidence in the code.

Arby currently uses pytest (Okken, 2017) and 555 for unit testing and coverage, respectively, completing up to of code-coverage using Python versions , , and .

The PEP - Style Guide for Python Code (Van Rossum et al., 2001) is one of a series of guidelines and practices on how to write Python code to improve code readability and consistency. There are several of PEP’s (Python Enhancement Proposals), including PEP 8. The latter has recommendations for code layout, whitespaces, comments, naming conventions and programming recommendations. In addition, there are tools, called linters, that can be used, in particular, to automate compliance with PEP 8; Arby currently uses flake8 666, which checks for any deviation in code style.

Finally, the entire source code is MIT-licensed (Initiative et al., 2006) and is publicly available from its GitHub repository777 All versions committed to this code are automatically tested in a continuous-integration service using Travis CI888 and GitHub Actions 999 Documentation is automatically built from the repository and made public in the read-the-docs service at  101010

Arby is built on top of the Python scientific stack: Numpy (Walt et al., 2011) to perform efficient numerical linear algebra operations; Scipy (Virtanen et al., 2020), used in the current release for splines interpolation.

The Arby package is available for installation on the Python-Package-Index (PyPI) and can be installed using the command Pythonpip install arby.

5 Toy model: a damped pendulum

We illustrate the construction of surrogate models applying Arby to a classical problem in physics: the damped pendulum. This system is a simple pendulum of given longitude subject to gravity and a dissipative force such as friction, allowing for the pendulum oscillations to damp at long times. We encode the generic dynamics of this system in the ordinary differential equation (ODE)


where represents the time-dependent angle of the pendulum with respect to the equilibrium axis and dots represent time differentiation. The symbols and represent friction and gravity strength per unit length, respectively, at fixed values of the pendulum’s longitude. Time units do not play any role in this example, so for practical purposes, we choose it adimensional. This election makes the two parameters of the model, and , also adimensional. In order to make the model one-dimensional and being able to apply Arby for surrogate modeling, fix to a convenient value so as to cover the variation of solutions in the selected time range widely. Fixed

, our parametrized model consist on damped oscillations

, being time the physical variable and the parameter.

We must solve numerically the ODE in (8) in order to generate the training set of solutions to feed Arby. To this, we set and choose intervals for physical and parameter ranges as and . The initial conditions are set to , where , meaning the pendulum departs from rest at and falls under the action of gravity. We generate a training set of solutions using an ODE solver from the Scipy Python package (Virtanen et al., 2020). We discretize the parameter and time domains in and equispaced points respectively, and generate solutions using the same initial conditions111111The code for this example can be found in See Fig. 3.

Figure 3: Graphical illustration of the training set. We plot a subset of training solutions.

To build a surrogate for solutions to the pendulum equation, we invoke the PythonReducedOrderModel class and feed it with the three main parameters: training set, physical points and parameter points which we denote Pythontraining, Pythontime and Pythonparam, respectively.

In addition, we modify default values of optional class parameters to increase the surrogate’s quality:

[breaklines, frame=lines, framesep=2mm]pycon ¿¿¿ from arby import ReducedOrderModel as ROM

# set the greedy tolerance to 1e-14 and # splines degree to 5 and create a model ¿¿¿ pendulum = ROM(training, time, param, greedy_tol=1e-14, poly_deg=5)

Once the model for the pendulum is created the next step is to build/evaluate the surrogate. For that, we just call it.

[breaklines, frame=lines, framesep=2mm,]pycon # define the parameter ‘par‘ for surrogate evaluation ¿¿¿ par = 2. # evalute ¿¿¿ pendulum.surrogate(par) array([1.57079633, 1.5683046 , 1.56086267, …, 0.00993474, 0.00975876, 0.009536])

In order to test the surrogate’s accuracy, we build a test set composed of solutions ( denser than training set) for the same parameter and physical domains. It means the test set contains the training set plus several solutions not used in the training stage.

We use the norm to compute the relative error between surrogate and ground truth,



This metric allows us to quantify how well the surrogate globally matches the ground truth model at some parameter value. Furthermore, we compute these errors not only for the surrogate model based on training parameters but for a set of surrogates built upon different discretizations, starting from very sparse ones until reaching the discretization of training parameters. With the integration tools available in Arby, these computations become straightforward.

In Fig. 4 we built a colormap of errors for different discretizations and parameter values. Naturally, the biggest errors correspond to very sparse training sets and fall below for discretizations . For a specific model (an horizontal line in the colormap), bright-dark patterns describe the behavior of the model when it alternates between in-sample and out-of-sample parameter evaluations. The lowest errors usually correspond to in-sample evaluations, where splines become exact. The largest ones usually correspond to out-of-sample evaluations, where the errors due to parametric fits become relevant.

Figure 4: Global errors for surrogate models built from different training discretizations . All models are evaluated at test parameters.

For the surrogate trained with solutions, in Fig. 5 (top panel) we show the function curves for both, surrogate and ground truth models, for a parameter value corresponding to the worst global error. The surrogate evaluation at this parameter is a prediction, i.e. the associated parameter do not correspond to a training one. Since both surrogate and ground truth models are indistinguishable at eyeball resolution, we plot the absolute value for both functions in logarithmic scale to locate the dissimilarity sectors between curves better. We conclude that even the worst-case scenario shows almost no difference between surrogate and ground truth solution. The bottom panel of Fig. 5 shows the absolute value of the point-wise difference between surrogate and test solutions. We removed from the test set those points which correspond to training parameters in order to focus only on generalization errors of the surrogate. We see that point-wise errors jump up at most to whereas the bulk remains close to .

Figure 5: Top. Absolute value of function curves for both models, surrogate and ground truth. They correspond to the worst prediction in the test set. Bottom. Point-wise difference errors for test parameters. Training points are excluded for which errors are much smaller than pure test errors.

6 Cosmic Microwave Background Anisotropies: a multidimensional case

A valuable application of surrogate models could stem from its application to modeling CMB temperature and polarization anisotropies power spectra as measured by satellites like Planck (Planck Collaboration et al., 2020). Such power spectra have a strong dependence on the underlying cosmological model, defined by a set of cosmological parameters , here taken to be -dimensional.

Using the CAMB (Code for Anisotropies in the Microwave Background)) for each cosmological model (Lewis et al., 2000), we generated their corresponding observed temperature anisotropies power spectra. We randomly sampled the cosmological parameter space using points; as we will see, this is sufficiently dense. The independent or physical variable here results to be the angular multipole index , which we sampled using discrete points .

We compute a reduced basis using Arby. For a greedy tolerance of we obtain a set of greedy parameters identifying those elements in the training set that conform the reduced basis, allowing us to describe any power spectra in our sample as linear combinations of them. In particular, any training set composed of power spectra is equivalent to a set of just 84 reduced functions. Thus, for example, in Fig. 6 we show the distribution of the cosmological parameters for a training set of

CMB power spectra as blue points and the corresponding selected set of reduced basis as orange points. This methodology could accelerate the estimation of the cosmological parameters from CMB anisotropies by using a surrogate model built using Machine Learning algorithms on the reduced basis set. The utilization of this approach to estimate cosmological parameters will be the subject of a forthcoming publication.

Figure 6: Cosmological parameters space of a training set in blue dots and their corresponding Arby selected reduced basis set in orange points.

We report the convergence of the reduced basis number as a function of the training set size in Fig. 7. This plot shows that the reduced basis number stabilizes close to , meaning that, for the specified greedy tolerance, the training set begins to saturate at . This proves that the full -points dataset involves highly redundant information, for which only functions are enough to represent the entire set. With the reduced basis we reach a compression factor of just simply by computing projections of the training set.

Figure 7: Convergence of the number of basis elements with the size of the training set. We observe the curve saturates around basis elements.

7 Conclusion

We have introduced Arby, an open source Python package that provides a set of tools to generate and handle fast and highly-accurate surrogate models in a non-intrusive way. Arby can be used to construct continuous models from sparse data composed by functions generated perhaps by differential equations or to explore redundancies in data by dimensional reduction. The offline-online architecture of Arby allows for fast deployments of predictive models that can approximate functions which otherwise can be expensive to compute.

We assessed the package with unit testing tools and ensured it satisfies proper software quality assurance, which improves in robustness and readability of code. We also perform benchmarks measuring computation times of reduced bases along different parameter combinations.

To date, for surrogate modeling Arby works on 1-D domains for both spaces, the parametric and the physical, though it supports multidimensional parameter domains for reduced basis and empirical interpolation. In future releases we want to extend Arby to several dimensions for surrogate modeling and combine it with state-of-the-art regression methods at the fitting stages. The curse of the dimensionality present in high dimensional problems like the CMB power spectra discussed in section 6 surely become a bottleneck, so parallelized scenarios are worth to be explored in the future.

The user-friendly API of Arby expands the usability of the code to virtually anyone looking for a continuous model out from discrete data, even with little or no knowledge of ROM methods.

The authors would like to thank to their families and friends, and also IATE astronomers and Manuel Tiglio for useful comments and suggestions. This work was partially supported by the Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET, Argentina). A.V., J.B.C and M.Ch. are supported by a fellowship from CONICET. This research has made use of the, Cornell University repository, adstex ( and the Python programming language.


  • Abbott et al. (2016) Abbott, B. P., Abbott, R., Abbott, T. D., et al. 2016, Phys. Rev. Lett., 116, 241102
  • Abbott et al. (2020) Abbott, B. P., Abbott, R., Abbott, T. D., et al. 2020, Classical and Quantum Gravity, 37, 055002
  • Antil et al. (2018) Antil, H., Chen, D., & Field, S. E. 2018, Comput. Sci. Eng., 20, 10
  • Ballarin et al. (2015) Ballarin, F., Sartori, A., & Rozza, G. 2015, ScienceOpen Posters
  • Barrault et al. (2004) Barrault, M., Maday, Y., Nguyen, N. C., & Patera, A. T. 2004, Comptes Rendus Mathematique, 339, 667
  • Blackman et al. (2015) Blackman, J., Field, S. E., Galley, C. R., et al. 2015, Phys. Rev. Lett., 115, 121102
  • Blackman et al. (2017a) Blackman, J., Field, S. E., Scheel, M. A., et al. 2017a, Phys. Rev. D, 95, 104023
  • Blackman et al. (2017b) Blackman, J., Field, S. E., Scheel, M. A., et al. 2017b, Phys. Rev. D, 96, 024058
  • Blanchet (2006) Blanchet, L. 2006, Living Rev. Rel., 9, 4
  • Boyaval et al. (2009) Boyaval, S., Bris, C. L., Maday, Y., Nguyen, N. C., & Patera, A. T. 2009, Computer Methods in Applied Mechanics and Engineering, 198, 3187
  • Chaturantabut & Sorensen (2010) Chaturantabut, S. & Sorensen, D. 2010, SIAM J. Scientific Computing, 32, 2737
  • Cutler & Flanagan (1994) Cutler, C. & Flanagan, E. E. 1994, Phys. Rev. D, 49, 2658
  • Damour & Nagar (2011) Damour, T. & Nagar, A. 2011, Fundam. Theor. Phys., 162, 211
  • Field et al. (2011) Field, S. E., Galley, C. R., Herrmann, F., et al. 2011, Phys. Rev. Lett., 106, 221102
  • Field et al. (2014) Field, S. E., Galley, C. R., Hesthaven, J. S., Kaye, J., & Tiglio, M. 2014, Phys. Rev. X, 4, 031006
  • Field et al. (2012) Field, S. E., Galley, C. R., & Ochsner, E. 2012, Phys. Rev., D86, 084046
  • Hannam et al. (2014) Hannam, M., Schmidt, P., Bohé, A., et al. 2014, Phys. Rev. Lett., 113, 151101
  • Hesthaven et al. (2015) Hesthaven, J. S., Rozza, G., & Stamm, B. 2015, Certified Reduced Basis Methods for Parametrized Partial Differential Equations, 1st edn., Springer Briefs in Mathematics (Switzerland: Springer), 135
  • Hoffmann (1989) Hoffmann, W. 1989, Computing, 41, 335
  • Initiative et al. (2006) Initiative, O. S. et al. 2006, 2015b.[Online]. Available: https://opensource. org/licenses/MIT.[Accessed 27 March 2017]
  • Jazayeri (2007) Jazayeri, M. 2007, in 2007 Future of Software Engineering, FOSE ’07 (USA: IEEE Computer Society), 199–213
  • J.H. Ahlberg & Walsh (1967) J.H. Ahlberg, E. N. & Walsh, J. L. 1967, The theory of splines and their applications (Academic Press)
  • Khan et al. (2019) Khan, S., Chatziioannou, K., Hannam, M., & Ohme, F. 2019, Phys. Rev. D, 100, 024059
  • Lehner & Pretorius (2014) Lehner, L. & Pretorius, F. 2014, Ann. Rev. Astron. Astrophys., 52, 661
  • Lewis et al. (2000) Lewis, A., Challinor, A., & Lasenby, A. 2000, ApJ, 538, 473
  • Logg et al. (2011) Logg, A., Wells, G., & Mardal, K.-A. 2011, Automated solution of differential equations by the finite element method. The FEniCS book, Vol. 84
  • Maday et al. (2009) Maday, Y., Nguyen, N. C., Patera, A. T., & Pau, S. H. 2009, Communications on Pure and Applied Analysis, 8, 383
  • Milk et al. (2016) Milk, R., Rave, S., & Schindler, F. 2016, SIAM Journal on Scientific Computing, 38, S194
  • Miller & Maloney (1963) Miller, J. C. & Maloney, C. J. 1963, Commun. ACM, 6, 58
  • Okken (2017) Okken, B. 2017, Python testing with Pytest: simple, rapid, effective, and scalable (Pragmatic Bookshelf)
  • Planck Collaboration et al. (2020) Planck Collaboration, Aghanim, N., Akrami, Y., et al. 2020, A&A, 641, A6
  • Quarteroni et al. (2015) Quarteroni, A., Manzoni, A., & Negri, F. 2015, Reduced Basis Methods for Partial Differential Equations: An Introduction, UNITEXT (Springer International Publishing)
  • Rifat et al. (2020) Rifat, N. E. M., Field, S. E., Khanna, G., & Varma, V. 2020, Phys. Rev. D, 101, 081502
  • Sturani et al. (2010) Sturani, R., Fischetti, S., Cadonati, L., et al. 2010, Journal of Physics: Conference Series, 243, 012007
  • Tiglio & Villanueva (2020) Tiglio, M. & Villanueva, A. 2020
  • Tiglio & Villanueva (2021) Tiglio, M. & Villanueva, A. 2021, Reduced Order and Surrogate Models for Gravitational Waves
  • Van Rossum et al. (2001) Van Rossum, G., Warsaw, B., & Coghlan, N. 2001, Python. org, 1565
  • Varma et al. (2019a) Varma, V., Field, S. E., Scheel, M. A., et al. 2019a, Phys. Rev. Research., 1, 033015
  • Varma et al. (2019b) Varma, V., Field, S. E., Scheel, M. A., et al. 2019b, Phys. Rev. D, 99, 064045
  • Villanueva et al. (2021) Villanueva, A., Beroiz, M., Cabral, J., Chalela, M., & Dominguez, M. 2021, Benchmark dataset for arby
  • Virtanen et al. (2020) Virtanen, P., Gommers, R., Oliphant, T. E., et al. 2020, Nature methods, 17, 261
  • Walt et al. (2011) Walt, S., Colbert, S. C., & Varoquaux, G. 2011, Computing in Science and Engineering, 13, 22

Appendix A Build reduced bases

By defining

  • the training set ,

  • the projection error at parameter

    where is the projector operator associated to a -sized basis,

  • : the greedy tolerance,

  • GS: Orthonormalizes against a basis through a Gram-Schmidt procedure,

  • gp: the greedy points, and

  • rb: the reduced basis,

the Reduced Basis-greedy algorithm proceeds as follows:

1:Input: ,
2:Seed choice (arbitrary):
4:rb = , gp =
8:while  do
10:     gp = gp
12:     rb = rb
15:end while
16:Output: rb = and gp =
Algorithm 2 RB greedy algorithm

Appendix B Build Empirical Interpolants

3:for  do
4:     Build
7:end for
8:Output: EIM nodes and interpolant
Algorithm 3 EIM algorithm

Appendix C On computing projection coefficients

The most relevant step in terms of computational cost in the RB greedy algorithm is the computation of projection coefficients, that is, step of Alg. 2. Taking full advantage of the reduced basis orthonormality

we can write projection errors as


where are the projection coefficients. Note that

We omitted the label for simplicity. This allows for constant computational cost in the addition of a new element to the basis, since one only need to compute the projection coefficients for the basis element while storing those corresponding to the previous basis.

In practice, the orthonormalization of the basis is not perfect and carries some error due to machine precision . Therefore, inner products between basis elements write as

and this error propagates through projection error as

Then, naive implementations of this rule (saving projection coefficients to compute the next projection error) can lead to undesired error amplifications whenever and therefore wrong estimations of projection errors.

This is avoided simply by normalizing the training set. By doing so, we ensure and keep controlled all orthonormalization errors.

Arby allows for normalizing options in the Pythonreduced_basis function. Depending on whether the training set is normalized or not, the computation of the greedy projection errors is done differently in each case, guaranteeing the conditioning of the algorithm at all resolutions within machine precision.