# The Random Feature Model for Input-Output Maps between Banach Spaces

Well known to the machine learning community, the random feature model, originally introduced by Rahimi and Recht in 2008, is a parametric approximation to kernel interpolation or regression methods. It is typically used to approximate functions mapping a finite-dimensional input space to the real line. In this paper, we instead propose a methodology for use of the random feature model as a data-driven surrogate for operators that map an input Banach space to an output Banach space. Although the methodology is quite general, we consider operators defined by partial differential equations (PDEs); here, the inputs and outputs are themselves functions, with the input parameters being functions required to specify the problem, such as initial data or coefficients, and the outputs being solutions of the problem. Upon discretization, the model inherits several desirable attributes from this infinite-dimensional, function space viewpoint, including mesh-invariant approximation error with respect to the true PDE solution map and the capability to be trained at one mesh resolution and then deployed at different mesh resolutions. We view the random feature model as a non-intrusive data-driven emulator, provide a mathematical framework for its interpretation, and demonstrate its ability to efficiently and accurately approximate the nonlinear parameter-to-solution maps of two prototypical PDEs arising in physical science and engineering applications: viscous Burgers' equation and a variable coefficient elliptic equation.

## Authors

• 2 publications
• 25 publications
08/26/2021

### Learning Partial Differential Equations in Reproducing Kernel Hilbert Spaces

We propose a new data-driven approach for learning the fundamental solut...
05/07/2020

### Model Reduction and Neural Networks for Parametric PDEs

We develop a general framework for data-driven approximation of input-ou...
06/14/2021

### On the Representation of Solutions to Elliptic PDEs in Barron Spaces

Numerical solutions to high-dimensional partial differential equations (...
04/13/2022

### Neural Operator with Regularity Structure for Modeling Dynamics Driven by SPDEs

Stochastic partial differential equations (SPDEs) are significant tools ...
07/24/2019

### Kernel Methods for Surrogate Modeling

This chapter deals with kernel methods as a special class of techniques ...
05/13/2021

### Informed Equation Learning

Distilling data into compact and interpretable analytic equations is one...
12/22/2020

### Enforcing exact physics in scientific machine learning: a data-driven exterior calculus on graphs

As traditional machine learning tools are increasingly applied to scienc...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The goal of this paper is to frame the random feature model, introduced in [rahimi2008random], as a methodology for the data-driven approximation of maps between infinite-dimensional spaces. Canonical examples of such maps include the semigroup generated by a time-dependent partial differential equation (PDE) mapping the initial condition (an input parameter) to the solution at a later time and the operator mapping a coefficient function (an input parameter) appearing in a PDE to its solution. Obtaining efficient and potentially low-dimensional representations of PDE solution maps is not only conceptually interesting, but also practically useful. Many applications in science and engineering require repeated evaluations of a complex and expensive forward model for different configurations of a system parameter. The model often represents a discretized PDE and the parameter, serving as input to the model, often represents a high-dimensional discretized quantity such as an initial condition or uncertain coefficient field. These outer loop applications commonly arise in inverse problems or uncertainty quantification tasks that involve control, optimization, or inference [peherstorfer2018survey]. Full order forward models do not perform well in such many-query contexts, either due to excessive computational cost (requiring the most powerful high performance computing architectures) or slow evaluation time (unacceptable in real-time contexts such as on-the-fly optimal control). In contrast to the big data

regime that dominates computer vision and other technological fields, only a relatively small amount of high resolution data is generated from computer simulations or physical experiments in scientific applications. Fast approximate solvers built from this limited available data that can efficiently and accurately emulate the full order model would be highly advantageous.

In this work, we demonstrate that the random feature model holds considerable potential for such a purpose. Resembling [lu2019deeponet] and the contemporaneous work in [bhattacharya2020pca, li2020neural]

, we present a methodology for true function space learning of black-box input-output maps between a Banach space and separable Hilbert space. We formulate the approximation problem as supervised learning in infinite dimensions and show that the natural hypothesis space is a reproducing kernel Hilbert space associated with an operator-valued kernel. For a suitable loss functional, training the random feature model is equivalent to solving a finite-dimensional convex optimization problem. As a consequence of our careful construction of the method as mapping between Banach spaces, the resulting emulator naturally scales favorably with respect to the high input and output dimensions arising in practical, discretized applications; furthermore, it is shown to achieve small relative test error for two model problems arising from approximation of a semigroup and the solution map for an elliptic PDE exhibiting parametric dependence on a coefficient function.

### 1.1 Literature Review

In recent years, two different lines of research have emerged that address PDE approximation problems with machine learning techniques. The first perspective takes a more traditional approach akin to point collocation methods from the field of numerical analysis. Here, the goal is to use a deep neural network (NN) to solve a prescribed initial boundary value problem with as high accuracy as possible. Given a point cloud in a spatio-temporal domain

as input data, the prevailing approach first directly parametrizes the PDE solution field as a NN and then optimizes the NN parameters by minimizing the PDE residual with respect to (w.r.t.) some loss functional (see [raissi2019physics, sirignano2018dgm, weinan2018deep] and the references therein). To clarify, the object approximated with this novel method is a low-dimensional input-output map , i.e., the real-valued function that solves the PDE. This approach is mesh-free by definition but highly intrusive as it requires full knowledge of the specified PDE. Any change to the original formulation of the initial boundary value problem or related PDE problem parameters necessitates an (expensive) re-training of the NN solution. We do not explore this first approach any further in this article.

The second direction is arguably more ambitious: use a NN as an emulator for the infinite-dimensional mapping between an input parameter and the PDE solution itself or a functional of the solution, i.e., a quantity of interest; the latter is widely prevalent in uncertainty quantification problems. We emphasize that the object approximated in this setting, unlike in the aforementioned first approach, is an input-output map , i.e., the PDE solution operator, where are infinite-dimensional Banach spaces; this map is generally nonlinear. For an approximation-theoretic treatment of parametric PDEs in general, we refer the reader to the article of Cohen and DeVore [cohen2015approximation]. In applications, the solution operator is represented by a discretized forward model , where is the mesh size, and hence represents a high-dimensional object. It is this second line of research that inspires our work.

Of course, there are many approaches to forward model reduction that do not explicitly involve machine learning ideas. The reduced basis method (see [barrault2004empirical, benner2017model, devore2014theoretical]

and the references therein) is a classical idea based on constructing an empirical basis from data snapshots and solving a cheaper variational problem; it is still widely used in practice due to computationally efficient offline-online decompositions that eliminate dependence on the full order degrees of freedom. Recently, machine learning extensions to the reduced basis methodology, of both intrusive (e.g., projection-based reduced order models) and non-intrusive (e.g., model-free data only) type, have further improved the applicability of these methods

[cheng2013data, gao2019non, hesthaven2018non, lee2020model, santo2019data]. However, the input-output maps considered in these works involve high dimension in only one of the input or output space, not both. Other popular surrogate modeling techniques include Gaussian processes [williams2006gaussian], polynomial chaos expansions [spanos1989stochastic][wendland2004scattered]; yet, these are only practically suitable for problems with input space of low to moderate dimension. Classical numerical methods for PDEs may also represent the forward model , albeit implicitly in the form a computer code (e.g.: finite element, finite difference, finite volume methods). However, the approximation error is sensitive to and repeated evaluations of this forward model often becomes cost prohibitive due to poor scaling with input dimension .

Instead, deep NNs have been identified as strong candidate surrogate models for parametric PDE problems due to their empirical ability to emulate high-dimensional nonlinear functions with minimal evaluation cost once trained. Early work in the use of NNs to learn the solution operator, or vector field, defining ODEs and time-dependent PDEs, may be found in the 1990s

[Kev98, Kev92]. There are now more theoretical justifications for NNs breaking the curse of dimensionality [kutyniok2019theoretical, ma2019generalization], leading to increased interest in PDE applications [geist2020numerical, schwab2019deep]. A suite of work on data-driven discretizations of PDEs has emerged that allow for identification of the governing model [bar2019learning, bigoni2020data, long2017pde, trask2019gmls]; we note that only the operators appearing in the equation itself are approximated with these approaches, not the solution operator of the PDE. More in line with our focus in this article, architectures based on deep convolutional NNs have proven quite successful for learning elliptic PDE solution maps (for example, see [tripathy2018deep, winovich2019convpde, zhu2018bayesian], which take an image-to-image regression approach). Other NNs have been used in similar elliptic problems for quantity of interest prediction [khoo2017solving]

, error estimation

[chen2020output][li2020variational]. Yet in all the approaches above, the architectures and resulting error are dependent on the mesh resolution. To circumvent this issue, the surrogate map must be well-defined on function space and independent of any finite-dimensional realization of the map that arises from discretization. This is not a new idea (see [chen1995universal, rossi2005functional] or for functional data analysis, [kadri2016operator, micchelli2005learning]). The aforementioned reduced basis method is an example, as is the method of [chkifa2013sparse, cohen2015approximation], which approximates the solution map with sparse Taylor polynomials and is proved to achieve optimal convergence rates in idealized settings. However, it is only recently that machine learning methods have been explicitly designed to operate in an infinite-dimensional setting, and there is little work in this direction [bhattacharya2020pca, li2020neural]. Here we propose the random feature model as another such method.

The random feature model (RFM) [rahimi2008random], detailed in Section 2.3, is in some sense the simplest possible machine learning model; it may be viewed as an ensemble average of randomly parametrized functions: an expansion in a randomized basis. These random features could be defined, for example, by randomizing the internal parameters of a NN. Compared to NN emulators with enormous learnable parameter counts (e.g., to , see [fan2020solving, feliu2020meta, li2020variational]) and methods that are intrusive or lead to nontrivial implementations [chkifa2013sparse, lee2020model, santo2019data], the RFM is one of the simplest models to formulate and train (often parameters, or fewer, suffice). The theory of the RFM for real-valued outputs is well developed, partly due to its close connection to kernel methods [cao2019generalization, jacot2018neural, rahimi2008random] and Gaussian processes [williams1997computing], and includes generalization rates and dimension-free estimates [ma2019generalization, rahimi2008random, sun2018random]. A quadrature viewpoint on the RFM provides further insight and leads to Monte Carlo sampling ideas [bach2017equivalence]; we explore this further in Section 2.3

. As in modern deep learning practice, the RFM has also been shown to perform best when the model is over-parametrized

[belkin2019reconciling]. In a similar high-dimensional setting of relevance in this paper, the authors of [griebel2017reproducing, kempf2017kernel] theoretically investigated nonparametric kernel regression for parametric PDEs with real-valued solution map outputs. However, these works require explicit knowledge of the kernel itself, rather than working with random features that implicitly define a kernel as we do here; furthermore, our work considers both infinite-dimensional input and output spaces, not just one or the other. A key idea underlying our approach is to formulate the proposed random feature algorithm on infinite-dimensional space and only then discretize. This philosophy in algorithm development has been instructive in a number of areas in scientific computing, such as optimization [hinze2008optimization]

and the development of Monte Carlo Markov Chain methodology

[cotter2013mcmc]. It has recently been promoted as a way of designing and analyzing algorithms within machine learning [haber2017stable, ma2019machine, ruthotto2019deep, weinan2017proposal, weinan2019mean] and our work may be understood within this general framework.

### 1.2 Contributions

Our primary contributions in this paper are now listed.

1. We develop the random feature model, directly formulated on the function space level, for learning input-output maps between Banach spaces purely from data. As a method for parametric PDEs, the methodology is non-intrusive but also has the additional advantage that it may be used in settings where only data is available and no model is known.

2. We show that our proposed method is more computationally tractable to both train and evaluate than standard kernel methods in infinite dimensions. Furthermore, we show that the method is equivalent to kernel ridge regression performed in a finite-dimensional space spanned by random features.

3. We apply our methodology to learn the semigroup defined by the solution operator for viscous Burgers’ equation and the coefficient-to-solution operator for the Darcy flow equation.

4. We demonstrate, by means of numerical experiments, two mesh-independent approximation properties that are built into the proposed methodology: invariance of relative error to mesh resolution and evaluation ability on any mesh resolution.

This paper is structured as follows. In Section 2, we communicate the mathematical framework required to work with the random feature model in infinite dimensions, identify an appropriate approximation space, and explain the training procedure. We introduce two instantiations of random feature maps that target physical science applications in Section 3 and detail the corresponding numerical results for these applications in Section 4. We conclude in Section 5 with discussion and future work.

## 2 Methodology

In this work, the overarching problem of interest is the approximation of a map , where are infinite-dimensional spaces of real-valued functions defined on some bounded open subset of and is defined by , where is the solution of a (possibly time dependent) PDE and is an input function required to make the problem well-posed. Our proposed approach for this approximation, constructing a surrogate map for the true map , is data-driven, non-intrusive, and based on least squares. Least squares-based methods are integral to the random feature methodology as proposed in low dimensions [rahimi2008random] and generalized here to the infinite-dimensional setting; they have also been shown to work well in other algorithms for high-dimensional numerical approximation [beylkin2005algorithms, cohen2016optimal, doostan2009least]. Within the broader scope of reduced order modeling techniques [benner2017model], the approach we adopt in this paper falls within the class of data-fit emulators. In its essence, our method interpolates the solution manifold

 M={u∈Y:u=F†(a),a∈X}. (1)

The solution map , as the inverse of a differential operator, is often smoothing and admits a notion of compactness, i.e., the output space compactly embeds into the input space. Then, the idea is that should have some compact, low-dimensional structure (intrinsic dimension). However, actually finding a model that exploits this structure despite the high dimensionality of the truth map is quite difficult. Further, the effectiveness of many model reduction techniques, such as those based on the reduced basis method, are dependent on inherent properties of the map itself (e.g., analyticity), which in turn may influence the decay rate of the Kolmogorov -width of the manifold  [cohen2015approximation]. While such subtleties of approximation theory are crucial to developing rigorous theory and provably convergent algorithms, we choose to work in the non-intrusive setting where knowledge of the map and its associated PDE are only obtained through measurement data, and hence detailed characterizations such as those mentioned above are essentially unavailable.

The remainder of this section introduces the mathematical preliminaries for our methodology. With the goal of operator approximation in mind, in Section 2.1 we formulate a supervised learning problem in an infinite-dimensional setting. We provide the necessary background on reproducing kernel Hilbert spaces in Section 2.2 and then define the RFM in Section 2.3. In Section 2.4, we describe the optimization principle which leads to algorithms for the RFM and an example problem in which and are one-dimensional spaces.

### 2.1 Problem Formulation

Let be real Banach spaces and a (possibly nonlinear) map. It is natural to frame the approximation of as a supervised learning problem. Suppose we are given training data in the form of input-output pairs , where ,

is a probability measure supported on

, and with, potentially, noise added to the evaluations of ; in our applications, this noise may be viewed as resulting from model error (the PDE does not perfectly represent the physics) and/or from discretization error (in approximating the PDE). We aim to build a parametric reconstruction of the true map from the data, that is, construct a model and find such that are close as maps from to in some suitable sense. The natural number here denotes the total number of parameters in the model. The standard approach to determine parameters in supervised learning is to first define a loss functional and then minimize the expected risk,

 minα∈PEa∼ν[ℓ(F†(a),F(a,α))]. (2)

With only the data at our disposal, we approximate problem Eq. 2 by replacing with the empirical measure , which leads to the empirical risk minimization problem

 minα∈P1nn∑j=1ℓ(yj,F(aj,α)). (3)

The hope is that given minimizer of Eq. 3 and of Eq. 2, well approximates , that is, the learned model generalizes

well; these ideas may be made rigorous with results from statistical learning theory

[hastie2009elements]. Solving problem Eq. 3 is called training the model . Once trained, the model is then validated on a new set of input-output pairs previously unseen during the training process. This testing phase indicates how well approximates . From here on out, we assume that is a real separable Hilbert space and focus on the squared loss

 ℓ(y1,y2):=12\normy1−y22Y. (4)

We stress that our entire formulation is in an infinite-dimensional setting and that we will remain in this setting throughout the paper; as such, the random feature methodology we propose will inherit desirable discretization-invariant properties, to be observed in the numerical experiments of Section 4.

### 2.2 Operator-Valued Reproducing Kernels

The random feature model is naturally formulated in a reproducing kernel Hilbert space (RKHS) setting, as our exposition will show in Section 2.3. However, the usual RKHS theory is concerned with real-valued functions [aronszajn1950theory, berlinet2011reproducing, cucker2002mathematical, wendland2004scattered]. Our setting, with the output space a separable Hilbert space, requires several new ideas that generalize the real-valued case. We now outline these ideas; parts of the presentation that follow may be found in the references [bach2017equivalence, carmeli2006vector, micchelli2005learning].

We first consider the special case for ease of exposition. A real RKHS is a Hilbert space comprised of real-valued functions such that the pointwise evaluation functional is bounded for every . It then follows that there exists a unique, symmetric, positive definite kernel function such that for every , and the reproducing kernel property holds. These two properties are often taken as the definition of an RKHS. The converse direction is also true: every symmetric, positive definite kernel defines a unique RKHS.

We now introduce the needed generalization of the reproducing property to arbitrary real Hilbert spaces , as this result will motivate the construction of the random feature model. With elements of now arbitrary elements of a vector space, the kernel is now operator-valued.

###### Definition 1

Let be a real Banach space and a real separable Hilbert space. An operator-valued kernel is a map

 k:X×X→L(Y,Y), (5)

where denotes the Banach space of all bounded linear operators on , such that its adjoint satisfies for all and for every ,

 N∑i,j=1⟨yi,k(ai,aj)yj⟩Y≥0 (6)

for all pairs .

Paralleling the development for the real-valued case, an operator-valued kernel also uniquely (up to isomorphism) determines an associated real RKHS . Now, choosing a probability measure supported on , we define a kernel integral operator (in the sense of the Bochner integral) by

 (7)

which is non-negative, self-adjoint, and compact (provided is compact for all  [carmeli2006vector]). Let us further assume that all conditions needed for to be an isometric isomorphism from into are satisfied. Generalizing the standard Mercer theory (see, e.g., [bach2017equivalence, berlinet2011reproducing]), we may write the RKHS inner product using

 ⟨F,G⟩Hk=⟨F,T−1kG⟩L2ν∀F,G∈Hk. (8)

Note that while Eq. 8 appears to depend on the measure on , the RKHS is itself determined by the kernel without any reference to a measure (see [cucker2002mathematical], Chp. 3, Thm. 4). With the inner product now explicit, we may directly deduce a reproducing property. A fully rigorous justification of the methodology is outside the scope of this article; however, we perform formal computations which provide intuition underpinning the methodology. To this end we fix and . Then,

 ⟨k(⋅,a)y,T−1kF⟩L2ν =∫⟨k(a′,a)y,(T−1kF)(a′)⟩Yν(da′) =∫⟨y,k(a,a′)(T−1kF)(a′)⟩Yν(da′) =⟨y,∫k(a,a′)(T−1kF)(a′)ν(da′)⟩Y =⟨y,F(a)⟩Y,

by using Definition 1 of operator-valued kernel and the fact that  ([carmeli2006vector]). So, we deduce the following: [Reproducing property for operator-valued kernels] Let be given. Then for every and ,

 ⟨y,F(a)⟩Y=⟨k(⋅,a)y,F⟩Hk. (9)

This identity, paired with a special choice of , is the basis of the random feature model in our abstract infinite-dimensional setting.

### 2.3 Random Feature Model

One could approach the approximation of from the perspective of kernel methods. However, it is generally a difficult task to explicitly design operator-valued kernels of the form Eq. 5 since the spaces may be of different regularity, for example. Example constructions of operator-valued kernels studied in the literature include diagonal operators, multiplication operators, and composition operators [kadri2016operator, micchelli2005learning], but these all involve some simple generalization of scalar-valued kernels. Instead, the random feature model allows one to implicitly work with operator-valued kernels by choosing a random feature map and a probability measure supported on ; the map is assumed to be square integrable w.r.t. the product measure . We now show the connection between random features and kernels; to this end, recall the following standard notation: Given a Hilbert space, the outer product is defined by for any .   Then, we consider maps of the form

 kμ(a,a′):=∫φ(a;θ)⊗φ(a′;θ)μ(dθ). (10)

Since may readily be shown to be an operator-valued kernel via Definition 1, it defines a unique real RKHS . Our approximation theory will be based on this space or finite-dimensional approximations thereof.

We now perform a purely formal but instructive calculation, following from application of the reproducing property Eq. 9 to operator-valued kernels of the form Eq. 10. Doing so leads to an integral representation of any : for all ,

 =⟨∫⟨φ(a;θ),y⟩Yφ(⋅;θ)μ(dθ),F†⟩Hkμ =∫⟨φ(a;θ),y⟩Y⟨φ(⋅;θ),F†⟩Hkμμ(dθ) =∫c(θ)⟨y,φ(a;θ)⟩Yμ(dθ)

where the coefficient function is defined by

 c(θ):=⟨φ(⋅;θ),F†⟩Hkμ. (11)

Since is Hilbert, the above holding for all implies the integral representation

 F†=∫c(θ)φ(⋅;θ)μ(dθ). (12)

The expression for needs careful interpretation because with probability one; indeed, is defined only as an limit. Nonetheless, the RKHS may be completely characterized by this integral representation. Define

 A:L2μ(Θ;R)→L2ν(X;Y)c↦Ac:=∫c(θ)φ(⋅;θ)μ(dθ). (13)

Then we have the following result whose proof, provided in Appendix A, is a straightforward generalization of the real-valued case given in [bach2017equivalence], Sec. 2.2: Under the assumption that , the RKHS defined by the kernel in Eq. 10 is precisely

 Hkμ=im(A)={∫c(θ)φ(⋅;θ)μ(dθ):c∈L2μ(Θ;R)}. (14)

We stress that the integral representation Eq. 12 is not unique since is not injective in general. A central role in what follows is the approximation of measure by the empirical measure

 μ(m):=1mm∑j=1δθj,θj∼μ i.i.d. (15)

Given this, define to be the empirical approximation to :

 k(m)(a,a′)=Eθ∼μ(m)[φ(a;θ)⊗φ(a′;θ)]=1mm∑j=1φ(a;θj)⊗φ(a′;θj). (16)

Then define to be the unique RKHS induced by the kernel The following characterization of is proved in Appendix A: Assume that and that the random features are linearly independent in Then, the RKHS is equal to the linear span of the .

Applying a simple Monte Carlo sampling approach to Equation 12, specifically, replacing the probability measure by the empirical measure , gives the approximation

 F†≈1mm∑j=1c(θj)φ(⋅;θj); (17)

by virtue of Eq. 16, this approximation is in and achieves the Monte Carlo rate . However, in the setting of interest to us, the Monte Carlo approach does not give rise to a practical method for two reasons: evaluation of requires knowledge of both the unknown mapping and of the RKHS appearing in the inner product defining from ; in our setting the kernel of the RKHS is not assumed to be known – only the random features are assumed to be given. To sidestep these difficulties, the RFM adopts a data-driven optimization approach to determine a different approximation to , also from the space .

We now define the RFM:

###### Definition 2

Given probability space , with a real Banach space, probability space , with a finite or infinite-dimensional Banach space, real separable Hilbert space , and , the random feature model is the parametric map

 Fm:X×Rm→Y(a;α)↦Fm(a;α):=1mm∑j=1αjφ(a;θj),θj∼μ i.i.d. (18)

We implicitly use the Borel -algebra to define the probability spaces in the preceding definition. The goal of the RFM is to choose parameters so as to approximate mappings by mappings . The RFM may be viewed as a spectral method since the randomized basis in the linear expansion Eq. 18 is defined on all of . Determining the coefficient vector from data obviates the difficulties associated with the Monte Carlo approach since the method only requires knowledge of sample input-output pairs from and knowledge of the random feature map .

As written, Equation 18 is incredibly simple. It is clear that the choice of random feature map and measure pair will determine the quality of approximation. In their original paper [rahimi2008random], Rahimi and Recht took a kernel-oriented perspective by first choosing a kernel and then finding a random feature map to estimate this kernel. Our perspective is the opposite in that we allow the choice of random feature map to implicitly define the kernel via the formula Eq. 10 instead of picking the kernel first. This methodology also has implications for numerics: the kernel never explicitly appears in any computations, which leads to storage savings. It does, however, leave open the question of characterizing the RKHS of mappings from to that underlies the approximation method.

The connection to kernels explains the origins of the RFM in the machine learning literature. Moreover, the RFM may also be interpreted in the context of neural networks. To see this, consider the setting where are both equal to the Euclidean space and choose

to be a family of hidden neurons

. A single hidden layer NN would seek to find in so that

 1mm∑j=1αjφNN(a;θj) (19)

matches the given training data . More generally, and in arbitrary Euclidean spaces, one may allow to be any deep NN. However, while the RFM has the same form as Eq. 19, there is a difference in the training: the are drawn i.i.d. from a probability measure and then fixed, and only the are chosen to fit the training data. This connection is quite profound: given any deep NN with randomly initialized parameters , studies of the lazy training regime and neural tangent kernel [gao2019non, jacot2018neural] suggest that adopting a RFM approach and optimizing over only is quite natural, as it is observed that in this regime the NN parameters do not stray far from their random initialization during gradient descent whilst the last layer of parameters adapt considerably.

Once the parameters are chosen at random and fixed, training the RFM only requires optimizing over which, due to linearity of in , is a simple task to which we now turn our attention.

### 2.4 Optimization

One of the most attractive characteristics of the RFM is its training procedure. With the -type loss Eq. 4 as in standard regression settings, optimizing the coefficients of the RFM with respect to the empirical risk Eq. 3 is a convex optimization problem, requiring only the solution of a finite-dimensional system of linear equations; the convexity also suggests the possibility of appending convex constraints (such as linear inequalities), although we do not pursue this here. We emphasize the simplicity of the underlying optimization tasks as they suggest the possibility of numerical implementation of the RFM into complicated black-box computer codes.

We now proceed to show that a regularized version of the optimization problem Eq. 3Eq. 4 arises naturally from approximation of a nonparametric regression problem defined over the RKHS To this end, recall the supervised learning formulation in Section 2.1. Given input-output pairs as data, with the drawn from unknown probability measure on , the objective is to find an approximation to the map . Let be the hypothesis space and its operator-valued reproducing kernel of the form Eq. 10. The most straightforward learning algorithm in this RKHS setting is kernel ridge regression, also known as penalized least squares. This method produces a nonparametric model by finding a minimizer of

 minF∈Hkμn∑j=112\normyj−F(aj)2Y+λ2\normF2Hkμ, (20)

where is a penalty parameter. By the representer theorem for operator-valued kernels [micchelli2005learning], the minimizer has the form

 F∗=n∑ℓ=1kμ(⋅,aℓ)βℓ (21)

for some functions . In practice, finding these functions in the output space requires solving a (block) linear operator equation. For the high-dimensional PDE problems we consider in this work, solving such an equation may become prohibitively expensive from both operation count and storage required. A few workarounds were proposed in [kadri2016operator] such as certain diagonalizations, but these rely on simplifying assumptions that are quite limiting. More fundamentally, the representation of the solution in Eq. 21 requires knowledge of the kernel ; in our setting we assume access only to the random features which define and not itself.

We thus proceed to explain how to make progress with this problem given only knowledge of random features. Recall the empirical kernel given by Eq. 16, the RKHS , and Eq. 16. The following result, proved in Appendix A, shows that a RFM hypothesis class with a penalized least squares empirical loss function in optimization problem Eq. 3Eq. 4 is equivalent to kernel ridge regression Eq. 20 restricted to .

Assume that and that the random features are linearly independent in . Fix . Let be the unique minimum norm solution of the following problem:

 minα∈Rmn∑j=112\normyj−1mm∑ℓ=1αℓφ(aj;θℓ)2Y+λ2m\normα22. (22)

Then, the RFM defined by this choice satisfies

 Fm(⋅;α∗)=argminF∈Hk(m)n∑j=112\normyj−F(aj)2Y+λ2\normF2Hk(m). (23)

Solving the convex problem Eq. 22 trains the RFM. The first order condition for a global minimizer leads to the normal equations

 1mm∑i=1n∑j=1αi⟨φ(aj;θi),φ(aj;θℓ)⟩Y+λαℓ=n∑j=1⟨yj,φ(aj;θℓ)⟩Y (24)

for each . This is an -by- linear system of equations for that is standard to solve. In the case , the minimum norm solution may be written in terms of a pseudoinverse operator [luenberger1997optimization].

###### Example 1 (Brownian bridge)

We now provide a simple one-dimensional instantiation of the random feature model to illustrate the methodology. Denote the input by and take the input space , output space , input space measure , and random parameter space . Then, consider the random feature map defined by the Brownian bridge

 φ(x;θ):=∞∑j=1θ(j)(jπ)−1√2sin(jπx),θ(j)∼N(0,1) i.i.d., (25)

where . For any realization of , the function is a Brownian motion constrained to zero at and . The kernel is simply the covariance function of this stochastic process:

 k(x,x′)=Eθ(j)∼N(0,1)[φ(x;θ)φ(x′;θ)]=min{x,x′}−xx′. (26)

Note that is the Green’s function for the negative Laplacian on with Dirichlet boundary conditions. Using this fact, we may explicitly characterize the associated RKHS as follows. First, define

 Tkf:=∫10k(⋅,y)f(y)\ddy=(−\dv[2]x)−1f, (27)

where the the negative Laplacian has domain . Viewing as an operator from into itself, from Eq. 8 we conclude, upon integration by parts,

 ⟨f,g⟩Hk=⟨f,T−1kg⟩L2=⟨\dvfx,\dvgx⟩L2=⟨f,g⟩H10∀f,g∈Hk; (28)

note that the last identity does indeed define an inner product on By this formal argument we identify the RKHS as the Sobolev space . Furthermore, Brownian bridge may be viewed as the Gaussian measure .

Approximation using the RFM with the Brownian bridge random feature map is illustrated in Figure 1. Since is a piecewise linear function, a kernel regression method will produce a piecewise linear approximation. Indeed, the figure indicates that the RFM with training points fixed approaches the optimal piecewise linear approximation as (see [ma2019generalization] for a related theoretical result).

The Brownian bridge example illuminates a more fundamental idea. For this low-dimensional problem, an expansion in a deterministic Fourier sine basis would of course be natural. But if we do not have a natural, computable orthonormal basis, then randomness provides a useful alternative representation; notice that the random features each include random combinations of the deterministic Fourier sine basis in this Brownian bridge example. For the more complex problems that we move on to study numerically in the next two sections, we lack knowledge of good, computable bases for general maps in infinite dimensions. The RFM approach exploits randomness to allow us to explore, implicitly discover the structure of, and represent, such maps. Thus we now turn away from this example of real-valued maps defined on a subset of the real line and instead consider the use of random features to represent maps between spaces of functions.

## 3 Application to PDE Solution Maps

In this section, we design the random feature maps for the RFM approximation of two particular PDE parameter-to-solution maps: the evolution semigroup of viscous Burgers’ equation in Section 3.1 and the coefficient-to-solution operator for the Darcy problem in Section 3.2. As practitioners of kernel methods in machine learning have found, the choice of kernel (which in this work, follows from the choice of random feature map) plays a central role in the quality of the function reconstruction. While our method is purely data-driven and requires no knowledge of the governing PDE, we take the view that any prior knowledge can, and should, be introduced into the design of . The maps that we employ are nonlinear in both arguments. We also detail the probability measure placed on the input space for each of the two PDE applications; this choice is crucial because while it is desirable that the trained RFM generalizes to arbitrary inputs in , we can in general only expect to learn an approximation of the truth map restricted to inputs that resemble those drawn from .

### 3.1 Burgers’ Equation: Formulation

Viscous Burgers’ equation in one spatial dimension is representative of the advection-dominated PDE problem class; these time-dependent equations are not conservation laws due to the presence of a small dissipative term, but nonlinear transport still plays a central role in the evolution of solutions. The initial value problem we consider is

 (29)

where is the viscosity (i.e., diffusion coefficient) and we have imposed periodic boundary conditions. The initial condition serves as the input and is drawn according to a Gaussian measure defined by

 a∼ν:=N(0,C), (30)

with covariance operator

 C=τ2α−d(−Δ+τ2Id)−α, (31)

where and the operator is defined on

. The hyperparameter

is an inverse length scale and controls the regularity of the draw. It is known that such are Hölder with exponent up to , so in particular . Then for all , the unique global solution to Eq. 29 is real analytic for all  (see [kiselev2008blow], Thm. 1.1). Hence, setting the output space to be for any , we may define the solution map

 F†:L2→Hsa↦F†(a):=ΨT(a)=u(T,⋅), (32)

where forms the solution operator semigroup for Eq. 29 and we fix the final time . The map is smoothing and nonlinear.

We now describe a random feature map for use in the RFM Eq. 18 that we call Fourier space random features. Let

denote the Fourier transform and define

by

 φ(a;θ)=σ(F−1(gFθFa)), (33)

where , the function defined below, is defined as a mapping on and applied pointwise to vectors. The randomness enters through , with the same covariance operator as in Eq. 31 but with potentially different inverse length scale and regularity, and the wavenumber filter function is

 g(k)=σg(δ\absk),σg(r):=max{0,min{2r,(r+1/2)−β}}, (34)

where . The map essentially performs a random convolution with the initial condition followed by a filtering operation. Figure 1(a) illustrates a sample input and output from . While not optimized for performance, the filter is designed to shuffle the energy in low to medium wavenumbers and cut off high wavenumbers (see Fig. 1(b)), reflecting our prior knowledge of the behavior of solutions to Eq. 29.

We choose the activation function

in Eq. 33 to be the exponential linear unit defined by

 ELU(r):={r,r≥0er−1,r<0. (35)

has successfully been used as activation in other machine learning frameworks for related nonlinear PDE problems [lee2020model, patel2018nonlinear]. We also find to perform better in the RFM framework over several other choices including , , , , , and . Note that the pointwise evaluation of in Eq. 33 will be well defined, by Sobolev embedding, for sufficiently large in the definition of . Since the solution operator maps into for any , this does not constrain the method.

### 3.2 Darcy Flow: Formulation

Divergence form elliptic equations [gilbarg2015elliptic] arise in a variety of applications, in particular, the groundwater flow in a porous medium governed by Darcy’s law [bear2012fundamentals]. This linear elliptic boundary value problem reads

 (36)

where is a bounded open subset in , represents sources and sinks of fluid, the permeability of the porous medium, and the piezometric head; all three functions map into and, in addition, is strictly positive almost everywhere in . We work in a setting where is fixed and consider the input-output map defined by . The measure on is a high contrast level set prior constructed as a pushforward of a Gaussian measure:

 a∼ν:=ψ♯N(0,C); (37)

here is a threshold function defined by

 ψ(r)=a+1(0,∞)(r)+a−1(−∞,0)(r),0

and the covariance operator is given in Eq. 31 with and homogeneous Neumann boundary conditions on the Laplacian . That is, the resulting coefficient almost surely takes only two values ( or ) and, as the zero level set of a Gaussian random field, exhibits random geometry in the physical domain . It follows that almost surely. Further, the size of the contrast ratio measures the scale separation of this elliptic problem and hence controls the difficulty of reconstruction [bernardi2000adaptive]. See Figure 2(a) for a representative draw.

Given , the standard Lax-Milgram theory may be applied to show that for coefficient , there exists a unique weak solution for Equation 36 (see, e.g., Evans [evans2010partial]). Thus, we define the ground truth solution map

 F†:L∞→H10a↦F†(a):=u. (39)

We emphasize that while the PDE Eq. 36 is linear, the solution map is nonlinear.

We now describe the chosen random feature map for this problem, which we call predictor-corrector random features. Define by such that equationparentequation

 −Δp0 =fa+σγ(θ1) (40a) −Δp1 =fa+σγ(θ2)+∇(loga)⋅∇p0, (40b)

where the boundary conditions are homogeneous Dirichlet, are two Gaussian random fields each drawn from measure , is the source term in Eq. 36, and are parameters for a thresholded sigmoid defined by

 σγ(r):=s+−s−1+e−r/δ+s− (41)

and extended as a Nemytskii operator when applied to and . In practice, since is not well-defined when drawn from the level set measure, we replace with , where is a smoothed version of obtained by evolving the following linear heat equation

 ⎧⎪⎨⎪⎩\dvvt=ηΔvin (0,1)×Dn⋅∇v=0on (0,1)×∂Dv(0)=ain D (42)

for one time unit. An example of the response of to a piecewise constant input is shown in Figure 3.

We remark that by removing the two random terms involving in Eq. 40, we obtain a remarkably accurate surrogate model for the PDE. This observation is representative of a more general iterative method, a predictor-corrector type iteration, for solving the Darcy equation Eq. 36, which is convergent for sufficiently large coefficients . The map is essentially a random perturbation of a single step of this iterative method: Equation 40a makes a coarse prediction of the output, then Eq. 40b improves this prediction with a correction term (here, this correction term is derived from expanding the original PDE). This choice of falls within an ensemble viewpoint that the RFM may be used to improve pre-existing surrogate models by taking to be an existing emulator, but randomized in a principled way through .

We are cognizant of the fact that, for this particular example, the random feature map requires full knowledge of the Darcy equation and a naïve evaluation of may be as expensive as solving the original PDE, which is itself a linear PDE; however, we believe that the ideas underlying the random features used here are intuitive and suggestive of what is possible in other applications areas. For example, random feature models may be applied on domains with simple geometries, which are supersets of the physical domain of interest, enabling the use of fast tools such as the fast Fourier transform (FFT) within the RFM even though they may not be available on the original problem, either because the operator to be inverted is spatially inhomogeneous or because of the geometry of the physical domain.

## 4 Numerical Experiments

We now assess the performance of our proposed methodology on the approximation of operators presented in Section 3. In the numerical experiments that follow, all infinite-dimensional objects are discretized on a uniform mesh with degrees of freedom. In this section, our notation does not make explicit the dependence on because the random feature model is consistent with the continuum in the limit ; we demonstrate this fact numerically.

The input functions and our chosen random feature maps Eq. 33 and Eq. 40 require

draws of Gaussian random fields to be fully defined. We efficiently sample these fields by truncating a Karhunen-Loéve expansion and employing fast summation of the eigenfunctions with FFT. More precisely, on a mesh of size

, denote by a numerical representation of a Gaussian random field on domain , :

 gθ=∑k∈ZKξk√λkϕk≈∑k′∈(Z+)dξk′√λk′ϕk′∼N(0,C), (43)

where and is a truncated one-dimensional lattice of cardinality ordered such that

is non-increasing. For clarity, the eigenvalue problem

for non-negative, symmetric, trace-class operator in Eq. 31 has solutions

 ϕk(x)=2cos(k1πx1)cos(k2πx2),λk=τ2α−2(π2\absk2+τ2)−α (44)

for homogeneous Neumann boundary conditions when , , , and solutions

 ϕk(x)=e2πikx,λk=τ2α−1(4π2k2+τ2)−α (45)

for periodic boundary conditions when , ,

(with appropriately modified random variables

in Eq. 43 to ensure that the resulting is real-valued). These forms of are used in all experiments that follow.

To measure the distance between the RFM approximation and the ground truth map , we employ the approximate expected relative test error

 en′,m:=1n′n′∑j=1\normF†(aj)−Fm(aj;α∗)L2\normF†(aj)L2≈Ea∼ν\normF†(a)−Fm(a;α∗)L2\normF†(a)L2, (46)

where the are drawn i.i.d. from and denotes the number of input-output pairs used in testing. All norms on the physical domain are numerically approximated by trapezoid rule quadrature. Since for both the PDE solution operators Eq. 32 and Eq. 39, we also perform all required inner products during training in rather than in ; this results in smaller relative test error .

### 4.1 Burgers’ Equation: Experiment

We generate a high resolution dataset of input-output pairs by solving Burgers’ equation Eq. 29 on a uniform periodic mesh of size (identifying the first mesh point with the last) using an FFT-based pseudospectral method for spatial discretization and a fourth-order Runge-Kutta integrating factor time-stepping scheme [kassam2005fourth] for time discretization. All mesh sizes are subsampled from this original dataset and hence we consider numerical realizations of up to . We fix training and testing pairs unless otherwise noted. The input data are drawn from where is given by Eq. 31 with parameter choices and . We fix the viscosity to in all experiments. Lowering leads to smaller length scale solutions and more difficult reconstruction; more data (higher ) and features (higher ) or a more expressive choice of would be required to achieve comparable error levels. For simplicity, we set the forcing , although nonzero forcing could lead to other interesting solution maps such as . It is easy to check that the solution will have zero mean for all time and a steady state of zero. Hence, we choose to ensure that the solution is far from attaining steady state. For the random feature map Eq. 33, we fix the hyperparameters , , , and . The map itself is evaluated efficiently with the FFT, and hyperparameters were not optimized. We find that regularization during training has a negligible effect for our problem, so the RFM is trained with by solving the normal equations Eq. 24 with the pseudoinverse to deliver the minimum norm least squares solution; we use the truncated SVD implementation in scipy.linalg.pinv2 for this purpose.