Learning mappings between infinite-dimensional function spaces is a challenging problem with numerous applications across various disciplines.
Examples arise in numerous differential equation models in science and engineering, in robotics and in computer vision.
In particular, any map where either the input or the output space, or both, can be infinite-dimensional; and in particular where the
inputs and/or outputs are themselves functions.
Examples arise in numerous differential equation models in science and engineering, in robotics and in computer vision. In particular, any map where either the input or the output space, or both, can be infinite-dimensional; and in particular where the inputs and/or outputs are themselves functions.The possibility of learning such mappings opens up a new class of problems in the design of neural networks with widespread applicability. New ideas are required to build upon traditional neural networks which are mappings between finite-dimensional Euclidean spaces and/or sets of finite cardinality.
A naive approach to this problem is simply to discretize the (input or output) function spaces and apply standard ideas from neural networks. Instead we formulate a new class of deep neural network based models, called neural operators, which map between spaces of functions on bounded domains , . Such models, once trained, have the property of being discretization invariant: sharing the same network parameters between different discretizations of the underlying functional data. In contrast, the naive approach leads to neural network architectures which depend heavily on this discretization: new architectures with new parameters are needed to achieve the same error for differently discretized data.
We demonstrate, numerically, that the same neural operator can achieve a constant error for any discretization of the data while standard feed-forward and convolutional neural networks cannot.
In this paper we experiment with the proposed model within the context of partial differential equations (PDEs). We study various solution operators or flow maps arising from the PDE model; in particular, we investigate mappings between functions spaces where the input data can be, for example, the initial condition, boundary condition, or coefficient function, and the output data is the respective solution. We perform numerical experiments with operators arising from the one-dimensional Burgers’ Equation (Evans, 2010), the two-dimensional steady sate of Darcy Flow (Bear and Corapcioglu, 2012) and the two-dimensional Navier-Stokes Equation (Lemarié-Rieusset, 2018).
1.1 Background and Context
“Differential equations […] represent the most powerful tool humanity has ever created for making sense of the material world.” Strogatz (2009). Over the past few decades, significant progress has been made on formulating (Gurtin, 1982) and solving (Johnson, 2012) the governing PDEs in many scientific fields from micro-scale problems (e.g., quantum and molecular dynamics) to macro-scale applications (e.g., civil and marine engineering). Despite the success in the application of PDEs to solve real-world problems, two significant challenges remain:
identifying the governing model for complex systems;
efficiently solving large-scale non-linear systems of equations.
Identifying/formulating the underlying PDEs appropriate for modeling a specific problem usually requires extensive prior knowledge in the corresponding field which is then combined with universal conservation laws to design a predictive model. For example, modeling the deformation and fracture of solid structures requires detailed knowledge of the relationship between stress and strain in the constituent material. For complicated systems such as living cells, acquiring such knowledge is often elusive and formulating the governing PDE for these systems remains prohibitive, or the models proposed are too simplistic to be informative. The possibility of acquiring such knowledge from data can revolutionize these fields. Second, solving complicated non-linear PDE systems (such as those arising in turbulence and plasticity) is computationally demanding and can often make realistic simulations intractable. Again the possibility of using instances of data from such computations to design fast approximate solvers holds great potential for accelerating scientific and discovery and engineering practice.
Learning PDE solution operators.
Neural networks have the potential to address these challenges when designed in a way that allows for the emulation of mappings between function spaces (Lu et al., 2019; Bhattacharya et al., 2020; Nelsen and Stuart, 2020; Li et al., 2020a, b, c; Patel et al., 2021; Opschoor et al., 2020; Schwab and Zech, 2019; O’Leary-Roseberry et al., 2020). In PDE applications, the governing equations are by definition local, whilst the solution operator exhibits non-local properties. Such non-local effects can be described by integral operators explicitly in the spatial domain, or by means of spectral domain multiplication; convolution is an archetypal example. For integral equations, the graph approximations of Nyström type (Belongie et al., 2002) provide a consistent way of connecting different grid or data structures arising in computational methods and understanding their continuum limits (Von Luxburg et al., 2008; Trillos and Slepčev, 2018; Trillos et al., 2020). For spectral domain calculations, there are well-developed tools that exist for approximating the continuum (Boyd, 2001; Trefethen, 2000). For these reasons, neural networks that build in non-locality via integral operators or spectral domain calculations are natural. This is governing framework for our work aimed at designing mesh invariant neural network approximations for the solution operators of PDEs.
1.2 Our Contribution
We introduce the concept of neural operators by generalizing standard feed-forward neural networks to learn mappings between infinite-dimensional spaces of functions defined on bounded domains of. The non-local component of the architecture is instantiated through either a parameterized integral operator or through multiplication in the spectral domain. The methodology leads to the following contributions.
We propose neural operators a concept which generalizes neural networks that map between finite-dimensional Euclidean spaces to neural networks that map between infinite-dimensional function spaces.
By construction, our architectures share the same parameters irrespective of the discretization used on the input and output spaces done for the purposes of computation. Consequently, neural operators are capable of zero-shot super-resolution as demonstrated in Figure 1.
We propose four methods for practically implementing the neural operator framework: graph-based operators, low-rank operators, multipole graph-based operators, and Fourier operators. Specifically, we develop a Nyström extension to connect the integral operator formulation of the neural operator to families of graph neural networks (GNNs) on arbitrary grids. Furthermore, we study the spectral domain formulation of the neural operator which leads to efficient algorithms in settings where fast transform methods are applicable. We include an exhaustive numerical study of the four formulations.
Numerically, we show that the proposed methodology consistently outperforms all existing deep learning methods even on the resolutions for which the standard neural networks were designed. For the two-dimensional Navier-Stokes equation, when learning the entire flow map, the method achieves error with viscosity and error with Reynolds number .
The Fourier neural operator has an inference time that is three orders of magnitude faster than the pseudo-spectral method used to generate the data for the Navier-Stokes problem (Chandler and Kerswell, 2013) – s compared to the on a uniform spatial grid. Despite its tremendous speed advantage, the method does not suffer from accuracy degradation when used in downstream applications such as solving Bayesian inverse problems.
In this work, we propose the neural operator models to learn mesh-free, infinite-dimensional operators with neural networks. Compared to previous methods that we will discuss in the related work section 1.3, the neural operator remedies the mesh-dependent nature of standard finite-dimensional approximation methods such as convolutional neural networks by producing a single set of network parameters that may be used with different discretizations. It also has the ability to transfer solutions between meshes. Furthermore, the neural operator needs to be trained only once, and obtaining a solution for a new instance of the parameter requires only a forward pass of the network, alleviating the major computational issues incurred in physics-informed neural network methods Raissi et al. (2019). Lastly, the neural operator requires no knowledge of the underlying PDE, only data.
1.3 Related Works
We outline the major neural network-based approaches for the solution of PDEs. To make the discussion concrete, we will consider the family of PDEs in the form
for some , and a bounded domain. We assume that the solution lives in the Banach space and is a mapping from the parameter Banach space to the space of (possibly unbounded) linear operators mapping to its dual . A natural operator which arises from this PDE is defined to map the parameter to the solution . A simple example that we study further in Section 5.2 is when is the weak form of the second-order elliptic operator subject to homogeneous Dirichlet boundary conditions. In this setting, , , and . When needed, we will assume that the domain is discretized into points and that we observe pairs of coefficient functions and (approximate) solution functions that are used to train the model (see Section 2.1).
An immediate approach to approximate is to parameterize it as a deep convolutional neural network (CNN) between the finite-dimensional Euclidean spaces on which the data is discretized i.e. (Guo et al., 2016; Zhu and Zabaras, 2018; Adler and Oktem, 2017; Bhatnagar et al., 2019). Khoo et al. (2017) concerns a similar setting, but with output space . Such approaches are, by definition, not mesh independent and need modifications to the architecture for different resolution and discretization of in order to achieve consistent error (if at all possible). We demonstrate this issue numerically in Section 6. Furthermore, these approaches are limited to the discretization size and geometry of the training data and hence it is not possible to query solutions at new points in the domain. In contrast for our method, we show in Section 6, both invariance of the error to grid resolution, and the ability to transfer the solution between meshes. The work Ummenhofer et al. (2020)
proposed a continuous convolution network for fluid problems, where off-grid points are sampled and linearly interpolated. However the continuous convolution method is still constrained by the underlying grid which prevents generalization to higher resolutions. Similarly, to get finer resolution solution,Jiang et al. (2020) proposed learning super-resolution with a U-Net structure for fluid mechanics problems. However fine-resolution data is needed for training, while neural operators are capable of zero-shot super-resolution with no new data.
Recently, a novel operator regression architecture, named DeepONet, was proposed by Lu et al. (2019, 2021) that designs a generic neural network based on the approximation theorem from Chen and Chen (1995). The architecture consists of two neural networks: a branch net applied on the input functions and a trunk net applied on the querying locations. Lanthaler et al. (2021)
developed an error estimate on the DeepONet. The standard DeepONet structure is a linear approximation of the target operator, where the trunk net and branch net learn the coefficients and basis. On the other hand, the neural operator is a non-linear approximation, which makes it constructively more expressive. We include an detailed discussion of DeepONet in Section3.2 and as well as a numerical comparison to DeepONet in Section 6.2.
Physics Informed Neural Networks (PINNs).
A different approach is to directly parameterize the solution as a neural network (E and Yu, 2018; Raissi et al., 2019; Bar and Sochen, 2019; Smith et al., 2020; Pan and Duraisamy, 2020). This approach is designed to model one specific instance of the PDE, not the solution operator. It is mesh-independent, but for any given new parameter coefficient function , one would need to train a new neural network which is computationally costly and time consuming. Such an approach closely resembles classical methods such as finite elements, replacing the linear span of a finite set of local basis functions with the space of neural networks.
ML-based Hybrid Solvers
Similarly, another line of work proposes to enhance existing numerical solvers with neural networks by building hybrid models (Pathak et al., 2020; Um et al., 2020a; Greenfeld et al., 2019) These approaches suffer from the same computational issue as classical methods: one needs to solve an optimization problem for every new parameter similarly to the PINNs setting. Furthermore, the approaches are limited to a setting in which the underlying PDE is known. Purely data-driven learning of a map between spaces of functions is not possible.
Reduced basis methods.
Our methodology most closely resembles the classical reduced basis method (RBM) (DeVore, 2014) or the method of Cohen and DeVore (2015). The method introduced here, along with the contemporaneous work introduced in the papers (Bhattacharya et al., 2020; Nelsen and Stuart, 2020; Opschoor et al., 2020; Schwab and Zech, 2019; O’Leary-Roseberry et al., 2020; Lu et al., 2019)
, is, to the best of our knowledge, amongst the first practical supervised learning methods designed to learn maps between infinite-dimensional spaces. It remedies the mesh-dependent nature of the approach in the papers(Guo et al., 2016; Zhu and Zabaras, 2018; Adler and Oktem, 2017; Bhatnagar et al., 2019) by producing a single set of network parameters that can be used with different discretizations. Furthermore, it has the ability to transfer solutions between meshes. Moreover, it need only be trained once on the equation set . Then, obtaining a solution for a new only requires a forward pass of the network, alleviating the major computational issues incurred in (E and Yu, 2018; Raissi et al., 2019; Herrmann et al., 2020; Bar and Sochen, 2019) where a different network would need to be trained for each input parameter. Lastly, our method requires no knowledge of the underlying PDE: it is purely data-driven and therefore non-intrusive. Indeed the true map can be treated as a black-box, perhaps to be learned from experimental data or from the output of a costly computer simulation, not necessarily from a PDE.
Continuous neural networks.
Using continuity as a tool to design and interpret neural networks is gaining currency in the machine learning community, and the formulation of ResNet as a continuous time process over the depth parameter is a powerful example of this (Haber and Ruthotto, 2017; E, 2017). The concept of defining neural networks in infinite-dimensional spaces is a central problem that long been studied (Williams, 1996; Neal, 1996; Roux and Bengio, 2007; Globerson and Livni, 2016; Guss, 2016). The general idea is to take the infinite-width limit which yields a non-parametric method and has connections to Gaussian Process Regression (Neal, 1996; Matthews et al., 2018; Garriga-Alonso et al., 2018), leading to the introduction of deep Gaussian processes (Damianou and Lawrence, 2013; Dunlop et al., 2018)
. Thus far, such methods have not yielded efficient numerical algorithms that can parallel the success of convolutional or recurrent neural networks for the problem of approximating mappings between finite dimensional spaces.Despite the superficial similarity with our proposed work, this body of work differs substantially from what we are proposing: in our work we are motivated by the continuous dependence of the data, or the functions it samples, in a spatial variable; in the work outlined in this paragraph continuity is used to approximate the network architecture when the depth or width approaches infinity.
Nyström approximation, GNNs, and graph neural operators (GNOs).
The graph neural operators (Section 4.1) has an underlying Nyström approximation formulation (Nyström, 1930) which links different grids to a single set of network parameters. This perspective relates our continuum approach to Graph Neural Networks (GNNs). GNNs are a recently developed class of neural networks that apply to graph-structured data; they have been used in a variety of applications. Graph networks incorporate an array of techniques from neural network design such as graph convolution, edge convolution, attention, and graph pooling (Kipf and Welling, 2016; Hamilton et al., 2017; Gilmer et al., 2017; Veličković et al., 2017; Murphy et al., 2018). GNNs have also been applied to the modeling of physical phenomena such as molecules (Chen et al., 2019) and rigid body systems (Battaglia et al., 2018) since these problems exhibit a natural graph interpretation: the particles are the nodes and the interactions are the edges. The work (Alet et al., 2019) performs an initial study that employs graph networks on the problem of learning solutions to Poisson’s equation, among other physical applications. They propose an encoder-decoder setting, constructing graphs in the latent space, and utilizing message passing between the encoder and decoder. However, their model uses a nearest neighbor structure that is unable to capture non-local dependencies as the mesh size is increased. In contrast, we directly construct a graph in which the nodes are located on the spatial domain of the output function. Through message passing, we are then able to directly learn the kernel of the network which approximates the PDE solution. When querying a new location, we simply add a new node to our spatial graph and connect it to the existing nodes, avoiding interpolation error by leveraging the power of the Nyström extension for integral operators.
Low-rank kernel decomposition and low-rank neural operators (LNOs).
Low-rank decomposition is a popular method used in kernel methods and Gaussian process (Kulis et al., 2006; Bach, 2013; Lan et al., 2017; Gardner et al., 2018). We present the low-rank neural operator in Section 4.2 where we structure the kernel network as a product of two factor networks inspired by Fredholm theory. The low-rank method, while simple, is very efficient and easy to train especially when the target operator is close to linear. Khoo and Ying (2019) similarly propose to use neural networks with low-rank structure to approximate the inverse of differential operators. The framework of two factor networks is also similar to the trunk and branch network used in DeepONet (Lu et al., 2019). But in our work, the factor networks are defined on the physical domain and non-local information is accumulated through integration, making our low-rank operator resolution invariant. On the other hand, the number of parameters of the networks in DeepONet(s) depend on the resolution of the data; see Section 3.2 for further discussion.
Multipole, multi-resolution methods, and multipole graph neural operators (MGNOs).
To efficiently capture long-range interaction, multi-scale methods such as the classical fast multipole methods (FMM) have been developed (Greengard and Rokhlin, 1997). Based on the assumption that long-range interactions decay quickly, FMM decomposes the kernel matrix into different ranges and hierarchically imposes low-rank structures to the long-range components (hierarchical matrices) (Börm et al., 2003). This decomposition can be viewed as a specific form of the multi-resolution matrix factorization of the kernel (Kondor et al., 2014; Börm et al., 2003). For example, the works of Fan et al. (2019c, b); He and Xu (2019) propose a similar multipole expansion for solving parametric PDEs on structured grids. However, the classical FMM requires nested grids as well as the explicit form of the PDEs. In Section 4.3, we propose the multipole graph neural operator (MGNO) by generalizing this idea to arbitrary graphs in the data-driven setting, so that the corresponding graph neural networks can learn discretization-invariant solution operators which are fast and can work on complex geometries.
Fourier transform, spectral methods, and Fourier neural operators (FNOs).
The Fourier transform is frequently used in spectral methods for solving differential equations since differentiation is equivalent to multiplication in the Fourier domain. Fourier transforms have also played an important role in the development of deep learning. In theory, they appear in the proof of the universal approximation theorem(Hornik et al., 1989) and, empirically, they have been used to speed up convolutional neural networks (Mathieu et al., 2013). Neural network architectures involving the Fourier transform or the use of sinusoidal activation functions have also been proposed and studied (Bengio et al., 2007; Mingo et al., 2004; Sitzmann et al., 2020). Recently, some spectral methods for PDEs have been extended to neural networks (Fan et al., 2019a, c; Kashinath et al., 2020). In Section 4.4, we build on these works by proposing the Fourier neural operator architecture defined directly in Fourier space with quasi-linear time complexity and state-of-the-art approximation capabilities.
1.4 Paper Outline
In Section 2, we define the general operator learning problem, which is not limited to PDEs. In Section 3, we define the general framework in term of kernel integral operators and relate our proposed approach to existing methods in the literature. In Section 4, we define four different ways of efficiently computing with neural operators: graph-based operators (GNO), low-rank operators (LNO), multipole graph-based operators (MGNO), and Fourier operators (FNO). In Section 5 we define four partial differential equations which serve as a testbed of various problems which we study numerically. In Section 6, we show the numerical results for our four approximation methods on the four test problems, and we discuss and compare the properties of each methods. In Section 7 we conclude the work, discuss potential limitations and outline directions for future work.
2 Learning Operators
2.1 Problem Setting
Our goal is to learn a mapping between two infinite dimensional spaces by using a finite collection of observations of input-output pairs from this mapping. We make this problem concrete in the following setting. Let and be Banach spaces of functions defined on bounded domains , respectively and be a (typically) non-linear map. Suppose we have observations where
are i.i.d. samples drawn from some probability measuresupported on and is possibly corrupted with noise. We aim to build an approximation of by constructing a parametric map
with parameters from the finite-dimensional space and then choosing so that .
We will be interested in controlling the error of the approximation on average with respect to . In particular, assuming is -measurable, we will aim to control the Bochner norm of the approximation
This is a natural framework for learning in infinite-dimensions as one could seek to solve the associated empirical-risk minimization problem
which directly parallels the classical finite-dimensional setting (Vapnik, 1998).
Since our data and are , in general, functions, to work with them numerically, we assume access only to their point-wise evaluations. To illustrate this, we will continue with the example of the preceding paragraph. For simplicity, assume and suppose that the input and output function are real-valued. Let be a -point discretization of the domain and assume we have observations , for a finite collection of input-output pairs indexed by . In the next section, we propose a kernel inspired graph neural network architecture which, while trained on the discretized data, can produce the solution for any given an input . That is to say that our approach is independent of the discretization . We refer to this as being a function space architecture, a mesh-invariant architecture or a discretization-invariant architecture; this claim is verified numerically by showing invariance of the error as . Such a property is highly desirable as it allows a transfer of solutions between different grid geometries and discretization sizes with a single architecture which has a fixed number of parameters.
We note that, while the application of our methodology is based on having point-wise evaluations of the function, it is not limited by it. One may, for example, represent a function numerically as a finite set of truncated basis coefficients. Invariance of the representation would then be with respect to the size of this set. Our methodology can, in principle, be modified to accommodate this scenario through a suitably chosen architecture. We do not pursue this direction in the current work.
3 Proposed Architecture
3.1 Neural Operators
In this section, we outline the neural operator framework. We assume that the input functions are -valued and defined on the bounded domain while the output functions are -valued and defined on the bounded domain . The proposed architecture has the following overall structure:
Lifting: Using a pointwise function , map the input
to its first hidden representation. Usually, we chooseand hence this is a lifting operation performed by a fully local operator.
Iterative kernel integration: For , map each hidden representation to the next via the action of the sum of a local linear operator, a non-local integral kernel operator, and a bias function, composing the sum with a fixed, pointwise nonlinearity. Here we set and and impose that is a bounded domain.
Projection: Using a pointwise function , map the last hidden representation to the output function. Analogously to the first step, we usually pick and hence this is a projection step performed by a fully local operator.
The outlined structure mimics that of a finite dimensional neural network where hidden representations are successively mapped to produce the final output. In particular, we have
where , are the local lifting and projection mappings respectively, are local linear operators (matrices), are integral kernel operators, are bias functions, and are fixed activation functions acting locally as maps in each layer. The output dimensions as well as the input dimensions and domains of definition
are hyperparameters of the architecture. By local maps, we mean that the action is pointwise, in particular, for the lifting and projection maps, we havefor any and for any and similarly, for the activation, for any . The maps, , , and can thus be thought of as defining Nemitskiy operators (Dudley and Norvaisa, 2011, Chapters 6,7) when each of their components are assumed to be Borel measurable. This interpretation allows us to define the general neural operator architecture when pointwise evaluation is not well-defined in the spaces or e.g. when they are Lebesgue, Sobolev, or Besov spaces.
The crucial difference between the proposed architecture (5) and a standard feed-forward neural network is that all operations are directly defined in function space and therefore do not depend on any discretization of the data. Intuitively, the lifting step locally maps the data to a space where the non-local part of
is easier to capture. This is then learned by successively approximating using integral kernel operators composed with a local nonlinearity. Each integral kernel operator is the function space analog of the weight matrix in a standard feed-forward network since they are infinite-dimensional linear operators mapping one function space to another. We turn the biases, which are normally vectors, to functions and, using intuition from the ResNet architecture [CITE], we further add a local linear operator acting on the output of the previous layer before applying the nonlinearity. The final projection step simply gets us back to the space of our output function. We concatenate inthe parameters of , , which are usually themselves shallow neural networks, the parameters of the kernels representing which are again usually shallow neural networks, and the matrices . We note, however, that our framework is general and other parameterizations such as polynomials may also be employed.
Integral Kernel Operators
We define three version of the integral kernel operator used in (5). For the first, let and let be a Borel measure on . Then we define by
Normally, we take to simply be the Lebesgue measure on but, as discussed in Section 4, other choices can be used to speed up computation or aid the learning process by building in a priori information.
For the second, let . Then we define by
where are fixed mappings. We have found numerically that, for certain PDE problems, the form (7) outperforms (6) due to the strong dependence of the solution on the parameters . Indeed, if we think of (5) as a discrete time dynamical system, then the input only enters through the initial condition hence its influence diminishes with more layers. By directly building in -dependence into the kernel, we ensure that it influences the entire architecture.
Lastly, let . Then we define by
where are fixed mappings. Note that, in contrast to (6) and (7), the integral operator (8) is nonlinear since the kernel can depend on the input function . With this definition and a particular choice of kernel and measure , we show in Section 3.3 that neural operators are a continuous input/output space generalization of the popular transformer architecture (Vaswani et al., 2017).
Single Hidden Layer Construction
Having defined possible choices for the integral kernel operator, we are now in a position to explicitly write down a full layer of the architecture defined by (5). For simplicity, we choose the integral kernel operator given by (6), but note that the other definitions (7), (8) work analogously. We then have that a single hidden layer update is given by
where are fixed mappings. We remark that, since we often consider functions on the same domain, we usually take to be the identity.
We will now give an example of a full single hidden layer architecture i.e. when . We choose , take as the identity, and denote by , assuming it is any activation function. Furthermore, for simplicity, we set , , and assume that is the Lebesgue measure on . We then have that (5) becomes
for any . In this example, , , , , , and . One can then parametrize the continuous functions by standard feed-forward neural networks (or by any other means) and the matrix simply by its entries. The parameter vector then becomes the concatenation of the parameters of along with the entries of . One can then optimize these parameters by minimizing with respect to using standard gradient based minimization techniques. To implement this minimization, the functions entering the loss need to be discretized; but the learned parameters may then be used with other discretizations. In Section 4, we discuss various choices for parametrizing the kernels and picking the integration measure and how those choices affect the computational complexity of the architecture.
It is often beneficial to manually include features into the input functions to help facilitate the learning process. For example, instead of considering the -valued vector field as input, we use the -valued vector field . By including the identity element, information about the geometry of the spatial domain is directly incorporated into the architecture. This allows the neural networks direct access to information that is already known in the problem and therefore eases learning. We use this idea in all of our numerical experiments in Section 6. Similarly, when dealing with a smoothing operators, it may be beneficial to include a smoothed version of the inputs using, for example, Gaussian convolution. Derivative information may also be of interest and thus, as input, one may consider, for example, the -valued vector field . Many other possibilities may be considered on a problem-specific basis.
3.2 DeepONets are Neural Operators
We will now draw a parallel between the recently proposed DeepONet architecture in Lu et al. (2019) and our neural operator framework. In fact, we will show that a particular variant of functions from the DeepONets class is a special case of a single hidden layer neural operator construction. To that end, we work with (10) where we choose and denote by . For simplicity, we will consider only real-valued functions i.e. and set . Define by and by . Furthermore let be given as for some . Similarly let be given as for some . Then (10) becomes
where for some . Choose each for some and then we obtain
where are functionals defined as
The set of maps form the trunk net while the functionals form the branch net of a DeepONet. The only difference between DeepONet(s) and (11) is the parametrization used for the functionals . Following Chen and Chen (1995), DeepONet(s) define the functional as maps between finite dimensional spaces. Indeed, they are viewed as and defined to map pointwise evaluations of for some set of points . We note that, in practice, this set of evaluation points is not known a priori and could potentially be very large. Indeed we show that (12) can approximate the functionals defined by DeepONet(s) arbitrary well therefore making DeepONet(s) a special case of neural operators. Furthermore (12) is consistent in function space as the number of parameters used to define each is independent of any discretization that may be used for , while this is not true in the DeepONet case as the number of parameters grow as we refine the discretization of . We demonstrate numerically in Section 6 that the error incurred by DeepONet(s) grows with the discretization of while it is remains constant for neural operators.
We point out that parametrizations of the form (11) fall within the class of linear approximation methods since the nonlinear space is approximated by the linear space DeVore (1998). The quality of the best possible linear approximation to a nonlinear space is given by the Kolmogorov -width where is the dimension of the linear space used in the approximation (Pinkus, 1985). The rate of decay of the -width as a function of quantifies how well the linear space approximates the nonlinear one. It is well know that for some problems such as the flow maps of advection-dominated PDEs, the -widths decay very slowly; hence a very large is needed for a good approximation for such problems (Cohen and DeVore, 2015). This can be limiting in practice as more parameters are needed in order to describe more basis functions and therefore more data is needed to fit these parameters.
On the other hand, we point out that parametrizations of the form (5), and the particular case (10), constitute (in general) a form of nonlinear approximation. The benefits of nonlinear approximation are well understood in the setting of function approximation, see e.g. (DeVore, 1998), however the theory for the operator setting is still in its infancy (Bonito et al., 2020; Cohen et al., 2020). We observe numerically in Section 6 that nonlinear parametrizations such as (10) outperform linear one such as DeepONets or the low-rank method introduced in Section 4.2.
3.3 Transformers are Neural Operators
We will now show that our neural operator framework can be viewed as a continuum generalization to the popular transformer architecture (Vaswani et al., 2017)
which has been extremely successful in natural language processing tasks(Devlin et al., 2018; Brown et al., 2020) and, more recently, is becoming a popular choice in computer vision tasks (Dosovitskiy et al., 2020). The parallel stems from the fact that we can view sequences of arbitrary length as arbitrary discretizations of functions. Indeed, in the context of natural language processing, we may think of a sentence as a “word”-valued function on, for example, the domain . Assuming our function is linked to a sentence with a fixed semantic meaning, adding or removing words from the sentence simply corresponds to refining or coarsening the discretization of . We will now make this intuition precise.
We will show that by making a particular choice of the nonlinear integral kernel operator (8) and discretizing the integral by a Monte-Carlo approximation, a neural operator layer reduces to a pre-normalized, single-headed attention, transformer block as originally proposed in (Vaswani et al., 2017). For simplicity, we assume and that for any , the bias term is zero, and is the identity. Further, to simplify notation, we will drop the layer index from (9) and, employing (8), obtain
a single layer of the neural operator where is the input function to the layer and we denote by the output function. We use the notation to indicate that the kernel depends on the entirety of the function and not simply its pointwise values. While this is not explicitly done in (8), it is a straightforward generalization. We now pick a specific form for kernel, in particular, we assume does not explicitly depend on the spatial variables but only on the input pair . Furthermore, we let
where is matrix of free parameters i.e. its entries are concatenated in so they are learned, and is defined as
Here are again matrices of free parameters, is a hyperparameter, and is the Euclidean inner-product on . Putting this together, we find that (13) becomes
Equation (14) can be thought of as the continuum limit of a transformer block. To see this, we will discretize to obtain the usual transformer block.
To that end, let be a uniformly-sampled, -point discretization of and denote and for . Approximating the inner-integral in (14) by Monte-Carlo, we have
Plugging this into (14) and using the same approximation for the outer integral yields
Define , where denotes the -dimensional probability simplex, as the softmax function
Then we may re-write (15) as
Furthermore, if we re-parametrize where and are matrices of free parameters, we obtain
which is precisely the single-headed attention, transformer block with no layer normalization applied inside the activation function. In the language of transformers, the matrices , , and correspond to the queries, keys, and values functions respectively. We note that tricks such as layer normalization (Ba et al., 2016) can be adapted in a straightforward manner to the continuum setting and incorporated into (14). Furthermore multi-headed self-attention can be realized by simply allowing to be a sum of over multiple functions with form all of which have separate trainable parameters. Including such generalizations yields the continuum limit of the transformer as implemented in practice. We do not pursue this here as our goal is simply to draw a parallel between the two methods.
While we have not rigorously experimented with using transformer architectures for the problems outlined in Section 5, we have found, in initial tests, that they perform worse, are slower, and are more memory expensive than neural operators using (6)-(8). Their high computational complexity is evident from (14) as we must evaluate a nested integral of for each . Recently more efficient transformer architectures have been proposed e.g. (Choromanski et al., 2020) and some have been applied to computer vision tasks. We leave as interesting future work experimenting and comparing these architectures to the neural operator both on problems in scientific computing and more traditional machine learning tasks.
4 Parameterization and Computation
In this section, we discuss various ways of parameterizing the infinite dimensional architecture (5). The goal is to find an intrinsic infinite dimensional paramterization that achieves small error (say ), and then rely on numerical approximation to ensure that this parameterization delivers an error of the same magnitude (say ), for all data discretizations fine enough. In this way the number of parameters used to achieve error is independent of the data discretization. In many applications we have in mind the data discretization is something we can control, for example when generating input/output pairs from solution of partial differential equations via numerical simulation. The proposed approach allows us to train a neural operator approximation using data from different discretizations, and to predict with discretizations different from those used in the data, all by relating everything to the underlying infinite dimensional problem.
We also discuss the computational complexity of the proposed parameterizations and suggest algorithms which yield efficient numerical methods for approximation. Subsections 4.1-4.4 delineate each of the proposed methods.
To simplify notation, we will only consider a single layer of (5) i.e. (9) and choose the input and output domains to be the same. Furthermore, we will drop the layer index and write the single layer update as
where is a bounded domain, is the input function and is the output function. When the domain domains of and are different, we will usually extend them to be on a larger domain. We will consider to be fixed, and, for the time being, take to be the Lebesgue measure on . Equation (16) then leaves three objects which can be parameterized: , , and . Since is linear and acts only locally on , we will always parametrize it by the values of its associated matrix; hence yielding parameters.
The rest of this section will be dedicated to choosing the kernel function and the computation of the associated integral kernel operator. For clarity of exposition, we consider only the simplest proposed version of this operator (6) but note that similar ideas may also be applied to (7) and (8). Furthermore, we will drop , , and from (16) and simply consider the linear update
To demonstrate the computational challenges associated with (17), let be a uniformly-sampled -point discretization of . Recall that we assumed and, for simplicity, suppose that , then the Monte Carlo approximation of (17) is
Therefore to compute on the entire grid requires matrix-vector multiplications. Each of these matrix-vector multiplications requires operations; for the rest of the discussion, we treat as constant and consider only the cost with respect to the discretization parameter since and are fixed by the architecture choice whereas varies depending on required discretization accuracy and hence may be arbitrarily large. This cost is not specific to the Monte Carlo approximation but is generic for quadrature rules which use the entirety of the data. Therefore, when is large, computing (17) becomes intractable and new ideas are needed in order to alleviate this. Subsections 4.1-4.4 propose different approaches to the solution to this problem, inspired by classical methods in numerical analysis. We finally remark that, in contrast, computations with , , and only require operations which justifies our focus on computation with the kernel integral operator.
It will often times be useful to consider the kernel matrix associated to for the discrete points . We define the kernel matrix to be the block matrix with each block given by the value of the kernel i.e.
where we use to index an individual block rather than a matrix element. Various numerical algorithms for the efficient computation of (17) can be derived based on assumptions made about the structure of this matrix, for example, bounds on its rank or sparsity.
4.1 Graph Neural Operator (GNO)
We first outline the Graph Neural Operator (GNO) which approximates (17) by combining a Nyström approximation with domain truncation and is implemented with the standard framework of graph neural networks. This construction was originally proposed in Li et al. (2020c).
A simple yet effective method to alleviate the cost of computing (17) is employing a Nyström approximation. This amounts to sampling uniformly at random the points over which we compute the output function . In particular, let be randomly selected points and, assuming , approximate (17) by
We can view this as a low-rank approximation to the kernel matrix , in particular,
where is a block matrix and , are interpolation matrices, for example, linearly extending the function to the whole domain from the random nodal points. The complexity of this computation is hence it remains quadratic but only in the number of subsampled points which we assume is much less than the number of points in the original discretization.
Another simple method to alleviate the cost of computing (17) is to truncate the integral to a sub-domain of which depends on the point of evaluation . Let be a mapping of the points of to the Lebesgue measurable subsets of denoted . Define then (17) becomes
If the size of each set is smaller than then the cost of computing (19) is where is a constant depending on . While the cost remains quadratic in , the constant can have a significant effect in practical computations, as we demonstrate in Section 6. For simplicity and ease of implementation, we only consider where for some fixed . With this choice of and assuming that , we can explicitly calculate that .
Furthermore notice that we do not lose any expressive power when we make this approximation so long as we combine it with composition. To see this, consider the example of the previous paragraph where, if we let , then (19) reverts to (17). Pick and let with be the smallest integer such that . Suppose that is computed by composing the right hand side of (19) times with a different kernel every time. The domain of influence of is then hence it is easy to see that there exist kernels such that computing this composition is equivalent to computing (17) for any given kernel with appropriate regularity. Furthermore the cost of this computation is and therefore the truncation is beneficial if which holds for any when and any when . Therefore we have shown that we can always reduce the cost of computing (17) by truncation and composition. From the perspective of the kernel matrix, truncation enforces a sparse, block diagonally-dominant structure at each layer. We further explore the hierarchical nature of this computation using the multipole method in subsection 4.3.
Besides being a useful computational tool, truncation can also be interpreted as explicitly building local structure into the kernel . For problems where such structure exists, explicitly enforcing it makes learning more efficient, usually requiring less data to achieve the same generalization error. Many physical systems such as interacting particles in an electric potential exhibit strong local behavior that quickly decays, making truncation a natural approximation technique.
Graph Neural Networks.
We utilize the standard architecture of message passing graph networks employing edge features as introduced in Gilmer et al. (2017) to efficiently implement (17) on arbitrary discretizations of the domain . To do so, we treat a discretization as the nodes of a weighted, directed graph and assign edges to each node using the function which, recall from the section on truncation, assigns to each point a domain of integration. In particular, for , we assign the node the value and emanate from it edges to the nodes which we call the neighborhood of . If then the graph is fully-connected. Generally, the sparsity structure of the graph determines the sparsity of the kernel matrix , indeed, the adjacency matrix of the graph and the block kernel matrix have the same zero entries. The weights of each edge are assigned as the arguments of the kernel. In particular, for the case of (17), the weight of the edge between nodes and is simply the concatenation . More complicated weighting functions are considered for the implementation of the integral kernel operators (7) or (8).
With the above definition the message passing algorithm of Gilmer et al. (2017), with averaging aggregation, updates the value of the node to the value as
which corresponds to the Monte-Carlo approximation of the integral (19). More sophisticated quadrature rules and adaptive meshes can also be implemented using the general framework of message passing on graphs, see, for example, Pfaff et al. (2020). We further utilize this framework in subsection 4.3.
Convolutional Neural Networks.
Lastly, we compare and contrast the GNO framework to standard convolutional neural networks (CNNs). In computer vision, the success of CNNs has largely been attributed to their ability to capture local features such as edges that can be used to distinguish different objects in a natural image. This property is obained by enforcing the convolution kernel to have local support, an idea similar to our truncation approximation. Furthermore by directly using a translation invariant kernel, a CNN architecture becomes translation equivariant; this is a desirable feature for many vision models e.g. ones that perform segmentation. We will show that similar ideas can be applied to the neural operator framework to obtain an architecture with built-in local properties and translational symmetries that, unlike CNNs, remain consistent in function space.
To that end, let and suppose that is supported on . Let be the smallest radius such that where denotes the center of mass of and suppose . Then (17) becomes the convolution
Notice that (20) is precisely (19) when and . When the kernel is parameterized by e.g. a standard neural network and the radius is chosen independently of the data discretization, (20) becomes a layer of a convolution neural network that is consistent in function space. Indeed the parameters of (20) do not depend on any discretization of . The choice enforces translational equivariance in the output while picking small enforces locality in the kernel; hence we obtain the distinguishing features of a CNN model.
We will now show that, by picking a parameterization that is inconsistent is functions space and applying a Monte Carlo approximation to the integral, (20) becomes a standard CNN. This is most easily demonstrated when and the discretization is equispaced i.e. for any . Let
be an odd filter size and letbe the points for . It is easy to see that which we choose as the support of . Furthermore, we parameterize directly by its pointwise values which are matrices at the locations thus yielding parameters. Then (20) becomes
where we define if . Up to the constant factor which can be re-absobred into the parameterization of
, this is precisely the update of a stride 1 CNN withinput channels,
output channels, and zero-padding so that the input and output signals have the same length. This example can easily be generalized to higher dimensions and different CNN structures, we made the current choicesfor simplicity of exposition. Notice that if we double the amount of discretization points for i.e. and , the support of becomes