NOMAD: Nonlinear Manifold Decoders for Operator Learning

by   Jacob H. Seidman, et al.
University of Pennsylvania

Supervised learning in function spaces is an emerging area of machine learning research with applications to the prediction of complex physical systems such as fluid flows, solid mechanics, and climate modeling. By directly learning maps (operators) between infinite dimensional function spaces, these models are able to learn discretization invariant representations of target functions. A common approach is to represent such target functions as linear combinations of basis elements learned from data. However, there are simple scenarios where, even though the target functions form a low dimensional submanifold, a very large number of basis elements is needed for an accurate linear representation. Here we present NOMAD, a novel operator learning framework with a nonlinear decoder map capable of learning finite dimensional representations of nonlinear submanifolds in function spaces. We show this method is able to accurately learn low dimensional representations of solution manifolds to partial differential equations while outperforming linear models of larger size. Additionally, we compare to state-of-the-art operator learning methods on a complex fluid dynamics benchmark and achieve competitive performance with a significantly smaller model size and training cost.



page 1

page 2

page 3

page 4


Neural Operator: Graph Kernel Network for Partial Differential Equations

The classical development of neural networks has been primarily for mapp...

Function-valued RKHS-based Operator Learning for Differential Equations

Recently, a steam of works seek for solving a family of partial differen...

Learning Operators with Coupled Attention

Supervised operator learning is an emerging machine learning paradigm wi...

KoopmanizingFlows: Diffeomorphically Learning Stable Koopman Operators

We propose a novel framework for constructing linear time-invariant (LTI...

Variational training of neural network approximations of solution maps for physical models

A novel solve-training framework is proposed to train neural network in ...

Enabling Nonlinear Manifold Projection Reduced-Order Models by Extending Convolutional Neural Networks to Unstructured Data

We propose a nonlinear manifold learning technique based on deep autoenc...

DeepGreen: Deep Learning of Green's Functions for Nonlinear Boundary Value Problems

Boundary value problems (BVPs) play a central role in the mathematical a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

The Operator Learning Manifold Hypothesis.

Machine learning techniques have been applied to great success for modeling functions between finite dimensional vector spaces. For example, in computer vision (vectors of pixel values) and natural language processing (vectors of word embeddings) these methods have produced state-of-the-art results in image recognition

he2016deep and translation tasks vaswani2017attention. However, not all data has an obvious and faithful representation as finite dimensional vectors. In particular, functional data is mathematically represented as a vector in an infinite dimensional vector space. This kind of data appears naturally in problems coming from physics, where scenarios in fluid dynamics, solid mechanics, and kinematics are described by functions of continuous quantities.

Supervised learning in the infinite dimensional setting can be considered for cases where we want to map functional inputs to target functional outputs. For example, we might wish to predict the velocity of a fluid as function of time given an initial velocity field, or predict the pressure field across the surface of the Earth given temperature measurements. This is similar to a finite dimensional regression problem, except that we are now interested in learning an operator between spaces of functions. We refer to this as a supervised operator learning problem: given a data-set of pairs of functions , learn an operator which maps input functions to output functions such that .

One approach to solve the supervised operator learning problem is to introduce a parameterized operator architecture and train it to minimize a loss between the model’s predicted functions and the true target functions in the training set. One of the first operator network architectures was presented in chen1995universal with accompanying universal approximation guarantees in the uniform norm. These results were adapted to deep networks in lu2021learning and led to the DeepONet architecture and its variants wang2021improved; lu2022comprehensive; jin2022mionet

. The Neural Operator architecture, motivated by the composition of linear and nonlinear layers in neural networks, was proposed in

li2020neural. Using the Fourier convolution theorem to compute the integral transform in Neural Operators led to the Fourier Neural Operator li2020fourier. Other recent architectures include approaches based on PCA-based representations bhattacharya2021model, random feature approaches nelsen2021random, wavelet approximations to integral transforms gupta2021multiwavelet, and attention-based architectures kissas2022learning.

A common feature shared among many of these approaches is that they aim to approximate an operator using three maps: an encoder, an approximator, and a decoder, see Figure 1 and Section 3 for more details. In all existing approaches embracing this structure, the decoder is constructed as a linear map. In doing so, the set of target functions is being approximated with a finite dimensional linear subspace in the ambient target function space. Under this setting, the universal approximation theorems of chen1995universal; kovachki2021universal; lanthaler2022error guarantee that there exists a linear subspace of a large enough dimension which approximates the target functions to any prescribed accuracy.

However, as with finite dimensional data, there are scenarios where the target functional data concentrates on a low dimensional nonlinear submanifold. We refer to the phenomenon of data in function spaces concentrating on low dimensional submanifolds as the Operator Learning Manifold Hypothesis, see Figure 1. For example, it is known that certain classes of parametric partial differential equations admit low dimensional nonlinear manifolds of solution functions cohen2015approximation. Although linear representations can be guaranteed to approximate these spaces, their required dimension can become very large and thus inefficient in capturing the true low dimensional structure of the data.

In this paper, we are motivated by the Operator Learning Manifold Hypothesis to formulate a new class of operator learning architectures with nonlinear decoders. Our key contributions can be summarized as follows.

  • Limitations of Linear Decoders: We describe in detail the shortcomings of operator learning methods with linear decoders and present some fundamental lower bounds along with an illustrative operator learning problem which is subject to these limitations.

  • Nonlinear Manifold Decoders (NOMAD): This motivates a novel operator learning framework with a nonlinear decoder that can find low dimensional representations for finite dimensional nonlinear submanifolds in function spaces.

  • Enhanced Dimensionality Reduction: A collection of numerical experiments involving linear transport and nonlinear wave propagation shows that, by learning nonlinear submanifolds of target functions, we can build models that achieve state-of-the-art accuracy while requiring a significantly smaller number of latent dimensions.

  • Enhanced Computational Efficiency: As a consequence, the resulting architectures contain a significantly smaller number of trainable parameters and their training cost is greatly reduced compared to competing linear approaches.

We begin our presentation in Section 2 by providing a taxonomy of representative works in the literature. In Section 3 we formally define the supervised operator learning problem and discuss existing approximation strategies, with a focus on highlighting open challenges and limitations. In Section 4 we present the main contributions of this work and illustrate their utility through the lens of a pedagogical example. In Section 5 we provide a comprehensive collection of experiments that demonstrate the performance of using NOMAD against competing state-of-the-art methods for operator learning. Section 6

summarizes our main findings and discusses lingering limitations and broader impact. Additional details on architectures, hyperparameter selection, and training details are provided in the Supplemental Materials.

2 Related Work in Dimensionality Reduction

Low Dimensional Representations in Finite Dimensional Vector Spaces:

Finding low dimensional representations of high dimensional data has a long history, going back to 1901 with the original formulation of principal components analysis (PCA)

pearson1901liii. PCA is a linear method that works best when data concentrates on low dimensional subspaces. When data instead concentrates on low dimensional nonlinear spaces, kernelized PCA scholkopf1998nonlinear and manifold learning techniques such as Isomap and diffusion maps tenenbaum2000global; coifman2006diffusion can be effective in finding nonlinear low dimensional structure, see Maaten2009DimensionalityRA

for a review. The recent popularity of deep learning has introduced new methods for finding low dimensional structure in high dimensional data-sets, most notably using auto-encoders

wang2016auto; champion2019data and deep generative models creswell2018generative; kingma2013auto. Relevant to our work, such techniques have found success in approximating submanifolds in vector spaces corresponding to discretized solutions of parametric partial differential equations (PDEs) sirovich1987turbulence; schilders2008model; geelen2022operator, where a particular need for nonlinear dimension reduction arises in advection-dominated problems common to fluid mechanics and climate science lee2020model; maulik2021reduced.

Low Dimensional Representations in Infinite Dimensional Vector Spaces:

The principles behind PCA generalize in a straightforward way to functions residing in low dimensional subspaces of infinite dimensional Hilbert spaces wang2016functional. In the field of reduced order modeling of PDEs this is sometimes referred to as proper orthogonal decomposition chatterjee2000introduction (see liang2002proper for an interesting exposition of the discrete version and connections to the Karhunen-Loève decomposition). Affine representations of solution manifolds to parametric PDEs and guarantees on when they are effective using the notion of linear -widths pinkus2012n have been explored in cohen2015approximation. As in the case of finite dimensional data, using a kernel to create a feature representation of a set of functions, and then performing PCA in the associated Reproducing Kernel Hilbert Space can give nonlinear low dimensional representations song2021nonlinear. The theory behind optimal nonlinear low dimensional representations for sets of functions is still being developed, but there has been work towards defining what “optimal” should mean in this context and how it relates to more familiar geometric quantities cohen2021optimal.

3 Operator Learning


Let us first set up some notation and give a formal statement of the supervised operator learning problem. We define as the set of continuous functions from a set to . When , we define the Hilbert space,

This is an infinite dimensional vector space equipped with the inner product . When is compact, we have that . We now can present a formal statement of the supervised operator learning problem.

Problem Formulation:

Suppose we are given a training data-set of pairs of functions , where with compact , and with compact . Assume there is a ground truth operator such that and that the

are sampled i.i.d. from a probability measure on

. The goal of the supervised operator learning problem is to learn a continuous operator to approximate . To do so, we will attempt to minimize the following empirical risk over a class of operators , with parameters ,


An Approximation Framework for Operators:

A popular approach to learning an operator acting on a probability measure on is to construct an approximation out of three maps lanthaler2022error (see Figure 1),


The first map, is known as the encoder. It takes an input function and maps it to a finite dimensional feature representation. For example, could take a continuous function to its point-wise evaluations along a collection of sensors, or project a function onto basis functions. The next map is known as the approximation map. This can be interpreted as a finite dimensional approximation of the action of the operator . Finally, the image of the approximation map is used to create the output functions in by means of the decoding map . We will refer to the dimension, , of the domain of the decoder as the latent dimension. The composition of these maps can be visualized in the following diagram.


Linear Decoders:

Many successful operator learning architectures such as the DeepONet lu2021learning, the (pseudo-spectral) Fourier Neural Operator in kovachki2021universal, LOCA kissas2022learning, and the PCA-based method in bhattacharya2021model all use linear decoding maps . A linear can be defined by a set of functions , , and acts on a vector as


For example, the functions can be built using trigonometric polynomials as in the -FNO kovachki2021universal, be parameterized by a neural network as in DeepONet lu2021learning, or created as the normalized output of a kernel integral transform as in LOCA kissas2022learning.

Limitations of Linear Decoders:

We can measure the approximation accuracy of the operator with two different norms. First is the operator norm,


Note that the empirical risk used to train a model for the supervised operator learning problem (see (1)) is a Monte Carlo approximation of the above population loss. The other option to measure the approximation accuracy is the uniform operator norm,


When a linear decoder is used for , a data-dependent lower bound to each of these errors can be derived.

lower bound:

When the pushforward measure has a finite second moment, its covariance operator

is self-adjoint, positive semi-definite, and trace-class, and thus admits an orthogonal set of eigenfunctions spanning its image,

with associated decreasing eigenvalues

 . The decay of these eigenvalues indicates the extent to which samples from

concentrate along the leading finite-dimensional eigenspaces. It was shown in

lanthaler2022error that for any choice of and , these eigenvalues give a fundamental lower bound to the expected squared error of the operator learning problem with architectures as in (3) using a linear decoder ,


This result can be further refined to show that the optimal choice of functions (see equation (4)) for a linear decoder are given by the leading eigenfunctions of the covariance operator . The interpretation of this result is that the best way to approximate samples from with an -dimensional subspace is to use the subspace spanned by the first “principal components” of the probability measure . The error incurred by using this subspace is determined by the remaining principal components, namely the sum of their eigenvalues . The operator learning literature has noted that for problems with a slowly decaying pushforward covariance spectrum (such as solutions to advection-dominated PDEs) these lower bounds cause poor performance for models of the form (3) lanthaler2022error; de2022cost.

Uniform lower bound:

In the reduced order modelling of PDEs literature cohen2015approximation; cohen2021optimal; lee2020model there exists a related notion for measuring the degree to an -dimensional subspace can approximate a set of functions . This is known as the Kolmogorov -width pinkus2012n, and for a compact set is defined as


This measure of how well a set of functions can be approximated by a linear subspace in the uniform norm leads naturally to a lower bound for the uniform error (6). To see this, first note that for any , the error from to is bounded by the minimum distance from to the image of . For a linear decoder , define the (at most) -dimensional . Note that , and we may write

Taking the supremum of both sides over , and then the infimum of both sides over all -dimensional subspaces gives

The quantity on the right is exactly the Kolmogorov -width of . We have thus proved the following complementary statement to (7) when the error is measured in the uniform norm.

Proposition 1

Let be compact and consider an operator learning architecture as in (3), where is a linear decoder. Then, for any and , the uniform norm error of satisfies the lower bound


Therefore, we see that in both the and uniform norm, the error for an operator learning problem with a linear decoder is fundamentally limited by the extent to which the space of output functions “fits” inside a finite dimensional linear subspace. In the next section we will alleviate this fundamental restriction by allowing decoders that can learn nonlinear embeddings of into .

4 Nonlinear Decoders for Operator Learning

A Motivating Example:

Consider the problem of learning the antiderivative operator mapping functions to their first-order derivative


acting on a set of input functions


The set of output functions is given by . This is a one-dimensional curve of functions in parameterized by a single number . However, we would not be able to represent this set of functions with a one-dimensional linear subspace. In Figure 1(b) we perform PCA on the functions in this set evaluated on a uniform grid of values of . We see that the first eigenvalues are nonzero and relatively constant, suggesting that an operator learning architecture with a linear or affine decoder would need a latent dimension of at least to effectively approximate functions from . Figure 1(a) gives a visualization of this curve of functions projected onto the first three PCA components. We will return to this example in Section 5

, and see that an architecture with a nonlinear decoder can in fact approximate the target output functions with superior accuracy compared to the linear case, using a single latent dimension that can capture the underlying nonlinear manifold structure.

Figure 2: Antiderivative Example: (a) of the leading PCA eigenvalues of ; (b) Projection of functions in the image of on the first three PCA components, colored by the frequency of each projected function; (c) Relative testing error ( scale) as a function of latent dimension for linear and nonlinear decoders (over 10 independent trials).

Operator Learning Manifold Hypothesis:

We now describe an assumption under which a nonlinear decoder is expected to be effective, and use this to formulate the NOMAD architecture. To this end, let be a probability measure on and . We assume that there exists an -dimensional manifold and an open subset such that


In connection with the manifold hypothesis in deep learning cayton2005algorithms; brahma2015deep, we refer to this as the Operator Learning Manifold Hypothesis. There are scenarios where it is known this assumption holds, such as in learning solutions to parametric PDEs maulik2021reduced.

This assumption motivates the construction of a nonlinear decoder for the architecture in (3) as follows. For each , choose such that

Figure 3: An example of linear versus nonlinear decoders.

Let be a coordinate chart for . We can represent by its coordinates . Consider a choice of encoding and approximation maps such that gives the coordinates for . If the decoder were chosen as then by construction, the operator will satisfy


Therefore, we interpret a learned decoding map as attempting to give a finite dimensional coordinate system for the solution manifold. Consider a generalized decoder of the following form


This induces a map from , as . If the solution manifold is a finite dimensional linear subspace in spanned by , we would want a decoder to use the coefficients along the basis as a coordinate system for . A generalized decoder could learn this basis as the output of a deep neural network to act as


However, if the solution manifold is not linear, then we should learn a nonlinear coordinate system given by a nonlinear . A nonlinear version of can be parameterized by using a deep neural network which jointly takes as arguments ,


When used in the context of an operator learning architecture of the form (3), we call a nonlinear decoder from (17) NOMAD (NOnlinear MAnifold Decoder). Figure 3 presents a visual comparison between linear and nonlinear decoders.

Summary of NOMAD:

Under the assumption of the Operator Learning Manifold Hypothesis, we have proposed a fully nonlinear decoder (17) to represent target functions using architectures of the form (3). We next show that using a decoder of the form (17) results in operator learning architectures which can learn nonlinear low dimensional solution manifolds. Additionally, we will see that when these solution manifolds do not “fit” inside low dimensional linear subspaces, architectures with linear decoders will either fail or require a significantly larger number of latent dimensions.

5 Results

In this section we investigate the effect of using a linear versus nonlinear decoders as building blocks of operator learning architecture taking the form (3). In all cases, we will use an encoder which takes point-wise evaluations of the input functions, and an approximator map given by a deep neural network. The linear decoder parametrizes a set of basis functions that are learned as the outputs of an MLP network. In this case, the resulting architecture exactly corresponds to the DeepONet model from lu2021learning. We will compare this against using NOMAD where the nonlinear decoder is built using an MLP network that takes as inputs the concatenation of and a given query point

. All models are trained with by performing stochastic gradient descent on the loss function in (

1). The reported errors are measured in the relative norm by averaging over all functional pairs in the testing data-set. More details about architectures, hyperparameters settings, and training details are provided in the Supplemental Materials.

Learning the Antiderivative Operator:

First, we revisit the motivating example from Section 4, where the goal is to learn the antidervative operator (10) acting on the set of functions (11). In Figure 1(c) we see the performance of a model with a linear decoder and NOMAD over a range of latent dimensions . For each choice of ,

experiments with random initialization seeds were performed, and the mean and standard deviation of testing errors are reported. We see that the NOMAD architecture consistently outperforms the linear one (by one order of magnitude), and can even achieve a

relative prediction error using only .

Solution Operator of a Parametric Advection PDE:

Here we consider the problem of learning the solution operator to a PDE describing the transport of a scalar field with conserved energy,


over a domain . The solution operator maps an initial condition to the solution at all times which satisfies (18

). We consider a training data-set of initial conditions taking the form of radial basis functions with a very small fixed lengthscale centered at randomly chosen locations in the interval

. We create the output functions by evolving these initial conditions forward in time for 1 time unit according to the advection equation (18) (see Supplemental Materials for more details). Figure 3(a) gives an illustration of one such solution plotted over the space-time domain.

Performing PCA on the solution functions generated by these initial conditions shows a very slow decay of eigenvalues (see Figure 3(b)

), suggesting that methods with linear decoders will require a moderately large number of latent dimensions. However, since the data-set was constructed by evolving a set of functions with a single degree of freedom (the center of the initial conditions), we would expect the output functions to form a solution manifold of very low dimension.

In Figure 3(c) we compare the performance of a linear decoder and NOMAD as a function of the latent dimension . Linear decoders yield poor performance for small values of , while NOMAD appears to immediately discover a good approximation to the true solution manifold.

Figure 4: Advection Equation: (a) Propagation of an initial condition function (highlighted in black) through time according to (18); (b) of the leading PCA eigenvalues of ; (c) Relative testing error ( scale) as a function of latent dimension for linear and nonlinear decoders (over 10 independent trials).

Propagation of Free-surface Waves:

As a more challenging benchmark we consider the shallow-water equations; a set of hyperbolic equations that describe the flow below a pressure surface in a fluid vreugdenhil1994numerical. The underlying PDE system takes the form




where the fluid height from the free surface, is the gravity acceleration, and , denote the horizontal and vertical fluid velocities, respectively. We consider reflective boundary conditions and random initial conditions corresponding to a random droplet falling into a still fluid bed (see Supplemental Materials). In Figure 4(a) we show the average testing error of a model with a linear and nonlinear decoder as a function of the latent dimension. Figure 4(b) shows snapshots of the predicted surface height function on top of a plot of the errors to the ground truth for the best, worst, median, and a random sample from the testing data-set.

We additionally use this example to compare the performance of a model with a linear decoder and NOMAD to other state-of-the-art operator learning architectures (see Supplemental Material for details). In Table 1, we present the mean relative error and its standard deviation for different operator learning methods, as well as the prediction that provides the worst error in the testing data-set when compared against the ground truth solution. For each method we also report the number of its trainable parameters, the number of its latent dimension , and the training wall-clock time in minutes. Since the general form of the FNO li2020fourier does not neatly fit into the architecture given by (3), there is not a directly comparable measure of latent dimension for it. We also observe that, although the model with NOMAD closely matches the performance of LOCA kissas2022learning, its required latent dimension, total number of trainable parameters, and total training time are all significantly smaller.

Figure 5: Propagation of Free-surface Waves: (a) Relative testing error ( scale) as a function of latent dimension for linear and nonlinear decoders (over 10 independent trials); (b) Visualization of predicted free surface height and point-wise absolute prediction error contours corresponding to the best, worst, and median samples in the test data-set, along with a representative test sample chosen at random.
Method worst case cost
LOCA 480 12.1
DON 480 15.4
FNO N/A 14.0
Table 1: Comparison of relative errors (in %) for the predicted output functions for the shallow water equations benchmark against existing state-of-the-art operator learning methods: LOCA kissas2022learning, DeepONet (DON) lu2021learning, and the Fourier Neural Operator (FNO) li2020fourier. The fourth column reports the relative error for corresponding to the worst case example in the test data-set. Also shown is each model’s total number of trainable parameters , latent dimension , and computational cost in terms of training time (minutes).

6 Discussion


We have presented a novel framework for supervised learning in function spaces. The proposed methods aim to address challenging scenarios where the manifold of target functions has low dimensional structure, but is embedded nonlinearly into its associated function space. Such cases commonly arise across diverse functional observables in the physical and engineering sciences (e.g. turbulent fluid flows, plasma physics, chemical reactions), and pose a significant challenge to the application of most existing operator learning methods that rely on linear decoding maps, forcing them to require an excessively large number of latent dimensions to accurately represent target functions. To address this shortcoming we put forth a fully nonlinear framework that can effectively learn low dimensional representations of nonlinear embeddings in function spaces, and demonstrated that it can achieve competitive accuracy to state-of-the-art operator learning methods while using a significantly smaller number of latent dimensions, leading to lighter model parametrizations and reduced training cost.


Our proposed approach relies on the Operator Learning Manifold Hypothesis (see equation (12)), suggesting that cases where a low dimensional manifold structure does not exist will be hard to tackle (e.g. target function manifolds with fractal structure, solutions to evolution equations with strange attractors). Moreover, even when the manifold hypothesis holds, the underlying effective latent embedding dimension is typically not known a-priori, and may only be precisely found via cross-validation. Another direct consequence of replacing linear decoders with fully nonlinear maps is that the lower bound in (9) needs to be rephrased in terms of a nonlinear -width, which in general can be difficult to quantify. Finally, in this work we restricted ourselves to exploring simple nonlinear decoder architectures such as an MLPs with the latent parameters and query location concatenated as inputs. Further investigation is needed to quantify the improvements that could be brought by considering more contemporary deep learning architectures, such as hypernetworks hypernetworks which can define input dependent weights for complicated decoder architectures. One example of this idea in the context of reduced order modeling can be found in Pan et. al. pan2022neural, where the authors propose a hypernetwork based method combined with a Implicit Neural Representation network sitzmann2020implicit.


Appendix A Nomenclature

Table 2 summarizes the main symbols and notation used in this work.

Space of continuous functions from a space to a space .
Hilbert space of square integrable functions.
Domain for input functions, subset of .
Domain for output functions, subset of .
Input function arguments.
Output function arguments (queries).
Input function in .
Output function in .
Latent dimension for solution manifold
Operator mapping input functions to output functions .
Table 2: (Nomenclature) A summary of the main symbols and notation used in this work.

Appendix B Architecture Choices and Hyper-parameter Settings

In this section, we present all architecture choices and training details considered in the experiments for the NOMAD and the DeepONet methods.

For both NOMAD and DeepONet, we set the batch size of input and output pairs equal to . We consider an initial learning rate of , and an exponential decay with decay-rate of 0.99 every training iterations. For the results presented in 1, we consider the same set-up as in [kissas2022learning]

for LOCA, DeepONet and FNO, while for NOMAD we use the same number of hidden layers and neurons as the DeepONet. The order of magnitude difference in number of parameters between NOMAD and DeepONet for the Shallow Water Equation comparison, come from the difference between the latent dimension choice between the two methods (

for NOMAD and for DeepONet) and the fact the in [kissas2022learning] the authors implement the improvements for DeepONet proposed in [lu2021comprehensive], namely perform a Harmonic Feature Expansion for the input functions.

b.1 Model Architecture

In the DeepONet, the approximation map is known as the branch network , and the neural network whose outputs are the basis is known as the trunk network, . We present the structure of and in Table 3. The DeepONet employed in this work is the plain DeepONet version originally put forth in [lu2021comprehensive], without considering the improvements in [lu2021comprehensive, wang2021improved]. The reason for choosing the simplest architecture possible is because we are interest in examining solely the effect of the decoder without any additional moving parts. For the NOMAD method, we consider the same architecture as the DeepONet for each problem.

Example depth width depth depth
Antiderivative 5 100 5 100
Parametric Advection 5 100 5 100
Free Surface Waves 5 100 5 100
Table 3: Architecture choices for different examples.
Example m P Batch Train iterations
Antiderivative 1000 1000 500 500 100 20000
Parametric Advection 1000 1000 256 25600 100 20000
Free Surface Waves 1000 1000 1024 128 100 100000
Table 4: Training details for the experiments in this work. We present the number of training and testing data pairs and , respectively, the number of sensor locations where the input functions are evaluated , the number of query points where the output functions are evaluated , the batch size, and total training iterations.

Appendix C Experimental Details

c.1 Data-set generation

For all experiments, we use number of function pairs for training and for testing. and number of points where the input and output functions are evaluated, respectively. See Table 4 for the values of these parameters for the different examples along with batch sizes and total training iterations. We train and test with the same data-set on each example for both NOMAD and DeepONet.

We build collections of measurements for each of the input/output function pairs, as follows. The input function is measured at locations to give the point-wise evaluations, . The output function is evaluated at locations , with these locations potentially varying over the data-set, to give the point-wise evaluations . Each data pair used in training is then given as .

c.2 Antiderivative

We approximate the antiderivative operator

acting on a set of input functions

The set of output functions is given by . We consider and the initial condition . For a given forcing term the solution operator returns the antiderivative . Our goal is to learn the solution operator . In this case .

To construct the data-sets we sample input functions by sampling and evaluate these functions on equispaced sensor locations. We measure the corresponding output functions on equispaced locations. We construct input/output function pairs for training and pairs for testing the model.

c.3 Advection Equation

For demonstrating the benefits of our method, we choose a linear transport equation benchmark, similar to [geelen2022operator],


with initial condition



is sampled from a uniform distribution

. Here we have , and . Our goal is to learn the solution operator . The advection equation admits an analytic solution


where the initial condition is propagated through the domain with speed , as shown in Figure 3(a).

We construct training and testing data-sets by sampling and initial conditions and evaluate the analytic solution on temporal and spatial locations. We use a high spatio-temporal resolution for training the model to avoid missing the narrow travelling peak in the pointwise measurements.

c.4 Shallow Water Equations

The shallow water equations are a hyperbolic system of equations that describe the flow below a pressure surface, given as


where is the total fluid column height, the velocity in the -direction, the velocity in the -direction, and the acceleration due to gravity.

We consider impenetrable reflective boundaries

where is the unit outward normal of the boundary.

Initial conditions are generated from a droplet of random width falling from a random height to a random spatial location and zero initial velocities

where corresponds to the altitude that the droplet falls from, the width of the droplet, and and the coordinates that the droplet falls in time . Instead of choosing the solution for at time as the input function, we use the solution at so the input velocities are not always zero. The components of the input functions are then

We set the random variables

, , , and to be distributed according to the uniform distributions

In this example, and . For a given set of input functions, the solution operator of 24 maps the fluid column height and velocity fields at time to the fluid column height and velocity fields at later times. Therefore, our goal is to learn a solution operator .

We create a training and a testing data-set by sampling and input/output function samples by sampling initial conditions on a grid, solving the equation using a Lax-Friedrichs scheme [moin2010fundamentals] and considering five snapshots . We randomly choose measurements from the available spatio-temporal data of the output functions per data pair for training.

Appendix D Comparison Metrics

Throughout this work, we employ the relative error as a metric to assess the test accuracy of each model, namely

where the model predicted solution, the ground truth solution and the realization index. The relative error is computed across all examples in the testing data-set, and different statistics of this error vector are calculated: the mean and standard deviation. For the Shallow Water Equations where we train on a lower resolution of the output domain, we compute the testing error using a full resolution grid.