 # DeepONet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators

While it is widely known that neural networks are universal approximators of continuous functions, a less known and perhaps more powerful result is that a neural network with a single hidden layer can approximate accurately any nonlinear continuous operator <cit.>. This universal approximation theorem is suggestive of the potential application of neural networks in learning nonlinear operators from data. However, the theorem guarantees only a small approximation error for a sufficient large network, and does not consider the important optimization and generalization errors. To realize this theorem in practice, we propose deep operator networks (DeepONets) to learn operators accurately and efficiently from a relatively small dataset. A DeepONet consists of two sub-networks, one for encoding the input function at a fixed number of sensors x_i, i=1,...,m (branch net), and another for encoding the locations for the output functions (trunk net). We perform systematic simulations for identifying two types of operators, i.e., dynamic systems and partial differential equations, and demonstrate that DeepONet significantly reduces the generalization error compared to the fully-connected networks. We also derive theoretically the dependence of the approximation error in terms of the number of sensors (where the input function is defined) as well as the input function type, and we verify the theorem with computational results. More importantly, we observe high-order error convergence in our computational tests, namely polynomial rates (from half order to fourth order) and even exponential convergence with respect to the training dataset size.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The universal approximation theorem states that neural networks can be used to approximate any continuous function to arbitrary accuracy if no constraint is placed on the width and depth of the hidden layers [7, 11]. However, another approximation result, which is yet more surprising and has not been appreciated so far, states that a neural network with a single hidden layer can approximate accurately any nonlinear continuous functional (a mapping from a space of functions into the real numbers) [3, 18, 25] or (nonlinear) operator (a mapping from a space of functions into another space of functions) [5, 4].

Before reviewing the approximation theorem for operators, we introduce some notation, which will be used through this paper. Let be an operator taking an input function , and then is the corresponding output function. For any point in the domain of , the output is a real number. Hence, the network takes inputs composed of two parts: and , and outputs (Fig. 1A). Although our goal is to learn operators, which take a function as the input, we have to represent the input functions discretely, so that network approximations can be applied. A straightforward and simple way, in practice, is to employ the function values at sufficient but finite many locations ; we call these locations as “sensors” (Fig. 1A). Next, we state the following theorem due to Chen & Chen , see appendix for more details.

###### Theorem 1 (Universal Approximation Theorem for Operator).

Suppose that is a continuous non-polynomial function, is a Banach Space, , are two compact sets in and , respectively, is a compact set in , is a nonlinear continuous operator, which maps into . Then for any , there are positive integers , , , constants , , , , , , such that

 ∣∣ ∣ ∣ ∣ ∣∣G(u)(y)−p∑k=1n∑i=1ckiσ(m∑j=1ξkiju(xj)+θki)branchσ(wk⋅y+ζk)trunk∣∣ ∣ ∣ ∣ ∣∣<ϵ (1)

holds for all and . Figure 1: Illustrations of the problem setup and architectures of DeepONets. (A) The network to learn an operator G:u↦G(u) takes two inputs [u(x1),u(x2),…,u(xm)] and y. (B) Illustration of the training data. For each input function u, we require that we have the same number of evaluations at the same scattered sensors x1,x2,…,xm. However, we do not enforce any constraints on the number or locations for the evaluation of output functions. (C) The stacked DeepONet in Theorem 1 has one trunk network and p stacked branch networks. (D) The unstacked DeepONet has one trunk network and one branch network.

This approximation theorem indicates the potential application of neural networks to learn nonlinear operators from data, i.e., similar to what the deep learning community is currently doing, that is learning functions from data. However, this theorem does not inform us how to learn operators effectively. Considering the classical image classification task as an example, the universal approximation theorem of neural networks for functions

[7, 11]

shows that fully-connected neural networks (FNNs) are capable to approximate the ground-truth classification function accurately, but in practice the performance of FNNs is far from networks with specific architectures, such as the widely-used convolutional neural networks (CNN)

 or the more recent capsule neural network (CapsNet) . The performance gap lies in the fact that the accuracy of NNs can be characterized by dividing the total error into three main types: approximation, optimization, and generalization [1, 17, 13, 16]. The universal approximation theorems only guarantee a small approximation error for a sufficiently large network, but they do not consider the optimization error and generalization error at all, which are equally important and often dominant contributions to the total error in practice. Useful networks should be easy to train, i.e., exhibit small optimization error, and generalize well to unseen data, i.e., exhibit small generalization error.

To demonstrate the capability and effectiveness of learning nonlinear operators by neural networks, we setup the problem as general as possible by using the weakest possible constraints on the sensors and training dataset. Specifically, the only condition required is that the sensor locations are the same but not necessarily on a lattice for all input functions , while we do not enforce any constraints on the output locations (Fig. 1B). To learn operators accurately and efficiently, we propose a specific network architecture, the deep operator network (DeepONet), to achieve smaller total error. We will demonsrate that the DeepONet significantly improves generalization based on a design of two sub-networks, the branch net for the input function and the trunk-net for the location to evaluate the output function.

We consider two types of operators, i.e., dynamic systems (e.g., in the form of ordinary differential equations, ODEs) and partial differential equations (PDEs). Dynamic systems are typically described by difference or differential equations, and identification of a nonlinear dynamic plant is a major concern in control theory. Some works

[22, 33] used neural networks to identify dynamic systems, but they only considered the dynamic systems described by difference equations. Some other works [20, 24, 23, 9] predict the evolution of a specific dynamic system rather than identifying the system behavior for new unseen input signals. The network architectures they employed includes FNNs 

, recurrent neural networks (RNNs)

, reservoir computing , residual networks , neural ordinary differential equations , and neural jump stochastic differential equations . For identifying PDEs, some works treat the input and output function as an image, and then use CNNs to learn the image-to-image mapping [30, 34], but this approach can be only applied to the particular type of problems, where the sensors of the input function are distributed on a equispaced grid, and the training data must include all the values with also on a equispaced grid. In another approach without this restriction, PDEs are parametrized by unknown coefficients, and then only the coefficient values are identified from data [2, 26, 31, 21, 15]. Alternatively, a generalized CNN based on generalized moving least squares  can be used for unstructured data, but it can only approximate local operators and is not able to learn other operators like an integral operator.

The paper is organized as follows. In Section 2, we present two network architectures of DeepONet: the stacked DeepONet and the unstacked DeepONet, and then introduce the data generation procedure. In Section 3, we present a theoretical analysis on the number of sensors required to represent the input function accurately for approximating ODE operators. In Section 4, we test the performance of FNN, stacked DeepONet, and unstacked DeepONet for different examples, and demonstrate the accuracy and convergence rates of unstacked DeepONet. Finally, we conclude the paper in Section 5.

## 2 Methodology

### 2.1 Deep operator networks (DeepONets)

We focus on learning operators in a more general setting, where the only requirement for the training dataset is the consistency of the sensors for input functions. In this general setting, the network inputs consist of two separate components: and (Fig. 1A), and the goal is to achieve good performance by designing the network architecture. One straightforward solution is to directly employ a classical network, such as FNN, CNN or RNN, and concatenate two inputs together as the network input, i.e., . However, the input does not have any specific structure, and thus it is not meaningful to choose networks like CNN or RNN. Here we use FNN as the baseline model.

In high dimensional problems,

is a vector with

components, so the dimension of does not match the dimension of for any more. This also prevents us from treating and equally, and thus at least two sub-networks are needed to handle and separately. Although the universal approximation theorem (Theorem 1) does not have any guarantee on the total error, it still provides us a network structure in Eq. 1. Theorem 1 only considers a shallow network with one hidden layer, so we extend it to deep networks, which have more expressivity than shallow ones. The architecture we propose is shown in Fig. 1C, and the details are as follows. First there is a “trunk” network, which takes as the input and outputs . In addition to the trunk network, there are “branch” networks, and each of them takes as the input and outputs a scalar for . We merge them together as in Eq. 1:

 G(u)(y)≈p∑k=1bktk.

We note that the trunk network also applies activation functions in the last layer, i.e.,

for , and thus this trunk-branch network can also be seen as a trunk network with each weight in the last layer parameterized by another branch network instead of the classical single variable. We also note that in Eq. 1 the last layer of each branch network does not have bias. Although bias is not necessary in Theorem 1, adding bias may increase the performance by reducing the generalization error. In addition to adding bias to the branch networks, we may also add a bias in the last stage:

 G(u)(y)≈p∑k=1bktk+b0. (2)

In practice, is at least of the order of 10, and using lots of branch networks is computationally and memory expensive. Hence, we merge all the branch networks into one single branch network (Fig. 1D), i.e., a single branch network outputs a vector . In the first DeepONet (Fig. 1C), there are branch networks stacked parallel, so we name it as “stacked DeepONet”, while we refer to the second DeepONet (Fig. 1D) as “unstacked DeepONet”. All versions of DeepONets are implemented in DeepXDE 

, a user-friendly Python library designed for scientific machine learning (

https://github.com/lululxvi/deepxde).

DeepONet is a high-level network architecture without defining the architectures of its inner trunk and branch networks. To demonstrate the capability and good performance of DeepONet alone, we choose the simplest FNN as the architectures of the sub-networks in this study. It is possible that using convolutional layers we could further improve accuracy. However, convolutional layers usually work for square domains with on a equispaced grid, so as alternative and for a more general setting we may use the “attention” mechanism .

Embodying some prior knowledge into neural network architectures usually induces good generalization. This inductive bias has been reflected in many networks, such as CNN for images and RNN for sequential data. The success of DeepONet even using FNN as its sub-networks is also due to its strong inductive bias. The output has two independent inputs and , and thus using the trunk and branch networks explicitly is consistent with this prior knowledge. More broadly, can be viewed as a function of conditioning on . Finding an effective way to represent the conditioning input is still an open question, and different approaches have been proposed, such as feature-wise transformations .

### 2.2 Data generation

The input signal of the process plays an important role in system identification. Clearly, the input signal is the only possibility to influence the process in order to gather information about its response, and the quality of the identification signal determines an upper bound on the accuracy that in the best case can be achieved by any model. In this study, we mainly consider two function spaces: Gaussian random field (GRF) and orthogonal (Chebyshev) polynomials.

We use the mean-zero GRF:

 u∼G(0,kl(x1,x2)),

where the covariance kernel

is the radial-basis function (RBF) kernel with a length-scale parameter

. The length-scale determines the smoothness of the sampled function, and larger leads to smoother .

Let and are Chebyshev polynomials of the first kind. We define the orthogonal polynomials of degree as:

 Vpoly={N−1∑i=0aiTi(x):|ai|≤M}.

We generate the dataset from by randomly sampling from to get a sample of .

After sampling from the chosen function spaces, we solve the ODE systems by Runge-Kutta (4, 5) and PDEs by a second-order finite difference method to obtain the reference solutions. We note that one data point is a triplet , and thus one specific input may appear in multiple data points with different values of . For example, a dataset of size 10000 may only be generated from 100 trajectories, and each evaluates for 100 locations.

## 3 Number of sensors for identifying nonlinear dynamic systems

In this section, we investigate how many sensor points we need to achieve accuracy for identifying nonlinear dynamic systems. Suppose that the dynamic system is subject to the following ODE system:

 {ddxs(x)=g(s(x),u(x),x)s(a)=s0, (3)

where (a compact subset of ) is the input signal, and is the solution of system (3) serving as the output signal.

Let be the operator mapping the input to the output , i.e., satisfies

 (Gu)(x)=s0+∫xag((Gu)(t),u(t),t)dt.

Now, we choose uniformly points from , and define the function as follows:

 um(x)=u(xj)+u(xj+1)−u(xj)xj+1−xj(x−xj), xj≤x≤xj+1, j=0,1,⋯,m−1.

Denote the operator mapping to by , and let , which is obviously a compact subset of since is compact and continuous operator keeps the compactness. Naturally, as the union of two compact sets is also compact. Then, set , and Lemma 7 points out that is still a compact set. Since is a continuous operator, is compact in . The subsequent discussions are mainly within and . For convenience of analysis, we assume that satisfies the Lipschitz condition with respect to and on , i.e., there is a constant such that

 ∥g(s1,u,x)−g(s2,u,x)∥2≤c∥s1−s2∥2∥g(s,u1,x)−g(s,u2,x)∥2≤c|u1−u2|.

Note that this condition is easy to achieve, for instance, as long as is differentiable with respect to and on .

For , there exists a constant depending on and compact space , such that

 maxx∈[a,b]|u(x)−um(x)|≤κ(m,V),κ(m,V)→0  as  m→∞. (4)

When is GRF with RBF kernel, we have , see Appendix C for the proof. Based on the these concepts, we have the following theorem.

###### Theorem 2.

Suppose that is a positive integer making less than , then for any , there exist , such that

 ∥(Gu)(d)−(W2⋅σ(W1⋅[u(x0)⋯u(xm)]T+b1)+b2)∥2<ε

holds for all .

###### Proof.

The proof can be found in Appendix B. ∎

## 4 Simulation results

In this section, we first show that DeepONets have better performance than FNNs due to the smaller generalization error even in the easiest linear problem, and then demonstrate the capability of DeepONets for three nonlinear ODE and PDE problems. In all problems, we use the Adam optimizer with learning rate 0.001, and the number of iterations is chosen to guarantee the training is convergent. The other parameters and network sizes are listed in Tables 1 and 2, unless otherwise stated. The codes of all examples are published in GitHub (https://github.com/lululxvi/deepxde).

### 4.1 A simple 1D dynamic system

A 1D dynamic system is described by

 ds(x)dx=g(s(x),u(x),x),x∈[0,1],

with an initial condition . Our goal is to predict over the whole domain for any .

#### 4.1.1 Linear case: g(s(x),u(x),x)=u(x)

We first consider a linear problem by choosing , which is equivalent to learning the antiderivative operator

 G:u(x)↦s(x)=∫x0u(τ)dτ.

As the baseline, we train FNNs to learn the antiderivative operator. To obtain the best performance of FNNs, we grid search the three hyperparameters: depth from 2 to 4, width from 10 to 2560, and learning rate from 0.0001 to 0.01. The mean squared error (MSE) of the test dataset with learning rate 0.01, 0.001, and 0.0001 are shown in Fig.

2. Although we only choose depth up to 4, the results show that increasing the depth further does not improve the error. Among all these hyperparameters, the smallest test error is obtained for the network with depth 2, width 2560, and learning rate 0.001. We observe that when the network is small, the training error is large and the generalization error (the difference between test error and training error) is small, due to small expressivity. When the network size increases, the training error decreases, but the generalization error increases. We note that FNN has not reached the overfitting region, where the test error increases. Figure 2: Errors of FNNs trained to learn the antiderivative operator (linear case). (A and B) Learning rate 0.01. (C and D) Learning rate 0.001. (E and F) Learning rate 0.0001. (A, C, E) The solid and dash lines are the training error and test error during the training process, respectively. The blue, red and green lines represent FNNs of size (depth 2, width 10), (depth 3, width 160), and (depth 4, width 2560), respectively. (B, D, F) The blue, red and green lines represent FNNs of depth 2, 3 and 4, respectively. The shaded regions in (D) are the one-standard-derivation (SD) from 10 runs with different training/test data and network initialization. For clarity, only the SD of learning rate 0.001 is shown (Fig. D). The number of sensors for u is m=100.

Compared to FNNs, DeepONets have much smaller generalization error and thus smaller test error. Here we do not aim to find the best hyperparameters, and only test the performance of the two DeepONets listed in Table 2. The training trajectory of an unstacked DeepONet with bias is shown in Fig. 3A, and the generalization error is negligible. We observe that for both stacked and unstacked DeepONets, adding bias to branch networks and Eq. (2) reduces both training and test errors (Fig. 3B); DeepONets with bias also have smaller uncertainty, i.e., more stable for training from random initialization (Fig. 3B). Compared to stacked DeepONets, although unstacked DeepONets have larger training error, the test error is smaller, due to the smaller generalization error. Therefore, unstacked DeepONets with bias achieve the best performance. In addition, unstacked DeepONets have fewer number of parameters than stacked DeepONets, and thus can be trained faster using much less memory. Figure 3: Errors of DeepONets trained to learn the antiderivative operator (linear case). (A) The training trajectory of an unstacked DeepONet with bias. (B) The training/test error for stacked/unstacked DeepONets with/without bias compared to the best error of FNNs. The error bars are the one-standard-derivation from 10 runs with different training/test data and network initialization.

#### 4.1.2 Nonlinear case: g(s(x),u(x),x)=−s2(x)+u(x)

Next we consider a nonlinear problem with . Because may explode for certain , we compute the test MSE by removing the 1 worst predictions. During the network training, the training MSE and test MSE of both stacked and unstacked DeepONets decrease, but the correlation between training MSE and test MSE of unstacked DeepONets is tighter (Fig. 4A), i.e., smaller generalization error. This tight correlation between training and test MSE of unstacked DeepONets is also observed across multiple runs with random training dataset and network initialization (Fig. 4B). Moreover, the test MSE and training MSE of unstacked DeepONets follow almost a linear correlation

 MSEtest≈10×MSEtrain−10−4.

Fig. 4C shows that unstacked DeepONets have smaller test MSE due to smaller generalization error. DeepONets work even for out-of-distribution predictions, see three examples of the prediction in Fig. 5. In the following study, we will use unstacked DeepONets. Figure 4: Nonlinear ODE: unstacked DeepONets have smaller generalization error and smaller test MSE than stacked DeepONets. (A) The correlation between the training MSE and the test MSE of one stacked/unstacked DeepONet during the training process. (B) The correlation between the final training MSE and test MSE of stacked/unstacked DeepONets in 10 runs with random training dataset and network initialization. The training and test MSE of unstacked DeepONets follow a linear correlation (black dash line). (C) The mean and one-standard-derivation of data points in B. Figure 5: Nonlinear ODE: predictions of a trained unstacked DeepONet on three out-of-distribution input signals. The blue and red lines represent the reference solution and the prediction of a DeepONet.

### 4.2 Gravity pendulum with an external force

The motion of a gravity pendulum with an external force is described as

 ds1dt =s2, ds2dt =−ksins1+u(t),

with an initial condition , and is determined by the acceleration due to gravity and the length of the pendulum. This problem is characterized by three factors: (1) , (2) maximum prediction time , and (3) input function space. The accuracy of learned networks is determined by four factors: (1) the number of sensor points ; (2) training dataset size; (3) network architecture, (4) optimizer. We investigate the effects of these factors on the accuracy.

#### 4.2.1 Number of sensors

The number of sensors required to distinguish two input functions depends on the value of , prediction time , and the input function space. For the case with , and , when the number of sensors is small, the error decays exponentially as we increase the number of sensors, Fig. 6A):

 MSE∝4.6−#sensors.

When is already large, the effect of increasing is negligible. The transition occurs at 10 sensors, as indicated by the arrow.

To predict for a longer time, more sensors are required (Fig. 6B). For example, predicting until requires 25 sensors. If the function is less smooth corresponding to smaller , it also requires more sensors (Fig. 6C). However, the number of sensors is not sensitive to (Fig. 6D). Although it is hard to quantify the exact dependency of on and , by fitting the computational results we show that

 m∝√Tandm∝l−1.

In Theorem 2, should be large enough to make small. In Appendix C we show theoretically that for the GRF function space with RBF kernel, which is consistent with our computational result here. in the bound is loose compared to the computational results. Figure 6: Gravity pendulum: required number of sensors for different T, k and l. (A) Training MSE (square symbols) and test MSE (circle symbols) decrease as the number of sensors increases in the case k=1, T=1 and l=0.2. Training and test MSE versus the number of sensors in different conditions of (B) T, (C) l, and (D) k. For clarity, the SD is not shown. The arrow indicates where the rate of the error decay diminishes.

#### 4.2.2 Error tendency and convergence

Here we investigate the error tendency under different conditions, including prediction time, network size, and training dataset size. We first observe that both the training and test errors grow exponentially with the maximum prediction time (Fig. 7A):

 MSE∝8T.

We note that the error is not the error at time but the average error over the domain , because we predict the for any . As shown in Fig. 6B, 100 sensor points are sufficient for , and thus the possible reasons for increased error are: (1) the network is too small, and (2) training data is not sufficient. Because the error in is already very small, to leverage the error, we use in the following experiments. By varying the network width, we can observe that there is a best width to achieve the smallest error (Fig. 7B). It is reasonable that increasing the width from 1 to 100 would decrease the error, but the error would instead increase when the width further increases. This could be due to the increased optimization error, and a better learning rate may be used to find a better network.

To examine the effect of training dataset size, we choose networks of width 100 to eliminate the network size effect. The training, test, and generalization errors using different dataset size are shown in Fig. 7C and D, and we have

 MSEtest∝{e−x/2000,for small % datasetx−0.5,for large dataset,MSEgen∝{e−x/2000,for small datasetx−1,for large dataset,

where is the number of training data points. It is surprising that test error and generalization error have exponential convergence for training dataset size . Even for large dataset, the convergence rate of for the generalization error is still higher than the classical in the learning theory . This fast convergence confirms the exceptional performance of DeepONets, especially in the region of small dataset. Figure 7: Gravity pendulum: error tendency and convergence. (A) Both training error (blue) and test error (red) increase fast for long-time prediction if we keep the network size fixed at width 40 and the training data size fixed at 10000 points. (B) Training and test errors first decrease and then increase, when network width increases (T=3). (C) Test error decreases when more training data are used (T=3, width 100). (D) Test error and generalization error have exponential convergence for small training dataset, and then converge with rate x−0.5 and x−1, respectively (T=3, width 100).

#### 4.2.3 Input function spaces

Next we investigate the effect of different function spaces, including GRF with different length scale and the space of Chebyshev polynomials with different number of bases. For a fixed sensor number, we observe that there exists a critical length scale, around where the training and test errors change rapidly, see the arrows in Fig. 8A and B. This sharp transition around the critical value is also observed in the space of Chebyshev polynomials with different number of bases (Fig. 8B). The relations between the critical values and the sensor numbers are

 l∝m−1and#Bases∝√m.

The inverse relation between and is consistent with the theoretical results in Appendix C and the computational results in Section 4.2.1. Therefore, when the function space complexity increases, one may increase to capture the functions well. Figure 8: Gravity pendulum: errors for different input function spaces. Training error (solid line) and test error (dash line) for (A) GRF function space with different length scale l. The colors correspond to the number of sensors as shown in the inset. (B) Chebyshev polynomials for different number of basis functions.

### 4.3 Diffusion-reaction system with a source term

A diffusion-reaction system with a source term is described by

 ∂s∂t=D∂2s∂x2+ks2+u(x),x∈[0,1],t∈[0,1],

with zero initial/boundary conditions, where is the diffusion coefficient, and is the reaction rate. We use DeepONets to learn the operator mapping from to the PDE solution . In the previous examples, for each input , we only use one random point of for training, and instead we may also use multiple points of . To generate the training dataset, we solve the diffusion-reaction system using a second-order implicit finite difference method on a 100 by 100 grid, and then for each we randomly select points out of these grid points (Fig. 9A). Hence, the dataset size is equal to the product of by the number of samples. We confirm that the training and test datasets do not include the data from the same . Figure 9: Learning a diffusion-reaction system. (A) (left) An example of a random sample of the input function u(x). (middle) The corresponding output function s(x,t) at P different (x,t) locations. (right) Pairing of inputs and outputs at the training data points. The total number of training data points is the product of P times the number of samples of u. (B) Training error (blue) and test error (red) for different values of the number of random points P when 100 random u samples are used. (C) Training error (blue) and test error (red) for different number of u samples when P=100. The shaded regions denote one-standard-derivation.

We investigate the error tendency with respect to (w.r.t.) the number of samples and the value of . When we use 100 random samples, the test error decreases first as increases (Fig. 9B), and then saturates due to other factors, such as the finite number of samples and fixed neural network size. We observe a similar error tendency but with less saturation as the number of samples increases with fixed (Fig. 9C). In addition, in this PDE problem the DeepONet is able to learn from a small dataset, e.g., a DeepONet can reach the test error of when it is only trained with 100 samples (). We recall that we test on 10000 grid points, and thus on average each location point only has training data points.

Before the error saturates, the rates of convergence w.r.t. both and the number of samples obey a polynomial law in the most of the range (Figs. 10A and B). The rate of convergence w.r.t. depends on the number of samples, and more samples induces faster convergence until it saturates (the blue line in Fig. 10C). Similarly, the rate of convergence w.r.t. the number of samples depends on the value of (the red line in Fig. 10C). In addition, in the initial range of the convergence, we observe an exponential convergence (Figs. 10D and E) as in Section 4.2.2. The coefficient in the exponential convergence also depends on the number of samples or the value of (Fig. 10F). It is reasonable that the convergence rate in Figs. 10C and F increases with the number of samples or the value of , because the total number of training data points is equal to . However, by fitting the points, it is surprising that there is a clear tendency in the form of either or , which we cannot fully explain yet, and hence more theoretical and computational investigations are required. Figure 10: Error convergence rates for different number of training data points. (A) Convergence of test error with respect to P for different number of u samples. (B) Convergence of test error with respect to the number of u samples for different values of P. (C) The polynomial rates of convergence versus the number of u samples or the values of P. (D) Exponential convergence of test error with respect to P for different number of u samples. (E) Exponential convergence of test error with respect to the number of u samples for different values of P. (F) The coefficient 1/k in the exponential convergence e−x/k versus the number of u samples or the values of P.

## 5 Conclusion

In this paper, we formulate the problem of learning operators in a more general setup, and propose DeepONets to learn nonlinear operators. In DeepONets, we first construct two sub-networks to encode input functions and location variables separately, and then merge them together to compute the output. We test DeepONets on four ordinary/partial differential equation problems, and show that DeepONets can achieve small generalization errors by employing this inductive bias. In our simulations, we study systematically the effects on the test error of different factors, including the number of sensors, maximum prediction time, the complexity of the space of input functions, training dataset size, and network size. We observe different order polynomial and even exponential error convergence with respect to the training dataset size. To the best of our knowledge, this is the first time exponential convergence is observed in deep learning. Moreover, we derive theoretically the dependence of approximation error on different factors, which is consistent with our computational results.

Despite the aforementioned achievements, more work should be done both theoretically and computationally. For example, there have not been any theoretical results of network size for operator approximation, similar to the bounds of width and depth for function approximation . We also do not understand theoretically yet why DeepONets can induce small generalization errors. On the other hand, in this paper we use fully-connected neural networks for the two sub-networks, but as we discussed in Section 2.1, we can also employ other network architectures, such as convolutional neural networks or “attention” mechanism. These modifications may improve further the accuracy of DeepONets.

## 6 Acknowledgments

We thank Yanhui Su of Fuzhou University for the help on Theorem 2. We thank Zhongqiang Zhang of Worcester Polytechnic Institute for the proof in Appendix C

. This work is supported by the DOE PhILMs project (No. de-sc0019453), the AFOSR grant FA9550-17-1-0013, and the DARPA-AIRA grant HR00111990025. The work of Pengzhan Jin is partially supported by the Major Project on New Generation of Artificial Intelligence from the Ministry of Science and Technology of China (Grant No. 2018AAA010100).

## References

•  L. Bottou and O. Bousquet (2008) The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, pp. 161–168. Cited by: §1.
•  S. L. Brunton, J. L. Proctor, and J. N. Kutz (2016) Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences 113 (15), pp. 3932–3937. Cited by: §1.
•  T. Chen and H. Chen (1993) Approximations of continuous functionals by neural networks with application to dynamic systems. IEEE Transactions on Neural Networks 4 (6), pp. 910–918. Cited by: §1.
•  T. Chen and H. Chen (1995) Approximation capability to functions of several variables, nonlinear functionals, and operators by radial basis function neural networks. IEEE Transactions on Neural Networks 6 (4), pp. 904–910. Cited by: §1.
•  T. Chen and H. Chen (1995) Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Transactions on Neural Networks 6 (4), pp. 911–917. Cited by: Appendix A, Appendix A, DeepONet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators, §1, §1.
•  T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud (2018) Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pp. 6571–6583. Cited by: §1.
•  G. Cybenko (1989)

Approximation by superpositions of a sigmoidal function

.
Mathematics of Control, Signals and Systems 2 (4), pp. 303–314. Cited by: §1, §1.
•  V. Dumoulin, E. Perez, N. Schucher, F. Strub, H. d. Vries, A. Courville, and Y. Bengio (2018) Feature-wise transformations. Distill. External Links: Document Cited by: §2.1.
•  N. B. Erichson, M. Muehlebach, and M. W. Mahoney (2019) Physics-informed autoencoders for Lyapunov-stable fluid flow prediction. arXiv preprint arXiv:1905.10866. Cited by: §1.
•  B. Hanin (2017)

Universal function approximation by deep neural nets with bounded width and ReLU activations

.
arXiv preprint arXiv:1708.02691. Cited by: §5.
•  K. Hornik, M. Stinchcombe, and H. White (1989) Multilayer feedforward networks are universal approximators. Neural Networks 2 (5), pp. 359–366. Cited by: §1, §1.
•  J. Jia and A. R. Benson (2019) Neural jump stochastic differential equations. arXiv preprint arXiv:1905.10403. Cited by: §1.
•  P. Jin, L. Lu, Y. Tang, and G. E. Karniadakis (2019) Quantifying the generalization error in deep learning in terms of data distribution and neural network smoothness. arXiv preprint arXiv:1905.11427. Cited by: §1.
•  A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §1.
•  L. Lu, X. Meng, Z. Mao, and G. E. Karniadakis (2019) DeepXDE: a deep learning library for solving differential equations. arXiv preprint arXiv:1907.04502. Cited by: §1, §2.1.
•  L. Lu, Y. Shin, Y. Su, and G. E. Karniadakis (2019) Dying ReLU and initialization: theory and numerical examples. arXiv preprint arXiv:1903.06733. Cited by: §1.
•  L. Lu, Y. Su, and G. E. Karniadakis (2018) Collapse of deep and narrow neural nets. arXiv preprint arXiv:1808.04947. Cited by: §1.
•  H. N. Mhaskar and N. Hahm (1997) Neural networks for functional approximation and system identification. Neural Computation 9 (1), pp. 143–159. Cited by: §1.
•  M. Mitzenmacher and E. Upfal (2017) Probability and computing: randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press. Cited by: §4.2.2.
•  G. Neofotistos, M. Mattheakis, G. D. Barmparis, J. Hizanidis, G. P. Tsironis, and E. Kaxiras (2018) Machine learning with observers predicts complex spatiotemporal behavior. arXiv preprint arXiv:1807.10758. Cited by: §1.
•  G. Pang, L. Lu, and G. E. Karniadakis (2019) fPINNs: fractional physics-informed neural networks. SIAM Journal on Scientific Computing 41 (4), pp. A2603–A2626. Cited by: §1.
•  J. C. Patra, R. N. Pal, B. Chatterji, and G. Panda (1999) Identification of nonlinear dynamic systems using functional link artificial neural networks. IEEE transactions on systems, man, and cybernetics, part b (cybernetics) 29 (2), pp. 254–262. Cited by: §1.
•  T. Qin, K. Wu, and D. Xiu (2019) Data driven governing equations approximation using deep neural networks. Journal of Computational Physics. Cited by: §1.
•  M. Raissi, P. Perdikaris, and G. E. Karniadakis (2018) Multistep neural networks for data-driven discovery of nonlinear dynamical systems. arXiv preprint arXiv:1801.01236. Cited by: §1.
•  F. Rossi and B. Conan-Guez (2005)

Functional multi-layer perceptron: a non-linear tool for functional data analysis

.
Neural Networks 18 (1), pp. 45–60. Cited by: §1.
•  S. H. Rudy, S. L. Brunton, J. L. Proctor, and J. N. Kutz (2017) Data-driven discovery of partial differential equations. Science Advances 3 (4), pp. e1602614. Cited by: §1.
•  S. Sabour, N. Frosst, and G. E. Hinton (2017) Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pp. 3856–3866. Cited by: §1.
•  N. Trask, R. G. Patel, B. J. Gross, and P. J. Atzberger (2019) GMLS-Nets: a framework for learning from unstructured data. arXiv preprint arXiv:1909.05371. Cited by: §1.
•  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §2.1.
•  N. Winovich, K. Ramani, and G. Lin (2019) ConvPDE-UQ: convolutional neural networks with quantified uncertainty for heterogeneous elliptic partial differential equations on varied domains. Journal of Computational Physics. Cited by: §1.
•  D. Zhang, L. Lu, L. Guo, and G. E. Karniadakis (2019) Quantifying total uncertainty in physics-informed neural networks for solving forward and inverse stochastic problems. Journal of Computational Physics 397, pp. 108850. Cited by: §1.
•  Z. Zhang and G. E. Karniadakis (2017)

Numerical methods for stochastic partial differential equations with white noise

.
Springer. Cited by: Appendix C.
•  H. Zhao and J. Zhang (2009) Nonlinear dynamic system identification using pipelined functional link artificial recurrent neural network. Neurocomputing 72 (13-15), pp. 3046–3054. Cited by: §1.
•  Y. Zhu, N. Zabaras, P.-S. Koutsourelakis, and P. Perdikaris (2019) Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data. Journal of Computational Physics 394, pp. 56–81. Cited by: §1.

## Appendix A Neural networks to approximate nonlinear operators

We list in Table 3 the main symbols and notations that are used throughout this paper.

Let denote the Banach space of all continuous functions defined on a compact set with sup-norm , where is a Banach space. We first review the definition and sufficient condition of Tauber-Wiener (TW) functions , and the definition of continuous operator.

###### Definition 1 (Tw).

If a function (continuous or discontinuous) satisfies that all the linear combinations , , , , , are dense in every , then is called a Tauber-Wiener (TW) function.

###### Theorem 3 (Sufficient condition for TW).

Suppose that is a continuous function, and (tempered distributions), then , if and only if is not a polynomial.

It is easy to verify that all the activation functions we used nowadays, such as sigmoid, and ReLU, are TW functions.

###### Definition 2 (Continuity).

Let be an operator between topological spaces and . We call continuous if for every , there exists a constant such that

 ∥G(x)−G(y)∥Y<ϵ

for all satisfying .

We recall the following two main theorems of approximating nonlinear continuous functionals and operators due to Chen & Chen .

###### Theorem 4 (Universal Approximation Theorem for Functional).

Suppose that , is a Banach Space, is a compact set, is a compact set in , is a continuous functional defined on , then for any , there are a positive integer , points , and real constants , , , , , such that

 ∣∣ ∣∣f(u)−n∑i=1ciσ(m∑j=1ξiju(xj)+θi)∣∣ ∣∣<ϵ

holds for all .

###### Theorem 5 (Universal Approximation Theorem for Operator).

Suppose that , is a Banach Space, , are two compact sets in and , respectively, is a compact set in , is a nonlinear continuous operator, which maps into , then for any , there are positive integers , , , constants , , , , , , such that

holds for all and .

Both Theorems 4 and 5 require the compactness of the function space. In fact, the function space like is “too large” for real applications, so it is suitable to consider a smaller space with compactness property. The necessary and sufficient conditions of the compactness are given by the following Arzelà–Ascoli Theorem.

###### Definition 3 (Uniform Boundedness).

Let be a family of real-valued functions on the set . We call uniformly bounded if there exists a constant such that

 |f(x)|≤M

for all and all .

###### Definition 4 (Equicontinuity).

Let be a family of real-valued functions on the set . We call equicontinuous if for every , there exists a such that

 |f(x)−f(y)|<ϵ

for all and all satisfying .

###### Theorem 6 (Arzelà–Ascoli Theorem).

Let be a Banach space, and be a compact set. A subset of is pre-compact (has compact closure) if and only if is uniformly bounded and equicontinuous.

## Appendix B Number of sensors for identifying nonlinear dynamic systems

is compact.

###### Proof.

At first, we prove that is pre-compact. For any , by (4), there exists an such that

 ∥u−Lm(u)∥C<ε4,∀u∈V,∀m>m0.

Since is a compact set subject to equicontinuity, there exists a such that

 |x−y|<δ⇒|u(x)−u(y)|<ε2,∀u∈Wm0,∀x,y∈[a,b].

Now for all and all , if , naturally we have , otherwise, . Suppose that , then there holds

 |u(x)−u(y)|=|u(x)−v(x)+v(x)−v(y)+v(y)−u(y)|≤|Lm(v)(x)−v(x)|+|v(x)−v(y)|+|Lm(v)(y)−v(y)|≤2∥Lm(v)−v∥C+|v(x)−v(y)|<2⋅ε4+ε2=ε,

which shows the quicontinuity of . In addition, it is obvious that is uniformly bounded, so that we know is pre-compact by applying the Arzelà–Ascoli Theorem.

Next we show that is close. Let be a sequence which converges to a . If there exists an such that , then . Otherwise, there is a subsequence of such that and as . Then we have

 ∥vin−w0∥C=∥vin−Lin(vin)+Lin(vin)−w0∥C≤∥vin−Lin(vin)∥C+∥Lin(vin)−w0∥C≤κ(in,V)+∥Lin(vin)−w0∥C,

which implies that . ∎

Next we show the proof of Theorem 2.

###### Proof.

For and , according to the bound (4) and the Lipschitz condition, we can derive that

 ∥(Gu)(d)−(Gum)(d)∥2 ≤ c∫da∥(Gu)(t)−(Gum)(t)∥2dt+c∫da|u(t)−um(t)|dt ≤ c∫da∥(Gu)(t)−(Gum)(t)∥2dt+c(b−a)κ(m,V).

By Gronwall inequality, we have

 ∥(Gu)(d)−(Gum)(d))∥2≤c(b−a)κ(m,V)ec(b−a).

Define which is a compact set in , and there is a bijective mapping between and . Furthermore, we define a vector-valued function on by

 φ(u(x0),u(x1),⋯,u(xm))=(Gum)(d).

For any , make large enough so that . By universal approximation theorem of neural network for high-dimensional functions, there exist , such that

 ∥φ(u(x0),⋯,u(xm))−(W2⋅σ(W1⋅[u(x0)⋯u(xm)]T+b1)+b2)∥2<ε−c(b−a)κ(m,V)ec(b−a).

Hence we have

 ∥(Gu)(d)−(W2⋅σ(W1⋅[u(x0)⋯u(xm)]T+b1)+b2)∥2≤∥(Gu)(d)−(Gum)(d)∥2+∥(Gum)(d)−(W2⋅σ(W1⋅[u(x0)⋯u(xm)]T+b1)+b2)∥2

In summary, by choosing the value of so that it makes less than is sufficient to achieve accuracy . ∎

## Appendix C Gaussian random field with the radial-basis function kernel

Suppose that . Then

 X(t)=√2(π)14∫R+(l)12cos(ωt)exp(−l2ω28)dW(ω)−√2(π)14∫R+(l)12sin(ωt)exp(−l2ω28)dB(ω),

where and are independent standard Brownian motions . Apply the change of variable and write as

 X(t)=√2(π)14∫R+cos(λlt)exp(−λ28)dW(λ)−√2(π)14∫R+sin(λlt)exp(−λ28)dB(λ).

Applying a linear interpolation

on the interval , then

 E[(X(t)−Π1X(t))2]  =  2(π)12∫R+((I−Π1)cos(λlt))2exp(−λ24)dλ  +2(π)12∫R+((I−Π1)sin(λlt))2exp(−λ24)dλ≤  (π)12(ti+1−ti)4∫R+(λl)4exp(−λ24)dλ=  24π(ti+1−ti)4l4,

where we recalled the error estimate of the linear interpolation on

(by Taylor’s expansion)

 |(I−Π1)g(t)|=12(b−a)2|f′′(ξ)|,

where lies in between and . Then by the Borel-Cantelli lemma, we have

 |X(t)−Π1X(t)|≤C(ti+1−ti)2−ϵl2,ϵ>0,

where

is an absolute value of a Gaussian random variable with a finite variance. Therefore, taking a piecewise linear interpolation of

with points will lead to convergence with order .