1 Introduction
With the growth of computing power and the availability of big data, machine learning, especially deep learning, has had success in a wide range of research fields such as image recognition, natural language process, and recommendation systems. Primarily, these successes are due to neural networks being readily trainable universal function approximators
[1]. For this same reason, neural networks can be used to describe complex physical phenomena modeled by differential equations.Many authors have recently proposed deep learning methods to solve differential equations [2; 3; 4; 5; 6; 7; 8; 9; 10; 11; 12; 13; 14]
arising in various fields including fluid dynamics and quantitative finance. These methods approximate the solution of a differential equation by a deep neural network, but differ in their learning methodology and particular choice of objective or loss function. An objective function measures how well a neural network approximation satisfies the differential equation and is used as a compass to train the neural network.
In the prior work, the objective function is selected either directly from the differential equation or an equivalent formulation. This is analogous to a finite difference method (FDM) directly discretizing the differential equation and a finite element method (FEM) using the variational formulation. Sirignano and Spiliopoulos [5] and Raissi et al. [6] select the objective function so that the neural network satisfies the PDE and the boundary conditions at points within the domain. This requires the computation of the derivatives of the neural network that appear in the differential equation. Zhu et al. [13] includes in the loss function either a residual of the PDE or an energy functional if it is available. Similarly, Karumuri et al. [7] and Weinan and Yu [12] first recast their elliptic PDEs in an equivalent variational formulation and then train a neural network to minimize the energy functional of the PDE. Sirignano and Spiliopoulos [5] use an efficient Monte Carlo method for computing the secondderivatives of a neural network. Weinan, Han, and Jentzen [4; 3] solve a class of parabolic PDEs by reformulating them as backward stochastic differential equations (BSDE). Their method is specialized to compute the solution at a single point and uses a neural network catered to the time discretization of the BSDE that is trained to satisfy the terminal conditions.
In this work, we propose a numerical method to solve a class of quasilinear elliptic PDEs. We reformulate the PDE to a probabilistic representation in the spirit of the FeynmanKac formula, which we use to train a neural network representing the solution. In particular, the solution is given by an expectation of a martingale process driven by a Brownian motion. Unlike the FeynmanKac formula [15], which is an expectation involving the trajectory of a stochastic process up until its first exit time, our formulation uses an expectation over a nonrandom increment of time, which is a common characteristic of reinforcement learning methods. In fact, our reformulation is the Bellman equation of a Markov reward process [16] and our learning methodology is similar to deep Qlearning [17]. As the Brownian motions explore the domain, a neural network is iteratively trained to satisfy the Bellman equation at the positions of these walkers on every deterministic time step. This distinguishes our approach from previous probabilistic approaches based on the FeynmanKac formula [18; 19; 20; 21; 22]. Moreover, our method is different from previous deep learning methods in that it is not necessary to explicitly compute the derivatives of a neural network with respect to input neurons in solving a differential equation.
After detailing our method in Sections 2 and 3, we will demonstrate its strengths. Our method is especially effective at solving problems in which the solution has singularities in its derivatives, which is shown in Section 4.1). Furthermore, the use of Brownian motion in our method allows us to elegantly handle jump conditions on internal boundaries, as can be seen in Section 4.2). A strength of a neural network representation is that it can represent multiple functions on multiple domains or parametrized functions by simply modifying the input layer. This is exhibited in Sections 4.2 and 4.3). These examples provide convincing evidence that our method is robust, versatile, and efficient.
2 Preliminaries
In this section, we present some background from the theory of stochastic processes and the core ideas underlying our numerical method. The details of the method and its implementation appear in Sections 3 and 4.
2.1 Martingales and quasilinear elliptic PDEs
There is an equivalence between the solutions of quasilinear elliptic PDEs and stochastic processes with the martingale property, which we briefly review here. We are interested in the quasilinear elliptic PDE,
(1) 
in which and may depend locally on the unknown function . For ease of notation we will write , and suppress the possible dependence on higher derivatives. Define the stochastic process as a solution to the stochastic differential equation
(2) 
in which is an ordinary Brownian motion on , is the function from the PDE, Eq. (1), and is arbitrary.
Define the stochastic process by
(3) 
for arbitrary . Note that actually depends on the entire history but we write it for convenience of notation.
With these definitions, the following two statements are equivalent:

is a solution to the PDE in Eq. (1): for all ;

the stochastic process satisfies the martingale property
(4)
This equivalence is a standard application of Itô’s lemma (see e.g. [23]), Moreover, Itô’s lemma shows that the infinitesimal drift of the stochastic process is precisely in the sense that
(5) 
which holds as . This drift is identically zero on precisely when satisfies the PDE. Having zero drift everywhere in the domain is equivalent to satisfying the martingale property. Both Eqs. (4) and (5) exchange solving a PDE into finding a martingale .
In order to impose a boundary condition, for example of the form
(6) 
we required that for . This follows from Eq. (3), since for all . In contrast, homogeneous Neumann conditions of the form
(7) 
can be imposed by reflecting in the normal direction on the boundary . Mixtures between these two boundary conditions, with Dirichlet on part of and Neumann on another part of , requires a straightforward modification.
2.2 Main idea of the numerical method
In the previous section, we transformed the search for a function that satisfies the PDE Eq. (1) into a search for the function that has the martingale property, Eq. (4). As a motivating case, consider the linear elliptic boundary value problem,
(8) 
Using the equivalence from the previous section, is the solution of Eq. (8) if and only if the stochastic process is a martingale satisfying , where is defined as
(9) 
and is the stochastic process which satisfies the stochastic differential equation,
(10) 
In particular, by using the martingale property for , the solution satisfies the following relation for any and any ,
(11) 
Note that Eq. (11) is also true when is a stopping time by application of the Doob’s optional stopping time theorem (see e.g. [23]). By using the stopping time , one obtains the well known FeynmanKac formula for this PDE, the basis of many Monte Carlo algorithms.
Equation (11) also appears in the context of Markov reward processes where it is known as the Bellman equation and the unknown function is known as the value function. Reinforcement learning [16] is a subfield of machine learning that has developed very powerful tools for numerically computing the value function in situations of this form. Algorithms such as temporal difference learning [16] or the deep Qlearning method [17] have been developed that apply very generally to this situation. Reinforcement learning techniques are usually applied to the even more complicated Markov decision process, where not only is the value function unknown, but the optimal action is also unknown. Our situation is simpler because there is no unknown action policy to be learned.
In our numerical method, we will consider a class of parametrized functions , and search for parameters such that Eq. (11) holds. We choose to use neural networks as our class of parametrized functions , as will be described in Section 3. Our method is not limited to this choice  any collection of arbitrary function approximators may be used.
Our search method to find the parameters is inspired by techniques from reinforcement learning. We fix a timestep , and construct an objective function which measures how well Eq. (11) is satisfied as a function of parameters ,
(12) 
In Eq. (12), the inner expectation is over the trajectory and the outer expectation is over the initial locations . Essentially, our algorithm performs gradient descent on this function to obtain a sequence of parameters that successively reduces making closer to solving Eq. (8). Section 3.2 provides the exact details on how this gradient descent is implemented in our method.
Since is small, one can think of the objective function Eq. (12) as approximating the value of the PDE as described in Eq. (5). We emphasize that many other more involved methods from reinforcement learning can be applied to solve the Bellman equation Eq. (11); the simple objective function we use here has this added interpretation of directly approximating the PDE.
2.3 The CameronMartinGirsanov theorem
The stochastic process explores the domain and contributes to learning the solution of the PDE at its location. The stochastic processes defined above in Eq. (2) or Eq. (10) might have a nontrivial drift that causes the stochastic process to be driven away from an important area of the domain. It may also have a small volatility that causes slow movement and therefore slow learning or a data imbalance in sampling and therefore an inaccurate solution. Both of these issues are discussed with respect to the example in Section 4.2.
To avoid these issues, we use an alternative martingale process with an ordinary Brownian motion, denoted . Being able to transformation from a possibly complicated process to a Brownian motion is a result of the CameronMartinGirsanov theorem (see e.g. [23]). Applying this theorem, we find that the function satisfies the PDE Eq. (1) if and only if the stochastic process
(13) 
is a martingale. In Eq. (13), is a Brownian motion on the time interval starting from an arbitrary initial condition . In order to remove the drift from the stochastic process , the martingale process Eq. (13) now contains an additional exponential factor, which can be thought of as discounting rewards obtained when happens to move in the upwind direction of the drift in . Using the CameronMartinGirsanov theorem allows the Brownian walkers to explore the domain unimpeded regardless of the PDE being solved.
From the martingale property for in Eq. (13), the solution of the Eq. (1) satisfies the following relation for any ,
(14) 
We search for the parametrized function to minimize the residual in Eq. (14) by minimizing the following objective function
(15) 
where is given by as Eq. (13) with the parameterized function .
3 The numerical method
This section provides the numerical algorithm to solve Eq. (1
) based on the theory described in the previous section. In particular, we choose to use a neural network as the parameterized approximation to the solution, which is trained so that Markov reward process has a value function that satisfies the Bellman equation. Note that any parametrized family of functions which are dense in the target function space could be used. This flexibility has many advantages, for example, prior knowledge of the solution (e.g. symmetry, homogeneity, etc.) can be incorporated into the choice of the parametrization. We opt to use neural networks for their ubiquity and the ease with which their parameters can be determined using gradient descent and backpropagation.
3.1 Neural networks
Neural networks are widely used in many applications including computer vision
[24], natural language processing [25], forecasting, and speech recognition [26]. The structures of neural networks are designed to extract the valuable features from data. For instance, a convolutional neural network (CNN) is suitable for image data while a recurrent neural network (RNN) is wellsuited for capturing sequential information from data.
In this work, we consider the multilayer perceptron (MLP)
[27] and a simple variant of the residual neural network (ResNet) [28] as the parameterized approximation of the solution. Other architectures are compatible with our method but not explored here. In particular, the accuracy of the neural network could be improved by including problemspecific information, but we do not focus on network architecture engineering in this study.3.1.1 Multilayer perceptron
Multilayer perceptron (MLP) approximator with dimension is recursively defined as
(16) 
where is a weight matrix,
is a bias vector,
is an activation function, and
denotes all the weight matrices and bias vectors. The first dimension is generally the same as the dimension of domain , . However, different choices of are made in Sections 4.2 and 4.3. The last dimension is typically equal to since we consider scalarvalued PDEs.3.1.2 A variant of the residual neural network
The residual neural network (ResNet) is widely used as a base network for feature extraction in image related applications, such as classification, object detection, and segmentation. The key of ResNet is, instead of fitting the desired mapping
directly, first fit the residual and then recover the original as . It is easy to implement by adding the identity shortcut connections between stacked layers and each connected block is called a residual building block. It has been empirically observed that ResNet preforms well in many applications and many researchers are interested in understanding why [29; 30; 31]. The variant of ResNet we use in this work is simply adding the identity shortcuts in a multilayer perceptron network, which is similar to the networks used elsewhere [12; 7; 14]. The following are stacks of identical residual blocks with dimension ;(17) 
in which is a residual block defined as
(18) 
in which , , are weight matrices, , , are bias vectors, and all with superscript are activation functions. Here and are the dimension of the input and output layer and we simply set
to avoid the linear transform between successive residual blocks.
3.2 The algorithm
The core of the method is computing and minimizing, with gradient descent, the residual of the Bellman equation for the stochastic process in Eq. (13) (or Eq. (3) if the CameronMartinGirsanov theorem is not being used). The psuedocode for our method is shown in Algorithm 1. Evaluating Eq. (13) with , we see that a function is a solution if and only if satisfies
(19) 
in which and are discounts and rewards defined as
(20)  
(21) 
The goal is to find parameters such that the neural network satisfies Eq. (19). To achieve this, the training methodology is to create a sequence of parameters , where each set of parameters is determined from
by a stochastic gradient descent update of a loss function
. The loss function combines information from the interior of obtained from a random sample of the expression in Eq. (15) and a contribution from the boundary .We simulate discrete Brownian motions , , which run in the interior of independently and simultaneously. The interior loss function is defined as
(22) 
The target values are
(23) 
in which and are the discretizations of and respectively, given as
(24)  
(25) 
It is important to note that the discretizations and above result in a discretetime stochastic process that has the martingale property like the continuoustime process . Also note that the target values in Eq. (23) depend on the previous parameters and not the argument to the loss function.
It is possible that a Brownian walker will exit the domain between time 0 and . We detect this by , which is an admittedly biased sampling of the exit since an exit may have occurred when . If a Brownian walker exits the domain, we approximate the exit position on the boundary by the intersection of line segment between and and the boundary
. We also approximate the exit time by linearly interpolating in time according to the approximated exit position in the line segment between
and . The approximation of the exit position and time is used in calculating the target value in Eq. (23) instead of the value of from the neural network. This target value is rich in the information of the exact solution since the exact value of function is given from the boundary condition. To enhance the information available on the boundary, we included an additional term in our loss function,(26) 
We train the neural network by minimizing the loss function,
(27) 
by sequentially updating the parameters with stochastic gradient descent,
(28) 
The learning rate may be adjusted on each step. In practice, this gradient descent step can be optimized to take previous steps into account, for example by using the ADAM optimization method [32].
4 Examples
In this section we use the DFLM to compute numerical solutions to some example quasilinear elliptic differential equation in the form of Eq. (1). We measure the accuracy of numerical solutions by the relative error, if the analytic solution is known (examples in Section 4.1 and 4.2), and we compute the relative difference between the numerical solution from our method and that from the finite element method (FEM) if the analytic solution is unknown (the example in Section 4.3).
We implement our algorithm using TensorflowGPU
[33], which are open source libraries for deep learning. In particular, Tensorflow is capable of automatic differentiation (AD) of functions specified by a computer program and, for instance, it can calculate the gradients of a parameterized function
, a neural network in this work, with respect to both and .We emphasize that our algorithm does not directly compute the derivatives of the neural network, or , and we only use the automatic differentiation in computing the gradients of the loss function, . In Section 4.1, we present some empirical results demonstrating how this is a benefit when solving Laplace’s equation. The example in Section 4.2 is an elliptic interface problem which we use to demonstrate how our algorithm deals with a discontinuous solution and discuss the benefits of our use of the CameronMartinGirsanov theorem. In Section 4.3, we consider a practical example of a differential equation that is a model of chemotaxis.
4.1 Laplace’s equation
As a representative example of a problem in elliptic PDEs, we apply our method to solve Laplace’s equation with a Dirichlet boundary condition,
(29) 
The domain is a circular sector and is the given boundary values. We choose this boundary data by evaluating a harmonic function in on . This provides us with an exact solution to the problem. We use two different harmonic functions and . Note that is smooth on , but the derivatives of , and , have a singularity at the origin. The elliptic problem with as the solution is known as a corner singularity problem [34], which is a benchmark for testing new numerical algorithms.
The Bellman equation corresponding to Eq. (29) is, for any ,
(30) 
Our method is to find that minimizes the residual of Eq. (30).
For these test problems, we compare our method with another method [5; 6; 8; 11] that uses a deep neural network that is explicitly differentiated with respect to its input neurons. We refer to this method as the derivativebased loss method (DBLM), which essentially finds that minimizes the residual of
(31) 
Note that , are derivatives of the neural network which can be computed using automatic differentiation, i.e backpropagation. The parameters are founded by stochastic gradient method using the loss function at each iteration,
(32) 
where , are minibatch samples in and , are minibatch samples on . The DFLM uses the positions of independent Brownian motions at each discrete time as minibatch samples in . In the DBLM, uniform minibatch sampling on at each iteration is used since it yields faster training speed than using Brownian walker sampling.
In order to compare the two methods, we use the multilayer perceptron and a variant of ResNet (defined in Section 3.1) with 4 different activation functions, LReLU, ELU, , and SWISH [35]
. All hyperparameters are identical between the two methods. The MLP has the dimension
and the ResNet is constructed as the stacks of 3 identical residual blocks with the dimension . The hyperparameters, (the number of Brownian walkers in the DFLM and minibatch samples in for the DBLM), (the number of minibatch samples on for both methods), (the number of samples for the estimation of the targets in the DFLM) and 5.0e4 (the discrete timestep of Brownian motion in the DFLM) are chosen, and the ADAM optimization method with an exponentially decaying learning rate is used to train the neural networks. The relative errors of the numerical solutions after iterations is presented in Table 1. The table also shows the error when the network is trained directly using samples of the exact solution. More detailed results are presented in Fig. 1.The ResNet is more effective than MLP for both methods, and when using the DFLM, the ResNet with SWISH activation approximates the solutions and with the highest accuracy. The convergence of the DFLM is relatively insensitive to the choice of activation function as can be seen in Fig. 1 (a) and (b)). Also, in this example, the training of our DFLM converges in less (wall)time than the DBLM.
Network  Activation  DFLM  DBLM  Direct  DFLM  DBLM  Direct 

MLP  LReLU  3.41e3  4.35e2  7.81e4  7.61e3  1.36e2  1.06e3 
ELU  3.26e3  6.78e2  8.33e4  5.70e3  2.61e2  1.25e3  
tanh  2.26e3  6.25e3  6.94e4  1.12e2  5.15e2  1.09e3  
SWISH  1.36e3  5.62e3  1.77e4  8.13e3  1.12e2  4.78e4  
ResNet  LReLU  2.77e3  3.75e2  3.34e4  5.25e3  1.21e2  8.22e4 
ELU  2.10e3  1.07e1  5.19e4  4.40e3  4.16e2  8.13e4  
tanh  1.68e3  7.96e4  2.69e4  4.30e3  5.82e3  9.53e4  
SWISH  5.74e4  1.11e3  1.69e4  2.95e3  6.92e3  2.23e4 
4.2 An interface problem
Interface problems arise when modeling diverse physical and biological phenomena such as electrostatics in composite materials, multiphase flow in fluid dynamics, heat conduction, and the electrical activity of biological cells. In this section, we solve the elliptic equation governing the electric potential on a domain with a single interface. Our method requires a small modification to account of the interface conditions. Although our example has only a single simple interface, our method is easily applied to problems with multiple, possible intricate, interfaces.
Let be the bounded domain in , which consists of two subdomains and with the interface . The conductivity on each has constant value for and the potential difference is given across the interface . The electric potential function satisfies
(33)  
(34)  
(35)  
(36) 
The function in Eq. (33) is the given source distribution over and the function in Eq. (34) is the potential on the boundary . Equation (35) is the jump condition resulting from a voltage difference across . Equation. (36) is the flux continuity condition across the interface . Here is defined as where is the limit of as from and is from .
To appreciate the idea of the modification we make to our method to satisfy the conditions Eq. (35) and Eq. (36), consider the Bellman equation corresponding to Eq. (33) without applying the CameronMartinGirsanov theorem, as discussed in Eq. (10) and (11),
(37) 
The Brownian walkers take steps from and according to
(38) 
for a time step . For in () near the interface , if is in (), the value is approximately equal to () by the jump condition Eq. (35). Thus, we modify the Bellman equation Eq. (37) to
(39) 
Next, in order to address the continuity condition Eq. (36), let’s assume in near the interface . Since is constant on as ,
(40) 
in which the distribution of is symmetric with respect to and only depends on the conductivity of in which lies. However, this is not the case for the continuous stochastic process. During the continuous time , once a Brownian walker crosses the interface to , it moves according to the conductivity . For instance, if , the walker moves faster in than in and, approximately, the distribution of has a longer tail in than in . Moreover, the distribution of depends on the distance between and the interface but the distribution from the Eq. (40) only depends on whether is or in . To capture this property of the continuous stochastic process, we regularize the discontinuity of the conductivity function on the interface . We define a regularized conductivity function that converges to as
using the sigmoid function,
(41) 
We therefore move the walkers according to
(42) 
There is a strong drift near the interface toward the region with higher conductivity and the drift is increasing closer to the interface. The solution of Eq. (33), Eq. (34), and Eq. (35) with the regularized conductivity automatically satisfies the continuity condition and is an approximation of the exact solution .
The parameter
should be selected so that, with high probability, the walkers sample the region where the drift
is large. If is too small relative to the typical distance travelled by the walkers, , the information about the regularized function near the interface goes unnoticed.This example illustrates an advantage of using the CameronMartinGirsanov theorem particularly well. Without writing the problem in terms of ordinary Brownian motion, the walkers move as Eq. (42) and their movement depends on the conductivity of their current positions. This dependence is problematic. If the conductivity is too small, the walkers move very slowly on that subdomain and no longer efficiently explore it. Moreover, if one conductivity is very large relative to the other conductivity, the walkers are effectively ‘absorbed’ into the subdomain with small conductivity causing an imbalance of sampling between the two subdomains. Furthermore, the walkers near the interface tend to move toward the direction of drift pushing them away from the interface, which slows the neural network from learning the information about the interface. The CameronMartinGirsanov theorem simplifies the walkers’ motion to that of ordinary Brownian motion which is able to explore the domain without interference from the PDE. All of the information from PDE is encoded in the discounts and rewards in the Bellman equation. This is akin to the method of importance sampling in Monte Carlo methods.
In summary, we apply our method to solve the interface problem with the following modified Bellman equation, which includes a term for the jump ,
(43) 
Considering that the solution is not continuous or differentiable across the interface, we construct the parameterized approximation of the solution separately on each subdomain and
. This is implemented simply in a neural network by adding a categorical variable of subdomains in the input layer. We encode the categorical variable by a onehot vector. The input layer is 4 dimensional vector in which the first 2 coordinates represent
, and last 2 coordinates are the onehot vector. In particular, in is represented by and in by . This is efficient in that a single neural network can represent multiple functions on multiple subdomains. Since a neural network could approximate discontinuous or nondifferentiable functions, since it is a universal approximator, this variation of input layer is not crucial in our methodology. However, as more information about the solution is reflected in the neural network, it is more accurate and trains faster.We test our method with , , , , , for which the analytic solution is
(44) 
Note that we do not use the knowledge about the exact function being a radial function and the Cartesian coordinates are used in the input layer. Based on our experiment, the variant of ResNet outperform the MLP and the structure of ResNet with LReLU activation in the first layer activation ( defined in Eq. (17)) and SWISH () activation in the residual blocks is suitable for this example. In particular, we suspect that the choice of LReLU activation in the first layer is effective for the neural network to represent the multiple functions based on the categorical input variable. The simulation results are presented in Fig. 2. The ResNet described above with 3 identical residual blocks with dimension is trained with (the number of Brownian walkers), (the number of minibatch samples on the boundary ), (the number of samples for the estimation of the target) by the ADAM optimization method. We measured the accuracy of the neural network approximation using a radial average with angular samples, , . We trained the neural network with different time steps of the Brownian motion. The neural network approximates the radial dependency of the solution, the jump condition, and the flux continuity condition on the interface, in total, with relative error 2.6975e4. The details are presented in Fig. 2.
4.3 A steadystate population model with taxis
In this section, we consider a quasilinear elliptic equation used in the modeling of taxis, which is the movement of living systems in response to external stimulus. For example, phototaxis refers to the movement of motile organism toward or away from the source of light, and chemotaxis is the migration of motile cell or organism in a direction affected by the gradient of diffusible substance. The phenomenological equation that describes the steadystate population density with taxis stimulus is
(45) 
in which is the diffusion coefficient, is the taxis flux, and is the kinetics of the population. In particular, we choose the taxis flux as , the diffusive flux of stimulus concentration with a coefficient proportional to the population. This choice of taxis flux is motivated from the model of chemotaxis, known as the KellerSegel model [36; 37]. For given and , the Bellman equation corresponding to Eq. (45) is
(46) 
We solved the equation in the domain with the taxis stimulus function, , in which the stimulus has the peak at . We choose quadratic population kinetics, , in which is the logistic growth rate and is the constant growth rate. We set the population to be zero on the boundary of the domain (i.e., on ). The simulation results with different logistic growth rates are presented in Fig. 3.
The results show that the smaller the logistic growth rate is, the more the population aggregates near the peak(s) of taxis stimulus functions. The numerical results are compared to the results from the finite element method using the MATLAB PDE Toolbox [38] by the relative difference measurement.
Since taxis models typically display diverse phenomenon as their parameters are varied, we present a modification of our method that finds a parameterdependent family of numerical solutions. In particular for this example, we compute the solution , which denotes the solution of Eq. (45) for the logistic growth rate from the kinetics function , within a bounded range . We regard the parameter as a new input variable and construct a neural network estimation , in which the input layer of the neural network is . We train this neural network to minimize the residual of the Bellman equation Eq. (46). In each iteration of our algorithm, each Brownian walker is assigned a value of in and contributes a term to the loss function corresponding to the residual of Eq. (46). There are many options for assigning the value of to each Brownian walker at each iteration. We choose to treat the value of each walker’s as a one dimensional Brownian motion in the domain to be consistent with the spirit of our algorithm. Each is randomly initialized in the domain and moves as Brownian motion at each iteration step, that is, . When a walker’s value of crosses the boundary ( or ), it is reinitialized to a uniformly random location in the parameter domain. The value of should be a small, but not very small, fraction of the size of the parameter domain. We use .
Figure 4 shows the dependence of the solution of Eq. (45) on the logistic growth rate for using the DFLM and the FEM. We see that the two methods give similar values, but are unable to attribute the difference primarily to errors in either method. The computation time was significantly shorter for the DFLM compared to the FEM.
The neural network provides a quicklyevaluated interpolant of the solution’s parameter dependence, which is useful for a variety of applications. In particular, it could be used to estimate the parameter from observed data by solving a nonlinear optimization problem (e.g. minimizing , in which , are the observed data) without having to solve the PDE each time the loss is evaluated.
5 Conclusions
We constructed a numerical method based on sampling Brownian motion that trains neural networks to solve quasilinear elliptic PDEs without explicitly having to compute the derivatives of the neural network with respect to the input variables. There being numerous applications of quasilinear elliptic PDEs, we expect our robust, versatile, and efficient numerical method to be immensely useful. Like other neural network methods for PDEs, this method is gridfree and naturally parallelizable. In addition, by using Brownian motions, the method can handle prescribed jumps along an interface (as in Section 4.2) and avoids some issues related choosing sample points in the domain (as in the corner singularity example of Section 4.1).
Our method was illustrated using two dimensional example problems, but it would be particularly effective for high dimensional problems. Explicitly computing the network derivatives and sampling in high dimensions is computationally expensive, while Brownian motions naturally explore space regardless of the dimension. Additionally, problems posed on infinite domains can be handled naturally within our method.
There are many future directions to explore with our method. One direction is to use other reinforcement learning methods to solve the Bellman equation, Eq. (11). It is possible that a different method could be much more efficient than the simple based loss function we use here. Other representations of the function beyond neural networks are compatible with our method primarily because it is not based on explicitly computing derivatives. By using Brownian motions, we have the option to handle boundary conditions by imposing conditions on the stochastic walkers, for example by reflecting them off the boundary for homogeneous Neumann boundary conditions, or by adapting the Bellman equation. Our method can also be improved and made more general by drawing on the rich theory of stochastic processes as we did with our use of the CameronMartinGirsanov theorem. A generalization to systems of PDEs would be straightforward by having a neural network for each unknown or a single network with multiple outputs, one for each unknown.
Considerable work remains to adapt traditional ideas from the numerical solution of PDEs to our method. Extrapolation (in ), operator splitting, and domain decomposition would be especially beneficial to include. A hybrid method, in which say finite elements or a boundary integral method were to be combined with our method, could be highly accurate and efficient. With regard to the use of neural networks, more complete studies of the training procedure and the network architecture best suited to solving partial differential equations are required.
Our method is readily applicable to quasilinear parabolic PDEs by adding an input neuron for the time variable and adapting the Bellman equation to account for ‘exits’ of the Brownian motion on the initial condition. In this case, the neural network would encode temporal information as it did for the parameter in the chemotaxis example. We hope to extend our approach to a wider class of partial differential equations by finding additional connections with stochastic processes.
Including a PDE problem’s parameterdependence in the neural network representation of the solution allows for efficient sensitivity analysis and regressionbased data fitting. Our method is based on a Markov reward process, but could readily be extended to a Markov decision process which opens the possibility of solving numerical control and design problems.
Acknowledgments
We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC): RGPIN201906946 for A.R.S. and PDF5022872017 for M.N., as well as a University of Toronto Connaught New Researcher Award. We appreciate the input from the students participating in the 2019 Fields Undergraduate Summer Research Program: Julia Costacurta, Cameron Martin, and Hongyuan Zhang.
References
 [1] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Netw. 2 (5) (1989) 359–366.
 [2] I. E. Lagaris, A. Likas, D. I. Fotiadis, Artificial neural networks for solving ordinary and partial differential equations, IEEE Trans. Neural Netw. 9 (5) (1998) 987–1000.
 [3] E. Weinan, J. Han, A. Jentzen, Deep learningbased numerical methods for highdimensional parabolic partial differential equations and backward stochastic differential equations, Commun. Math. Stat. 5 (2017) 349–380.
 [4] J. Han, A. Jentzen, E. Weinan, Solving highdimensional partial differential equations using deep learning, Proc. Natl. Acad. Sci. USA 34 (2018) 8505–8510.
 [5] J. Sirignano, K. Spiliopoulos, DGM: a deep learning algorithm for solving partial differential equations, J. Comput. Phys. 375 (2018) 1339–1364.
 [6] M. Raissi, P. Perdikaris, G. E. Karniadakis, Physicsinformed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, J. Comput. Phys. 378 (2019) 686–707.
 [7] S. Karumuri, R. Tripathy, I. Bilionis, J. Panchal, Simulatorfree solution of highdimensional stochastic elliptic partial differential equations using deep neural networks, J. Comput. Phys. 404 (2020) 109120.
 [8] M. Raissi, Forwardbackward stochastic neural networks: Deep learning of highdimensional partial differential equations, arXiv:1804.07010.
 [9] M. Raissi, G. E. Karniadakis, Hidden physics models: machine learning of nonlinear partial differential equations, J. Comput. Phys. 357 (2018) 125–141.
 [10] M. Raissi, A. Yazdani, G. E. Karniadakis, Hidden fluid mechanics: A NavierStokes informed deep learning framework for assimilating flow visualization data, arXiv:1808.04327.
 [11] M. Raissi, Deep hidden physics models: deep learning of nonlinear partial differential equations, J. Mach. Learn. Res. 19 (1) (2018) 932–955.
 [12] E. Weinan, B. Yu, The deep Ritz method: a deep learningbased numerical algorithm for solving variational problems, Commun. Math. Stat. 6 (1) (2018) 1–12.
 [13] Y. Zhu, N. Zabaras, P.S. Koutsourelakis, P. Perdikaris, Physicsconstrained deep learning for highdimensional surrogate modeling and uncertainty quantification without labeled data, J. Comput. Phys. 394 (2019) 56–81.
 [14] M. A. Nabian, H. Meidani, A deep learning solution approach for highdimensional random differential equations, Probabilistic Eng. Mech. 57 (2019) 14–25.
 [15] H. Pham, Continuoustime Stochastic Control and Optimization with Financial Applications, Vol. 61, Springer Science & Business Media, 2009.
 [16] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, 2nd Edition, The MIT Press, 2018.
 [17] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Playing Atari with deep reinforcement learning, arXiv: 1312.5602.
 [18] T. E. Booth, Exact Monte Carlo solution of elliptic partial differential equations, J. Comput. Phys. 39 (2) (1981) 396–404.
 [19] T. E. Booth, Regional Monte Carlo solution of elliptic partial differential equations, J. Comput. Phys. 47 (2) (1982) 281–290.
 [20] J. Delaurentis, L. Romero, A Monte Carlo method for Poisson’s equation, J. Comput. Phys. 90 (1) (1990) 123–140.
 [21] C.O. Hwang, M. Mascagni, J. A. Given, A FeynmanKac pathintegral implementation for poisson’s equation using an hconditioned Green’s function, Math. Comput. Simulat. 62 (36) (2003) 347–355.
 [22] S. Pauli, R. N. Gantner, P. Arbenz, A. Adelmann, Multilevel Monte Carlo for the FeynmanKac formula for the Laplace equation, Bit Numer. Math. 55 (2015) 1125–1143.
 [23] I. Karatzas, S. Shreve, Brownian Motion and Stochastic Calculus, 2nd Edition, Springer, 1998.

[24]
C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
 [25] Y. Goldberg, Neural network methods for natural language processing, Synthesis Lectures on Human Language Technologies 10 (1) (2017) 1–309.
 [26] A. Graves, A.r. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in: 2013 IEEE international conference on acoustics, speech and signal processing, IEEE, 2013, pp. 6645–6649.
 [27] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, 2nd Edition, The MIT Press, 2016.
 [28] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [29] B. Chang, L. Meng, E. Haber, F. Tung, D. Begert, Multilevel residual networks from dynamical systems view, in: International Conference on Learning Representations, 2018.
 [30] K. Greff, R. K. Srivastava, J. Schmidhuber, Highway and residual networks learn unrolled iterative estimation, in: International Conference on Learning Representations, 2017.
 [31] E. Haber, L. Ruthotto, E. Holtham, S.H. Jun, Learning across scales  multiscale methods for convolution neural networks, arXiv:1703.02009.
 [32] D. P. Kingma, J. L. Ba, ADAM: a method for stochastic optimization, International Conference on Learning Representations.
 [33] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, et al., Tensorflow: largescale machine learning on heterogeneous distributed systems, arXiv:1603.04467.
 <
Comments
There are no comments yet.