1 Introduction
Partial differential equations (PDEs) are used to model a variety of phenomena in the natural sciences. Common to most of the PDEs encountered in practical applications is that they cannot be solved analytically but require various approximation techniques. Traditionally, mesh based methods such as finite elements (FEM), finite differences (FDM), or finite volumes (FVM), are the dominant techniques for obtaining approximate solutions. These techniques require that the computational domain of interest is discretized into a set of mesh points and the solution is approximated at the points of the mesh. The advantage of these methods is that they are very efficient for lowdimensional problems on regular geometries. The drawback is that for complicated geometries, meshing can be as difficult as the numerical solution of the PDE itself. Moreover, the solution is only computed at the mesh points and evaluation of the solution at any other point requires interpolation or some other reconstruction method.
In contrast, other methods do not require a mesh but a set of collocation points where the solution is approximated. The collocation points can be generated according to some distribution inside the domain of interest and examples include radial basis functions (RBF) and Monte Carlo methods (MCM). The advantage is that it is relatively easy to generate collocation points inside the domain by a hitandmiss approach. The drawbacks of these methods compared to traditional mesh based methods are for example numerical stability for RBF and inefficiency for MCM.
The last decade has seen a revolution in deep learning where deep artificial neural networks (ANNs) are the key component. ANNs have been around since the 40’s
[25] and used for various applications. A historical review, particularly in the context of differential equations, can be found in ch. 2 in [36]. The success of deep learning during the last decade is due to a combination of improved theory starting with unsupervised pretraining and deep belief nets and improved hardware resources such as general purpose graphics processing units (GPGPUs). See for example [12, 20]. Deep ANNs are now routinely used with impressive results in areas such as image analysis, pattern recognition, object detection, natural language processing, and selfdriving cars to name a few areas.
While deep ANNs have achieved impressive results in several important application areas there are still many open questions concerning how and why they actually work so efficiently. From the perspective of function approximation theory it has been known since the 90’s that ANNs are universal approximators that can be used to approximate any continuous function and its derivatives [13, 14, 6, 23]
. In the context of PDEs single hidden layer ANNs have traditionally been used to solve PDEs since one hidden layer with sufficiently many neurons is sufficient for approximating any function and as all gradients that are needed can be computed in analytical closed form
[21, 22, 26]. More recently there is a limited but emerging literature on the use of deep ANNs to solve PDEs [31, 30, 7, 4, 29, 8, 33]. In general ANNs have the benefits that they are smooth, analytical functions which can be evaluated at any point inside, or even outside, the domain without reconstruction.In this paper we introduce a method of solving PDEs which only involves feedforward deep ANNs with (close to) no user intervention. In previous works only a single hidden layer is used and the method outlined requires some userdefined functions which are nontrivial or even impossible to compute. See Lagaris et. al [21]. Later Lagaris et. al [22] removed the need for user intervention by using a combination of feedforward and radial basis function ANNs in a nontrivial way. Later work by McFall and Mahan [26] removed the need for the radial basis function ANN by replacing it with length factors that are computed using thin plate splines. The thin plate splines computation is, however, rather cumbersome as it involves the solution of many linear systems which adds extra complexity to the problem. In contrast, the method proposed in this paper requires only an implementation of a deep feedforward ANN together with a cost function and an arbitrary choice of unconstrained numerical optimization method.
We denote the method presented in this paper as unified as it only requires feedforward neural networks in contrast to the previous methods presented by Lagaris et. al and McFall et. al. We are not claiming that ANNs necessarily are suitable for solving PDEs in lowdimensions and simple geometries where they are outperformed by classical mesh based methods. Instead, the method have its merits for highdimensional problems and complex domains where most numerical PDE techniques become infeasible.
As training neural networks consumes a lot of time and computational resources it is desirable to reduce the number of required iterations until a certain accuracy has been reached. During this work we have found two factors which strongly influence the number of required iterations in our PDE focused applications/examples. The first is pretraining of the network using the available boundary data and the second is to increase the number of hidden layers. One of our particular findings is that we, by fixing the capacity of the network and increasing the number of hidden layers, can see a dramatic decrease in the number of iterations required to reach a desired level of accuracy. Our findings indicate that the use of deep ANNs instead of just shallow ones adds real value in the context of using ANNs to solve PDEs.
The rest of the paper is organized as follows. In section 2 we recall the network architecture of deep fully connected feedforward ANNs, introduce some necessary notation and the backpropagation scheme. In section 3 we develop unified artificial neural network approximations for stationary PDEs. Based on our ansatz for the solution we discuss extension of the boundary data, smooth distance function approximation and computation, gradient computations and the backpropagation algorithm for parameter calibration. The detailed account of the latter computations/algorithms in the case of advection and diffusion problems are given in Appendix A and B, respectively. In section 4 we provide some concrete examples concerning how to apply the method outlined. Our examples include linear advection and diffusion in 1D and 2D, but highdimensional problems are also discussed. Section 5 is devoted to the convergence considerations briefly mentioned above and we here highlight that the use of deep ANNs instead of just shallow seems to add real value in the context of using ANNs to solve PDEs. Finally, section 6 is devoted to a summary and conclusions.
2 Network architecture
In this paper we consider deep, fully connected feedforward ANNs. The ANN consists of layers where layer 0 is the input layer and layer is the output layer. The layers
are the hidden layers. The activation functions in the hidden layers can be any activation function such as for example sigmoids, rectified linear units, or hyperbolic tangents. Unless otherwise stated we will use sigmoids in the hidden layers. The output activation will be the linear activation. The ANN defines a mapping
.Each neuron in the ANN is supplied with a bias, including the output neurons but excluding the input neurons, and the connections between neurons in subsequent layers are represented by matrices of weights. We let denote the bias of neuron in layer . The weight between neuron in layer and neuron in layer is denoted by . The activation function in layer will be denoted by regardless of the type of activation. We assume for simplicity that a single activation function is used for each layer. The output from neuron in layer will be denoted by . See Figure 1 for a schematic representation of a fully connected feedforward ANN.
A quantity that will be used extensively is the socalled weighted input which is defined as
(2.1) 
where the sum is taken over all inputs to neuron in layer . That is, the number of neurons in layer . The weighted input (2.1) can of course also be written in terms of the output from the previous layer as
(2.2) 
where the output is the activation of the weighted input. As we will be working with deep ANNs we will prefer formula (2.1) as it naturally defines a recursion in terms of previous weighted inputs through the ANN. By definition we have
(2.3) 
which terminates any recursion.
By dropping the subscripts we can write (2.1) in the convenient vectorial form
(2.4) 
where each element in the and vectors are given by and , respectively, and the activation function is applied elementwise. The elements of the matrix are given by .
With the above definitions the feedforward algorithm for computing the output , given the input , is given by
(2.5)  
2.1 Backpropagation
Given some data
and some target outputs
we wish to choose our weights and biases such that is a good approximation of . We use the notation to indicate that the ANN takes as input and is parametrized by the weights and biases , . To find the weights and biases we define some cost function and compute(2.6) 
The minimization problem (2.6) can be solved using a variety of optimization methods, both gradient based and gradient free. In this paper we will focus on gradient based methods and we derive the gradients that we need in order to use any gradient based optimization method. The standard method to compute gradients is backpropagation. The main ingredient is the error of neuron in layer defined by
(2.7) 
The standard backpropagation algorithm for computing the gradients of the cost function is then given by
(2.8)  
The terms in (2.8) can be written in vectorial form as
(2.9) 
where we use the notation and denotes the Hadamard (componentwise) product. The notation without a subscript will denote the vector of partial derivatives with respect to the input . Note the transpose on the weight matrix . The transpose is the Jacobian matrix of the coordinate transformation between layer and .
3 Unified ANN approximations to PDEs
In this section we are interested in solving stationary PDEs of the form
(3.1)  
where is a differential operator, a forcing function, a boundary operator, and the boundary data. The domain of interest is , denotes its boundary, and is the part of the boundary where boundary conditions should be imposed. In this paper we focus on problems where is either the advection or diffusion operator, or some mix thereof. The extension to higher order problems are conceptually straightforward based on the derivations outlined below.
We consider the ansatz for where denotes the parameters of the underlying ANN. To determine the parameters we will use the cost function defined as the quadratic residual,
(3.2) 
For further reference we need the gradient of (3.2) with respect to any network parameter. Let denote any of or . Then
(3.3) 
We will use the collocation method [21] and therefore we discretize and into a sets of collocation points and , with and , respectively. In its discrete form the minimization problem (2.6) then becomes
(3.4) 
subject to the constraints
(3.5) 
There are many approaches to the minimization problem (3.4) subject to the constraints (3.5). One can for example use a constrained optimization procedure, Lagrange multipliers, or penalty formulations to include the constraints in the optimization. Another approach is to design the ansatz such that the constraints are automatically fulfilled, thus turning the constrained optimization problem into an unconstrained optimization problem. Unconstrained optimization problems can be more efficiently solved using gradient based optimization techniques.
In the following we let in (3.1) be the identity operator and we hence consider PDEs with Dirichlet boundary conditions. In this case our ansatz for the solution is
(3.6) 
where is a smooth extension of the boundary data and is a smooth distance function giving the distance for to . Note that the form of the ansatz (3.6) ensures that attains its boundary values at the boundary points. Moreover, and are precomputed using lowcapacity ANNs for the given boundary data and geometry using only a small subset of the collocation points and do not add any extra complexity when minimizing (3.4). We call this approach unified as there are no other ingredients involved other than feedforward ANNs, and gradient based, unconstrained optimization methods to compute all of , , and .
3.1 Extension of the boundary data
The ansatz (3.6) requires that is globally defined, smooth^{1}^{1}1It is sufficient that has continuous derivatives up to the order of the differential operator . However, since will be computed using an ANN it will be smooth. and that
(3.7) 
Other than these requirements the exact form of is not important. It is hence an ideal candidate for an ANN approximation. To compute we simply train an ANN to fit , . The quadratic cost function used is given by
(3.8) 
and in the course of calibration we compute the gradients using the standard backpropagation (2.8). Note that in general we have and the time it takes to compute is of lower order.
3.2 Smooth distance function computation
To compute the smooth distance function we start by computing a nonsmooth distance function and approximate it using a lowcapacity ANN. For each point we define as the minimum distance to a boundary point where a boundary condition should be imposed. That is,
(3.9) 
Again, the exact form of (and ) is not important other than that is smooth and
(3.10) 
We can use a small subset with to compute .
Once has been computed and normalized we fit another ANN and train it using the cost function
(3.11) 
Note that we train the ANN over the points with . Since the ANN defining the smooth distance function can be taken to be a single hidden layer, lowcapacity ANN with , the total time to compute both and is of lower order. See Figure 2 for (rather trivial examples) of and in one dimension.
Remark 3.1.
We stress that the distance function (3.9) can be computed efficiently with nearestneighbor searches using for example d trees [2] or ball trees [27] for very highdimensional input. A naïve nearest neighbor search will have complexity while an efficient nearestneighbor search will typically have .
Remark 3.2.
Instead of computing the actual distance function we could use the more extreme version
(3.12) 
However, simulations showed that this function and its approximating ANN have a negative impact on convergence and quality when computing .
3.3 Gradient computations
When the differential operator acts on the ansatz in (3.6) we need to compute the partial derivatives of the ANNs with respect to the spatial variables . Also, when using gradient based optimization we need to compute the gradients of the quadratic residual cost function (3.2) with respect to the network parameters.
In the single hidden layer case all gradients can be computed in closed analytical form as shown in [21]. For deep ANNs, however, we need to modify the feedforward (2.5) and backpropagation (2.8) algorithms to compute the gradients with respect to and network parameters, respectively.
Note that the minimization problem in (3.4) involves the collocation points . Throughout the paper we will always denote as the partial derivative with respect to the spatial coordinate , not with respect to collocation point .
We supply the details of the relevant gradient computations/algorithms in the case of advection and diffusion problems in Appendix A and B, respectively.
4 Numerical examples
In this section we provide some concrete examples concerning how to apply the modified feedforward and backpropagation algorithms to model problems. Each problem requires their own modified backpropagation algorithms depending on the differential operator . We start by presenting a few simple examples in some detail before we proceed to examples of more complicated problems.
In all of the following numerical examples we have implemented the gradients according to the schemes presented in the previous section and we consequently use the BFGS [9] method with default settings from SciPy [18] to train the ANN. We have tried all gradient free and gradient based numerical optimization methods available in the scipy.optimize
package, and the online, batch, and stochastic gradient descent methods. BFGS shows superior performance compared to all other methods that we have tried.
An issue one might encounter is that the line search in BFGS fails due to the Hessian matrix being very illconditioned. To circumvent this we use a BFGS in combination with stochastic gradient descent. When BFGS fails due to line search failure we run 1000 iterations with stochastic gradient descent with a very small learning rate, in the order of , to bring BFGS out of the troublesome region. This procedure is repeated until convergence or until the maximum number of allowed iterations has been exceeded.
The examples in 1D are rather simple and it suffices to use ANNs with two hidden layers with 10 neurons each. The 2D examples are somewhat more complicated and for those we use ANNs with five hidden layers with 10 neurons each.
4.1 Linear advection in 1D
The stationary, scalar, linear advection equation in 1D is given by
(4.1)  
To get an analytic solution we take, for example,
(4.2) 
and plug it into (4.1) to compute and . In this simple case the boundary data extension can be taken as the constant and the smoothed distance function can be seen in Figure 1(a). In this simple case we could take the distance function to be the line . When acts on the ansatz we get
(4.3) 
To use a gradient based optimization method we need the gradients of the quadratic residual cost function (3.2). They can be computed from (3.3) as
(4.4)  
Each of the terms in (4.3) and (4.4) can be computed using the algorithms (2.5), (2.8), (A.2), (A.8), or (A.11). The result is shown in Figure 3. In this case we used an ANN with 2 hidden layers with 10 neurons each and 100 equidistant collocation points.
4.2 Linear diffusion in 1D
The stationary, scalar, linear diffusion equation in 1D is given by
(4.5)  
To get an analytical solution we let, for example,
(4.6) 
and compute , , and . In this simple case the boundary data extension can be taken as the straight line . The smoothed distance function is seen in Figure 1(b). Here we could instead take the smoothed distance function to be the parabola . When acts on the ansatz we get
(4.7) 
and to use gradient based optimization we compute (3.3) as
(4.8)  
Each of the terms in (4.7) and (4.8) can be computed using the algorithms (2.5), (A.2), (A.8), (A.11), (B.2), and (B.7). The solution is shown in Figure 4 using an ANN with two hidden layers with 10 neurons each and 100 equidistant collocation points.
4.3 A remark on 1D problems
First and second order problems in 1D have a rather interesting property — the boundary data can be added in a pure post processing step. A first order problem in 1D requires only one boundary condition and the boundary data extension can be taken as a constant. When the firstorder operator acts on the ansatz the boundary data extension vanishes and is not used in any subsequent training. A second order problem in 1D requires boundary data at both boundary points and the boundary data extension can be taken as the line passing through these points. When the secondorder operator acts on the ansatz the boundary data extension vanishes and is not used in any subsequent training. For one dimensional problems we hence let the ansatz be given by
(4.9) 
and we can then evaluate the solution for any boundary data using
(4.10) 
without any retraining. Even 1D equations can be time consuming for complicated problems and large networks and being able to experiment with different boundary data without any retraining is an improvement.
4.4 Linear advection in 2D
The stationary, scalar, linear advection equation in 2D is given by
(4.11)  
where , are the (constant) advection coefficients. The set is the part of the boundary where boundary conditions should be imposed. Let denote the outer unit normal to at . Following [19] we have
(4.12) 
The condition (4.12), usually called the inflow condition, is easily incorporated into the computation of the distance function. Whenever we generate a boundary point to use in the computation of the distance function we check if (4.12) holds, and if it does not we add the point to the set of collocation points.
As an example we take the domain to be a star shape and generate uniformly distributed points inside and uniformly distributed points on , see Figure 5. Once the ANNs have been trained we can evaluate on any number of points.
To compute the distance function we compute (3.9) on the finer grid shown in Figure 4(b) after traversing the boundary and moving all points that does not satisfy (4.12) to the set of collocation points. To compute the smoothed distance function we train an ANN on the coarser grid shown in Figure 4(a) and evaluate it on the finer grid. The results can be seen in Figure 6.
When the advection operator acts on the ansatz we get
(4.13) 
To compute the solution we need the gradients of the residual cost function given here by
(4.14)  
Each of the terms in (4.13) and (4.14) can be computed using the algorithms (2.5), (2.8), (A.2), (A.8), or (A.11) as before. Note that the full gradient vectors can be computed simultaneously and there is no need to do a separate feedforward and backpropagation pass for each component of the gradient.
To get an analytic solution we let, for example,
(4.15) 
and plug it into (4.11) to compute and . An ANN with a single hidden layer with 20 neurons is used to compute the extension of the boundary data . To compute the solution we use an ANN with five hidden layers with 10 neurons each. The result can be seen in Figure 7. Note that the errors follow the socalled streamlines. This is a wellknown problem in FEM where streamline artificial diffusion is added to reduce the streamline errors [17].
4.5 Linear diffusion in 2D
The linear, scalar diffusion equation in 2D is given by
(4.16)  
In this case and the smoothed distance function is computed using all points along the boundary as seen in Figure 8.
To get an analytic solution to compare with we let, for example,
(4.17) 
and use it to compute and in (4.16). The extension of the boundary data is again computed using an ANN with a single hidden layer with 20 neurons and trained on all boundary points of the fine mesh. When the diffusion operator acts on the ansatz we get
(4.18)  
and the gradients of the residual cost function becomes
(4.19)  
Each of the terms in (4.19) and (4.18) can be computed using the algorithms (2.5), (A.2), (A.8), (A.11), (B.2), and (B.7). The solution using an ANN with five hidden layers with 10 neurons each can be seen in Figure 9.
4.6 Linear diffusion in a complex 2D geometry
The examples above show how to use the ANN method on nontrivial geometries. To put the method to the test we consider the whole of Sweden as our domain. The domain is represented as a polygon with 160876 vertices and the boundary is highly irregular with very small details. The latitude and longitude coordinates of the verticies have been transformed to coordinates by using the standard Mercator projection. As before we generate 500 collocation points uniformly distributed along the boundary and 1000 collocation points inside the domain as shown in Figure 10 and compute the smoothed distance function using a single hidden layer ANN with 20 neurons. In this case we used all collocation points, rather than a subset as before, to compute the smoothed distance function due to the complexity of the boundary.
We take as an analytic solution the function
(4.20) 
where is the center of mass of the domain. The network solution and error can be seen in Figure 11 where we have as before used a network with 5 hidden layers with 10 neurons each.
4.6.1 A remark on mesh based methods
Classical methods for PDEs in complex geometries such as FEM and FVM require that the domain is discretized by a triangulation. Each mesh element along the boundary needs to coincide with a line segment of the polygon describing the boundary. For problems with simple polygonal boundaries for which a high quality mesh can be constructed the ANN method is not competitive with classical methods. Computing gradients and performing gradient based optimization is much more time consuming than solving the linear systems produced by classical numerical methods.
The problem above consists of a complicated polygon with very short line segments and fine grained details which places severe restrictions on the triangulation. In an attempt to compare the ANN method with FEM we used the state of the art FEM software suite FEniCS [24]. FEniCS includes a mesh generator called mshr
which is capable of creating high quality meshes on many complex geometries. Unfortunately though, we did not succeed in creating a mesh to be used in a FEM simulation. We experimented with different settings and resolutions but each attempt was aborted after 16 hours when the mesh generation was not completed. The above ANN simulation took about 10 minutes on a highend laptop from start to finish including computing the smoothed distance function and boundary data extension. With the recent progress in open source software for neural networks by for example TensorFlow
[1]and PyTorch
[28], a reimplementation would significantly speed up the computational time.4.7 Higherdimensional problems
Higherdimensional problems follow the same pattern as the 2D case. For example in dimensions we have the diffusion operator acting on the ansatz as
(4.21) 
and the gradients of the residual cost function becomes
(4.22)  
As we can see the computational complexity increases linearly with the number of space dimensions as there is one additional term added for each dimension. The number of parameters in the deep ANN increases with the number of neurons in the first hidden layer for each dimension. For a sufficiently deep ANN this increase is negligible.
The main cause for increased computational time is the number of collocation points. Regular grids grows exponentially with the number of dimensions and quickly becomes infeasible. Uniformly random numbers in highdimensions are in fact not uniformly distributed across the whole space but tends to cluster on hyperplanes, thus providing a poor coverage of the whole space. This is a wellknown fact for Monto Carlo methods and has led to the development of sequences of quasirandom numbers called low discrepancy sequences. See for example ch. 2 in
[32], or [5, 35] and the references therein.There are plenty of different low discrepancy sequences and an evaluation of their efficiency in the context of deep ANNs is beyond the scope of this paper. A common choice is the Sobol sequence [34] which is described in detail in [3] for and later extended in [15] and [16] for and , respectively. To compute a highdimensional problem we thus generate points from a Sobol sequence in dimensions to use as collocation points and points in dimensions to use as boundary points. The rest follows the same procedure as previously described.
5 Convergence considerations
Training neural networks consumes a lot of time and computational resources. It is therefore desirable to reduce the number of required iterations until a certain accuracy has been reached. In the following we will discuss two factors which strongly influence the number of required iterations. The first is pretraining the network using the available boundary data and the second is to increase the number of hidden layers.
5.1 Boundary data pretraining
As previously described, we used a lowcapacity ANN to compute the global extension of the boundary data to save some computational time. However, the lowcapacity ANN that extends the boundary data is already an approximate solution to the PDE. Figure 12 shows the boundary and solution ANNs evaluated on all collocation and boundary points. Note that the boundary ANN has not been trained on a single collocation point inside the domain.
As fitting an ANN to the boundary data is an order of magnitude faster than solving the PDE this suggests that we can pretrain the solution ANN by fitting it to the boundary data. It is still a good idea to keep the boundary ANN as a separate lowcapacity ANN as it will speed up the training of the solution ANN due to the many evaluations that is required during training.
When training with the BFGS method pretraining has limited effect due to the efficiency of the method to quickly reduce the value of the cost function. For less efficient methods, such as gradient descent, the difference can be quite pronounced as boundary data pretraining gives a good starting points for the iterations. There is, however, one prominent effect. The time until the first line search fail is greatly reduced by boundary data pretraining for the diffusion problem which is significantly more illconditioned than the advection problem. In Figure 13 we show the convergence history of BFGS until the first line search fail with and without pretraining. That line search fails is the main bottleneck of the BFGS method and reducing them is a major gain.
5.2 Number of hidden layers
Most earlier works, as for example in the pioneering work by Lagaris et al. [21, 22], have been focused on neural networks with a single hidden layer. Single hidden layer networks are sufficient to approximate any continuous function to any accuracy by having a sufficient amount of neurons in the hidden layer [13, 14, 23]. However, by fixing the capacity of the network and increasing the number of hidden layers we can see a dramatic decrease in the number of iterations required to reach a desired level of accuracy.
In the following we solve the diffusion problem in 2D as before (with boundary data pretraining) using networks with 1, 2, 3, 4 and 5 hidden layers with 120, 20, 14, 12, and 10, neurons in each hidden layer, respectively. This gives networks with 481, 501, 477, 517, and 481 trainable parameters (weights and biases), respectively, which we consider comparable. The networks are trained until the cost is less than and we record the number of required iterations. The result can be seen in Figure 14
. We can clearly see how the number of required iterations decreases as we increase the number of hidden layers. At five hidden layers we are starting to experience the vanishing gradient problem
[10] and the convergence starts to deteriorate. Note that when using a single hidden layer we did not reach the required tolerance until we exceeded the maximum number of allowed iterations.6 Summary and conclusion
This paper presents a method for solving advection and diffusion type PDEs in complex geometries using deep feedforward artificial neural networks. By following the derivations of the modified backpropagation algorithms outlined here, one can extend this work to include nonlinear systems of PDEs of arbitrary order.
We show examples of advection and diffusion type PDEs in 1D and 2D with one example being a highly complex polygon for which traditional mesh based methods are infeasible.
It is shown that increasing the number of hidden layers is beneficial for the number of training iterations required to reach a certain accuracy. The effects will be even more pronounced when adding deep learning techniques that prevents the vanishing gradient problem. Extending the feedforward network to more complex configurations, and to develop even deeper networks, using for example convolutional layers, dropout, and batch normalization is the topic for future works.
7 Acknowledgements
The authors were partially supported by a grant from the Göran Gustafsson Foundation for Research in Natural Sciences and Medicine.
References

[1]
M. Abadi et al.
TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.  [2] J. L. Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18(9):509–517, Sept. 1975.
 [3] P. Bratley and B. L. Fox. Algorithm 659: Implementing Sobol’s quasirandom sequence generator. ACM Trans. Math. Softw., 14(1):88–100, Mar. 1988.
 [4] P. Chaudhari, A. Oberman, S. Osher, S. Soatto, and G. Carlier. Deep relaxation: partial differential equations for optimizing deep neural networks. ArXiv eprints, Apr. 2017.

[5]
J. Cheng and M. J. Druzdzel.
Computational investigation of lowdiscrepancy sequences in simulation algorithms for Bayesian networks.
ArXiv eprints, Jan. 2013.  [6] N. E. Cotter. The Stone–Weierstrass theorem and its application to neural networks. IEEE Transactions on Neural Networks, 1(4):290–295, Dec. 1990.
 [7] W. E, J. Han, and A. Jentzen. Deep learningbased numerical methods for high–dimensional parabolic partial differential equations and backward stochastic differential equations. ArXiv eprints, June 2017.
 [8] W. E and B. Yu. The deep Ritz method: A deep learningbased numerical algorithm for solving variational problems. ArXiv eprints, Sept. 2017.
 [9] R. Fletcher. Practical methods of optimization. Wiley, 2nd edition, 1987.

[10]
X. Glorot and Y. Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, pages 249–256. PMLR, 2010. 
[11]
J. Han, A. Jentzen, and W. E.
Overcoming the curse of dimensionality: Solving highdimensional partial differential equations using deep learning.
ArXiv eprints, July 2017.  [12] G. E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006.
 [13] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989.
 [14] K. Hornik, M. Stinchcombe, and H. White. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks, 3(5):551–560, 1990.
 [15] S. Joe and F. Y. Kuo. Remark on algorithm 659: Implementing Sobol’s quasirandom sequence generator. ACM Trans. Math. Softw., 29(1):49–57, Mar. 2003.
 [16] S. Joe and F. Y. Kuo. Constructing Sobol sequences with better twodimensional projections. SIAM Journal on Scientific Computing, 30(5):2635–2654, 2008.
 [17] C. Johnson and J. Saranen. Streamline diffusion methods for the incompressible Euler and Navier–Stokes equations. Mathematics of Computation, 47(175):1–18, 1986.
 [18] E. Jones, T. Oliphant, P. Peterson, et al. SciPy: Open source scientific tools for Python. http://www.scipy.org, 2001–. Accessed 20171212.
 [19] H.O. Kreiss. Initial boundary value problems for hyperbolic systems. Communications on Pure and Applied Mathematics, 23(3):277–298, 1970.
 [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
 [21] I. E. Lagaris, A. Likas, and D. I. Fotiadis. Artificial neural networks for solving ordinary and partial differential equations. IEEE Transactions on Neural Networks, 9(5):987–1000, Sept. 1998.
 [22] I. E. Lagaris, A. C. Likas, and D. G. Papageorgiou. Neuralnetwork methods for boundary value problems with irregular boundaries. IEEE Transactions on Neural Networks, 11(5):1041–1049, Sept. 2000.
 [23] X. Li. Simultaneous approximations of multivariate functions and their derivatives by neural networks with one hidden layer. Neurocomputing, 12(4):327–343, 1996.
 [24] A. Logg, K.A. Mardal, G. N. Wells, et al. Automated Solution of Differential Equations by the Finite Element Method. Springer, 2012.
 [25] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, Dec. 1943.
 [26] K. S. McFall and J. R. Mahan. Artificial neural network method for solution of boundary value problems with exact satisfaction of arbitrary boundary conditions. IEEE Transactions on Neural Networks, 20(8):1221–1233, Aug. 2009.
 [27] S. M. Omohundro. Five balltree construction algorithms. Technical report, International Computer Science Institute, 1989.
 [28] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In NIPSW, 2017.
 [29] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics informed deep learning (part I): Datadriven solutions of nonlinear partial differential equations. ArXiv eprints, Nov. 2017.
 [30] K. Rudd and S. Ferrari. A constrained integration (cint) approach to solving partial differential equations using artificial neural networks. Neurocomputing, 155:277–285, 2015.
 [31] K. Rudd, G. Muro, and S. Ferrari. A constrained backpropagation approach for the adaptive solution of partial differential equations. IEEE Transactions on Neural Networks and Learning Systems, 25(3):571–584, 2014.
 [32] R. Seydel. Tools for Computational Finance. Universitext (1979). Springer, 2004.
 [33] J. Sirignano and K. Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations. ArXiv eprints, Aug. 2017.
 [34] I. M. Sobol. On the distribution of points in a cube and the approximate evaluation of integrals. USSR Computational Mathematics and Mathematical Physics, 7(4):86–112, 1967.
 [35] X. Wang and I. H. Sloan. Low discrepancy sequences in high dimensions: How well are their projections distributed? J. Comput. Appl. Math., 213:366–386, Mar. 2008.
 [36] N. Yadav, A. Yadav, and M. Kumar. An Introduction to Neural Network Methods for Differential Equations. Springer Briefs in Applied Sciences and Technology. Springer Netherlands, 2015.
Appendix A Advection problems
For advection problems we have a PDE of the form
(A.1) 
for some (nonlinear) matrix coefficient . When the advection operator acts on we need to compute the gradient of the ANN with respect to the input. The gradients can be computed by taking partial derivatives of (2.5). We get in component form
(A.2)  
Note the slight abuse of subscript notation in order to avoid having too many subscripts. Note also that (A.2) defines a new ANN with the same weights, but no biases, as the original ANN with modified activation functions. A convenient vectorial form is obtained by dropping the subscripts and rewriting (A.2) as
(A.3)  
where is the th standard basis vector. To compute all gradients simultaneously we define for a vector and vectorvalued function the diagonal and Jacobian matrices
(A.4) 
Using these matrices we can write all partial derivatives in the compact matrix form
(A.5)  
where is the identity matrix. All gradients of the ANN with respect to the input are contained in the rows of the Jacobian matrix as