1 Introduction
Deep learning has had great success in computer vision and other artificial intelligence tasks
[1]. Underlying this success is a new way to approximate functions, from an additive construction commonly used in approximation theory to a compositional construction used in deep neural networks. The compositional construction seems to be particularly powerful in high dimensions. This suggests that deep neural network based models can be of use in other contexts that involve constructing functions. This includes solving partial differential equations, molecular modeling, model reduction, etc. These aspects have been explored recently in
[2, 3, 4, 5, 6, 7].In this paper, we continue this line of work and propose a new algorithm for solving variational problems. We call this new algorithm the Deep Ritz method since it is based on using the neural network representation of functions in the context of the Ritz method. The Deep Ritz method has a number of interesting and promising features, which we explore later in the paper.
2 The Deep Ritz Method
An explicit example of the kind of variational problems we are interested in is [8]
(1) 
where
(2) 
and is the set of admissible functions (also called trial function, here represented by ), is a given function, representing external forcing to the system under consideration. Problems of this type are fairly common in physical sciences. The Deep Ritz method is based on the following set of ideas:

Deep neural network based approximation of the trial function.

A numerical quadrature rule for the functional.

An algorithm for solving the final optimization problem.
2.1 Building trial functions
The basic component of the Deep Ritz method is a nonlinear transformation defined by a deep neural network. Here
denotes the parameters, typically the weights in the neural network, that help to define this transformation. In the architecture that we use, each layer of the network is constructed by stacking several blocks, each block consists of two linear transformations, two activation functions and a residual connection, both the input
and the outputof the block are vectors in
. The th block can be expressed as:(3) 
where , are parameters associated with the block. is the (scalar) activation function [1].
Our experience has suggested that the smoothness of the activation function plays a key role in the accuracy of the algorithm. To balance simplicity and accuracy, we have decided to use
(4) 
The last term in (3
), the residual connection, makes the network much easier to train since it helps to avoid the vanishing gradient problem
[9]. The structure of the two blocks, including two residual connections, is shown in Figure 1.The full layer network can now be expressed as:
(5) 
denotes the set of all the parameters in the whole network. Note the input for the first block is in , not
. To handle this discrepancy we can either pad
by a zero vector when , or apply a linear transformation on when . Having , we obtain by(6) 
Here in the lefthand side and in what follows, we will use to denote the full parameter set . Substituting this into the form of , we obtain a function of , which we should minimize.
For the functional that occurs in (2), denote:
(7) 
then we are left with the optimization problem:
(8) 
2.2 The stochastic gradient descent algorithm and the quadrature rule
To finish describing the algorithm, we need to furnish the remaining two components: the optimization algorithm and the discretization of the integral in in (2) or in (8). The latter is necessary since computing the integral in (or ) explicitly for functions of the form (6) is quite an impossible task.
In machine learning, the optimization problem that one encounters often takes the form:
(9) 
where each term at the righthand side corresponds to one data point. , the number of data points, is typically very large. For this problem, the algorithm of choice is the stochastic gradient descent (SGD) method, which can be described as follows:
(10) 
Here
are i.i.d random variables uniformly distributed over
. This is the stochastic version of the gradient descent algorithm (GD). The key idea is that instead of computing the sum when evaluating the gradient of , we simply randomly choose one term in the sum. Compared with GD, SGD requires only one function evaluation of function evaluations at each iteration. In practice, instead of picking one term, one chooses a ”minibatch” of terms at each step.At a first sight, our problem seems different from the ones that occur in machine learning since there are no data involved. The connection becomes clear once we view the integral in as a continuous sum, each point in then becomes a data point. Therefore, at each step of the SGD iteration, one chooses a minibatch of points to discretize the integral. These points are chosen randomly and the same quadrature weight is used at every point.
Note that if we use standard quadrature rules to discretize the integral, then we are bound to choose a fixed set of nodes. In this case, we run into the risk where the integrand is minimized on these fixed nodes but the functional itself is far from being minimized. It is nice that SGD fits naturally with the needed numerical integration in this context.
In summary, the SGD in this context is given by:
(11) 
where for each , is a set of points in that are randomly sampled with uniform distribution. To accelerate the training of the neural network, we use the Adam optimizer version of the SGD [10].
3 Numerical Results
3.1 The Poisson equation in two dimension
Consider the Poisson equation:
(12)  
where . The solution to this problem suffers from the wellknown ”corner singularity” caused by the nature of the domain [11]
. A simple asymptotic analysis shows that at the origin, the solution behaves as
[11]. Models of this type have been extensively used to help developing and testing adaptive finite element methods.The network we used to solve this problem is a stack of four blocks (eight fullyconnected layers) and an output layer with . There are a total of parameters in the model. As far as we can tell, this network structure is not special in any way. It is simply the one that we used.
The boundary condition causes some problems. Here for simplicity, we use a penalty method and consider the modified functional
(13) 
We choose . The results from the Deep Ritz method is shown in see Figure 2(a). For comparison, we also plot the result of the finite difference method with (degrees of freedom), see Figure 2(b).
To analyze the error more quantitatively, we consider the following problem
(14)  
where . This problem has an explicit solution in polar coordinates. The error , where and are the exact and approximate solutions respectively, is shown in Table 1 for both the Deep Ritz method and the finite difference method (on uniform grids). We can see that with fewer parameters, the Deep Ritz method gives more accurate solution than the finite difference method.
Method  Blocks Num  Parameters  relative error 

DRM  3  591  0.0079 
4  811  0.0072  
5  1031  0.00647  
6  1251  0.0057  
FDM  625  0.0125  
2401  0.0063 
Being a naturally nonlinear variational method, the Deep Ritz method is also naturally adaptive. We believe that this contributes to the better accuracy of the Deep Ritz method.
3.2 Poisson equation in high dimension
Experiences in computer vision and other artificial intelligence tasks suggest that deep learningbased methods are particularly powerful in high dimensions. This has been confirmed by the results of the Deep BSDE method [3]. In this subsection, we investigate the performance of the Deep Ritz method in relatively high dimension.
Consider ()
(15)  
The solution of this problem is simply , and we will use the exact solution to compute the error of our model later.
For the network structure, we stack six fullyconnected layers with three skip connections and a final linear layer, and there are a total of 671 parameters. For numerical integration, at each step of the SGD iteration, we sample 1,000 points in
and 100 points at each hyperplane that composes
. We set . After 50,000 iterations, the relative error was reduced to about . The training process is shown in Figure 3(a).Also shown in Figure 3(b) is the training process for the problem:
(16)  
with with a similar network structure (stack 3 blocks of size m=100). The solution of this problem is . After 50000 iterations, the relative error is reduced to about .
3.3 An example with the Neumann boundary condition
Consider:
(17)  
The exact solution is
In this case, we can simply use
without any penalty function for the boundary.
With a similar network structure the relative error reaches for and for . The training process is shown in Figure 4.
3.4 Transfer learning
An important component of the training process is the initialization. Here we investigate the benefit of transferring weights in the network when the forcing function is changed.
Consider the problem:
(18)  
where . Here we used a mixture of rectangular and polar coordinates. The exact solution is
.
The network consists of a stack of 3 blocks with m=10, that is, six fullyconnected layers and three residual connections and a final linear transformation layer to obtain . We show how the error and the weights in the layers change during the training period in Figure 5.
We also transfer the weights from the problem:
(19)  
where .
The error and the weights during the training period are also shown in Figure 5. We see that transferring weights speeds up the training process considerably during the initial stage of the training. This suggests that transferring weights is a particularly effective procedure if the accuracy requirement is not very strigent.
3.5 Eigenvalue problems
Consider the following problem:
(20)  
Problems of this kind occur often in quantum mechanics where is the potential function.
There is a wellknown variational principle for the smallest eigenvalue:
(21)  
The functional we minimize here is called the Rayleigh quotient.
To avoid getting the trivial optimizer , instead of using the functional
we use
(22)  
In practice, we use
(23) 
One might suggest that with the last penalty term, the denominator in the Rayleigh quotient is no longer necessary. It turns out that we found in practice that this term still helps in two ways: (1) In the presence of this denominator, there is no need to choose a large value of . For the harmonic oscillator in , we choose , to be 100 and this seems to be large enough. (2) This term helps to speed up the training process.
To solve this problem, we build a deep neural network much like the Densenet [12]. There are skip connections between every pairwise layers, which help gradients flow through the whole network. The network structure is shown in Figure 6.
(24) 
(25) 
We use an activation function . If we use the same activation function as before, we found that the gradients can become quite large and we may face the gradient explosion problem.
The remaining components of the algorithm are very much the same as before.
Example 1: Infinite potential well
Consider the potential function
(26) 
The problem is then equivalent to solving:
(27)  
The smallest eigenvalue is .
The results of the Deep Ritz method in different dimensions are shown in Table 2.
Dimension  Exact  Approximate  Error 

1  9.87  9.85  0.20% 
5  49.35  49.29  0.11% 
10  98.70  92.35  6.43% 
Example 2: The harmonic oscillator
The potential function in is . For simplicity, we truncate the computational domain from to . Obviously, there are better strategies, but we leave improvements to later work.
The results in different dimensions are shown in Table 3.
Dimension  Exact  Approximate  Error 

1  1  1.0016  0.16% 
5  5  5.0814  1.6% 
10  10  11.26  12.6% 
The results deteriorate substantially as the dimension is increased. We believe that there is still a lot of room for improving the results. We will leave this to future work.
4 Discussion
We proposed a variational method based on representing the trial functions by deep neural networks. Our limited experience with this method suggests that it has the following advantages:

It is naturally adaptive.

It is less sensitive to the dimensionality of the problem and has the potential to work in rather high dimensions.

The method is reasonably simple and fits well with the stochastic gradient descent framework commonly used in deep learning.
We also see a number of disadvantages that need to be addressed in future work:

The variational problem that we obtain at the end is not convex even when the initial problem is. The issue of local minima and saddle points is nontrivial.

At the present time, there is no consistent conclusion about the convergence rate.

The treatment of the essential boundary condition is not as simple as for the traditional methods.
In addition, there are still interesting issues regarding the choice of the network structure, the activation function and the minimization algorithm. The present paper is far from being the last word on the subject.
Acknowledgement: We are grateful to Professor Ruo Li and Dr. Zhanxing Zhu for very helpful discussions. The work of E and Yu is supported in part by the National Key Basic Research Program of China 2015CB856000, Major Program of NNSFC under grant 91130005, DOE grant DESC0009248, and ONR grant N000141310338.
References
 [1] I. Goodfellow, Y. Bengio and A. Courville, Deep Learning. MIT Press, 2016.
 [2] W. E, “A proposal for machine learning via dynamical systems”, Communications in Mathematics and Statistics, March 2017, Volume 5, Issue 1, pp 111.
 [3] J. Q. Han, A. Jentzen and W. E, “Overcoming the curse of dimensionality: Solving highdimensional partial differential equations using deep learning”, submitted, arXiv:1707.02568.
 [4] W. E, J. Q. Han and A. Jentzen, “Deep learningbased numerical methods for highdimensional parabolic partial differential equations and backward stochastic differential equations”, submitted, arXiv:1706.04702.
 [5] C. Beck, W. E and Arnulf Jentzen, “Machine learning approximation algorithms for highdimensional fully nonlinear partial differential equations and secondorder backward stochastic differential equations”, submitted. arXiv:1709.05963.
 [6] J. Q. Han, L. Zhang, R. Car and W. E, “Deep Potential: A general and “firstprinciple” representation of the potential energy”, submitted, arXiv:1707.01478.
 [7] L. Zhang, J.Q. Han, H. Wang, R. Car and W.E, “Deep Potential Molecular Dynamics: A scalable model with the accuracy of quantum mechanics”, submitted, arXiv:1707.09571.
 [8] L. C. Evans, Partial Differential Equations, 2nd ed. American Mathematical Society, 2010.

[9]
K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun, “Deep residual learning for image recognition”,
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, vol. 00, no. , pp. 770778, 2016, doi:10.1109/CVPR.2016.90  [10] D. P. Kingma, and J. Ba. “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980, 2014.
 [11] G. Strang and G. Fix, An Analysis of the Finite Element Method. PrenticeHall, 1973.
 [12] G. Huang, Z. Liu, K. Q. Weinberger, V. D. M. Laurens, “Densely connected convolutional networks.”, arXiv preprint arXiv:1608.06993, 2016.
Comments
There are no comments yet.