Learning to Discretize: Solving 1D Scalar Conservation Laws via Deep Reinforcement Learning

05/27/2019 ∙ by Yufei Wang, et al. ∙ Peking University 0

Conservation laws are considered to be fundamental laws of nature. It has broad application in many fields including physics, chemistry, biology, geology, and engineering. Solving the differential equations associated with conservation laws is a major branch in computational mathematics. Recent success of machine learning, especially deep learning, in areas such as computer vision and natural language processing, has attracted a lot of attention from the community of computational mathematics and inspired many intriguing works in combining machine learning with traditional methods. In this paper, we are the first to explore the possibility and benefit of solving nonlinear conservation laws using deep reinforcement learning. As a proof of concept, we focus on 1-dimensional scalar conservation laws. We deploy the machinery of deep reinforcement learning to train a policy network that can decide on how the numerical solutions should be approximated in a sequential and spatial-temporal adaptive manner. We will show that the problem of solving conservation laws can be naturally viewed as a sequential decision making process and the numerical schemes learned in such a way can easily enforce long-term accuracy. Furthermore, the learned policy network can determine a good local discrete approximation based on the current state of the solution, which essentially makes the proposed method a meta-learning approach. In other words, the proposed method is capable of learning how to discretize for a given situation mimicking human experts. Finally, we will provide details on how the policy network is trained, how well it performs compared with some state-of-the-art numerical solvers such as WENO schemes, and how well it generalizes.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Conservation laws are considered to be one of the fundamental laws of nature, and has broad applications in multiple fields such as physics, chemistry, biology, geology, and engineering. For example, Burger’s equation, a very classic partial differential equation (PDE) in conservation laws, has important applications in fluid mechanics, nonlinear acoustics, gas dynamics, and traffic flow.

Solving the differential equations associated with conservation laws has been a major branch of computational mathematics LeVeque (1992, 2002), and a lot of effective methods have been proposed, from classic methods such as the upwind scheme, the Lax-Friedrichs scheme, to the advanced ones such as the ENO/WENO schemes Liu et al. (1994); Shu (1998), the flux-limiter methods Jerez Galiano and Uh Zapata (2010)

, and etc. In the past few decades, these traditional methods have been proven successful in solving conservation laws. Nonetheless, the design of some of the high-end methods heavily relies on expert knowledge and the coding of these methods can be a laborious process. To ease the usage and potentially improve these traditional algorithms, machine learning, especially deep learning, has been recently incorporated into this field. For example, the ENO scheme requires lots of ‘if/else’ logical judgments when used to solve complicated system of equations or high-dimensional equations. This very much resembles the old-fashioned expert systems. The recent trend in artificial intelligence (AI) is to replace the expert systems by the so-called ‘connectionism’, e.g., deep neural networks, which leads to the recent bloom of AI. Therefore, it is natural and potentially beneficial to introduce deep learning in traditional numerical solvers of conservation laws.

1.1 Related works

In the last few years, neural networks (NNs) have been applied to solving ODEs/PDEs or the associated inverse problems. These works can be roughly classified into three categories according to the way that the NN is used.

The first type of works propose to harness the representation power of NNs, and are irrelevant to the numerical discretization based methods. For example, Raissi et al. (2017a, b) treated the NNs as new ansatz to approximate solutions of PDEs. More recent works along this line include Magiera et al. (2019); Michoski et al. (2019); Both et al. (2019). Besides, several works have focused on using NNs to establish direct mappings between the parameters of the PDEs (e.g. the coefficient field or the ground state energy) and their associated solutions Khoo et al. (2017); Khoo and Ying (2018); Li et al. (2019); Fan et al. (2018b). Furthermore, Han et al. (2018); Beck et al. (2017) proposed a method to solve very high-dimensional PDEs by converting the PDE to a stochastic control problem and use NNs to approximate the gradient of the solution.

The second type of works focus on the connection between deep neural networks (DNNs) and dynamic systems Weinan (2017); Chang et al. (2017); Lu et al. (2018); Long et al. (2018b); Chen et al. (2018). These works observed that there are connections between DNNs and dynamic systems (e.g. differential equations or unrolled optimization algorithms) so that we can combine deep learning with traditional tools from applied and computational mathematics to handle challenging tasks in inverse problems Long et al. (2018b, a); Qin et al. (2018).The main focus of these works, however, is to solve inverse problems, instead of learning numerical discretizations of differential equations. Nonetheless, these methods are closely related to numerical differential equations since learning a proper discretization is often an important auxiliary task for these methods to accurately recover the form of the differential equations.

The third type of works, which target at using NNs to learn new numerical schemes, are closely related to our work. However, we note that these works mainly fall in the setting of supervised learning (SL). For example,

Discacciati et al. (2019) proposed to integrate NNs into high-order numerical solvers to predict artificial viscosity; Ray and Hesthaven (2018)

trained a multilayer perceptron to replace traditional indicators for identifying troubled-cells in high-resolution schemes for conservation laws. These works greatly advanced the development in machine learning based design of numerical schemes for conservation laws. Note that in

Discacciati et al. (2019), the authors only utilized the one-step error to train the artificial viscosity networks without taking into account the long-term accuracy of the learned numerical scheme. Ray and Hesthaven (2018) first constructed several functions with known regularities and then used them to train a neural network to predict the location of discontinuity, which was later used to choose a proper slope limiter. Therefore, the training of the NNs is separated from the numerical scheme. Then, a natural question is whether we can learn discretization of differential equations in an end-to-end fashion and the learned discrete scheme also takes long-term accuracy into account. This motivates us to employ reinforcement learning to learn good solvers for conservation laws.

1.2 Our Approach

The main objective of this paper is to design new numerical schemes in an autonomous way. We propose to use reinforcement learning (RL) to aid the process of solving the conservation laws. To our best knowledge, we are the first to propose using (deep) RL to solve evolution PDEs. We carefully design the proposed RL-based method so that the learned policy can generate high accuracy numerical schemes and can well generalize in varied situations. Details will be given in section 3.

Here, we first provide a brief discussion on the benefits of using RL instead of SL to solve conservation laws (the arguments apply to general evolution PDEs as well):

  • [leftmargin=*]

  • Most of the numerical solvers of conservation law can be interpreted naturally as a sequential decision making process (e.g., the approximated grid values at the current time instance definitely affects all the future approximations). Thus, it can be easily formulated as a Markov Decision Process (MDP) and solved by RL.

  • In almost all the RL algorithms, the policy (which is the AI agent who decides on how the solution should be approximated locally) is optimized with regards to the values , which by definition considers the long-term accumulated reward (or, error of the learned numerical scheme), thus could naturally guarantee the long-term accuracy of the learned schemes. Furthermore, it can gracefully handle the cases when the action space is discrete, which is in fact one of the major strength of RL.

  • Non-smooth norms such as the infinity norm of the error is often used to evaluate the performance of the learned numerical schemes. As the norm of the error serves as the loss function for the learning algorithms, computing the gradient of the infinity norm can be problematic for supervised learning, while RL does not have such problem since it does not explicitly take gradients of the loss function (i.e. the reward function for RL).

  • Learning the policy within the RL framework makes the algorithm meta-learning-like Schmidhuber (1987); Bengio et al. (1992); Andrychowicz et al. (2016); Li and Malik (2016); Finn et al. (2017). The learned policy can decide on which local numerical approximation to use by judging from the current state of the solution (e.g. local smoothness, oscillatory patterns, dissipation, etc). This is vastly different from regular (non-meta) learning where the algorithms directly make inference on the numerical schemes without the aid of an additional network such as . As subtle the difference as it may seem, meta-learning-like methods have been proven effective in various applications such as in image restoration Jin et al. (2017); Fan et al. (2018a); Zhang et al. (2019).

Our paper is organized as follows. In section 2 we briefly review 1-dimensional conservation laws and the WENO schemes. In section 3, we discuss how to formulate the process of numerically solving conservation laws into a Markov Decision Process. Then, we present details on how to train a policy network to mimic human expert in choosing discrete schemes in a spatial-temporary adaptive manner by learning from WENO. In section 4, we conduct numerical experiments on 1-D conservation laws to demonstrate the performance of our trained policy network. Our experimental results show that the trained policy network indeed learned to adaptively choose good discrete schemes that offer better results than the state-of-the-art WENO scheme which is 5th order accurate in space and 4th order accurate in time. This serves as an evidence that the proposed RL framework has the potential to design high-performance numerical schemes for conservation laws in a data-driven fashion. Furthermore, the learned policy network generalizes well to other situations such as different initial conditions, mesh sizes, temporal discrete schemes, etc. The paper ends with a conclusion in section 5, where possible future research directions are also discussed.

2 Preliminaries

2.1 Notations

In this paper, we consider solving the following 1-D conservation laws:


For example, corresponds to the famous Burger’s Equation. We discretize the -plane by choosing a mesh with spatial size and temporal step size , and define the discrete mesh points by

We denote The finite difference methods will produce approximations to the solution on the given discrete mesh points. We denote point-wise values of the true solution to be , and the true point-wise flux values to be .

2.2 WENO – Weighted Essentially Non-Oscillatory Schemes

WENO (Weighted Essentially Non-Oscillatory) Liu et al. (1994) is a family of high order accurate finite difference schemes for solving hyperbolic conservation laws, and has been quite successful for a lot of practical problems. The key idea of WENO is a nonlinear adaptive procedure that automatically chooses the smoothest local stencil to reconstruct the numerical flux. Generally speaking, a finite difference method solves Eq.(1) by using a conservative approximation to the spatial derivative of the flux:


where is the numerical approximation to the point value and is the numerical flux generated by a numerical flux policy

which is manually designed. Note that the term “numerical flux policy" is a new terminology that we introduce in this paper, which is exactly the policy we shall learn using RL. In WENO, works as follows. Using the physical flux values , we could obtain a

order accurate polynomial interpolation

, where the indices is called a ‘stencil’. We could also use the stencil , or to obtain another three interpolants , and . The key idea of WENO is to average (with properly designed weights) all these interpolants to obtain the final reconstruction:


The weight depends on the smoothness of the stencil. A general principal is: the smoother is the stencil, the more accurate is the interpolant and hence the larger is the weight. To ensure convergence, we need the numerical scheme to be consistent and stable LeVeque (1992). It is known that WENO schemes as described above are consistent. For stability, upwinding is required in constructing the flux. The most easy way is to use the sign of the Roe speed to determine the upwind direction: if , we only average among the three interpolants , and ; if , we use , and .

Some further thoughts.

The key of the WENO method lies in how to compute the weight vector

, which primarily depends on the smoothness of the solution at local stencils. In WENO, such smoothness is characterized by handcrafted formula, and was proven to be successful in many practical problems when coupled with high-order temporal discretization. However, it remains unknown whether there are better ways to combine the stencils so that higher accuracy can be achieved. Furthermore, estimating the upwind directions is another key component of WENO, which can get quite complicated in high-dimensional situations and requires lots of logical judgments (i.e. “if/else"). Can we ease the (some time painful) coding and improve the estimation at the aid of machine learning?

3 Methods

In this section we present how to employ reinforcement learning to solve the conservation laws given by Eq.(1). To better illustrate our idea, we first show in general how to formulate the process of numerically solving a conservation law into an MDP. We then discuss how to incorporate a policy network with the WENO scheme. Our policy network targets at the following two key aspects of WENO:

  • [leftmargin=*]

  • Can we learn to choose better weights to combine the constructed fluxes?

  • Can we learn to automatically judge the upwind direction, without complicated logical judgments?

3.1 MDP Formulation

1 Input: initial values , flux , , , evolve time , left shift and right shift .
2 Output:
4 for  to  do
5       for  to  do
6             Compute the numerical flux , , …, and , , …, , e.g., using the WENO scheme
7             Compute
8             Compute = , e.g., using the Euler scheme
Algorithm 1 A Conservation Law Solving Procedure

As shown in Algorithm 1, the procedure of numerically solving a conservation law is naturally a sequential decision making problem. The key of the procedure is the numerical flux policy and the temporal scheme as shown in line 6 and 8 in Algorithm 1. Both policies could be learned using RL. However, in this paper, we mainly focus on using RL to learn the numerical flux policy , while leaving the temporal scheme with traditional numerical schemes such as the Euler scheme or the Runge–Kutta methods. A quick review of RL is given in the appendix.

Now, we demonstrate how to formulate the above procedure as an MDP and the construction of the state , action , reward and transition dynamics . Algorithm 2 shows in general how RL is incorporated into the procedure. In Algorithm 2, we use a single RL agent. Specifically, when computing :

  • [leftmargin=*]

  • The state for the RL agent is , where is a pre-processing function.

  • In general, the action of the agent is used to determine how the numerical fluxes and is computed. In the next subsection, we detail how we incorporate to be the linear weights of the fluxes computed using different stencils in the WENO scheme.

  • The reward should encourage the agent to generate a scheme that minimizes the error between its approximated value and the true value. Therefore, we define the reward generating function as , where is a another pre-processing function, e.g., a simplest choice is .

  • The transition dynamics is fully deterministic, and depends on the choice of the temporal scheme at line 10 in Algorithm 2. Note that the next state can only be constructed when we have obtained all the point values in the next time step, i.e., does not only depends on action , but also on actions (action can only determine the value ). This subtlety can be resolved by viewing the process under the framework of multi-agent RL, in which at each mesh point we use a distinct agent , and the next state depends on these agents’ joint action . However, it is impractical to train different agents as is usually very large, therefore we enforce the agents at different mesh point to share the same weight, which reduces to case of using just a single agent. The single agent can be viewed as a counterpart of a human designer who decides on the choice of a local scheme based on the current state in traditional numerical methods.

1 Input: initial values , flux , , , evolve time , left shift , right shift and RL policy
2 Output:
4 for Many iterations do
5        Construct initial states for
6        for  to  do
7               for  to  do
8                      Compute the action that determines how and is computed
9                      Compute
10                      Compute = , e.g., the Euler scheme
11                      Compute the reward .
12              Construct the next states for
13               Use any RL algorithm to train the RL policy with the transitions .
Return the well-trained RL policy .
Algorithm 2 General RL Running Procedure

3.2 RL Empowered WENO

In this subsection, we present how to transfer the actions of the RL policy network to the weights for WENO fluxes. Instead of directly using to generate the numerical flux, we use it to produce the weights of numerical fluxes computed using different stencils in WENO.

Specifically, at point (here we drop the time superscript for simplicity), to compute the numerical flux and , we first construct four fluxes and using four different stencils just as in WENO, and then use the RL policy to generate the weights of these fluxes:

The numerical flux is then constructed by averaging these fluxes:

Note that the determination of upwind direction is automatically embedded in the RL policy since it generates four weights at once. For instance, when the roe speed , we expect the weight and when , we expect . Note that the upwind direction can be very complicated in a system of equations or in the high-dimensional situations, and using the policy network to automatically embed such a process could save lots of efforts in algorithm design and implementation. Our numerical experiments show that can indeed automatically determine upwind directions for 1D scalar cases. Although this does not mean that it works for systems and/or in high-dimensions, it shows the potential of the proposed framework and value for further studies.

4 Experiments

4.1 Setup

In this subsection, we explain the general training setup. We train the RL policy network on the Burger’s equation, whose flux is computed as . In all the experiments, we set the left-shift and the right shift . The state pre-processing function will generate two vectors: , and for computing and respectively. and will be passed into the same policy neural network to produce the desired actions, as described in section 3.2. The reward pre-processing function simply computes the infinity norm, i.e.,

The policy network is a feed-forward Multi-layer Perceptron with

hidden layers, each has 64 neurons and use Relu

Goodfellow et al. (2016)

as the activation function. We use the Deep Deterministic Policy Gradient Algorithm

Lillicrap et al. (2015) to train the RL policy.

To guarantee the generalization power of the trained RL agent, we randomly sampled 20 initial conditions in the form , where , and . The goal of generating such kind of initial conditions is to ensure they have similar degree of smoothness and thus similar level of difficulty in learning. The computation domain is and with , , and evolve steps (which ensures the appearance of shocks). When training the RL agent, we use the Euler scheme for temporal discretization. The true solution needed for reward computing is generated using WENO on the same computation domain with , and the 4th order Runge-Kutta (RK4).

In the following, we denote the policy network that generates the weights of the WENO fluxes (as described in section 3.2) as RL-WENO. We randomly generated another different 10 initial conditions in the same form as training for testing.

0.02 0.04 0.05
0.002 5.66 (1.59) 5.89 (1.74) 8.76 (2.50) 9.09 (2.62) 9.71 (2.42) 10.24 (2.84)
0.003 5.64 (1.54) 5.86 (1.67) 8.73 (2.46) 9.06 (2.58) 9.75 (2.41) 10.28 (2.81)
0.004 5.63 (1.55) 5.81 (1.66) 8.72 (2.46) 9.05 (2.55) 9.61 (2.42) 10.13 (2.84)
0.005 5.08 (1.46) 5.19 (1.58) 8.29 (2.34) 8.58 (2.47) 9.30 (2.26) 9.78 (2.69)
0.006 - - 8.71 (2.49) 9.02 (2.61) 9.72 (2.38) 10.24 (2.80)
0.007 - - 8.56 (2.49) 8.84 (2.62) 9.59 (2.41) 10.12 (2.80)
0.008 - - 8.68 (2.55) 8.93 (2.66) 9.57 (2.49) 10.06 (2.92)
Table 1: Comparison of relative errors (

) of RL-WENO and WENO with standard deviations of the errors among 10 trials in the parenthesis. Temporal discretization: RK4; flux function:

0.02 0.04 0.05
0.002 4.85 (1.15) 5.17 (1.26) 7.77 (1.95) 8.05 (2.02) 8.16 (1.93) 8.56 (2.19)
0.003 - - 7.79 (1.96) 8.06 (2.03) 8.18 (1.92) 8.59 (2.18)
0.004 - - 7.72 (1.93) 7.98 (2.01) 8.15 (1.95) 8.54 (2.20)
0.005 - - - - 8.18 (1.94) 8.55 (2.15)
Table 2: Comparison of relative errors () of RL-WENO and WENO with standard deviations of the errors among 10 trials in the parenthesis. Temporal discretization: RK4; flux function: .

4.2 Results

We compare the performance of RL-WENO and WENO. We also test whether the trained RL policy can generalize to different temporal discretization schemes, mesh sizes and flux functions that are not included in training. Table 1 and Table 2 present the comparison results, where the number shows the relative error (computed as with the 2-norm taking over all ) between the approximated solution and the true solution , averaged over 250 evolving steps () and 10 random initial values. Numbers in the bracket shows the standard deviation over the 10 initial conditions. Several entries in the table are marked as ‘-’ because the corresponding CFL number () is not small enough to guarantee convergence. Recall that training of the RL-WENO was conducted with Euler time discretization, , and .

Our experimental results show that, compared with the high order accurate WENO (5th order accurate in space and 4th order accurate in time), the linear weights learned by RL not only achieves smaller errors, but also generalizes well to: 1) longer evolving time ( for training and for testing); 2) new time discretization schemes (trained on Euler, tested on RK4); 3) new mesh sizes (see Table 1 and Table 2 for results of varied and ); and 4) a new flux function (trained on shown in Table 1, tested on Table 2).

Figure 1 shows some examples of the solutions. As one can see that, the solutions generated by RL-WENO have clear advantage over WENO near singularities which is particularly challenging for numerical PDE solvers and important in applications. Figure 2 shows that the learned numerical flux policy can indeed correctly determine upwind directions and generate local numerical schemes in an adaptive fashion. More interestingly, comparing to WENO, RL-WENO seems to be able to select stencils in a different way from WENO, and eventually leads to a more accurate solution. This shows that the proposed RL framework has the potential to surpass human experts in designing numerical schemes for conservation laws.

(b) (c) (d)

(f) (g) (h)

Figure 1: First row: solutions of RL-WENO (red), WENO (blue) and exact solutions (green). Second row: zoom-in views corresponding to the first row.

(a) Weights of
(b) Weights of

Figure 2: This figure compares the weights generated by the learned numerical flux policy and those of WENO. The weights shown in (a) are ; while those in (b) are . In each of the two plots, the 4 numbers in the upper bracket of each location are the weights of RL-WENO and those in the lower bracket are the weights of WENO. The relative errors of RL-WENO and WENO are and respectively.

5 Conclusion

In this paper, we proposed a general framework to learn how to solve 1-dimensional conservation laws via deep reinforcement learning. We first discussed how the procedure of numerically solving conservation laws can be naturally cast in the form of Markov Decision Process. We then elaborated how to relate notions in numerical schemes of PDEs with those of reinforcement learning. In particular, we introduced a numerical flux policy which was able to decide on how numerical flux should be designed locally based on the current state of the solution. In other words, the numerical flux policy mimics human experts in designing numerical schemes, which made the proposed method meta-learning-like. Our numerical experiments showed that the proposed RL based solver was able to outperform high order WENO and was well generalized in various cases.

As part of the future works, we would like to consider using the numerical flux policy to inference more complicated numerical fluxes with guaranteed consistency and stability. Furthermore, we can use the proposed framework to learn a policy that can generate adaptive grids and the associated numerical schemes. Lastly, we would like consider system of conservation laws in 2nd and 3rd dimensional space.


Appendix A A Review of Reinforcement Learning

a.1 Reinforcement Learning

Reinforcement Learning (RL) is a general framework for solving sequential decision making problems. Recently, combined with deep neural networks, RL has achieved great success in various tasks such as playing video games from raw screen inputs Mnih et al. (2015), playing Go Silver et al. (2016), and robotics control Schulman et al. (2017). The sequential decision making problem RL tackles is usually formulated as a Markov Decision Process (MDP), which comprises five elements: the state space , the action space , the reward

, the transition probability of the environment

, and the discounting factor . The interactions between an RL agent and the environment forms a trajectory . The return of is the discounted sum of all its future rewards:

Similarly, the return of a state-action pair is:

A policy

in RL is a probability distribution on the action

given a state : . We say a trajectory is generated under policy if all the actions along the trajectory is chosen following , i.e., means and . Given a policy , the value of a state is defined as the expected return of all the trajectories when the agent starts at and then follows :

Similarly, the value of a state-action pair is defined as the expected return of all trajectories when the agent starts at , takes action , and then follows :

As aforementioned in introduction, in most RL algorithms the policy is optimized with regards to the values , thus naturally guarantees the long-term accumulated rewards (in our setting, the long-term accuracy of the learned schemes). Bellman Equation, one of the most important equations in RL, connects the value of a state and the value of its successor state:

The goal of RL is to find a policy to maximize the expected discounted sum of rewards starting from the initial state , , where is the initial state distribution. If we parameterize using , then we can optimize it using the famous policy gradient theorem:

where is the state distribution deduced by the policy . In this paper we focus on the case where the action space is continuous, and a lot of mature algorithms has been proposed for such a case, e.g., the Deep Deterministic Policy Gradient (DDPG) Lillicrap et al. (2015), the Trust Region Policy Optimization algorithm Schulman et al. (2015), and etc.