Conservation laws are considered to be one of the fundamental laws of nature, and has broad applications in multiple fields such as physics, chemistry, biology, geology, and engineering. For example, Burger’s equation, a very classic partial differential equation (PDE) in conservation laws, has important applications in fluid mechanics, nonlinear acoustics, gas dynamics, and traffic flow.
Solving the differential equations associated with conservation laws has been a major branch of computational mathematics LeVeque (1992, 2002), and a lot of effective methods have been proposed, from classic methods such as the upwind scheme, the Lax-Friedrichs scheme, to the advanced ones such as the ENO/WENO schemes Liu et al. (1994); Shu (1998), the flux-limiter methods Jerez Galiano and Uh Zapata (2010)
, and etc. In the past few decades, these traditional methods have been proven successful in solving conservation laws. Nonetheless, the design of some of the high-end methods heavily relies on expert knowledge and the coding of these methods can be a laborious process. To ease the usage and potentially improve these traditional algorithms, machine learning, especially deep learning, has been recently incorporated into this field. For example, the ENO scheme requires lots of ‘if/else’ logical judgments when used to solve complicated system of equations or high-dimensional equations. This very much resembles the old-fashioned expert systems. The recent trend in artificial intelligence (AI) is to replace the expert systems by the so-called ‘connectionism’, e.g., deep neural networks, which leads to the recent bloom of AI. Therefore, it is natural and potentially beneficial to introduce deep learning in traditional numerical solvers of conservation laws.
1.1 Related works
In the last few years, neural networks (NNs) have been applied to solving ODEs/PDEs or the associated inverse problems. These works can be roughly classified into three categories according to the way that the NN is used.
The first type of works propose to harness the representation power of NNs, and are irrelevant to the numerical discretization based methods. For example, Raissi et al. (2017a, b) treated the NNs as new ansatz to approximate solutions of PDEs. More recent works along this line include Magiera et al. (2019); Michoski et al. (2019); Both et al. (2019). Besides, several works have focused on using NNs to establish direct mappings between the parameters of the PDEs (e.g. the coefficient field or the ground state energy) and their associated solutions Khoo et al. (2017); Khoo and Ying (2018); Li et al. (2019); Fan et al. (2018b). Furthermore, Han et al. (2018); Beck et al. (2017) proposed a method to solve very high-dimensional PDEs by converting the PDE to a stochastic control problem and use NNs to approximate the gradient of the solution.
The second type of works focus on the connection between deep neural networks (DNNs) and dynamic systems Weinan (2017); Chang et al. (2017); Lu et al. (2018); Long et al. (2018b); Chen et al. (2018). These works observed that there are connections between DNNs and dynamic systems (e.g. differential equations or unrolled optimization algorithms) so that we can combine deep learning with traditional tools from applied and computational mathematics to handle challenging tasks in inverse problems Long et al. (2018b, a); Qin et al. (2018).The main focus of these works, however, is to solve inverse problems, instead of learning numerical discretizations of differential equations. Nonetheless, these methods are closely related to numerical differential equations since learning a proper discretization is often an important auxiliary task for these methods to accurately recover the form of the differential equations.
The third type of works, which target at using NNs to learn new numerical schemes, are closely related to our work. However, we note that these works mainly fall in the setting of supervised learning (SL). For example,Discacciati et al. (2019) proposed to integrate NNs into high-order numerical solvers to predict artificial viscosity; Ray and Hesthaven (2018)
trained a multilayer perceptron to replace traditional indicators for identifying troubled-cells in high-resolution schemes for conservation laws. These works greatly advanced the development in machine learning based design of numerical schemes for conservation laws. Note that inDiscacciati et al. (2019), the authors only utilized the one-step error to train the artificial viscosity networks without taking into account the long-term accuracy of the learned numerical scheme. Ray and Hesthaven (2018) first constructed several functions with known regularities and then used them to train a neural network to predict the location of discontinuity, which was later used to choose a proper slope limiter. Therefore, the training of the NNs is separated from the numerical scheme. Then, a natural question is whether we can learn discretization of differential equations in an end-to-end fashion and the learned discrete scheme also takes long-term accuracy into account. This motivates us to employ reinforcement learning to learn good solvers for conservation laws.
1.2 Our Approach
The main objective of this paper is to design new numerical schemes in an autonomous way. We propose to use reinforcement learning (RL) to aid the process of solving the conservation laws. To our best knowledge, we are the first to propose using (deep) RL to solve evolution PDEs. We carefully design the proposed RL-based method so that the learned policy can generate high accuracy numerical schemes and can well generalize in varied situations. Details will be given in section 3.
Here, we first provide a brief discussion on the benefits of using RL instead of SL to solve conservation laws (the arguments apply to general evolution PDEs as well):
Most of the numerical solvers of conservation law can be interpreted naturally as a sequential decision making process (e.g., the approximated grid values at the current time instance definitely affects all the future approximations). Thus, it can be easily formulated as a Markov Decision Process (MDP) and solved by RL.
In almost all the RL algorithms, the policy (which is the AI agent who decides on how the solution should be approximated locally) is optimized with regards to the values , which by definition considers the long-term accumulated reward (or, error of the learned numerical scheme), thus could naturally guarantee the long-term accuracy of the learned schemes. Furthermore, it can gracefully handle the cases when the action space is discrete, which is in fact one of the major strength of RL.
Non-smooth norms such as the infinity norm of the error is often used to evaluate the performance of the learned numerical schemes. As the norm of the error serves as the loss function for the learning algorithms, computing the gradient of the infinity norm can be problematic for supervised learning, while RL does not have such problem since it does not explicitly take gradients of the loss function (i.e. the reward function for RL).
Learning the policy within the RL framework makes the algorithm meta-learning-like Schmidhuber (1987); Bengio et al. (1992); Andrychowicz et al. (2016); Li and Malik (2016); Finn et al. (2017). The learned policy can decide on which local numerical approximation to use by judging from the current state of the solution (e.g. local smoothness, oscillatory patterns, dissipation, etc). This is vastly different from regular (non-meta) learning where the algorithms directly make inference on the numerical schemes without the aid of an additional network such as . As subtle the difference as it may seem, meta-learning-like methods have been proven effective in various applications such as in image restoration Jin et al. (2017); Fan et al. (2018a); Zhang et al. (2019).
Our paper is organized as follows. In section 2 we briefly review 1-dimensional conservation laws and the WENO schemes. In section 3, we discuss how to formulate the process of numerically solving conservation laws into a Markov Decision Process. Then, we present details on how to train a policy network to mimic human expert in choosing discrete schemes in a spatial-temporary adaptive manner by learning from WENO. In section 4, we conduct numerical experiments on 1-D conservation laws to demonstrate the performance of our trained policy network. Our experimental results show that the trained policy network indeed learned to adaptively choose good discrete schemes that offer better results than the state-of-the-art WENO scheme which is 5th order accurate in space and 4th order accurate in time. This serves as an evidence that the proposed RL framework has the potential to design high-performance numerical schemes for conservation laws in a data-driven fashion. Furthermore, the learned policy network generalizes well to other situations such as different initial conditions, mesh sizes, temporal discrete schemes, etc. The paper ends with a conclusion in section 5, where possible future research directions are also discussed.
In this paper, we consider solving the following 1-D conservation laws:
For example, corresponds to the famous Burger’s Equation. We discretize the -plane by choosing a mesh with spatial size and temporal step size , and define the discrete mesh points by
We denote The finite difference methods will produce approximations to the solution on the given discrete mesh points. We denote point-wise values of the true solution to be , and the true point-wise flux values to be .
2.2 WENO – Weighted Essentially Non-Oscillatory Schemes
WENO (Weighted Essentially Non-Oscillatory) Liu et al. (1994) is a family of high order accurate finite difference schemes for solving hyperbolic conservation laws, and has been quite successful for a lot of practical problems. The key idea of WENO is a nonlinear adaptive procedure that automatically chooses the smoothest local stencil to reconstruct the numerical flux. Generally speaking, a finite difference method solves Eq.(1) by using a conservative approximation to the spatial derivative of the flux:
where is the numerical approximation to the point value and is the numerical flux generated by a numerical flux policy
which is manually designed. Note that the term “numerical flux policy" is a new terminology that we introduce in this paper, which is exactly the policy we shall learn using RL. In WENO, works as follows. Using the physical flux values , we could obtain a
order accurate polynomial interpolation, where the indices is called a ‘stencil’. We could also use the stencil , or to obtain another three interpolants , and . The key idea of WENO is to average (with properly designed weights) all these interpolants to obtain the final reconstruction:
The weight depends on the smoothness of the stencil. A general principal is: the smoother is the stencil, the more accurate is the interpolant and hence the larger is the weight. To ensure convergence, we need the numerical scheme to be consistent and stable LeVeque (1992). It is known that WENO schemes as described above are consistent. For stability, upwinding is required in constructing the flux. The most easy way is to use the sign of the Roe speed to determine the upwind direction: if , we only average among the three interpolants , and ; if , we use , and .
Some further thoughts.
The key of the WENO method lies in how to compute the weight vector
, which primarily depends on the smoothness of the solution at local stencils. In WENO, such smoothness is characterized by handcrafted formula, and was proven to be successful in many practical problems when coupled with high-order temporal discretization. However, it remains unknown whether there are better ways to combine the stencils so that higher accuracy can be achieved. Furthermore, estimating the upwind directions is another key component of WENO, which can get quite complicated in high-dimensional situations and requires lots of logical judgments (i.e. “if/else"). Can we ease the (some time painful) coding and improve the estimation at the aid of machine learning?
In this section we present how to employ reinforcement learning to solve the conservation laws given by Eq.(1). To better illustrate our idea, we first show in general how to formulate the process of numerically solving a conservation law into an MDP. We then discuss how to incorporate a policy network with the WENO scheme. Our policy network targets at the following two key aspects of WENO:
Can we learn to choose better weights to combine the constructed fluxes?
Can we learn to automatically judge the upwind direction, without complicated logical judgments?
3.1 MDP Formulation
As shown in Algorithm 1, the procedure of numerically solving a conservation law is naturally a sequential decision making problem. The key of the procedure is the numerical flux policy and the temporal scheme as shown in line 6 and 8 in Algorithm 1. Both policies could be learned using RL. However, in this paper, we mainly focus on using RL to learn the numerical flux policy , while leaving the temporal scheme with traditional numerical schemes such as the Euler scheme or the Runge–Kutta methods. A quick review of RL is given in the appendix.
Now, we demonstrate how to formulate the above procedure as an MDP and the construction of the state , action , reward and transition dynamics . Algorithm 2 shows in general how RL is incorporated into the procedure. In Algorithm 2, we use a single RL agent. Specifically, when computing :
The state for the RL agent is , where is a pre-processing function.
In general, the action of the agent is used to determine how the numerical fluxes and is computed. In the next subsection, we detail how we incorporate to be the linear weights of the fluxes computed using different stencils in the WENO scheme.
The reward should encourage the agent to generate a scheme that minimizes the error between its approximated value and the true value. Therefore, we define the reward generating function as , where is a another pre-processing function, e.g., a simplest choice is .
The transition dynamics is fully deterministic, and depends on the choice of the temporal scheme at line 10 in Algorithm 2. Note that the next state can only be constructed when we have obtained all the point values in the next time step, i.e., does not only depends on action , but also on actions (action can only determine the value ). This subtlety can be resolved by viewing the process under the framework of multi-agent RL, in which at each mesh point we use a distinct agent , and the next state depends on these agents’ joint action . However, it is impractical to train different agents as is usually very large, therefore we enforce the agents at different mesh point to share the same weight, which reduces to case of using just a single agent. The single agent can be viewed as a counterpart of a human designer who decides on the choice of a local scheme based on the current state in traditional numerical methods.
3.2 RL Empowered WENO
In this subsection, we present how to transfer the actions of the RL policy network to the weights for WENO fluxes. Instead of directly using to generate the numerical flux, we use it to produce the weights of numerical fluxes computed using different stencils in WENO.
Specifically, at point (here we drop the time superscript for simplicity), to compute the numerical flux and , we first construct four fluxes and using four different stencils just as in WENO, and then use the RL policy to generate the weights of these fluxes:
The numerical flux is then constructed by averaging these fluxes:
Note that the determination of upwind direction is automatically embedded in the RL policy since it generates four weights at once. For instance, when the roe speed , we expect the weight and when , we expect . Note that the upwind direction can be very complicated in a system of equations or in the high-dimensional situations, and using the policy network to automatically embed such a process could save lots of efforts in algorithm design and implementation. Our numerical experiments show that can indeed automatically determine upwind directions for 1D scalar cases. Although this does not mean that it works for systems and/or in high-dimensions, it shows the potential of the proposed framework and value for further studies.
In this subsection, we explain the general training setup. We train the RL policy network on the Burger’s equation, whose flux is computed as . In all the experiments, we set the left-shift and the right shift . The state pre-processing function will generate two vectors: , and for computing and respectively. and will be passed into the same policy neural network to produce the desired actions, as described in section 3.2. The reward pre-processing function simply computes the infinity norm, i.e.,
The policy network is a feed-forward Multi-layer Perceptron with2016)
as the activation function. We use the Deep Deterministic Policy Gradient AlgorithmLillicrap et al. (2015) to train the RL policy.
To guarantee the generalization power of the trained RL agent, we randomly sampled 20 initial conditions in the form , where , and . The goal of generating such kind of initial conditions is to ensure they have similar degree of smoothness and thus similar level of difficulty in learning. The computation domain is and with , , and evolve steps (which ensures the appearance of shocks). When training the RL agent, we use the Euler scheme for temporal discretization. The true solution needed for reward computing is generated using WENO on the same computation domain with , and the 4th order Runge-Kutta (RK4).
In the following, we denote the policy network that generates the weights of the WENO fluxes (as described in section 3.2) as RL-WENO. We randomly generated another different 10 initial conditions in the same form as training for testing.
|0.002||5.66 (1.59)||5.89 (1.74)||8.76 (2.50)||9.09 (2.62)||9.71 (2.42)||10.24 (2.84)|
|0.003||5.64 (1.54)||5.86 (1.67)||8.73 (2.46)||9.06 (2.58)||9.75 (2.41)||10.28 (2.81)|
|0.004||5.63 (1.55)||5.81 (1.66)||8.72 (2.46)||9.05 (2.55)||9.61 (2.42)||10.13 (2.84)|
|0.005||5.08 (1.46)||5.19 (1.58)||8.29 (2.34)||8.58 (2.47)||9.30 (2.26)||9.78 (2.69)|
|0.006||-||-||8.71 (2.49)||9.02 (2.61)||9.72 (2.38)||10.24 (2.80)|
|0.007||-||-||8.56 (2.49)||8.84 (2.62)||9.59 (2.41)||10.12 (2.80)|
|0.008||-||-||8.68 (2.55)||8.93 (2.66)||9.57 (2.49)||10.06 (2.92)|
) of RL-WENO and WENO with standard deviations of the errors among 10 trials in the parenthesis. Temporal discretization: RK4; flux function:.
|0.002||4.85 (1.15)||5.17 (1.26)||7.77 (1.95)||8.05 (2.02)||8.16 (1.93)||8.56 (2.19)|
|0.003||-||-||7.79 (1.96)||8.06 (2.03)||8.18 (1.92)||8.59 (2.18)|
|0.004||-||-||7.72 (1.93)||7.98 (2.01)||8.15 (1.95)||8.54 (2.20)|
|0.005||-||-||-||-||8.18 (1.94)||8.55 (2.15)|
We compare the performance of RL-WENO and WENO. We also test whether the trained RL policy can generalize to different temporal discretization schemes, mesh sizes and flux functions that are not included in training. Table 1 and Table 2 present the comparison results, where the number shows the relative error (computed as with the 2-norm taking over all ) between the approximated solution and the true solution , averaged over 250 evolving steps () and 10 random initial values. Numbers in the bracket shows the standard deviation over the 10 initial conditions. Several entries in the table are marked as ‘-’ because the corresponding CFL number () is not small enough to guarantee convergence. Recall that training of the RL-WENO was conducted with Euler time discretization, , and .
Our experimental results show that, compared with the high order accurate WENO (5th order accurate in space and 4th order accurate in time), the linear weights learned by RL not only achieves smaller errors, but also generalizes well to: 1) longer evolving time ( for training and for testing); 2) new time discretization schemes (trained on Euler, tested on RK4); 3) new mesh sizes (see Table 1 and Table 2 for results of varied and ); and 4) a new flux function (trained on shown in Table 1, tested on Table 2).
Figure 1 shows some examples of the solutions. As one can see that, the solutions generated by RL-WENO have clear advantage over WENO near singularities which is particularly challenging for numerical PDE solvers and important in applications. Figure 2 shows that the learned numerical flux policy can indeed correctly determine upwind directions and generate local numerical schemes in an adaptive fashion. More interestingly, comparing to WENO, RL-WENO seems to be able to select stencils in a different way from WENO, and eventually leads to a more accurate solution. This shows that the proposed RL framework has the potential to surpass human experts in designing numerical schemes for conservation laws.
(a) Weights of
|(b) Weights of|
In this paper, we proposed a general framework to learn how to solve 1-dimensional conservation laws via deep reinforcement learning. We first discussed how the procedure of numerically solving conservation laws can be naturally cast in the form of Markov Decision Process. We then elaborated how to relate notions in numerical schemes of PDEs with those of reinforcement learning. In particular, we introduced a numerical flux policy which was able to decide on how numerical flux should be designed locally based on the current state of the solution. In other words, the numerical flux policy mimics human experts in designing numerical schemes, which made the proposed method meta-learning-like. Our numerical experiments showed that the proposed RL based solver was able to outperform high order WENO and was well generalized in various cases.
As part of the future works, we would like to consider using the numerical flux policy to inference more complicated numerical fluxes with guaranteed consistency and stability. Furthermore, we can use the proposed framework to learn a policy that can generate adaptive grids and the associated numerical schemes. Lastly, we would like consider system of conservation laws in 2nd and 3rd dimensional space.
- Andrychowicz et al. (2016) Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. 2016. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems. 3981–3989.
- Beck et al. (2017) Christian Beck, E Weinan, and Arnulf Jentzen. 2017. Machine learning approximation algorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations. Journal of Nonlinear Science (2017), 1–57.
- Bengio et al. (1992) Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. 1992. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks. Univ. of Texas, 6–8.
- Both et al. (2019) Gert-Jan Both, Subham Choudhury, Pierre Sens, and Remy Kusters. 2019. DeepMoD: Deep learning for Model Discovery in noisy data. arXiv preprint arXiv:1904.09406 (2019).
- Chang et al. (2017) Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. 2017. Multi-level residual networks from dynamical systems view. arXiv preprint arXiv:1710.10348 (2017).
Chen et al. (2018)
Tian Qi Chen, Yulia
Rubanova, Jesse Bettencourt, and
David K Duvenaud. 2018.
Neural ordinary differential equations. InAdvances in Neural Information Processing Systems. 6571–6583.
- Discacciati et al. (2019) Niccolo’ Discacciati, Jan S Hesthaven, and Deep Ray. 2019. Controlling oscillations in high-order Discontinuous Galerkin schemes using artificial viscosity tuned by neural networks. Technical Report.
- Fan et al. (2018a) Qingnan Fan, Dongdong Chen, Lu Yuan, Gang Hua, Nenghai Yu, and Baoquan Chen. 2018a. Decouple learning for parameterized image operators. In Proceedings of the European Conference on Computer Vision (ECCV). 442–458.
- Fan et al. (2018b) Yuwei Fan, Lin Lin, Lexing Ying, and Leonardo Zepeda-Núnez. 2018b. A multiscale neural network based on hierarchical matrices. arXiv preprint arXiv:1807.01883 (2018).
- Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1126–1135.
- Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.
- Han et al. (2018) Jiequn Han, Arnulf Jentzen, and E Weinan. 2018. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences 115, 34 (2018), 8505–8510.
- Jerez Galiano and Uh Zapata (2010) Silvia Jerez Galiano and Miguel Uh Zapata. 2010. A new TVD flux-limiter method for solving nonlinear hyperbolic equations. J. Comput. Appl. Math. 234, 5 (2010), 1395–1403.
et al. (2017)
Meiguang Jin, Stefan
Roth, and Paolo Favaro.
Noise-blind image deblurring. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3510–3518.
- Khoo et al. (2017) Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. 2017. Solving parametric PDE problems with artificial neural networks. arXiv preprint arXiv:1707.03351 (2017).
- Khoo and Ying (2018) Yuehaw Khoo and Lexing Ying. 2018. SwitchNet: a neural network model for forward and inverse scattering problems. arXiv preprint arXiv:1810.09675 (2018).
- LeVeque (1992) Randall J LeVeque. 1992. Numerical methods for conservation laws. Vol. 132. Springer.
- LeVeque (2002) Randall J LeVeque. 2002. Finite volume methods for hyperbolic problems. Vol. 31. Cambridge university press.
- Li and Malik (2016) Ke Li and Jitendra Malik. 2016. Learning to optimize. arXiv preprint arXiv:1606.01885 (2016).
- Li et al. (2019) Yingzhou Li, Jianfeng Lu, and Anqi Mao. 2019. Variational training of neural network approximations of solution maps for physical models. arXiv preprint arXiv:1905.02789 (2019).
- Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
- Liu et al. (1994) Xu-Dong Liu, Stanley Osher, and Tony Chan. 1994. Weighted essentially non-oscillatory schemes. Journal of computational physics 115, 1 (1994), 200–212.
- Long et al. (2018a) Zichao Long, Yiping Lu, and Bin Dong. 2018a. PDE-Net 2.0: Learning PDEs from Data with A Numeric-Symbolic Hybrid Deep Network. arXiv preprint arXiv:1812.04426 (2018).
- Long et al. (2018b) Zichao Long, Yiping Lu, Xianzhong Ma, and Bin Dong. 2018b. PDE-Net: Learning PDEs from Data. In ICML.
- Lu et al. (2018) Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. 2018. Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations. In ICML.
- Magiera et al. (2019) Jim Magiera, Deep Ray, Jan S Hesthaven, and Christian Rohde. 2019. Constraint-Aware Neural Networks for Riemann Problems. arXiv preprint arXiv:1904.12794 (2019).
- Michoski et al. (2019) Craig Michoski, Milos Milosavljevic, Todd Oliver, and David Hatch. 2019. Solving Irregular and Data-enriched Differential Equations using Deep Neural Networks. arXiv preprint arXiv:1905.04351 (2019).
- Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.
- Qin et al. (2018) Tong Qin, Kailiang Wu, and Dongbin Xiu. 2018. Data driven governing equations approximation using deep neural networks. arXiv preprint arXiv:1811.05537 (2018).
- Raissi et al. (2017a) Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. 2017a. Physics informed deep learning (part i): Data-driven solutions of nonlinear partial differential equations. arXiv preprint arXiv:1711.10561 (2017).
- Raissi et al. (2017b) Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. 2017b. Physics informed deep learning (part ii): Data-driven discovery of nonlinear partial differential equations. arXiv preprint arXiv:1711.10566 (2017).
- Ray and Hesthaven (2018) Deep Ray and Jan S Hesthaven. 2018. An artificial neural network as a troubled-cell indicator. J. Comput. Phys. 367 (2018), 166–191.
- Schmidhuber (1987) Jurgen Schmidhuber. 1987. Evolutionary principles in self-referential learning. On learning how to learn: The meta-meta-… hook.) Diploma thesis, Institut f. Informatik, Tech. Univ. Munich (1987).
- Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 1889–1897.
- Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
- Shu (1998) Chi-Wang Shu. 1998. Essentially non-oscillatory and weighted essentially non-oscillatory schemes for hyperbolic conservation laws. In Advanced numerical approximation of nonlinear hyperbolic equations. Springer, 325–432.
- Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484–489.
- Weinan (2017) E Weinan. 2017. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics 5, 1 (2017), 1–11.
- Zhang et al. (2019) Xiaoshuai Zhang, Yiping Lu, Jiaying Liu, and Bin Dong. 2019. Dynamically unfolding recurrent restorer: A moving endpoint control method for image restoration. ICLR (2019).
Appendix A A Review of Reinforcement Learning
a.1 Reinforcement Learning
Reinforcement Learning (RL) is a general framework for solving sequential decision making problems. Recently, combined with deep neural networks, RL has achieved great success in various tasks such as playing video games from raw screen inputs Mnih et al. (2015), playing Go Silver et al. (2016), and robotics control Schulman et al. (2017). The sequential decision making problem RL tackles is usually formulated as a Markov Decision Process (MDP), which comprises five elements: the state space , the action space , the reward
, the transition probability of the environment, and the discounting factor . The interactions between an RL agent and the environment forms a trajectory . The return of is the discounted sum of all its future rewards:
Similarly, the return of a state-action pair is:
in RL is a probability distribution on the actiongiven a state : . We say a trajectory is generated under policy if all the actions along the trajectory is chosen following , i.e., means and . Given a policy , the value of a state is defined as the expected return of all the trajectories when the agent starts at and then follows :
Similarly, the value of a state-action pair is defined as the expected return of all trajectories when the agent starts at , takes action , and then follows :
As aforementioned in introduction, in most RL algorithms the policy is optimized with regards to the values , thus naturally guarantees the long-term accumulated rewards (in our setting, the long-term accuracy of the learned schemes). Bellman Equation, one of the most important equations in RL, connects the value of a state and the value of its successor state:
The goal of RL is to find a policy to maximize the expected discounted sum of rewards starting from the initial state , , where is the initial state distribution. If we parameterize using , then we can optimize it using the famous policy gradient theorem:
where is the state distribution deduced by the policy . In this paper we focus on the case where the action space is continuous, and a lot of mature algorithms has been proposed for such a case, e.g., the Deep Deterministic Policy Gradient (DDPG) Lillicrap et al. (2015), the Trust Region Policy Optimization algorithm Schulman et al. (2015), and etc.