FD-Net with Auxiliary Time Steps: Fast Prediction of PDEs using Hessian-Free Trust-Region Methods

by   Nur Sila Gulgec, et al.

Discovering the underlying physical behavior of complex systems is a crucial, but less well-understood topic in many engineering disciplines. This study proposes a finite-difference inspired convolutional neural network framework to learn hidden partial differential equations from given data and iteratively estimate future dynamical behavior. The methodology designs the filter sizes such that they mimic the finite difference between the neighboring points. By learning the governing equation, the network predicts the future evolution of the solution by using only a few trainable parameters. In this paper, we provide numerical results to compare the efficiency of the second-order Trust-Region Conjugate Gradient (TRCG) method with the first-order ADAM optimizer.



page 1

page 2

page 3

page 4


Finite Difference Neural Networks: Fast Prediction of Partial Differential Equations

Discovering the underlying behavior of complex systems is an important t...

Trust Region Method for Coupled Systems of PDE Solvers and Deep Neural Networks

Physics-informed machine learning and inverse modeling require the solut...

On the order of accuracy for finite difference approximations of partial differential equations using stencil composition

Stencil composition uses the idea of function composition, wherein two s...

A Feynman-Kac based numerical method for the exit time probability of a class of transport problems

The exit time probability, which gives the likelihood that an initial co...

A single-step third-order temporal discretization with Jacobian-free and Hessian-free formulations for finite difference methods

Discrete updates of numerical partial differential equations (PDEs) rely...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Partial differential equations (PDEs) are widely adopted in engineering fields to explain a variety of phenomena such as heat, diffusion, electrodynamics, fluid dynamics, elasticity, and quantum mechanics. With the rapid development in the sensing and storage capabilities provide engineers to reach more knowledge about these phenomena. The collected massive data from multidimensional systems have the potential to provide better understanding of system dynamics and lead to a discovery of more complex systems.

Exploiting data to discover physical laws has been recently investigated through several studies. Schmidt and Lipson (2009); Bongard and Lipson (2007) applied symbolic regression and Rudy et al. (2017); Schaeffer (2017) proposed sparse regression techniques to explain the nonlinear dynamical systems. Raissi and Karniadakis (2018); Raissi et al. (2017) introduced physics informed neural networks using Gaussian processes. Chen et al. (2018) demonstrated continuous-depth residual networks and continuous-time latent variable models to train ordinary neural networks. Farimani et al. (2017) proposed conditional generative adversarial networks and Long et al. (2017) proposed PDE-Net originated from Wavelet theory.

This study proposes a finite-difference inspired convolutional neural network framework to learn the hidden partial differential equations from the given data and iteratively estimate the future dynamical behavior with only a few parameters. Additionally, we introduce auxiliary time steps to achieve higher accuracy in the solutions.

While first-order methods have been extensively used in training deep neural networks, they struggle to promise the training efficiency. By only considering first-order information, these methods are sensitive to the settings of hyper-parameters, with difficulty in escaping saddle points, and so on. Hessian-free (second-order) Martens (2010)

methods use curvature information, make more progress every iteration, minimize the amount of works of tuning hyper-parameters, and only require Hessian-vector product. In this paper, ADAM

Kingma and Ba (2014) and TRCG methods Nocedal and Wright (2006) Steihaug (1983)

are used to train the proposed network. The empirical results demonstrate that this particular second-order method is more favorable than ADAM to provide high accuracy results to our engineering application of deep learning.

The rest of the paper is organized as follows. First, motivation of our approach is provided in Section 2; then, the proposed methodology is described in Section 3. In Section 4 and 5, the numerical study is introduced and main findings of this study are discussed, respectively.

2 Motivation

Let us consider a partial differential equation of the general form


where is the non-linear function of , its partial derivatives in time or space where it is denoted by the subscripts. The objective of the study is to implicitly learn the from the given time-series measurements at specific time instances and predict the behavior of the equation for long time sequences.

For easier interpretation of the approach, the proposed algorithm is explained through the motivation problem. Parabolic evolution equations describe processes that are evolving in time. The heat equation is one of the frequently used examples in physics and mathematics to describe how heat evolves over time in an object Incropera et al. (2007). Let denotes the temperature at point at time . The heat equation has the following form for the 1-D bar of length :


where is a constant and called the thermal conductivity of the material. Thermal equation has some boundary conditions. If boundaries are perfectly insulated, the boundary conditions are reduced to,


The PDE of the heat equation can be solved by using Euler method where x and t are discretized for and to find directional derivatives.


where . When the individual time steps are too from each other, Euler method fails to provide a good solution. The stability criteria is satisfied only when Olsen-Kettle (2011). Additionally, for each prediction step, boundary conditions and values are assumed to be known which is not necessarily true for the real applications. In order to address these challenges, data-driven approach is proposed.

3 Methodology

The proposed approach is inspired by the finite difference approximation. Each directional derivative in

direction is defined as trainable finite difference filters by size of three (i.e., one parameter for the left neighbor, one for the point itself and one for the right neighbor). The trainable parameters only include weights without any nonlinear activation function and biases. When there is a higher degree of partial difference, multiple sets of learnable weights are considered during training. At the boundary conditions, the filter size of two is adopted since there is only one neighbor. The main benefit of using such a filter is to reduce the number of parameters of the network and to use more natural and interpretable building blocks for the engineering applications.

Figure 1: Predicting the function with artificial time steps.

In order to increase the accuracy and stability, "artificial" time-steps are introduced to the network (Figure 1), where the function value is computed from the linear combination of the input and the feature maps obtained from the difference approximations. These steps are repeated until the prediction of . Similar idea is also used in residual neural networks He et al. (2016) because of its ease in optimization, however, in our case it is a necessity to obtain solutions for unstable PDEs. The relationship between these iterative updates and Euler discretization is also discussed in the Chen et. al. Chen et al. (2018).


Training might take a considerable amount of time while working with long sequences. The proposed approach addresses this problem by training the architecture with randomized mini-batches. We generate samples from randomly picked time intervals during each iteration where represents a sample from the th time series at time . For comparison purposes, first-order ADAM Kingma and Ba (2014) and second-order TRCG methods Nocedal and Wright (2006) Steihaug (1983) are used to train the proposed network.

TRCG Nocedal and Wright (2006) method uses Steihaug’s Conjugate Gradient (CG) method Steihaug (1983) to approximately solve the trust region subproblem and obtain a searching direction. Compared with the exact Newton’s method, CG requires only the computation of the Hessian-vector products without explicitly storing the Hessian. This feature makes TRCG method a Hessian-free Martens (2010) method and suitable to our deep learning application, where the Hessian matrix can be in an immense size due to its quadratic relationship with the number of parameters. To make TRCG more practical to the proposed network and the datasets, a stochastic mini-batch training is adopted: for every iteration of TRCG, one mini-batch dataset is randomly selected to compute the gradient and for CG to compute the Hessian-vector products and solve the trust region subproblem.

Architecture: The general map of the FD-Net architecture is shown in Figure 2. It shows an example of an artificial time step for a selected time from the sample generated from the PDE. The sample is passed through two sets of trainable finite difference layers and the resultant of each layer is aggregated through a fully connected (FC) layer. Then, the output of the FC layer are mapped into a residual building block to predict the function behavior at time

. The loss function is defined as mean squared error loss between the predicted and true values of the function value. The loss function is penalized more at the boundaries.

Figure 2: FD-Net predicting the auxiliary time step of from time step .

4 Numerical Study

A dataset containing samples are generated with varying initial conditions by selecting different

from normal distribution. The the domain of the samples is

such that total the dataset contains values. The dataset is split randomly into train/test sets following an 75/25 ratio. The samples are produced with the parameters , and varying time discretization for stable () and unstable () cases. The boundary conditions and the initial condition of the problem is defined as in (3) and (5), respectively. The optimal solution of the heat conduction problem is adopted from the study Olsen-Kettle (2011) and formulated as following:


5 Results and Discussion

During testing, the function value at time is predicted by using the function value at time . Then, the function value at time is predicted by using the function value at time . These predictions are repeated for the full length of the sequence. The RMSE of the true and predicted sequence is computed for all ’s.

Figure 3: Testing error of FD-Net with .

To compare the performance of ADAM and TRCG on training the proposed networks, we conduct experiments with various random seeds and mini-batch sizes on the dataset of the stable case. For Adam, we use two learning rates, and . For each experiment, depending on the mini-batch size, while we allow ADAM to run between to epochs, TRCG is given a small budget, less than epochs.

Figure 4: Prediction of a sample at , , , and .

In spite of the small budget TRCG had, the scale of the testing error in terms of RMSE it achieves at , and ADAM is only able to reduce the error to the scale of . Figure 3 presents an example result from the experiment with the random seed and mini-batch size chosen to be and , and it illustrates the empirical performance of ADAM and TRCG on the proposed network very well. The results demonstrate a relatively slow convergence of ADAM and suggest that, for the proposed network, second-order information is important and the searching directions that TRCG generated seem to capture the information.

The predictions of the testing data is investigated. Figure 4 shows the predictions obtained by the proposed method with TRCG and ADAM, and Euler approaches for the time , , and . As can be seen from the figure, although the function characteristics change drastically in the longer term, the proposed architecture is able to determine the behavior with only a few parameters. The most accurate prediction is achieved when FD-Net with TRCG method.

FD-Net with TRCG
Batch Size k = 1 k = 10 k = 20 Euler
32 0.0345 0.0037 0.0028 73.787
64 0.0342 0.0038 0.0033 73.787
128 0.0337 0.0079 0.0079 73.787
Table 1: RMSE for the unstable case.

Since the prediction at time instance affects the next time prediction, the effect of the error accumulation is tested for the unstable case with different artificial time steps , and . Table 1 shows that the performance of the adopted approach is greater than the Euler approach for the unstable cases and increasing the number of artificial time step increases the accuracy of the method. Although our approach mimics the Euler method when , thus better performance is observed.

6 Acknowledgements

Research funding is partially provided by the National Science Foundation through Grant No. CMMI-1351537 by Hazard Mitigation and Structural Engineering program, and by a grant from the Commonwealth of Pennsylvania, Department of Community and Economic Development, through the Pennsylvania Infrastructure Technology Alliance (PITA). Martin Takáč was supported by National Science Foundation grants CCF-1618717, CMMI-1663256 and CCF-1740796.