Model Reduction with Memory and the Machine Learning of Dynamical Systems

08/10/2018 ∙ by Chao Ma, et al. ∙ 0

The well-known Mori-Zwanzig theory tells us that model reduction leads to memory effect. For a long time, modeling the memory effect accurately and efficiently has been an important but nearly impossible task in developing a good reduced model. In this work, we explore a natural analogy between recurrent neural networks and the Mori-Zwanzig formalism to establish a systematic approach for developing reduced models with memory. Two training models-a direct training model and a dynamically coupled training model-are proposed and compared. We apply these methods to the Kuramoto-Sivashinsky equation and the Navier-Stokes equation. Numerical experiments show that the proposed method can produce reduced model with good performance on both short-term prediction and long-term statistical properties.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 The Mori-Zwanzig formalism

The starting point of the M-Z formalism is to divide all the variables into resolved and unresolved ones. The basic idea is to project functions of all variables into the space of functions of only the resolved variables. The original dynamical system then becomes an equation for the resolved variables with memory and noise. This equation is called the Generalized Langevin Equation (GLE). Below we first demonstrate the idea of the M-Z formalism using a simple linear example. We then use the Kuramoto-Sivashinsky equation as an example to show how M-Z formalism help us develop a reduced model.

1.1 A brief review to the M-Z formalism

Consider a linear system


with initial value , . We take as the resolved variable, as the unresolved variable, and want to develop a reduced system for . To achieve this, we can first fix in (8) and solve for , and then insert the result into (7). In this way we get


There are three terms at the right hand side of (9). The first term is a Markovian term of ; the second term is a memory term. The third term contains the initial value of , this term will be viewed as a noise term. Hence, by eliminating from the original linear system, we obtain an equation for with memory and noise.

Similar deduction can be applied to non-linear ODE systems. Assume we have a system


with and

being vectors, and we split

into and take as resolved variables, then by the Mori-Zwanzig formalism we can get a GLE for ,


In (11), denotes at time with initial value . Although (11) is more complicated than (9), the essential ingredients are similar. The M-Z formalism tells us that model reduction leads to memory effects, and inspired us to pursuing the application of RNNs as a tool for performing efficient yet rigorous model reduction.

1.2 Application to the K-S equation

Going back to the K-S equation (1), let and

be Fourier transforms of solution

and filter , respectively. Then, by (3), we have for all frequency . Here, we take a spectrally sharp filter which satisfies

for a certain positive integer . Then, the Fourier transform of is a truncation of the Fourier transform of , and our resolved variables are those with . Writing the K-S equation in Fourier space, and put all terms with unresolved variables to the right hand side of the equation, we can get


Applying Fourier transform to (4), and compare with (12), we have


for . Using the M-Z theory, the sub-grid stress can be expressed as a memory term and a noise term. By Galilean invariance, the sub-grid stress can be expressed as a function of the history of the resolved strain, , when the noise effect is negligible.

2 Recurrent Neural Networks and LSTM

In this section, we briefly introduce recurrent neural networks and long short-term memory networks.

2.1 Recurrent neural networks

RNNs are neural networks with recurrent connections, designed to deal with time series. Usually, a recurrent connection links some units in a neural network to themselves. This connection means that the values of these units depend on their values at the previous time step. For a simple example, consider a one-layer RNN with time series as input. Let and be the hidden values and output values at time step , respectively. Then, the simplest RNN model is given by


where , , , , and are matrices or vectors of trainable parameters, and

is a nonlinear activation function. To deal with complicated time series, sometimes the output of an RNN is fed to another RNN to form a multi-layer RNN.

Note that in RNNs the same group of parameters are shared by all time steps. In this sense, RNNs are stationary models. Among other things, this allows RNNs to learn information in long time series without inducing too many parameters. The training of RNNs is similar to the training of regular feed-forward neural networks, except that we have to consider gradient through time direction, such as the gradient of

with respect to for . Usually RNNs are trained by the so-called back-propagation through time (BPTT) method.

2.2 Long short-term memory networks

Theoretically, RNNs is capable of learning long-term memory effects in the time series. However, in practice it is hard for RNN to catch such dependencies, because of the exploding or shrinking gradient effects [13], [14]. The Long Short-Term Memory (LSTM) network is designed to solve this problem. Proposed by Hochreiter et al. [15], the LSTM introduces a new group of hidden units called states, and uses gates to control the information flow through the states. Since the updating rule of the states is incremental instead of compositional as in RNN, the gradients are less likely to explode or shrink. The computational rule of an LSTM cell is

where , , are called the forget gate, input gate, and output gate, respectively, is the state, and is the hidden value. The LSTMs can also be trained by BPTT.

3 The Two Training Models

For simplicity, we will focus on learning the memory-dependent terms in the GLE, neglecting for now the noise term. For the K-S equation, as shown in Figure 3, by directly fitting the stress using the history of the strain, we can reduce the error to nearly while maintaining a small gap between the testing and training error. This shows that we are not overfitting, and the history of the strain determines most of the stress.

We will discuss two models for learning the memory effect: a direct training model and a coupled training model. For both models, we generate data using direct numerical simulation (DNS). In the direct training model, after generating the data, we directly train the RNN to fit the sub-grid stress

using the time history of the strain. The loss function is defined as the difference between the output of the RNN and the true stress. The neural network model for the stress is then used in the macro-scale equation for the reduced system. In the coupled training model, the loss function is defined as the difference between the solutions of the reduced model (with stress represented by the neural network) and the ground truth solution. Therefore in the coupled model, the RNN is coupled with the solver of the macro-scale equation when training.

3.1 The direct training model

In the direct training model, a RNN is used to represent the stress as a function of the time history of the macrocell strain. The loss function is simply the squared difference of the predicted stress and the ground truth stress. This RNN model is then used in the reduced system. Figure 1 shows the flow chart of the direct training model.

Figure 1: The flow chart of the direct training model.

3.2 The coupled training model

In the direct training model, we train an accurate stress model and we hope that this stress model can help produce accurate results for the macro-scale solution. In the coupled model, getting accurate macro-scale solutions is ensured by defining the loss function to be the difference between the solution of the reduced system and the ground truth, with the stress in the reduced system represented by a RNN. Specifically, in every epoch, we solve the reduced system for

steps from some initial condition and get a prediction of the solution steps later (). We use to denote the predicted solution. The loss function is defined to be some function of . From another perspective, we can also view this coupled system as a single large RNN, part of the hidden units of this RNN is updated according to a macro-scale solver. Let be the predicted solution at the -th step, and , , be the state of RNN, stress, and strain at the step, respectively. Then the update rule of this large RNN can be written as


Figure 2 shows part of the computational graph of the coupled training model.

Figure 2: The computation graph of the coupled training model.

Computation of the gradient

In the coupled training model, the trainable parameters are still in the RNN, but to compute the gradient of the loss with respect to the parameters, we have to do back-propagation (BP) through the solver of the macro-scale equation, i.e. through (16). In many applications, this solver is complicated and it is hard to do BP through the solver. To deal with this problem, we take one step back to perform BP directly through the differential equations. This is done by writing down the differential equation satisfied by the gradient, which we call the backward equation. Another solver is used to solve this backward equation. In this way, BP is done by solving the backward equation, and is decoupled from the macro-scale solver.

As an illustration, let be the solution of the K-S equation at time in the Fourier space, and assume that we want to perform back-propagation through the K-S equation from time back to , which means we are going to compute the Jacobian . Let


then we have and we want to compute . In the reduced system, we solve the K-S equation with stress,


where is the sub-grid stress in Fourier space, means convolution, and represents a diagonal matrix with vector being the diagonal entries. Taking derivative of with respect to , and assuming that , we obtain


Hence, as long as we know the solution for , we can compute the Jacobian by solving the equation (19) from to , with initial condition .

4 Numerical Experiments

We now present some numerical results of the proposed models for the K-S equation and the 2-dimensional shear flow problem. Below when we talk about true solution or ground truth, we mean the exact solution after filtering.

4.1 The Kuramoto-Sivashinsky equation

Experiment setting

The K-S equation is considered in the Fourier space. To generate the training data, we solve the K-S equation with Fourier modes to approximate the accurate solution, using a order integral factor Runge-Kutta method [16]. We set and the time step . The settings are similar to that in [12]. The micro-scale equation is solved for time units and filterd outputs are saved for every time units. We drop the results of the first time units, and take the results from the following time units as the training data, and the results of the last time units as the testing data.

For the reduced system, we solve (4) in Fourier space. We take Fourier modes and the time step . The macro-scale solver is still a order integral factor Runge-Kutta method. The solver works in the Fourier space, and takes the output of the RNN as the sub-grid stress.

On the machine learning side, we use an LSTM to predict the stress. The LSTM has two layers and hidden units in each layer. An output layer with linear activation is applied to ensure that the dimension of the outputs is . The LSTM works in the physical space: it takes strains in the physical space as inputs, and outputs predicted stresses in the physical space. A Fourier transform is applied to the output of the LSTM before feeding it into the macro solver.

Direct training

For the direct training model, we train the LSTM to fit the stress as a (memory-dependent) function of the strains for the last time steps. The network is trained by the Adam algorithm [17] for iterations, with batch size being . Figure 3 shows the relative training and testing error during the training process. We can see that the two curves go down simultaneously, finally reaching a relative error of about . There is almost no gap between the training and the testing error, hence little overfitting. This also suggests that most contribution to the sub-grid stress can be explained by the time history of the strain.

Figure 3: Relative training and testing error during training.

First we consider some a priori results. Figure 4 presents the relative error and the correlation coefficients between the predicted stress and the true stress at different locations in the physical space. We can see that our prediction of the stress has relative errors of about and correlation coefficients very close to .

Figure 4: Relative error and correlation coefficients of the prediction at different locations in the physical space.

We next examine some a posteriori results. After training the model, we pick some times in the testing set, and solve the reduced system initialized from the true solution at these times. Then, we compare the reduced solution with the true solution at later times. Figure 5 shows two examples. In these figures, the -axis measures the number of time units from the initial solution. From these figures we can see that the reduced solution given by the direct training model produces satisfactory prediction for - time units (- time steps of macro-scale solver).

As for the eventual deviation between the two solutions, it is not clear at this point what is more responsible, the intrinsic unstable behavior in the model or the model error. We will conduct careful studies to resolve this issue in future work.

Figure 5: Comparison of the reduced solution given by the direct training models and the true solution. The lines shows values of the solutions at a certain location in the physical space.

Next we compare the long-term statistical properties of the solution to the reduced model with the true solution. We solve the reduced system for a long time ( time units in our case). We then compute the distributions and the autocorrelation curves of Fourier modes of the reduced solution, and compare with that of the true solution. Figure 6 shows some results. From these figures we see that the reduced model can satisfactorily recover statistical properties of the true solution.

Figure 6: Distribution and auto-correlation of the real part of the Fourier mode with wave number . The direct training model and the coupled training model.

Coupled training

For the coupled training model, we train the LSTM together with the equation solver. A detailed description of this training model is given in Section 3.2. Here we choose during training. The size of the LSTM is the same as that in the direct training model. A simple one-step forward scheme is used to solve the backward equation for the Jacobian (19). iterations were performed using an Adam optimizer with batch size . During the training process, we measured the relative error of predicted solutions steps later from the initial true solution. The results for different ’s are given in Figure 7. From the figure we can see that the relative error for different goes down to after training, and there is no gap between the different curves, which means that solutions with different steps from the initial time are fitted nearly equally well.

Figure 7: Relative error at different number of steps from the initial true solution during the training process.

Short-term prediction and long-term statistical performance are also considered for the coupled training model. Results are shown on Figure 8 and 6. From the figures we see that, the reduced system trained by the coupled training model gives satisfactory prediction for time units, which is shorter than the direct training model. However, the auto-correlation of Fourier modes match better with the true solution, while the distribution is as good as the direct training model. This suggests that, compared to the direct training model, the coupled model is less competitive for short-term prediction, but performs better for long-term statistical properties.

Figure 8: Comparison of the reduced solution and the true solution on the physical space.

4.2 The -dimensional shear flow

Experiment setting

Consider the -dimensional shear flow in channel, whose governing equation is given by (6). We take and use periodic boundary condition in direction and zero boundary condition at . We choose , and as a constant driving force. To numerically solve the equation, we employ the spectral method used in [18]. We take Fourier modes in direction and Legendre modes in direction. The time step are chosen to be .

For ease of implementation, we only do model reduction for Fourier modes in direction, and keep all Legendre modes in direction. The macro solution has Fourier modes in direction. Still, we generate accurate solution by DNS and compute the macro-scale strain and stress, and use an LSTM to fit the stress as a function of the history of the strain. In this experiment, the neural network we use is a -layer LSTM with hidden units in each layer. As for the K-S equation, the input and output of the LSTM are in the physical space.

In this problem, the stress has components (, , ). Since each component has modes in the spectral space, we need at least variables in the physical space to represent the stress. If directly fit the stress, our LSTM will have about thousand outputs. This can make the training very difficult. Here, noting that both the stress and the strain are periodic in the direction, we choose to fit the stress by column. We train an LSTM to predict one column of the stress (the stress at the same in the physical space), using strains at this column and the neighboring columns. Figure 9 shows this idea. In practice, when predicting the -th column of the stress, we use strains from the -th to the -th column.

Figure 9: The LSTM predicts one column of the stress each time, using strains at this column and the neighboring columns.

For the -dimensional shear flow problem, we only show results from the direct training model. The LSTM is trained by an Adam optimizer for steps, with batch size being 64. Still, the stress is predicted using the strains from the last time steps.

Numerical results

Figure 10 shows the relative training and testing error during the training process. We can see that the training error goes down to , while the testing error goes down to about . Figure 11 shows the correlation coefficients of the predicted stress and the true stress at different . The three curves represent three components of the stress, respectively. we can see that the correlation coefficients are close to except in the area close to the boundary. Considering the boundary condition, the stress near the boundary is close to . Hence, its reasonable for the prediction to have a low correlation with the true stress.

Figure 10: Training and testing error of 2D shear flow.
Figure 11: Correlation coefficients of predicted stress and true stress at different .

Next, we solve the reduced system initialized from a true solution, and compare the quality of the reduced solution at later times with the solution given by the Smagorinsky model [19]. The Smagorinsky model is a classical and widely used model for large eddy simulation (LES). In the two dimensional case, the Smagorinsky model for the sub-grid stress can be written as


where , is called the turbulent eddy viscosity, and

The turbulent eddy viscosity can be expressed as


with being the grid size and being a constant. In Figure 12, we compare the deviation of our reduced solution and Smagorinsky solution from the true solution. We see that the prediction given by our reduced system is much better than that by the Smagorinsky model.

Figure 12: Distance of the reduced solution and the true solution, compared with the Smagorinsky method.

5 Conclusion

Much work needs to be done to develop the methods proposed here into a systematic and practical approach for model reduction for a wide variety of problems. For example, in the coupled training model, it is crucial to find an accurate and efficient way to compute the Jacobian. How to make the method scalable for larger systems is a problem that should be studied. For reduced systems where the noise effects cannot be neglected, how to model the noise term in the GLE is a problem that should be dealt with. Finally, we also need theoretical understanding to answer questions such as how to choose the size of the neural network, and how large the dataset should be in order to obtain a good reduced system.


  • [1] Hazime Mori. Transport, collective motion, and brownian motion. Progress of theoretical physics, 33(3):423–455, 1965.
  • [2] Robert Zwanzig. Nonlinear generalized langevin equations. Journal of Statistical Physics, 9(3):215–220, 1973.
  • [3] Alexandre J Chorin, Ole H Hald, and Raz Kupferman. Optimal prediction with memory. Physica D: Nonlinear Phenomena, 166(3-4):239–257, 2002.
  • [4] Eric J Parish and Karthik Duraisamy. Non-markovian closure models for large eddy simulations using the mori-zwanzig formalism. Physical Review Fluids, 2(1):014604, 2017.
  • [5] Xiantao Li and Weinan E. Variational boundary conditions for molecular dynamics simulations of crystalline solids at finite temperature: treatment of the thermal bath. Physical Review B, 76(10):104107, 2007.
  • [6] Zhen Li, Hee Sun Lee, Eric Darve, and George Em Karniadakis. Computing the non-markovian coarse-grained interactions derived from the mori–zwanzig formalism in molecular systems: Application to polymer melts. The Journal of chemical physics, 146(1):014104, 2017.
  • [7] Masataka Gamahara and Yuji Hattori. Searching for turbulence models by artificial neural network. Physical Review Fluids, 2(5):054604, 2017.
  • [8] Antoine Vollant, Guillaume Balarac, and C Corre.

    Subgrid-scale scalar flux modelling based on optimal estimation theory and machine-learning procedures.

    Journal of Turbulence, 18(9):854–878, 2017.
  • [9] Julia Ling, Andrew Kurzawski, and Jeremy Templeton. Reynolds averaged turbulence modelling using deep neural networks with embedded invariance. Journal of Fluid Mechanics, 807:155–166, 2016.
  • [10] Jian-Xun Wang, Jin-Long Wu, and Heng Xiao. Physics-informed machine learning approach for reconstructing reynolds stress modeling discrepancies based on dns data. Physical Review Fluids, 2(3):034603, 2017.
  • [11] H Xiao, J-L Wu, J-X Wang, R Sun, and CJ Roy. Quantifying and reducing model-form uncertainties in reynolds-averaged navier–stokes simulations: A data-driven, physics-informed bayesian approach. Journal of Computational Physics, 324:115–136, 2016.
  • [12] Fei Lu, Kevin K. Lin, and Alexandre J. Chorin. Data-based stochastic model reduction for the kuramoto–sivashinsky equation. Physica D Nonlinear Phenomena, 340, 2016.
  • [13] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
  • [14] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310–1318, 2013.
  • [15] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [16] Chi-Wang Shu and Stanley Osher. Efficient implementation of essentially non-oscillatory shock-capturing schemes. Journal of computational physics, 77(2):439–471, 1988.
  • [17] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [18] Weinan E and Jianchun Wang. A thermodynamic study of the two-dimensional pressure-driven channel flow. Discrete & Continuous Dynamical Systems-A, 36(8):4349–4366, 2016.
  • [19] Joseph Smagorinsky. General circulation experiments with the primitive equations: I. the basic experiment. Monthly weather review, 91(3):99–164, 1963.