1 The MoriZwanzig formalism
The starting point of the MZ formalism is to divide all the variables into resolved and unresolved ones. The basic idea is to project functions of all variables into the space of functions of only the resolved variables. The original dynamical system then becomes an equation for the resolved variables with memory and noise. This equation is called the Generalized Langevin Equation (GLE). Below we first demonstrate the idea of the MZ formalism using a simple linear example. We then use the KuramotoSivashinsky equation as an example to show how MZ formalism help us develop a reduced model.
1.1 A brief review to the MZ formalism
Consider a linear system
(7)  
(8) 
with initial value , . We take as the resolved variable, as the unresolved variable, and want to develop a reduced system for . To achieve this, we can first fix in (8) and solve for , and then insert the result into (7). In this way we get
(9) 
There are three terms at the right hand side of (9). The first term is a Markovian term of ; the second term is a memory term. The third term contains the initial value of , this term will be viewed as a noise term. Hence, by eliminating from the original linear system, we obtain an equation for with memory and noise.
Similar deduction can be applied to nonlinear ODE systems. Assume we have a system
(10) 
with and
being vectors, and we split
into and take as resolved variables, then by the MoriZwanzig formalism we can get a GLE for ,(11) 
In (11), denotes at time with initial value . Although (11) is more complicated than (9), the essential ingredients are similar. The MZ formalism tells us that model reduction leads to memory effects, and inspired us to pursuing the application of RNNs as a tool for performing efficient yet rigorous model reduction.
1.2 Application to the KS equation
Going back to the KS equation (1), let and
be Fourier transforms of solution
and filter , respectively. Then, by (3), we have for all frequency . Here, we take a spectrally sharp filter which satisfiesfor a certain positive integer . Then, the Fourier transform of is a truncation of the Fourier transform of , and our resolved variables are those with . Writing the KS equation in Fourier space, and put all terms with unresolved variables to the right hand side of the equation, we can get
(12) 
Applying Fourier transform to (4), and compare with (12), we have
(13) 
for . Using the MZ theory, the subgrid stress can be expressed as a memory term and a noise term. By Galilean invariance, the subgrid stress can be expressed as a function of the history of the resolved strain, , when the noise effect is negligible.
2 Recurrent Neural Networks and LSTM
In this section, we briefly introduce recurrent neural networks and long shortterm memory networks.
2.1 Recurrent neural networks
RNNs are neural networks with recurrent connections, designed to deal with time series. Usually, a recurrent connection links some units in a neural network to themselves. This connection means that the values of these units depend on their values at the previous time step. For a simple example, consider a onelayer RNN with time series as input. Let and be the hidden values and output values at time step , respectively. Then, the simplest RNN model is given by
(14)  
where , , , , and are matrices or vectors of trainable parameters, and
is a nonlinear activation function. To deal with complicated time series, sometimes the output of an RNN is fed to another RNN to form a multilayer RNN.
Note that in RNNs the same group of parameters are shared by all time steps. In this sense, RNNs are stationary models. Among other things, this allows RNNs to learn information in long time series without inducing too many parameters. The training of RNNs is similar to the training of regular feedforward neural networks, except that we have to consider gradient through time direction, such as the gradient of
with respect to for . Usually RNNs are trained by the socalled backpropagation through time (BPTT) method.2.2 Long shortterm memory networks
Theoretically, RNNs is capable of learning longterm memory effects in the time series. However, in practice it is hard for RNN to catch such dependencies, because of the exploding or shrinking gradient effects [13], [14]. The Long ShortTerm Memory (LSTM) network is designed to solve this problem. Proposed by Hochreiter et al. [15], the LSTM introduces a new group of hidden units called states, and uses gates to control the information flow through the states. Since the updating rule of the states is incremental instead of compositional as in RNN, the gradients are less likely to explode or shrink. The computational rule of an LSTM cell is
where , , are called the forget gate, input gate, and output gate, respectively, is the state, and is the hidden value. The LSTMs can also be trained by BPTT.
3 The Two Training Models
For simplicity, we will focus on learning the memorydependent terms in the GLE, neglecting for now the noise term. For the KS equation, as shown in Figure 3, by directly fitting the stress using the history of the strain, we can reduce the error to nearly while maintaining a small gap between the testing and training error. This shows that we are not overfitting, and the history of the strain determines most of the stress.
We will discuss two models for learning the memory effect: a direct training model and a coupled training model. For both models, we generate data using direct numerical simulation (DNS). In the direct training model, after generating the data, we directly train the RNN to fit the subgrid stress
using the time history of the strain. The loss function is defined as the difference between the output of the RNN and the true stress. The neural network model for the stress is then used in the macroscale equation for the reduced system. In the coupled training model, the loss function is defined as the difference between the solutions of the reduced model (with stress represented by the neural network) and the ground truth solution. Therefore in the coupled model, the RNN is coupled with the solver of the macroscale equation when training.
3.1 The direct training model
In the direct training model, a RNN is used to represent the stress as a function of the time history of the macrocell strain. The loss function is simply the squared difference of the predicted stress and the ground truth stress. This RNN model is then used in the reduced system. Figure 1 shows the flow chart of the direct training model.
3.2 The coupled training model
In the direct training model, we train an accurate stress model and we hope that this stress model can help produce accurate results for the macroscale solution. In the coupled model, getting accurate macroscale solutions is ensured by defining the loss function to be the difference between the solution of the reduced system and the ground truth, with the stress in the reduced system represented by a RNN. Specifically, in every epoch, we solve the reduced system for
steps from some initial condition and get a prediction of the solution steps later (). We use to denote the predicted solution. The loss function is defined to be some function of . From another perspective, we can also view this coupled system as a single large RNN, part of the hidden units of this RNN is updated according to a macroscale solver. Let be the predicted solution at the th step, and , , be the state of RNN, stress, and strain at the step, respectively. Then the update rule of this large RNN can be written as(15)  
(16) 
Figure 2 shows part of the computational graph of the coupled training model.
Computation of the gradient
In the coupled training model, the trainable parameters are still in the RNN, but to compute the gradient of the loss with respect to the parameters, we have to do backpropagation (BP) through the solver of the macroscale equation, i.e. through (16). In many applications, this solver is complicated and it is hard to do BP through the solver. To deal with this problem, we take one step back to perform BP directly through the differential equations. This is done by writing down the differential equation satisfied by the gradient, which we call the backward equation. Another solver is used to solve this backward equation. In this way, BP is done by solving the backward equation, and is decoupled from the macroscale solver.
As an illustration, let be the solution of the KS equation at time in the Fourier space, and assume that we want to perform backpropagation through the KS equation from time back to , which means we are going to compute the Jacobian . Let
(17) 
then we have and we want to compute . In the reduced system, we solve the KS equation with stress,
(18) 
where is the subgrid stress in Fourier space, means convolution, and represents a diagonal matrix with vector being the diagonal entries. Taking derivative of with respect to , and assuming that , we obtain
(19) 
Hence, as long as we know the solution for , we can compute the Jacobian by solving the equation (19) from to , with initial condition .
4 Numerical Experiments
We now present some numerical results of the proposed models for the KS equation and the 2dimensional shear flow problem. Below when we talk about true solution or ground truth, we mean the exact solution after filtering.
4.1 The KuramotoSivashinsky equation
Experiment setting
The KS equation is considered in the Fourier space. To generate the training data, we solve the KS equation with Fourier modes to approximate the accurate solution, using a order integral factor RungeKutta method [16]. We set and the time step . The settings are similar to that in [12]. The microscale equation is solved for time units and filterd outputs are saved for every time units. We drop the results of the first time units, and take the results from the following time units as the training data, and the results of the last time units as the testing data.
For the reduced system, we solve (4) in Fourier space. We take Fourier modes and the time step . The macroscale solver is still a order integral factor RungeKutta method. The solver works in the Fourier space, and takes the output of the RNN as the subgrid stress.
On the machine learning side, we use an LSTM to predict the stress. The LSTM has two layers and hidden units in each layer. An output layer with linear activation is applied to ensure that the dimension of the outputs is . The LSTM works in the physical space: it takes strains in the physical space as inputs, and outputs predicted stresses in the physical space. A Fourier transform is applied to the output of the LSTM before feeding it into the macro solver.
Direct training
For the direct training model, we train the LSTM to fit the stress as a (memorydependent) function of the strains for the last time steps. The network is trained by the Adam algorithm [17] for iterations, with batch size being . Figure 3 shows the relative training and testing error during the training process. We can see that the two curves go down simultaneously, finally reaching a relative error of about . There is almost no gap between the training and the testing error, hence little overfitting. This also suggests that most contribution to the subgrid stress can be explained by the time history of the strain.
First we consider some a priori results. Figure 4 presents the relative error and the correlation coefficients between the predicted stress and the true stress at different locations in the physical space. We can see that our prediction of the stress has relative errors of about and correlation coefficients very close to .
We next examine some a posteriori results. After training the model, we pick some times in the testing set, and solve the reduced system initialized from the true solution at these times. Then, we compare the reduced solution with the true solution at later times. Figure 5 shows two examples. In these figures, the axis measures the number of time units from the initial solution. From these figures we can see that the reduced solution given by the direct training model produces satisfactory prediction for  time units ( time steps of macroscale solver).
As for the eventual deviation between the two solutions, it is not clear at this point what is more responsible, the intrinsic unstable behavior in the model or the model error. We will conduct careful studies to resolve this issue in future work.
Next we compare the longterm statistical properties of the solution to the reduced model with the true solution. We solve the reduced system for a long time ( time units in our case). We then compute the distributions and the autocorrelation curves of Fourier modes of the reduced solution, and compare with that of the true solution. Figure 6 shows some results. From these figures we see that the reduced model can satisfactorily recover statistical properties of the true solution.
Coupled training
For the coupled training model, we train the LSTM together with the equation solver. A detailed description of this training model is given in Section 3.2. Here we choose during training. The size of the LSTM is the same as that in the direct training model. A simple onestep forward scheme is used to solve the backward equation for the Jacobian (19). iterations were performed using an Adam optimizer with batch size . During the training process, we measured the relative error of predicted solutions steps later from the initial true solution. The results for different ’s are given in Figure 7. From the figure we can see that the relative error for different goes down to after training, and there is no gap between the different curves, which means that solutions with different steps from the initial time are fitted nearly equally well.
Shortterm prediction and longterm statistical performance are also considered for the coupled training model. Results are shown on Figure 8 and 6. From the figures we see that, the reduced system trained by the coupled training model gives satisfactory prediction for time units, which is shorter than the direct training model. However, the autocorrelation of Fourier modes match better with the true solution, while the distribution is as good as the direct training model. This suggests that, compared to the direct training model, the coupled model is less competitive for shortterm prediction, but performs better for longterm statistical properties.
4.2 The dimensional shear flow
Experiment setting
Consider the dimensional shear flow in channel, whose governing equation is given by (6). We take and use periodic boundary condition in direction and zero boundary condition at . We choose , and as a constant driving force. To numerically solve the equation, we employ the spectral method used in [18]. We take Fourier modes in direction and Legendre modes in direction. The time step are chosen to be .
For ease of implementation, we only do model reduction for Fourier modes in direction, and keep all Legendre modes in direction. The macro solution has Fourier modes in direction. Still, we generate accurate solution by DNS and compute the macroscale strain and stress, and use an LSTM to fit the stress as a function of the history of the strain. In this experiment, the neural network we use is a layer LSTM with hidden units in each layer. As for the KS equation, the input and output of the LSTM are in the physical space.
In this problem, the stress has components (, , ). Since each component has modes in the spectral space, we need at least variables in the physical space to represent the stress. If directly fit the stress, our LSTM will have about thousand outputs. This can make the training very difficult. Here, noting that both the stress and the strain are periodic in the direction, we choose to fit the stress by column. We train an LSTM to predict one column of the stress (the stress at the same in the physical space), using strains at this column and the neighboring columns. Figure 9 shows this idea. In practice, when predicting the th column of the stress, we use strains from the th to the th column.
For the dimensional shear flow problem, we only show results from the direct training model. The LSTM is trained by an Adam optimizer for steps, with batch size being 64. Still, the stress is predicted using the strains from the last time steps.
Numerical results
Figure 10 shows the relative training and testing error during the training process. We can see that the training error goes down to , while the testing error goes down to about . Figure 11 shows the correlation coefficients of the predicted stress and the true stress at different . The three curves represent three components of the stress, respectively. we can see that the correlation coefficients are close to except in the area close to the boundary. Considering the boundary condition, the stress near the boundary is close to . Hence, its reasonable for the prediction to have a low correlation with the true stress.
Next, we solve the reduced system initialized from a true solution, and compare the quality of the reduced solution at later times with the solution given by the Smagorinsky model [19]. The Smagorinsky model is a classical and widely used model for large eddy simulation (LES). In the two dimensional case, the Smagorinsky model for the subgrid stress can be written as
(20) 
where , is called the turbulent eddy viscosity, and
The turbulent eddy viscosity can be expressed as
(21) 
with being the grid size and being a constant. In Figure 12, we compare the deviation of our reduced solution and Smagorinsky solution from the true solution. We see that the prediction given by our reduced system is much better than that by the Smagorinsky model.
5 Conclusion
Much work needs to be done to develop the methods proposed here into a systematic and practical approach for model reduction for a wide variety of problems. For example, in the coupled training model, it is crucial to find an accurate and efficient way to compute the Jacobian. How to make the method scalable for larger systems is a problem that should be studied. For reduced systems where the noise effects cannot be neglected, how to model the noise term in the GLE is a problem that should be dealt with. Finally, we also need theoretical understanding to answer questions such as how to choose the size of the neural network, and how large the dataset should be in order to obtain a good reduced system.
References
 [1] Hazime Mori. Transport, collective motion, and brownian motion. Progress of theoretical physics, 33(3):423–455, 1965.
 [2] Robert Zwanzig. Nonlinear generalized langevin equations. Journal of Statistical Physics, 9(3):215–220, 1973.
 [3] Alexandre J Chorin, Ole H Hald, and Raz Kupferman. Optimal prediction with memory. Physica D: Nonlinear Phenomena, 166(34):239–257, 2002.
 [4] Eric J Parish and Karthik Duraisamy. Nonmarkovian closure models for large eddy simulations using the morizwanzig formalism. Physical Review Fluids, 2(1):014604, 2017.
 [5] Xiantao Li and Weinan E. Variational boundary conditions for molecular dynamics simulations of crystalline solids at finite temperature: treatment of the thermal bath. Physical Review B, 76(10):104107, 2007.
 [6] Zhen Li, Hee Sun Lee, Eric Darve, and George Em Karniadakis. Computing the nonmarkovian coarsegrained interactions derived from the mori–zwanzig formalism in molecular systems: Application to polymer melts. The Journal of chemical physics, 146(1):014104, 2017.
 [7] Masataka Gamahara and Yuji Hattori. Searching for turbulence models by artificial neural network. Physical Review Fluids, 2(5):054604, 2017.

[8]
Antoine Vollant, Guillaume Balarac, and C Corre.
Subgridscale scalar flux modelling based on optimal estimation theory and machinelearning procedures.
Journal of Turbulence, 18(9):854–878, 2017.  [9] Julia Ling, Andrew Kurzawski, and Jeremy Templeton. Reynolds averaged turbulence modelling using deep neural networks with embedded invariance. Journal of Fluid Mechanics, 807:155–166, 2016.
 [10] JianXun Wang, JinLong Wu, and Heng Xiao. Physicsinformed machine learning approach for reconstructing reynolds stress modeling discrepancies based on dns data. Physical Review Fluids, 2(3):034603, 2017.
 [11] H Xiao, JL Wu, JX Wang, R Sun, and CJ Roy. Quantifying and reducing modelform uncertainties in reynoldsaveraged navier–stokes simulations: A datadriven, physicsinformed bayesian approach. Journal of Computational Physics, 324:115–136, 2016.
 [12] Fei Lu, Kevin K. Lin, and Alexandre J. Chorin. Databased stochastic model reduction for the kuramoto–sivashinsky equation. Physica D Nonlinear Phenomena, 340, 2016.
 [13] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
 [14] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310–1318, 2013.
 [15] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [16] ChiWang Shu and Stanley Osher. Efficient implementation of essentially nonoscillatory shockcapturing schemes. Journal of computational physics, 77(2):439–471, 1988.
 [17] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [18] Weinan E and Jianchun Wang. A thermodynamic study of the twodimensional pressuredriven channel flow. Discrete & Continuous Dynamical SystemsA, 36(8):4349–4366, 2016.
 [19] Joseph Smagorinsky. General circulation experiments with the primitive equations: I. the basic experiment. Monthly weather review, 91(3):99–164, 1963.