# Computing the Invariant Distribution of Randomly Perturbed Dynamical Systems Using Deep Learning

The invariant distribution, which is characterized by the stationary Fokker-Planck equation, is an important object in the study of randomly perturbed dynamical systems. Traditional numerical methods for computing the invariant distribution based on the Fokker-Planck equation, such as finite difference or finite element methods, are limited to low-dimensional systems due to the curse of dimensionality. In this work, we propose a deep learning based method to compute the generalized potential, i.e. the negative logarithm of the invariant distribution multiplied by the noise. The idea of the method is to learn a decomposition of the force field, as specified by the Fokker-Planck equation, from the trajectory data. The potential component of the decomposition gives the generalized potential. The method can deal with high-dimensional systems, possibly with partially known dynamics. Using the generalized potential also allows us to deal with systems at low temperatures, where the invariant distribution becomes singular around the metastable states. These advantages make it an efficient method to analyze invariant distributions for practical dynamical systems. The effectiveness of the proposed method is demonstrated by numerical examples.

## Authors

• 14 publications
• 25 publications
• 6 publications
12/04/2017

### Linearly-Recurrent Autoencoder Networks for Learning Dynamics

This paper describes a method for learning low-dimensional approximation...
02/06/2019

### Toward computing sensitivities of average quantities in turbulent flows

Chaotic dynamical systems such as turbulent flows are characterized by a...
03/18/2021

### How to Compute Invariant Manifolds and their Reduced Dynamics in High-Dimensional Finite-Element Models?

Invariant manifolds are important constructs for the quantitative and qu...
07/05/2021

### Persistence of Conley-Morse Graphs in Combinatorial Dynamical Systems

Multivector fields provide an avenue for studying continuous dynamical s...
12/08/2020

### Stability and Identification of Random Asynchronous Linear Time-Invariant Systems

In many computational tasks and dynamical systems, asynchrony and random...
11/02/2021

### DeepParticle: learning invariant measure by a deep neural network minimizing Wasserstein distance on data generated from an interacting particle method

We introduce the so called DeepParticle method to learn and generate inv...
03/12/2022

### Energy networks for state estimation with random sensors using sparse labels

State estimation is required whenever we deal with high-dimensional dyna...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The probability density function of a randomly perturbed dynamical system is of great importance in studying its steady-state properties and transition events

[wang2008potential, li2013quantifying, li2018landscape]. The long-term effects of the noise on the dynamics can be investigated through the invariant distribution of the system, for example, in biological networks [li2018landscape] and the socio-economic systems [furioli2017fokker]. In particular, a theoretical framework based on the underlying potential landscape, which is derived from the invariant distribution, can be used to analyze the robustness and stability of nonequilibrium systems [wang2008potential].

Despite its analytical usefulness, numerical computation of the invariant distribution remains a central challenge for high dimensional systems, especially at low temperatures. The invariant distribution is governed by the stationary Fokker-Planck equation. Traditional numerical methods such as the finite difference method [sepehrian2015numerical], the finite element method [galan2007stochastic] and the variational iteration method [torvattanabun2011numerical] have been used to effectively solve the Fokker-Planck equation in low dimensions. These methods require the discretization of a bounded domain in space, thus the computational cost usually increases exponentially with the dimension of the problem, a difficulty known as the curse of dimensionality. For this reason, these traditional numerical methods become prohibitively expensive for practical systems where the dimension is larger than three [li2018landscape, chen2018efficient, li2013quantifying]

. An alternative approach for computing the probability distribution is the Monte Carlo method. For example, in Ref.

[zhai2020deep]

, the probability density is estimated using the direct Monte Carlo method by sampling long trajectories of the stochastic differential equation or using the conditional Gaussian framework

[chen2017beating, chen2018efficient]. A naive application of the Monte Carlo method suffers from the difficulty caused by meta-stability in systems with multiple meta-stable states, especially when the amplitude of the noise, i.e. the temperature is low.

Recently, a number of deep learning based methods haven been proposed for solving partial differential equations (PDEs)

[han2018solving, weinan2018deep, li2019computing, khoo2019solving, nabian2018deep, raissi2019physics, zang2020weak]. These methods have been very successful even for problems in high dimensions. In the work of Ref. [han2018solving], a deep learning framework was designed for solving semilinear parabolic PDEs based on a reformulation of backward stochastic differential equations. In Refs. [weinan2018deep, li2019computing, khoo2019solving, nabian2018deep]

, the solution of PDEs is approximated by neural networks and computed by solving the corresponding variational problems. In Ref.

[raissi2019physics], physics-informed neural networks (PINN) were introduced to compute solutions of PDEs.

In this paper, we focus on the stationary Fokker-Planck equation and develop a deep learning based method for computing the invariant distribution of randomly perturbed dynamical systems modeled by stochastic differential equations. Instead of computing the invariant distribution directly, we propose to compute the generalized potential, which is the negative logarithm of the invariant distribution multiplied by the noise. The method is based on a decomposition of the force field as specified by the Fokker-Planck equation. The potential component of the decomposition gives the generalized potential. We design the loss functions to learn the decomposition for both known and unknown force fields. In the latter case, the decomposition is learned from trajectory data of the corresponding deterministic dynamics. Thus the method is applicable for high-dimensional systems with partially known information of the dynamics at low-temperature regimes. The ability of the proposed method to compute the invariant distribution of practical systems at various temperatures and of high dimensions is demonstrated in model systems.

The idea of learning a decomposition of the force field was used to compute the quasipotential [lin2021data]. The quasipotential describes the asymptotic property of the dynamics and characterizes the generalized potential in the zero noise limit [freidlin2012random, zhou2016construction, lin2021data]. The quasipotential satisfies a first-order Hamilton-Jacobi (HJ) equation. In Ref. [lin2021data], the HJ equation was solved by learning an orthogonal decomposition of the force field from the trajectory data. In the current work, we use the similar idea to solve the Fokker-Planck equation for the invariant distribution at finite noise.

The proposed method has advantages over the recently proposed deep learning based methods for solving the Fokker-Planck equation [xu2020solving, zhai2020deep, chen2021solving]. The main idea of the previous methods was to use a neural network to represent the probability density function and then minimize a loss function involving the residual of the Fokker-Planck equation. These methods become less efficient when the magnitude of the noise is low, as the density function becomes singular around the metastable states. In contrast, parameterizing the generalized potential in the current method allows us to deal with systems at low temperatures. Furthermore, the proposed method is data-driven in the sense that it can deal with systems with partially known dynamics.

The paper is organized as follows. We first review the Fokker-Planck equation and some earlier work for solving the equation in Section 2. In Section 3, we propose the deep learning based method, including the decomposition of the force field, its parameterization using neural networks and the loss functions under two different problem settings. The effectiveness of the method is demonstrated by numerical examples in Section 4. We draw conclusions in Section 5.

## 2 The Fokker-Planck equation and related work

Consider a dynamical system in modeled by the stochastic differential equation (SDE)

 dxt=f(xt)dt+√2ϵσdWt,t>0 (1)

where

is a vector (force) field,

is a -dimensional Wiener process, is a constant matrix, and is a parameter controlling the strength of the noise. The invariant probability density function of the dynamical system, , solves the Fokker-Planck (FP) equation

 Np(x):=−∇⋅(f(x)p(x))+ϵ∇⋅(D∇p(x))=0,x∈Rd (2)

where

is the diffusion tensor,

and denote the gradient of function and the divergence of the vector field , respectively. The FP equation can also be written as , where the probability flux .

In low dimensions when , the invariant distribution of the dynamical system (1) can be computed by solving the FP equation using traditional numerical methods, e.g. finite difference or finite element methods [sepehrian2015numerical, galan2007stochastic]. Take the finite difference method for example. The FP equation is restricted to a bounded domain with certain boundary condition, e.g. the no-flux condition , . where is the outward normal vector on the boundary . Then the differential operators in the equation are approximated by difference operators on a mesh covering , and the resulting difference equations are solved for the density function on the grid points of the mesh. Such mesh-based numerical methods can only be applied to low-dimensional systems. For practical systems when the dimension is greater than three, these methods become too expensive as the computational cost increases exponentially with the dimension.

Recently, learning methods based on artificial neural networks were proposed to compute invariant distributions. In Ref. [xu2020solving], it was proposed to solve the FP equation by parameterizing the solution using a neural network . The neural network weights are trained by minimizing the loss function

 L =∫Ω|Npθ(x)|2dx+λ1∣∣∣∫Ωpθ(x)dx−1∣∣∣2+λ2∫∂Ω|pθ(x)|2dx, (3)

where the last two terms are introduced to impose the normalization condition and the homogeneous Dirichlet boundary condition respectively, , are parameters controlling the proportion of the two penalty terms. In Ref. [xu2020solving], the integral in the second term was approximated using quadrature on a uniform mesh covering . This limited the applicability of the method to low-dimensional systems.

In another learning-based method [zhai2020deep], the invariant density function was also parameterized by a neural network . The neural network weights were trained by minimizing the loss function

 L=∫Ω|Npθ(x)|2dμX(x)+∫Ω|pθ(y)−~p(y)|2dμY(y), (4)

where and are probability measures, is a rough estimate of the invariant probability density obtained by sampling trajectories of the SDE (1). To emphasize high probability regions, and were chosen based on the sampled trajectories of the SDE (1) in Ref. [zhai2020deep].

Both learning-based methods solve the FP equation for invariant density function directly. When the noise is small, the density function becomes rather singular with peaks at meta-stable states and nearly zero elsewhere. In this situation, directly computing the density function becomes less efficient and may lead to inaccurate solutions.

## 3 Computing the generalized potential

Let , where is the invariant probability density function of the dynamical system (1). The function is called the generalized potential. In a gradient system with the force field and the diffusion matrix , where is the potential function and denotes the

-dimensional identity matrix, the invariant distribution is given by the Boltzmann-Gibbs distribution:

, where is the normalization constant. In this case, the generalized potential differs from the potential function by a constant: , for all .

Remark. In general dynamical systems, the generalized potential is related to the global quasipotential , which can be constructed from local quasipotentials [freidlin2012random, zhou2016construction]. The local quasipotential is defined as the minimum action to reach from a metastable state [freidlin2012random]. The global quasipotential characterizes the invariant distribution of the dynamical system in the zero noise limit [zhou2016construction, lin2021data] up to a constant: . Therefore, the generalized potential converges to the quasipotential in the zero noise limit.

Using the ansatz in the FP equation, we immediately obtain the following equation for the generalized potential:

 ∇V(x)T(f(x)+D∇V(x))−ϵ∇⋅(f(x)+D∇V(x))=0,x∈Rd (5)

where we have dropped the exponential factor . Let . Then solving the above equation is equivalent to finding a decomposition of the force field

 f(x)=−D∇V(x)+g(x), (6)

such that

 ∇V(x)Tg(x)−ϵ∇⋅g(x)=0. (7)

For convenience, we call the first term in the decomposition (6) as the potential component of the force field and the term is referred to as the residual component. Once the decomposition is found, we readily obtain the invariant distribution: , where is shifted so that the normalization condition for is satisfied, as the decomposition (6)-(7) is invariant with respect to addition of constants to . Note that the generalized potential remains well-behaved even in the small noise limit. This is in contrast to the density function which becomes nearly singular and difficult to compute directly when is small.

To compute the decomposition of the force field, we parameterize the two components using neural networks. Specifically, the generalized potential is approximated by

 Vθ(x)=~Vθ(x)+d∑i=1ρi(xi−ci)2, (8)

where

is a fully-connected neural network with the activation function

, and are trainable parameters, with . In practice, we take to ensure positivity. In Eq. (8), the quadratic term is introduced so that the function is integrable in . Similarly, the residual component is represented by a neural network . Then the force field is parameterized by

 fθ(x)=−D∇Vθ(x)+gθ(x). (9)

Next we introduce the loss function for training the networks under different problem settings.

Fully known dynamics. First, consider the case when the dynamics (1) is completely known, i.e. the force field , the diffusion tensor as well as the strength of the noise are all given. This is the case considered in traditional numerical methods [sepehrian2015numerical, galan2007stochastic, torvattanabun2011numerical] and existing learning-based methods [xu2020solving, zhai2020deep]. To learn the parameters in and , we minimize the loss function

 L =Ldyn+λLcon, (10)

where

 Ldyn =1d∫Rd|f(x)−fθ(x)|2dμ(x), (11) Lcon =∫Rd∣∣∇Vθ(x)Tgθ(x)−ϵ∇⋅gθ(x)∣∣2dμ(x),

where is a probability measure, is to ensure that approximates the given force field , is to impose the constraint (7) for the decomposition of , and is a parameter that controls the relative weights of the two terms in the loss function. The probability measure can be chosen at our disposal to focus on regions of interest in the dynamics. In the numerical examples, the integrals in (11

) are represented as finite sums using data points sampled from the uniform distribution on a bounded domain, or a mixture of the uniformly sampled data points and those sampled from the numerical simulation of the SDE (

1). Note that the sampling scheme does not require a discretization mesh.

Partially known dynamics.  Next, we consider the case when the force field is unknown, but we have access to trajectory data of the deterministic dynamics corresponding to the SDE (1):

 ˙x=f(x). (12)

This is a common scenario adopted in recent works on learning dynamics from data [brunton2016discovering, yu2020onsagernet, lin2021data]. Furthermore we assume the diffusion tensor and the strength of the noise are given. In the proposed method, we learn an interpretable dynamics with the force field in the form of the decomposition (9) from the trajectory data. Specifically, we denote the observed data by , which consists of trajectories , , each with states sampled at the times from the dynamics (12). Here is a small time step. The trajectories start from different initial states. To train the force field model using these data, we minimize the loss function

 L =Ldyn+λLcon, (13)

where

 Ldyn =1N(M+1)dN∑i=1M∑j=0∣∣∣1Δt(IΔt[fθ;Xi(tj)]−Xi(tj+Δt))∣∣∣2, (14) Lcon =1SS∑k=1∣∣∇Vθ(~Xk)Tgθ(~Xk)−ϵ∇⋅gθ(~Xk)∣∣2,

where is the end state obtained by performing a numerical integration of the dynamics by one time step , starting from the state ; is a representative subset of . The data points , , in the loss are chosen so that they are uniformly distributed in regions where the sample trajectories visited. The loss with these representative data points can effectively impose the constraint in these regions. In this work, we use the algorithm proposed in Ref. [lin2021data] to sample the representative data points from the dataset .

Note that the loss function measures the difference between the force field and its approximation . Indeed, let and be the solution to the dynamics and , respectively, starting from the same initial state . It follows that

 1Δt∣∣Xθ(Δt)−X(Δt)∣∣ =1Δt∣∣∣∫Δt0(fθ(Xθ(t))−f(X(t))dt∣∣∣ (15) ≈|fθ(X0)−f(X0)|.

In the loss , is approximated by , the solution obtained from a numerical integrator of starting from .

## 4 Numerical examples

To illustrate the effectiveness of the proposed method, we apply the method to three systems with different features: a two-dimensional system with two meta-stable states, a biochemical oscillation network model, and a dynamical system in high dimensions. In each example, we use fully connected neural networks with two hidden layers to parameterize and , and the hyperbolic tangent function () as the activation function. The following three types of datasets are used in the loss function:

• Data sampled from the uniform distribution on a bounded domain .

• Data sampled from trajectories of the SDE (1). The initial states of the trajectories are sampled from the uniform distribution on , and the SDE is solved using the Euler-Maruyama scheme with time step . After the first time steps on each trajectory, one data point is sampled for every time steps.

• Data sampled from trajectories of the deterministic dynamics (12). The initial states of the trajectories are sampled from the uniform distribution on , and the dynamics is solved using the four-order Runge-Kutta method with time step . Along each trajectory, the data points are sampled at times and , . The representative data points are sampled using the algorithm in Ref. [lin2021data] with the parameter .

The first and second type of datasets cover the high-probability region of interest in the dynamics. The third type is from the deterministic dynamics driven by the force field. The domain , the time steps and , the parameter and the network structures are provided in Table. 1. We use the second-order Runge-Kutta method as the numerical integrator in the loss function (14). The parameter in the loss function is tuned so that both the loss and the loss are small. We train the neural networks using Adam optimizer [kingma2014adam] with a mini-batch of size . The learning rate decays exponentially over the training steps.

To assess the accuracy of the learned generalized potential, we compute the relative root mean square error (rRMSE) and the relative mean absolute error (rMAE):

 rRMSE=(∫D|Vθ(x)−V(x)|2dx)1/2(∫D|V(x)|2dx)1/2,rMAE=∫D|Vθ(x)−V(x)|dx∫D|V(x)|dx,

where is the learned generalized potential, is the solution computed from the FP equation using the finite difference (FD) method in Example 1 and 2, and is the domain , which excludes regions of low density. To facilitate the comparison, the solutions and are shifted so that their minimum values are both 0.

In the numerical examples, the generalized potential learned using the proposed method is compared with the one computed from the corresponding FP equation using the FD method. In Example 1 and 2, the FP equation is solved on the domain given in Table 1, with the no-flux boundary condition. A uniform mesh with grid points in each dimension is used in the finite difference discretization. Details of the FD scheme is provided in Appendix A.

### 4.1 Example 1: A two-dimensional system with two metastable states

Consider the following two-dimensional dynamical system,

 ⎧⎪⎨⎪⎩˙x=15x(1−x2)+y(1+sinx)+√ϵ5ξ1,˙y=−y+2x(1−x2)(1+sinx)+√2ϵ ξ2, (16)

where the state of the system is ,

is a two-dimensional white noise, and the diffusion tensor

. The corresponding deterministic dynamics () has two stable stationary points at and and one unstable stationary point at .

First we assume the dynamics in (16) is completely known to us. In this case, we use the loss function (10)-(11) to train the neural networks for . The integrals in the loss function are represented as finite sums using data points sampled from the uniform distribution on (dataset (i)). Using these data, we train the model at , and , respectively.

The learned generalized potentials for and are shown in Fig. 1. Also shown in the figure are the solution obtained by solving the FP equation for using the finite difference method (the FD solution) and the solution , where is computed by minimizing the loss function (4). In the loss function (4), we use the FD solution as the estimator , and the invariant density function is trained directly. This method gives less accurate solution as compared to the proposed method for learning the generalized potential, especially when the temperature is low and the density is narrowly peaked at the metastable states, as can be seen from the numerical results. Details of training the invariant density function using the loss function (4) are provided in Appendix B. A quantitative assessment of the numerical solutions is provided in Table 2, where we report the root mean square error and the root mean absolute error of the learned potentials. For each value of , the FD solution is used as the reference solution to compute the errors of the learned potential. Each error in the table is computed from independent runs including sampling the data and training the networks. The advantage of parameterizing the potential over the method of parameterizing the invariant density is also evident from the results shown in the table.

Next we assume that the force field in (16) is unknown to us. In this case, we compute the neural network model for the force field, , in the form of the decomposition (9) by minimizing the loss function (13)-(14). The dataset contains data points sampled from trajectories of the deterministic dynamics (dataset (iii)). Using these data, we train the neural network model at , and , respectively. The learned potential for is shown in Fig. 2, together with the finite difference solution for the purpose of comparison. The root mean square error and the root mean absolute error of are and , respectively.

With the sampled trajectory data, we also apply the method in Ref. [lin2021data] to compute the respective quasipotentials associated with each attractor of the system. The learned quasipotential is shown in the left panel of Fig. 3. In particular, we plot the quasipotential and the generalized potential for along the line in the right panel of Fig. 3. From the figure, we can see that as the noise tends to zero, the generalized potential converges to the global quasipotential. Also, the numerical results reveal that the finite noise in the system (16) has a significant entropic effect on the landscape of the equilibrium distribution: the left well of the system concentrates more equilibrium probabilities than the right one, as the magnitude of the noise increases.

### 4.2 Example 2: A biochemical oscillation network model

To show the effectiveness of the proposed method in systems with other features, we evaluate our method on a system whose potential landscape has a limit-cycle shape when the temperature is low. We consider a biochemical oscillation network of cell cycles [wang2008potential]. The network consists of two cyclins: CLN/CDC28 and CLB/CDC28. Let and denote the average concentration of the two cyclins, respectively. The variation rates of the concentrations are described by the SDEs

 ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩˙x=100(α2+x21+x211+y−ax)+√2ϵ ξ1,˙y=100τ0(b−y1+cx2)+√2ϵ ξ2, (17)

where the state of the system is , is a two-dimensional white noise, and the diffusion tensor . The parameters in the equations are taken as , , , and .

We assume the dynamics in (17) is completely known to us. We use the loss function (10)-(11) to train the neural network model in the form of the decomposition (9). The integrals in the loss function are represented as finite sums using a mixture of data points sampled from the uniform distribution on (dataset (i)) and data points sampled from trajectories of the SDEs (17) at the temperature (dataset (ii)). To focus on the region of interest, only the data points inside the region are kept in sampling of the SDE data. Using these data, we train the force field model at . The numerical solution for the potential is shown in Fig. 4. Also shown in the figure is the finite difference solution . The two solutions agree very well.

We also conducted the computation for and , respectively. The errors of for the different values of are reported in Table 3. These results show that our method is effective in capturing different types of potential landscapes.

### 4.3 Example 3: A ten-dimensional system

To show the effectiveness of the proposed method in high-dimensional systems where traditional numerical methods are not directly applicable, we consider a dynamical system in ten-dimensional space . We choose a synthetic example with an explicitly known reduction to a lower dimensional system, on which the invariant distribution can be computed accurately. This is important for validating the proposed method in high dimensions, since classical numerical methods such as finite difference and finite element methods cannot be directly applied to obtain reference solutions for general high dimensional systems. Concretely, we consider the ten-dimensional system

 ˙x=Bh(B−1x)+√2ϵBξ,t>0, (18)

where is a vector field with

 h2k−1(y) =v1(y2k−1,y2k):=−y2k−1+y2k(1+siny2k−1), h2k(y) =v2(y2k−1,y2k):=−y2k−y2k−1(1+siny2k−1),1≤k≤5,

is a matrix given by

 bi,j=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩0.8,for i=j=2k−1,1≤k≤5,1.25,for i=j=2k,1≤k≤5,−0.5,for j=i+1,1≤i≤90,otherwise,

and is a ten-dimensional white noise. This system is obtained by coupling five independent two-dimensional systems of the same form:

 {˙y2k−1=v1(y2k−1,y2k)+√2ϵ ξ2k−1,˙y2k    =v2(y2k−1,y2k)+√2ϵ ξ2k,1≤k≤5, (19)

and . As shown in Appendix C, the generalized potential of the system (18) is given by

 V(x)=V0(y1,y2)+⋯+V0(y9,y10), (20)

where and is the generalized potential of the two-dimensional system (19). Thus a reference solution of can be obtained by solving the FP equation associated with the two-dimensional system using the finite difference method.

We assume the force field in Eq. (18) is unknown to us. In this case, we learn the neural network model for the force field, , in the form of the decomposition (9) by minimizing the loss function (13)-(14). The dataset contains data points sampled from trajectories of the deterministic dynamics (dataset (iii)). Using these data, we train the neural network model at . As the learned potential is in the ten-dimensional space, we plot a number of its cross sections in Fig. 5. Also shown in the figure are cross sections of the reference solution . It can be observed from the results that the two solutions agree well in the five cross sections in both high-probability and low-probability regions for this high-dimensional system.

## 5 Conclusion

In this paper, we developed a machine learning method to compute the invariant distribution of randomly perturbed dynamical systems modeled by SDEs. We considered two scenarios: in the first one, the force field is given; while in the second one, the force field is unknown but we have access to the data of the deterministic dynamics. In each case, we proposed an appropriate loss function. The method learns the force field in the form of a decomposition as suggested by the FP equation by minimizing the loss function. The two components of the decomposition are parameterized by neural networks. The potential component of the decomposition gives the generalized potential for the invariant distribution.

The proposed method was shown to be effective in various systems with different features, including a dynamical system with two meta-stable states, a biological network model, and a system in high dimensions. In all these examples, the numerical results agreed well with the respective reference solutions. The advantage of parameterizing the generalized potential rather than the invariant density function directly was also demonstrated in one of the examples. Furthermore, the method is data-driven in the sense that it does not require any prior knowledge of the force field other than the data of the dynamics. The method enables us to study the equilibrium properties of practical dynamical systems in high dimensions at low temperatures.

In the current work, we considered systems with known noise covariance structure. Also, the data was sampled from the deterministic dynamics. In an ongoing work, we extend this method to learn the invariant distribution together with the structure of the noise from noisy data.

## Appendix A Solving the Fokker-Planck equation using the finite difference method

Consider the two-dimensional Fokker-Planck equation,

 ∇⋅J=0,(x,y)∈Ω (21)

where , and the probability flux is given by

 J1 =−f1(x,y)p(x,y)+ϵ1∂xp(x,y), (22) J2 =−f2(x,y)p(x,y)+ϵ2∂yp(x,y).

Equation (21) is supplemented with the no-flux boundary condition

 J(x,y)⋅n(x,y)=0,(x,y)∈∂Ω (23)

where is the outward normal of the boundary of , and the normalization condition

 ∫Ωp(x,y)dxdy=1. (24)

We discretize the domain using a uniform mesh with grid points , , , where , , , . We denote the mid-points of the mesh (i.e. cell centers) by , where , , and , . The solution for is computed at the cell centers.

The flux at the point , is approximated by

 Ji,j−1/21= −f1(xi,yj−1/2)⋅12(pi−1/2,j−1/2+pi+1/2,j−1/2) (25) +ϵ1hx(pi+1/2,j−1/2−pi−1/2,j−1/2),

where denotes the approximate solution for at the cell center . Similarly, the flux at the point , is approximated by

 Ji−1/2,j2= −f2(xi−1/2,yj)⋅12(pi−1/2,j−1/2+pi−1/2,j+1/2) (26) +ϵ2hy(pi−1/2,j+1/2−pi−1/2,j−1/2).

In the Fokker-Planck equation (21), we approximate the derivatives using the centered difference. This yields

 1hx(Ji,j−1/21−Ji−1,j−1/21) +1hy(Ji−1/2,j2−Ji−1/2,j−12)=0, (27) 1≤i≤Nx, 1≤j≤Ny.

The no-flux boundary condition gives

 J0,j−1/21 =JNx,j−1/21=0,1≤j≤Ny, (28) Ji−1/2,02 =Ji−1/2,Ny2=0,1≤i≤Nx.

Equations (27)-(28) form a linear system

 Ap=0, (29)

where , , and is the vector formed by , , . The normalization condition (24) is approximated by

 hxhy∑p∈pp=1. (30)

Equations (29)-(30) determine the unique solution for . We solve these equations to obtain an approximate solution for the invariant density function.

## Appendix B Computing the generalized potential using the loss function 4

In Example 1, for the purpose of comparison, we implemented the method proposed in Ref. [zhai2020deep] with the loss function 4. The invariant distribution is parameterized by , where is a vanilla neural network with two hidden layers and the activation . Each hidden layer has nodes. The positive function acting on the output of the network is introduced to guarantee the positivity of the density function. In the loss function 4, we replace the Monte Carlo estimator with the finite difference solution and take

 L=1NN∑i=1|Npθ(xi)|2+1MM∑i=1|pθ(yi)−p(yi)|2, (31)

where , and , are data points sampled from the uniform distribution on . The training problem is solved by the “double shuffling” method used in Ref. [zhai2020deep], which alternatively performs gradient-descent steps for minimizing the two separate terms in the loss function (31). The contour plots of on at and are shown in the right panel of Fig. 1 in the paper.

## Appendix C Proof of equation 20

Consider the following dynamical system in the -dimensional space,

 ˙y=h(y)+√2ϵ ξ,t>0 (32)

where is a -dimensional white noise. Let be a constant matrix with , and . Then the process satisfies the equation

 ˙x=Bh(B−1x)+√2ϵBξ,t>0. (33)

The Fokker-Planck equation associated with the system (32) and the system (33) is respectively given by

 −∇y⋅(h(y)py(y))+ϵ∇y⋅(∇ypy(y)) =0,y∈Rd, (34) =0,x∈Rd. (35)

It is straightforward to verify that

 px(x)=py(B−1x). (36)

In Example 3, the system (32) is composed of five independent two-dimensional systems of same form, therefore the probability density is given by

 py(y)=5∏k=1p0(y2k−1,y2k), (37)

where is the invariant distribution of the two-dimensional system. Let . Then from Eqns. (36)-(37), the generalized potential is given by

 Vx(x)=5∑k=1V0(y2k−1,y2k), (38)

where .

## Acknowledgements

The work of B. Lin and W. Ren were partially supported by A*STAR under its AME Programmatic programme: Explainable Physics-based AI for Engineering Modelling & Design (ePAI) [Award No. A20H5b0142]. Q. Li is supported by the National Research Foundation of Singapore, under the NRF Fellowship NRFF13-2021-0106.