Partial Differential Equations is All You Need for Generating Neural Architectures – A Theory for Physical Artificial Intelligence Systems

03/10/2021 ∙ by Ping Guo, et al. ∙ Xi'an Jiaotong-Liverpool University Beijing Normal University Harbin Institute of Technology 420

In this work, we generalize the reaction-diffusion equation in statistical physics, Schrödinger equation in quantum mechanics, Helmholtz equation in paraxial optics into the neural partial differential equations (NPDE), which can be considered as the fundamental equations in the field of artificial intelligence research. We take finite difference method to discretize NPDE for finding numerical solution, and the basic building blocks of deep neural network architecture, including multi-layer perceptron, convolutional neural network and recurrent neural networks, are generated. The learning strategies, such as Adaptive moment estimation, L-BFGS, pseudoinverse learning algorithms and partial differential equation constrained optimization, are also presented. We believe it is of significance that presented clear physical image of interpretable deep neural networks, which makes it be possible for applying to analog computing device design, and pave the road to physical artificial intelligence.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 5

page 6

page 7

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In recent years, the field of artificial intelligence has made rapid progress, especially for the field of digital artificial intelligence (AI). Now researchers pay attention to physical artificial intelligence (PAI), and believe that PAI will be the most potential field in robot field in the next decades[1]. To develop PAI effectively, we need the solid foundation from theoretical physics. We believe that theories and methods from physics are able to pave the road to PAI, and it has being boosted artificial intelligence methods.

We have known in physics, theoretical work is said to be from first principles, if it does not make assumptions. While in physics, the principle of least action (PLA) is a very elegant and important discovery of mechanics[2] [3]. The PLA can be used to derive Newtonian, Lagrangian and Hamiltonian equations of motion, and even general relativity. The physicist Richard Feynman, demonstrated how this PLA can be used in quantum mechanics with path integral calculations [4]. It is well known that most of the fundamental laws in Physics are described with partial differential equations. As Stephen Wolfram said, “Indeed, it turns out that almost all the traditional mathematical models that have been used in physics and other areas of science are ultimately based on partial differential equations[5].

Now question is what is the first principle of AI? Is there any fundamental equations for AI field? In our viewpoint, if there exists the first principle for AI research field, we believe it is this PLA [6]. Also, as the problem asked by Yann LeCun in IJCAI 2018, “What is the equivalent of ‘Thermodynamics’ in intelligence? [7], this has inspired us to seek the simple rule to build a world model, in other words, to establish the fundamental equations for AI with physics laws inspired methods.

In this work, we developed the generalized nonlinear parabolic partial differential equations (PPDE), which can be considered as the fundamental equations in the AI research field. And we also show that the reaction-diffusion equation in statistical physics, Schrödinger equation in quantum mechanics, Helmholtz equation in paraxial optics, are all examples of PPDEs. With finite element method (FEM) and the finite difference method (FDM), we discrete PPDEs into neural partial differential equations (NPDE). The NPDEs can be used to describe an AI systems, such as the synergetic learning systems.

Ii Background

Here we briefly review the synergetic learning systems (SLS) firstly [8].

Ii-a SLS theory

Our proposed SLS is an information processing systems based on the PLA, and the dynamics of the SLS should be described by differential dynamic equations. What kind of equations is this differential dynamic equation? It is known that when generalizing the equation to a multi-component systems, we can obtain a variation of the “free energy” :

(1)

in which , is the diffusion matrix, and .

is called convection vector and

is reaction vector[9].

Therefore, the dynamics of the SLS is described by the reaction-diffusion equations (RDE). In statistical physics, the dissipative systems theory is a theoretical description of the self-organization of non-equilibrium systems, and the RDEs is utilized to model this systems also [10]. If the SLS is designed as a differential dynamical system and attention is paid to the attractor subnetwork [11], the RDEs can also be used to describe this systems.

Ii-B Turing’s Equations

Turing’s equations is a two-component RDEs [12], and is the mechanism for pattern formation[13]. When let and , the mathematical expression of Turing’s equations is

(2)

In the literature, Prigogine proposed the theory of the dissipative systems, believing that “the energy exchange with the outside world is the fundamental reason for making the system orderly (contrary to the principle of entropy increase),” and founded a new discipline called non-equilibrium statistical mechanics [14], and the RDEs is utilized to model the dissipative structures.

Iii Neural Partial Differential Equations

The RDEs in statistical physics is widely used, it belongs a family of semi-linear parabolic partial differential equations (PDE). We believe that nonlinear PDEs can be utilized to describe an AI systems, and it can be considered as a fundamental equations for the neural systems. Following we will present a general form of neural PDEs.

Now we use matrix-valued function , and , the elliptic operator is defined as [15]:

(3)

This is a more general form of a second-order divergence form nonlinear elliptic differential operator. When matrix-valued function is only function of , the operator is linear elliptic differential operator. Especially, if matrix-valued function is symmetric and positive definite for every , it said to be a uniformly elliptic operator.

Please note that this elliptic operator form is the same with right side of Eq. (1). The matrix is the diffusion matrix, the second term in right side of Eq. (3) is the convection vector, and the third therm is the reaction vector.

In mathematics, Laplace operator is expressed as:

(4)

In the theory of partial differential equations, elliptic operators are differential operators that generalize the Laplace operator, since the Laplace operator is obtained by taking .

In mathematics, a PDE is an equation which imposes relations between the various partial derivatives of a multivariable function. And second-order linear PDEs are classified as either elliptic, hyperbolic, or parabolic. Because we will discuss nonlinear PDEs, it has no mathematical strictly defined the neural PDEs belong to parabolic, hyperbolic , or elliptic types. Following we just take several parabolic PDEs besides RDEs.

Iii-a Fisher’s Equations

Fisher’s Equation, also known as the Kolmogorov-Petrovsky-Piskunov (KPP) equation, or Fisher–KPP equation[16]. The mathematical expression is

(5)

Fisher’s equation belongs to the class of RDE, in fact, it is one of the simplest semi-linear RDEs, the one which has the inhomogeneous term.

Iii-B Heat Equations

The heat equation is a certain partial differential equation in mathematics and physics, it has the form[17]

(6)

where denotes a general point of the domain. Or right side of the equation written in Laplacian,

(7)

When study thermal-optic effect in nonlinear optical medium, the heat equation has the form [18][19][20]:

(8)

where is temperature, , and , are heat conductive coefficient, medium density, medium specific heat capacity and light intensity, respectively.

Iii-C Schrödinger Equations

The Schrödinger equation is a linear partial differential equation that governs the wave function of a quantum-mechanical system[21], the most general form is the time-dependent Schrödinger equation:

(9)

where is imaginary index, is the state vector of the quantum system, is time, and is an observable, the Hamiltonian operator.

The Schrödinger equation does have the same form with heat equation, when we perform Wick rotation[22].

For example, one-dimensional Schrödinger equation for a free particle is

(10)

Take the Wick rotation . Then the Schrödinger equation (10) turns into

(11)

where . Eq. (11) is a homogeneous heat equation with diffusion coefficient . Conversely, apply the Wick rotation to the one-dimensional homogeneous heat equation

(12)

Then the resulting equation is

(13)

where . Eq. (13) is a Schrödinger equation for a free particle with .

This example indeed shows an intriguing relationship between Schrödinger equation and heat equation.

Also, by Wick rotation , we have the Schrödinger equation for a particle in a potential :

This equation has the same form with heat diffusion equation.

Iii-D Paraxial Helmholtz Equations

In mathematics, the eigenvalue problem for the Laplace operator is known as the Helmholtz equation. It corresponds to the linear partial differential equation:

where is the Laplace operator, is the eigenvalue, and

is the eigenfunction. When the equation is applied to waves,

is known as the wave number. The Helmholtz equation has a variety of applications in physics, including the wave equation and the diffusion equation.

A wave is said to be paraxial if its wavefront normals are paraxial rays. In paraxial Optics, parabolic approximation, also called slowly varying envelope approximation (SVEA)[23], is utilized. In the paraxial approximation of the Helmholtz equation[24], the complex amplitude is expressed as

(14)

where represents the complex-valued amplitude which modulates the sinusoidal plane wave represented by the exponential factor. Then under a suitable assumption, approximately solves

(15)

where is the transverse Laplacian operator.

This is a parabolic partial differential equation, it represents waves propagating in directions significantly different from the z-direction. And it has important applications in optics science, where it provides solutions that describe the propagation of electromagnetic waves in the form of either paraboloidal waves[25] or Gaussian beams[26][27].

Iii-E General Neural PDE

Above we presented several parabolic PDEs, including backward parabolic PDEs. Please note that parabolic PDEs can also be nonlinear. For example, Fisher’s equation is a nonlinear PDE that includes the same diffusion term as the heat equation but incorporates a linear growth term and a nonlinear decay term.

Now we write a general form of neural PDEs, it is named nonlinear parabolic PDEs.

(16)

With quasi-linear approximation, Eq. (16) will be substituted with

(17)

Here quasi-linear approximation means matrices are the linear function of , while can be the nonlinear function of , for example, .

Iv Generating Neural Architectures

Here we briefly review the related work firstly.

Iv-a Two-model SLS

We discussed two-model SLS in[28][29], Fig. (1) shown our designed systems.

Fig. 1: The synergetic learning systems (SLS) using two models, one is system generative model (system A), the other is called system reduction model (system B).

The generative model is called the System A, and RDE is adopted to describe the system evolution procedure. We introduce an auxiliary model to solve inverse procedure problems also, this auxiliary model just is defined as reduction model (System B).

Iv-B Finite Difference Methods

At present, most studies on partial differential equations in mathematics use the finite element method (FEM) [30] and the finite difference method (FDM) [31] to obtain numerical solutions. Before giving our method, we first describe the preliminary of finite difference methods for discrete Laplacian. For nonlinear partial differential equations, the numerical solution method usually is used. [32][33][34]

In one-dimension case, the discrete Laplace operator is approximated as

(18)

This approximation is usually expressed via the following 3-point stencil

(19)

As we known most used Laplace operator is Laplace of Gaussian in image processing, the discrete Laplace operator is a 2D filter [35]:

(20)

corresponds to the (five-point stencil) finite-difference formula. When nine-point stencil is used, it includes the diagonals:

(21)

The stencil represents a symmetric, tridiagonal matrix. If an equidistant grid for a finite element unit is utilized, one will get a Toeplitz matrix.

Following we will present three most popular deep neural network (DNN) building blocks. DNN can be built with simple stacked blocks and/or combined cascade / parallel manner. In physics, when we study field distribution in the layered-wise structure medium, PDE will be used to find the system solution. The layered-wise structure medium model has some relationship with DNN model, for example, they have same layered-wise structure. This implies that DNN structure is discretized from layered-wise structure medium physical model.

Iv-C Convolutional Neural Networks

Suppose we use one component RDE to describe the System A in SLS as presented in Fig. (1).

(22)

Following we will discuss one dimensional signal processing with convolutional neural network (CNN) problem first.

Iv-C1 1d Cnn

For one dimensional case, operator . For extreme deep and wide neural network, is the width of a layer in network. As we presented in ref. [36], the ResNet-like MLP is developed. Now we assign and

is an activation function.

(23)

Taking Euler’s method to above partial differential equation, we have

(24)

When taking step , we get

(25)

Almost other’s work just let as a black box, only consider the network depth direction variation, such as in [37][38] [39]. Now we open this black box, to investigate the NPDE in (width) direction case, explain why this RDE is a NPDE and how it can generate neural architecture.

Here we describe the physical image our model. Imaging that the information is expressed with variable , which propagate through a 2-dimensional virtual medium, for example, a thin film. This thin film has no thickness, only has width () and length (). The information along direction propagation with a constant speed111Strictly speaking, light speed is not a constant in the nonlinear medium. Here we just neglect the nonlinear effect. , .

So using symbol or will only has a constant difference. Recall our infinite width and infinite depth neural network model [36], in which stands for network’s width, and (or ) means network’s depth.

Let us give our interpretable NPDEs with Fig. (2).

Fig. 2: The explanation of elliptic operator for neural network based on physics model described with RDE.

In Fig. (2), and is explained as network layer index (

), and network neuron position (

) index, respectively. Variables, which is explained as network’s depth and width variable, respectively. is the function of

matrix is related to the properties of the medium model, it can be considered as diffusion matrix in RDE. This 2D virtual medium is inhomogeneous medium when we assign matrix is the function of and , and just like weight matrix in DNN, each layer is not same, because is dependent of (or ) variables.

Combining Eq. (19), the elliptic operator in Eq. ( 25) (the operator part of the second term) can be written in the form:

(26)

Because we assign the modeled medium is inhomogeneous medium everywhere, matrix is unknown, it may take any values. So we let this stencil is

(27)

as a weight matrix to be learnt in CNN. The second term in Eq. ( 25) become

(28)

We explain this to be the -th neuron input at layer . Similar, from Fig. (2), we have

and so on .

Apparently, the elliptic operator plays a convolutional operation performance, in this example, the 1D convolutional kernel size is 3. Also, after convolutional operation is implemented, slice increase a small amount of quality, say .

The third term in Eq. (25) will be taken as pooling and/or activation nonlinear operation. For Fisher’s equation, we observed that . When

is a sigmoid function,

. (Suppose are seperable).

Please note that information propagation speed at most is light speed because of physics law restriction, when calculate pooling and/or activate operation, we use the function , that means pooling or activate operation is after finishing convolution operation. A general express is

, where stands for a very small quality. In this way, the nonlinear PDE is approximated with quasi-linear PDE.

Here we obtain the basic building block of 1D convolutional neural network from NPDE by applying FDM.

When we assign , , Eq. (23) reduced into Fisher’s Equation,

(29)

Iv-C2 2d Cnn

Fig. 3: Discretized diffusion equation as convolutional neural network. The time slice is equivalent to layered structure along direction.

Our idea is that incorporating diffusion matrix with discrete transverse Laplace operator expression Eq. (21), the combined filter is assigned as a learnable filter, the elements in this filer will be learnt like the convolutional kernel in CNN. In fact, following filter is a special case of the Elliptic operator.

(30)

Here we can regard this filter as a second order derivative operator now.

With above discussion, discrete NPDE to the ResNet structure becomes very easy. We have known that ResNet can be regarded as ODE net, while discussed above showed how the ResNet is obtained by discretizing RDE.

Eq. (30) shows 9 point stencils, it is equivalent to convolutional kernel. Visualized expression is shown in Fig. (3). Apparently, it will proceed two periods in right side part of Eq. (24) when left side step one in ResNet. In other words, in Fig. (3) for ResNet.

Fig. 4: Multi-channel with full size convolutional kernel is equivalent to MLP network structure.

Iv-D Full Connection–DNN (MLP)

MLP is well studied structure of DNN in the literatures.

As we have proved in [36], the full connected MLP is equivalent to full size kernel convolution in addition with multi-channel operation. The channel number is desired neuron number in layer. Fig. (4) illustrates this relationship between full link layer and full size kernel with multi-channel design architecture.

For example, suppose the -th layer and -th layer has and neurons, respectively, information at channel propagation to reach the activation function input is

(31)

We need to design channels to get neurons at -th layer.

As discussed in ref. [36], random vector functional link network (RVFL) can evolve to ResNet-like MLP with our design. The ResNet-like MLP’s layers output is222The symbol in Eqs. (32) and (33) has the same meaning with in this paper.

(32)

This is the same with Eq. (3) in He et al’s paper[40].

(33)

From the literatures, we known that this equation is discrete ordinary differential equation by Euler’s method. While in this work, we consider not only ordinary differential equation, but also partial differential equation. Eq. (

31) is just the second term of the Eq. (25), when it is fed into an activation function of hidden neuron , the will be figured out.

Also quasi-linear approximation is utilized to solve nonlinear PDE.

Above we presented full connection NN such as MLP, the weight layer also can be regarded as full size convolution kernel combine with multi-channel operator [36], it is natural transferring to NPDE also according our idea.

Iv-D1 Restricted Boltzmann Machine

Restricted Boltzmann Machine (RBM) was proposed by Hinton et al in 2006 [41]

, which can be regarded as a generative stochastic neural network. In fact, the topological structure of RBM essential is the same with a full link single layer feed forward neural network, but the connecting weights and neurons are stochastic. So RBM can be generated with stochastic partial differential equation.

When using the FEM and the FDM to discrete RDEs, we know that if using a forward difference at time and a second-order central difference for the space derivative at position we get the recurrence equation:

(34)

This is an explicit method for solving the one-dimensional inhomogeneous heat equation with diffusion matrix .

We can obtain from the other values this way:

(35)

where .

If we use the backward difference at time and a second-order central difference for the space derivative at position we get the recurrence equation:

(36)

This is an implicit method for solving the one-dimensional heat equation.

We can obtain from solving a system of linear equations:

(37)

In explicit method (forward Euler), we regard in Eq. (35) as a hidden neuron. While in implicit method (backward Euler), we take in Eq, (37) as a visible neuron. Please note traditional heat equation only contains Laplace operator. When we take a general Elliptic operator, on considering and are learnable coefficients. with discrete Laplace operator stencil stands for connecting weight, one part

stands for bias vector, the other part is used as activation function, if we split

into two parts. Combining Eq. (17), Eq. (35) and Eq. (36

), we will get energy configuration of the RBM. And the free energy can be used as a loss function to perform energy-based learning

[42].

Iv-E Recurrent Neural Network

A recurrent neural network (RNN) is widely used in temporal sequence processing, it can exhibit temporal dynamic behavior [43]. Hopfield networks is special kind of RNN, while modern Hopfield networks proposed by Ramsauer et al [44] may have broadly applications compared with attention mechanics [45].

RNNs have many variants, such as Long Short-Term Memory model (LSTM)

[46], LSTM based model can be used to predict COVID-19 [47]. Here we just discuss the basic RNNs. Basic RNNs are a network of neuron-like nodes organized into successive layers. Fig. (5) shows unfolded basic recurrent neural network.

Fig. 5: Unfolded basic recurrent neural network. (Refer to [48])

The mathematical expression of basic RNN is [43]

(38)

This form belongs to Elman network [49].

The output is a single layer full connection network, this kind structure can be generated with Eq. (22), and it has been presented at subsection (IV-D). Following we will discuss hidden layer output .

Suppose in Eq. (22) is a diagonal matrix, and . Now let (traveling wave case), we have

(39)

When combining forward difference and a second order central difference for variable, we get the recurrence equation

Eq. (IV-E) can be rewritten in the form [50]:

(41)

Where

, , and .

Define

(42)

Rewrite Eq. (41) in the form

(43)

Especially, when we still consider information diffusion in a finite time, quasi-linear approximation can be applied to NPDE. Or above equation is the case of assigning activation function to be a linear function. Please note that NPDE contains spacial and time variables, it can be applied to describe spatiotemporal dynamic systems [51] also.

V NPDEs Solver

In the literatures, linear or semi-linear parabolic PDEs are well studied, but only a few simple cases can obtain analytical form solutions, such as Fisher’s equation. An integral transform may transform the PDE to a simpler one, in particular, a separable PDE. This corresponds to diagonalizing an operator[52]. In the book of “Nonlinear Reaction-Diffusion- Convection Equations[52], an important example of integral transform is shown, which is Fourier analysis. Fourier analysis diagonalizes the heat equation using the eigenbasis of sinusoidal waves. In the case of the domain is finite or periodic, an infinite sum of solutions such as a Fourier series is appropriate, but an integral of solutions such as a Fourier integral is generally required for infinite domains. The solution for a point source for the heat equation is an example of the utilization of a Fourier integral. For classical RDE solution problems, extensive studies have been done, investigation of various initial and boundary conditions can be found in the literatures, for example, in these books [53][54].

However, traditional PPDE Solver requires the system configuration function such as diffusion matrix should be given, but for NPDEs we suppose that the diffusion matrix has parameters which are determined by learning, that is the optimization algorithm should be adopted to develop the NPDEs Solver. With physical interpretation of our model, we image that the information wave pass through a layered medium, the state is described with NPDEs. When solving NPDEs with numerical method, we can use nine-point stencil (in 2D case) finite-difference formula to find numerical solutions. When input information is given, this can be regarded as initial condition for NPDEs. Because we don’t know the medium parameter matrix , we can just simple initialize them with random values. As for , we can assign it as a nonlinear operator, which play the role such as activating operation and/or neurons pooling. Then we can compute the value of at , get . With these values, we can compute loss function . Consequently, we can use the second equation to find network parameter matrix (equivalent to determining coefficient matrix ). From these description, we can know that NPDEs is forward, while optimization algorithm is backward scenario. In Appendix A, Eq. (A-B4

) describes these forward and backward scenarios mathematically. Once these configuration matrices are learnt, NPDEs are transferred into traditional deterministic PDEs. We know that almost all physical laws are described by PDEs, that means the physics systems are deterministic because they obey PDEs. A real physical processes governed by the PDEs will have deterministic solution. So when knowing the initial and boundary conditions, solving analytically or numerically these PDEs, we would know the future of a physical system for all times. This may give us a clue to solve the problem that to get expected effect is very difficult in deep learning, since that future states of a PAI systems would be precisely theoretically predicted by solution of the PDE.

Following we present several learning algorithms for finding numerical solutions of NPDEs.

V-a Supervised Learning for NPDE

Those classification/regression problems usually belong to supervised learning cases, we can writer the loss function as

norm with weight decay regularization. Refer to [55], the loss function can be written in the form:

(44)

here is network output at the , is weight parameter, and is expected output.

The algorithm is trivial, it can be described as follows:

Step 1

Give initial and boundary condition and , respectively.

Step 2

Random set parameter , discretized elliptic operator value with step size and ).

Step 3

Numerical solve NPDE, obtain the solution .

Step 4

Optimizing loss function Eq. (44) with either Adaptive moment estimation (ADAM) or L-BFGS algorithm to obtain parameter .

Step 5

If expected loss is reached, end optimization, otherwise jump to Step 3.

More details about learning strategies and classical optimization algorithms, such as ADAM, L-BFGS, are presented in Appendix A.

When taking the trivial algorithm to this NPDE, it has no advantage on learning speed. “God closes one door while opening another one for you”, the most important is its theoretical significance, we will present it on Section VI.

V-B Unsupervised Learning for NPDE

In our SLS paper [8], we have discussed that a systems can be described with nonlinear RDEs. In two-model SLS shown in Fig. (1), NPDEs can be used to model a generative network, and its inverse model is the reductive network[28]

. This framework can be utilized as unsupervised learning framework, since we can build an encoder-decoder framework with two-model SLS. In this encoder-decoder framework, System A is assigned to be a decoder, while System B is the encoder. In practice, we can apply approximated synergetic learning algorithm in order to reduce the system’s complexity

[56] [57].

V-C PDE-constrained Optimization

Above discussed can be considered as a traditional nonlinear PDE solver, here we suggest developing new PDE-constrained optimization method for parameters learning NPDEs.

The idea is to adopt some metrics to measure the difference between expected system output and real system output, while the system is constrained by NPDEs.

PDE-constrained optimization is a subset of mathematical optimization where at least one of the constraints may be expressed as a PDE

[58]. A standard formulation of PDE-constrained optimization is given by [59]:

(45)

where is the control variable, is regularization parameter and is the Euclidean norm. Closed-form solutions are generally unavailable for PDE-constrained optimization problems because of complexity, it is necessitating the development of numerical methods[60].

When we adopt PDE-constrained optimization method, we need to model the PAI systems with formal mathematical expression. In this work, we write the optimization equation as,

(46)

where is the system output, is expected system output, and is system configuration parameters (network parameters). In the equation constraint term, , and the is the same with Eq. (17). This loss function has the form of first order Tikhonov regularizer [61][62].

For supervised learning,

will be assigned as data labels. While for unsupervised learning problem, we can design an autoencoder system, and

will be assigned as given data representation function. In practice, there are two strategies can be used to deal with this kind optimization problems, that is, Discretize-then-optimize or Optimize-then-discretize. More about this topic, please refer to book[59].

Vi Discussion

By utilizing FEM and FDM, we discrete neural PDE to generate three popular building blocks for deep neural network architectures. The advantages of Neuronizing PDE have two folds.

  1. One is that we provide the theoretical foundation for applying AI to solve mathematical physics PDEs, this will bring various applications in the fields of physics, chemistry, and so on. Some tricks in deep learning, such as paddle method including extended, wrap, mirror, crop and so on, can be used to deal with boundary condition problem.

  2. The other is that we use NPDEs to describe the AI systems, this can give the interpretable AI based on statistical mechanics. NPDEs can be regarded as the theoretical foundation for physical artificial intelligence systems, which will provide theoretical guidance for designing AI analog computing device.

Some more words about these two folds are discussed as follows.

Vi-a AI for PDEs

Using the AI to solve PDEs has a been studied for a decades, why AI can be applied to the PDEs described problems successfully? Our work demonstrates that when adopting FEM and FDM to solve PDES numerically, it basically is equivalent to using deep neural network to do so. From this point of view, we think that our work provide theoretical foundation for applying AI to solving PDEs.

Recently, some researchers make great success in applying AI to solve PDEs. For example, Han et al have used deep learning to solve high-dimensional PDEs[63], including solving Hamilton - Jacobi -Bellman equation for control. They also applied DNN to solve many-electron schrödinger equation[64]. More recently, Pfau et al have developed a new neural network architecture, names FermiNet, to modeling the quantum state of large collections of electrons[65]. They believe that the FermiNet was the first demonstration of deep learning for computing the energy of atoms and molecules from first principles, and show how deep learning can help solve the fundamental equations of quantum mechanics for real-world systems.

Vi-B NPDEs for AI

In this work, we propose a general NPDEs as shown in Eq. (16). It can be considered as a fundamental equation for an AI systems, for example, a synergetic learning systems described with nonlinear RDEs [8].

Vi-B1 Interpretable AI

Using neural PDEs to describe an AI systems means that we should give an explainable AI systems. Taking a SLS shown in Fig. (1) as an example, we describe the PAI systems as following:

SLS can be regarded as a simple PAI system model, in which the information propagates through this system, and diffusion and reaction occurred in the system.

An information wave is different from an electromagnetic wave, which is the carrier of information, but the elliptic and parabolic equations in mathematical physics equations can be used to describe the process of information transmission. Methods to solve such equations depend on the complexity of the problem. At present, most studies on nonlinear partial differential equations in mathematics use the FEM and the FDM to obtain numerical solutions. In our earlier work, we used the heat diffusion equation to study the propagation of light beams in nonlinear media [26] [25] and the dynamics of interference filters [27]. The diffusion equation is a parabolic semi-linear PDE and can also be used to study dispersive optical tomography [66].

Vi-B2 Analog Computing Device Design

To realize a PAI systems, we should have hardware computing components to assemble the systems. Currently, photonics computing is a potential developing filed[67], here we just simple introduce optical analogy computing devices.

It is well known that wave-particle duality has been demonstrated for photons (light), electrons and other microscopic objects (de Broglie wave). Here we propose the hypothesis that information has wave-particle duality, since photons have long been considered as the carriers of information in optical computing. Compared with electrons computing, we believe that optical analogy computing may display the advantages of analog computer verse to optical digital computer. Our previous works include studies Gaussian beam propagation in a nonlinear medium [26][25], properties of optical interference filter device[19][20]. Sone more discussions about nonlinear diffusion PDEs which describe neural information processing will be presented elsewhere.

The advantage of using optical computing device is that it uses photons as information carriers, the propagation of information is at the speed of light. Also, when we take the passive transmission of optical waves, it can aids ultra-low power consumption. From our previous studies, we suggest that optical interference filters can be considered as the candidates of analog computing device[68][27]. For example, ZnSe film filter can be used as optical bistability device[69], but with proper design, it can be used as gate device to play the role of sigmoid type activation function[70][71].

Why does we consider to realize a PAI system with optical device? From Fourier optics principles we know that when an optical system consists of an input plane, and output plane, and a set of components. The output image is related to the input image by convolving the input image with the optical impulse response . If we design an optical device with layered micro-structure, at the input plane (input layer) the input image is

The output plane (layer) is at . The output image is

The 2D convolution of input function against the impulse response function

i.e.,

That is, when we assign the optical impulse response to be convolutional kernel, an optical system naturally plays convolution operation.

As Wu & Dai said, “a hybrid optical–electronic framework that takes advantage of both photonic and electronic processors could revolutionize AI hardware in the near future” [67]. Details discussion about this topic is beyond the scope of this paper, further research work may focus on the design of nonlinear integrated photonic computing architectures [72] [73] based on NPDEs.

Vi-B3 Materials Design

Neural PDEs can be applied to materials design for PAI systems, it is named “inverse design”. Design new materials is to find those unknown coefficients , and in Elliptic operator (Eq. (3)).

When obtaining the numerical values of those unknown function in each time slice, we model them with a functional expression. The obtained functions can be used for materials design. This is an interesting topic worthy of further study, Sanchez-Lengeling & Aspuru-Guzik’s review paper “

Inverse molecular design using machine learning: Generative models for matter engineering

[74] describes generative models for matter engineering.

Vi-C Insight into NPDEs

Here we give more discussions about neural PDEs.

Vi-C1 Difference with PDE Net

What is the difference with other’s ODE or PDE net?

We have discussed in [36], RVFL and MLP with direct link can evolve to NPDE, and we also shown that ResNet can evolve to NPDE with Elliptic operator. Previous research works, such as ODE and PDE net[37][38] [39], are transformation from discrete to continuous variables. Now we start from NPDEs, by taking FEM and FDM, it is transformation from continuous to discrete variables to make the problem become computable. As we have known, discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts. This procedure can be regarded as quantized problem also. In PDEs numerical solution, there are two hyper-parameters and in Eq. (34), which control the depth and width of the NN structure, respectively. Also, from literatures, say book [33], we know that and control the error of PDEs solution.

Discretizing (Quantizing) PDEs can construct extreme deep and extreme wide neural architecture, this depends on the parameters and selection when FEM is applied. While PDEs with continuous variables can be applied to describe infinite deep and infinite wide neural architecture instead of infinite wide only neural networks [75]. In Eq. (4) of ResNet paper [40], summation term can be transformed to an integral form when network’s depth approach infinite. In verse, under sparse approximation, we can write an integral as summation[76]. The sparse approximation means we should assign relative large and parameters under a given finite volume of medium.

Vi-C2 Realizable PAI Systems

In order to realize a PAI systems, we must have solid theoretical foundations. It is notable that unlike ODEs, essentially all PDEs that have been widely studied come quite directly from physics. For example, the diffusion equation arises in physics from the evolution of temperature or of gas density, the wave equation represents the propagation of linear waves, such as along a compressible spring, along fluids. RDEs come from studies of dissipative structure, representing far from thermal equilibrium state in thermodynamics. In this work, we introduce NPDEs to describe the PAI systems, and several viewpoints are presented as follows.

  • Inspired by Blackbody radiation theory, radiated spectra from macro view is continues, from micro view is quantized. So mathematical method is FDM, its physical basis is quantization (discretization).

  • When we design a realizable PAI systems based on SLS architecture, it is mandatory that at least one sub-system should be real part, instead of two virtual sub-system. The real part sub-systems is consist of some components which can implement optical computation, say with materials of nonlinear optical medium.

  • Our developed nonlinear to quasi-linear PDEs approximated method makes NPDEs to be solvable. Quasi-linear approximation is reasonable because it has physical basis. According to special relativity theory, there is no possibility of superluminal velocity. So information propagation in a PAI systems has a limited speed, it cannot faster than light.

  • Strictly speaking, there is no absolute simultaneous computing convolution and activation / pooling operation in a PAI systems because of physics law restriction. So in a PAI systems, asynchronous computing is real case, time delay should be taken into account when solving NPDEs. Both asynchronous and time delay should be considered when we design a realizable PAI systems.

  • In physical world, causality is related to time, because “causes” can’t happen after the “consequences”. If we believe that the arrow of time is unidirectional, the NPDEs with time variable will allow us to predict the future behavior of a PAI system. It also allows us to gain insight in a PAI system, and develop an interpretable NN for causal inference.

Vi-C3 NPDEs Solution

First principle in quantum chemistry is ab initio methods, which is solving Schrödinger equation. In AI systems, we believe that first principle is PLA, which is solving NPDEs.

In solving NPDEs numerically, we should pay attention to following problems:

  • The initial conditions and boundary conditions for various real world applications.

  • Step parameters and selection problem, how to make trade-off between computation accuracy and complexity.

  • By applying FEM to NPDEs, how to establish relationship between Capsulate networks and elements obtained by FEM.

  • Compared with neural architecture search method, how to generate optimal neural architecture for the given condition (data set) with Truing’s equations.

Currently, we can use numerical methods to find NPDEs’ solution, however, like Navier-Stokes equations, one of the fundamental equations of fluid mechanics, it is considered very challenging mathematically to find analytical solutions. It may require mathematicians to develop new mathematics for solving neural PDEs analytically.

Vii Summary and Further Works

Vii-a Summary

Based on the first principle, we take free energy as the action, derive the RDE which used to describe SLS. In this work, we proposed the general neural PDEs to describe an AI systems. By utilizing FEM and FDM, we generate three most popular DNN building blocks by discretizing PDEs.

In addition to discuss generating neural architecture with NPDEs, we also involve the AI system optimization algorithm. The learning strategies, such as ADAM, L-BFGS, pseudoinverse learning algorithms and PDE - constrained optimization methods, are also presented. Based on physics law, we proposed quasi-linear approximation to solve nonlinear PDEs.

Most significant thing is that we present the clear physical image of interpretable deep neural networks, that is, an AI system is explained as information diffusion and reaction processing. When information is considered as wave propagation in an AI systems, analog optical computing device is suitable to construct a PAI system.

Classical PDEs describe the physics systems, while NPDEs describe the PAI system. By utilizing FEM and FDM, we built the bridge between mathematical physics PDEs and modern DNN architecture. In other words, DNNs is obtained by properly applying discretization to NPDEs with FEM and FDM. In this viewpoint, we can say that we developed a theory for PAI systems, currently deep neural networks are special case of NPDEs. Therefore, we intrinsically believe in that NPDEs would be the fundamental equations for a PAI systems, it provides the theoretical basis which makes it be possible for analog computing device design, and it paves the road to PAL also.

Vii-B Further Works

The path integral formulation is a description in quantum mechanics that generalizes the action principle of classical mechanics. It replaces the classical notion of a single, unique classical trajectory for a system with a sum, or functional integral, over an infinity of quantum-mechanically possible trajectories to compute a quantum amplitude.

In artificial intelligence, we assign free energy (FE) as an action , and RDE is derived. We intend to apply path integral formulation to find an optimal network architecture. The key idea is that we assign different NN architecture as different path to reach the minimum FE state, one NN structure is regarded as one system configuration (corresponding one path in Feynman’s interpretation). We believe it will be an alternative to neural architecture search method, and expect that it is more effective.

Acknowledgment

The research work described in this paper was fully supported by the National Key Research and Development Program of China (No. 2018AAA0100203).

-

References

  • [1] A. Miriyev and M. Kovač, “Skills for physical artificial intelligence,” Nature Machine Intelligence, vol. 2, no. 11, pp. 658–660, 2020. [Online]. Available: https://doi.org/10.1038/s42256-020-00258-y
  • [2] A. B. BASSET, “The principle of least action,” Nature, vol. 67, no. 1737, pp. 343–344, 1903. [Online]. Available: https://doi.org/10.1038/067343d0
  • [3] C. G. Gray, G. Karl, and V. A. Novikov, “Progress in classical and quantum variational principles,” Reports on Progress in Physics, vol. 67, no. 2, pp. 159–208, jan 2004. [Online]. Available: https://doi.org/10.1088%2F0034-4885%2F67%2F2%2Fr02
  • [4] R. P. Feynman, The Principle Of Least Action in Quantum Mechanics.   WORLD SCIENTIFIC, 2021/01/07 2005, pp. 1–69. [Online]. Available: https://doi.org/10.1142/9789812567635_0001
  • [5] S. Wolfram, A New Kind of Science.   Champaign, IL: Wolfram Media, 2002. [Online]. Available: https://www.wolframscience.com
  • [6] P. Guo, “What is the first principles for artificial intelligence (ver. 2),” Communications of the China Computer Federation, vol. 16, no. 10, pp. 53–58, October 2020, (in Chinese), [The First Principles for Artificial Intelligence (ver. 1), preprint, researchgate.net, https://doi.org/10.13140/RG.2.2.30526.72001] . [Online]. Available: https://dl.ccf.org.cn/institude/institudeDetail?_ack=1&id=5144072786659328
  • [7] Y. LeCun, “Learning world models: the next step towards AI,” Keynote, the 27th IJCAI, July 2018.
  • [8] P. Guo and Q. Yin, “Synergetic learning systems (I): Concept, architecture, and algorithms,” Preprint, researchgate.net (Chinese), https://doi.org/10.13140/RG.2.2.13681.12644/1; arXiv, (English), 01 2019, the third China Systems Science Conference (CSSC2019), Changsha, May 18-19, 2019. (Chinese version). [Online]. Available: https://arxiv.org/abs/2006.06367
  • [9] Q.-X. Ye, “Introduction to reaction diffusion equation,” Practice and understanding of mathematics, pp. 48–56, 1984-02, in Chinese.
  • [10] J. C. Willems, “Dissipative dynamical systems part I: General theory,” Archive for Rational Mechanics and Analysis, vol. 45, no. 5, pp. 321–351, 1972. [Online]. Available: https://doi.org/10.1007/BF00276493
  • [11] Domnisoru, Cristina, K. ans Amina A., Tank, and D. W., “Membrane potential dynamics of grid cells,” Nature, vol. 495, pp. 199–204, 2013.
  • [12] A. M. Turing, “The chemical basis of morphogenesis,” Philosophical Transactions of the Royal Society of London, Series B, vol. 237, no. 641, pp. 37–72, 01 1952.
  • [13] J. E. Pearson, “Complex patterns in a simple system,” Science, vol. 261, no. 5118, pp. 189–192, Jul 1993. [Online]. Available: http://dx.doi.org/10.1126/science.261.5118.189
  • [14] G. Nikolis and I. Prigogine, Self-Organization in Non-Equilibrium Systems.   New York: Wiley, 1977.
  • [15] R. O. Wells, Elliptic Operator Theory.   New York, NY: Springer New York, 2008, pp. 108–153. [Online]. Available: https://doi.org/10.1007/978-0-387-73892-5_4
  • [16] G. Adomian, “Fisher-kolmogorov equation,” Applied Mathematics Letters, vol. 8, no. 2, pp. 51 – 52, 1995. [Online]. Available: http://www.sciencedirect.com/science/article/pii/089396599500010N
  • [17] A. Friedman, Partial differential equations of parabolic type.   Englewood Cliffs, N.J.: Prentice-Hall, 1964. [Online]. Available: //catalog.hathitrust.org/Record/000659848
  • [18] J. Y. Bigot and G. J. B., “Switching dynamics of thermally induced optical bistability: Theoretical analysis,” Phys. Rev. (A), vol. 35, no. 9, pp. 810–816, 1987.
  • [19] P. Guo, L. Chen, and Y.-G. Sun, “Laser pulse induced thermal distribution in interference filter,” Journal of Beijing Normal University (Natural Science Edition), vol. 29, no. 1, pp. 82–86, 3 1993, (in Chinese), Received: 1992-09-03. [Online]. Available: http://www.cnki.com.cn/Article/CJFDTOTAL-BSDZ199301018.htm
  • [20] P. Guo and Y.-G. Sun, “Thermo–optic dynamics analysis of interference filter bistable device,” Journal of Beijing Normal University (Natural Science Edition), vol. 29, no. 2, pp. 189–193, 6 1993, (in Chinese), Received: 1992-11-03. [Online]. Available: http://www.cnki.com.cn/Article/CJFDTOTAL-BSDZ199302009.htm
  • [21] D. J. Griffiths, Introduction to Quantum Mechanics.   Princeton, NJ, 2007.
  • [22] G. C. Wick, “Properties of bethe-salpeter wave functions,” Phys. Rev., vol. 96, pp. 1124–1134, Nov 1954. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRev.96.1124
  • [23] O. Svelto, Self-Focusing, Self-Trapping, and Self-Phase Modulation of Laser Beams.   Elsevier, 1974, vol. 12, pp. 1–51. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0079663808702634
  • [24] B. E. A. Saleh and M. C. Teich, Fundamentals of Photonics, 3rd ed.   Hoboken, NJ: John Wiley & Sons, Inc, 2019.
  • [25] P. Guo and Y.-G. Sun, “Gaussian beam propagation with nonlinear medium limiter,” Acta Optica Sinica, vol. 10, no. 12, 1990, (in Chinese). [Online]. Available: http://www.cnki.com.cn/Article/CJFDTOTAL-GXXB199012006.htm
  • [26] P. Guo, “Numerical solution of gaussian beam propagation in nonlinear gradient refractive media,” Laser Technology, vol. 14, no. 5, 1990, (in Chinese). [Online]. Available: http://www.cnki.com.cn/Article/CJFDTotal-JGJS199005009.htm
  • [27] P. Guo, A. A. S. Awwal, and C. L. P. Chen, “Dynamics of a coupled double-cavity optical interference filter,” Journal of Optics, vol. 46, no. 1, pp. 167–174, 1999.
  • [28] P. Guo, “Synergetic learning systems (II): Interpretable neural network model with statistical physics approach,” Preprint, researchgate.net, May 2019, the Fifth National Statistical Physics & Complex Systems Conference (SPCSC 2019), Hefei, July 26-29, 2019. [Online]. Available: https://doi.org/10.13140/RG.2.2.23969.66401
  • [29] ——, “Synergetic learning systems (III): Automatic organization and evolution theory of neural network architecture,” Preprint, researchgate.net, 9 2020, the Fourth China Systems Science Conference (CSSC2020), QingDao, September 19-20, 2020. [Online]. Available: https://doi.org/10.13140/RG.2.2.23186.07360
  • [30] O. C. Zienkiewicz, R. L. Taylor, and J. Z. Zhu, The Finite Element Method: Its Basis and Fundamentals.   Oxford: Butterworth-Heinemann, 2013, p. 756. [Online]. Available: https://doi.org/10.1016/B978-1-85617-633-0.00019-8
  • [31] Ferziger, J. H., and M. Perić, Finite Difference Methods.   Berlin, Heidelberg: Springer, 2002, in: Computational Methods for Fluid Dynamics.
  • [32] O. P. Iliev, S. D. Margenov, P. D. Minev, P. S. Vassilevski, and L. T. Zikatanov, Eds., Numerical Solution of Partial Differential Equations: Theory, Algorithms, and Their Applications, 1st ed., ser. Springer Proceedings in Mathematics & Statistics.   New York: Springer-Verlag, 2013, vol. 45.
  • [33] L. Lapidus, Numerical Solution of Partial Differential Equations in Science and Engineering.   Wiley-Interscience, 7 1999.
  • [34] J. Lu and Z. Guan, Numerical Solving Methods for Partial Differential Equations, 2nd ed.   Tsing-Hua University Press, 1 2004, (in Chinese).
  • [35] P. Guo, Q. Yin, and X.-L. Zhou, Image Semantic Analysis.   Beijing: Science Press, June 2015, (in Chinese). [Online]. Available: http://www.ecsponline.com/goods.php?id=174834
  • [36] P. Guo, “On the structure evolutionary of the pseudoinverse learners in synergetic learning systems,” Preprint, researchgate.net, 12 2020. [Online]. Available: https://doi.org/10.13140/RG.2.2.12262.45121
  • [37] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, “Neural ordinary differential equations,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., 2018, pp. 6572–6583. [Online]. Available: http://papers.nips.cc/paper/7892-neural-ordinary-differential-equations
  • [38] Z. Long, Y. Lu, and B. Dong, “PDE-Net 2.0: Learning PDEs from data with a numeric-symbolic hybrid deep network,” Journal of Computational Physics, vol. 399, p. 108925, 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0021999119306308
  • [39] Y. Lu, A. Zhong, Q. Li, and B. Dong, “Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations,” Proceedings of Machine Learning Research, J. G. Dy and A. Krause, Eds., vol. 80.   PMLR, 2018, pp. 3282–3291.
  • [40] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, 2016, pp. 630–645.
  • [41] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, pp. 504–507, 7 2006.
  • [42] Y. Lecun, S. Chopra, R. Hadsell, M. Ranzato, and F. J. Huang, “A tutorial on energy-based learning,” 2006.
  • [43] Z. C. Lipton, “A critical review of recurrent neural networks for sequence learning,” 2015. [Online]. Available: http://arxiv.org/abs/1506.00019
  • [44] H. Ramsauer, B. Schäfl, J. Lehner, P. Seidl, M. Widrich, L. Gruber, M. Holzleitner, M. Pavlovic, G. K. Sandve, V. Greiff, D. P. Kreil, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter, “Hopfield networks is all you need,” CoRR, vol. abs/2008.02217, 2020. [Online]. Available: https://arxiv.org/abs/2008.02217
  • [45] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA pp. 5998–6008.
  • [46] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [Online]. Available: https://doi.org/10.1162/neco.1997.9.8.1735
  • [47] N. Zheng, S. Du, J. Wang, H. Zhang, W. Cui, Z. Kang, T. Yang, B. Lou, Y. Chi, H. Long, M. Ma, Q. Yuan, S. Zhang, D. Zhang, F. Ye, and J. Xin, “Predicting covid-19 in china using hybrid ai model,” IEEE Transactions on Cybernetics, vol. 50, no. 7, pp. 2891–2904, 2020. [Online]. Available: https://doi.org/10.1109/TCYB.2020.2990162
  • [48] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 5 2015.
  • [49] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179–211, 1990. [Online]. Available: https://doi.org/10.1207/s15516709cog1402_1
  • [50] T. W. Hughes, I. A. D. Williamson, M. Minkov, and S. Fan, “Wave physics as an analog recurrent neural network,” Science Advances, vol. 5, no. 12, p. eaay6946, 2019.
  • [51] J. W. Wang and H. N. Wu, “Design of suboptimal local piecewise fuzzy controller with multiple constraints for quasi-linear spatiotemporal dynamic systems,” IEEE Transactions on Cybernetics, pp. 1–13, 2019.
  • [52] R. Cherniha, M. Serov, and O. Pliukhin, Nonlinear Reaction-Diffusion-Convection Equations, 1st ed.   New York: Chapman and Hall /CRC, 11 2017, vol. 104.
  • [53] C.-V. Pao, Nonlinear parabolic and elliptic equations.   Boston, MA: Springer US, 1992. [Online]. Available: https://doi.org/10.1007/978-1-4615-3034-3
  • [54] R. Cherniha and V. Davydovych, Nonlinear reaction-diffusion systems.   Springer, 2017, vol. 2196, conditional Symmetry, Exact Solutions and their Applications in Biology.
  • [55] C. M. Bishop, Pattern Recognition and Machine Learning.   Springer-Verlag New York, 2006, chapter 10: Approximate Inference.
  • [56] P. Guo, J.-X. Hou, and B. Zhao, “Methodology for building synergetic learning system,” Preprint, researchget.net, (ICCS 2020, English version), 2019, the third China Systems Science Conference (CSSC2019), Changsha, May 18-19, 2019. (Chinese version). [Online]. Available: http://dx.doi.org/10.13140/RG.2.2.10146.07368
  • [57] Q. Yin, B. Xu, K. Zhou, and P. Guo, “Bayesian psudo-inverse learners: From uncertainty to deterministic learning,” 2020, submitted to IEEE Trans. on Cybernetics.
  • [58] G. Leugering, P. Benner, S. Engell, A. Griewank, H. Harbrecht, M. Hinze, R. Rannacher, and S. Ulbrich, Eds., Trends in PDE Constrained Optimization, 4th ed., ser. International Series of Numerical Mathematics.   Birkhäuser, Cham, 2014, vol. 165.
  • [59] H. Antil, D. P. Kouri, M. Lacasse, and D. Ridzal, Eds., Frontiers in PDE-Constrained Optimization, 1st ed., ser. The IMA Volumes in Mathematics and its Applications.   New York: Springer-Verlag, 2018, vol. 163.
  • [60] H. Antil, M. Heinkenschloss, R. H. W. Hoppe, and D. C. Sorensen, “Domain decomposition and model reduction for the numerical solution of PDE constrained optimization problems with localized optimization variables,” Computing and Visualization in Science, vol. 13, no. 6, pp. 249–264, 2010.
  • [61] A. N. Tikhonov and V. Y. Arsenin, Solutions of Ill-Posed Problems.   Hoboken & New York: Wiley, 1977, chapter 2.
  • [62] P. Guo, M. R. Lyu, and C. L. P. Chen, “Regularization parameter estimation for feedforward neural networks,” IEEE Trans. Systems, Man, and Cybernetics, Part B, vol. 33, no. 1, pp. 35–44, 2003. [Online]. Available: https://doi.org/10.1109/TSMCB.2003.808176
  • [63] J. Han, A. Jentzen, and W. E, “Solving high-dimensional partial differential equations using deep learning,” Proceedings of the National Academy of Sciences, vol. 115, no. 34, pp. 8505–8510, 2018.
  • [64] J. Han, L. Zhang, and W. E, “Solving many-electron schrödinger equation using deep neural networks,” Journal of Computational Physics, vol. 399, p. 108929, 09 2019.
  • [65] D. Pfau, J. S. Spencer, A. G. D. G. Matthews, and W. M. C. Foulkes, “Ab initio solution of the many-electron schrödinger equation with deep neural networks,” Phys. Rev. Research, vol. 2, p. 033429, Sep 2020.
  • [66] H. Niu, P. Guo, L. Ji, Q. Zhao, and T. Jiang, “Improving image quality of diffuse optical tomography with a projection-error-based adaptive regularization method,” Optics Express, vol. 16, no. 17, pp. 12 423–34, 2008.
  • [67] H. Wu and Q. Dai, “Artificial intelligence accelerated by light,” Nature, vol. 589, pp. 25–26, January 2021, news & Views. [Online]. Available: https://doi.org/10.1038/d41586-020-03572-y
  • [68] P. Guo, Y.-G. Sun, J. Xiong, W. Wang, and Z. Jiang, Nonlinear three-mirror Fabry-Perot multistability element.   Singapore: World Scientific, 1987, pp. 83–87. [Online]. Available: https://www.researchgate.net/publication/330955256_Nonlinear_three-mirror_Fabry-Perot_multistability_element
  • [69] Y.-G. Sun, Z. Jiang, J. Xiong, W. Wang, P. Guo, and S. Shang, Optical Bistability in Very Thin ZnSe Film.   Singapore: World Scientific, 1987, pp. 95–97.
  • [70] W. Wang, Y.-G. Sun, J. Xiong, Z. Jiang, P. Guo, and S. Shang, Low-power Optical Bistability in Improved ZnSe Interference Filters.   Singapore: World Scientific, 1987, pp. 57–59. [Online]. Available: https://www.researchgate.net/publication/348390364_Low-power_Optical_Bistability_in_Improved_ZnSe_Interference_Filters
  • [71] P. Guo and M. R. Lyu, “A new approach to optical multilayer learning neural network,” in Proceedings of International Conference on artificial intelligence (IC-AI’01), H. R. Arabnia, Ed., vol. I.   Las Vegas, Nevada, USA: CSREA Press, 5 2001, pp. 426–432. [Online]. Available: https://www.researchgate.net/publication/331463645_A_New_Approach_to_Optical_Multilayer_Learning_Neural_Network
  • [72] X. Xu, M. Tan, B. Corcoran, J. Wu, A. Boes, T. G. Nguyen, S. T. Chu, B. E. Little, D. G. Hicks, R. Morandotti, A. Mitchell, and D. J. Moss, “11 TOPS photonic convolutional accelerator for optical neural networks,” Nature, vol. 589, no. 7840, pp. 44–51, 2021.
  • [73]

    J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Raja, J. Liu, C. D. Wright, A. Sebastian, T. J. Kippenberg, W. H. P. Pernice, and H. Bhaskaran, “Parallel convolutional processing using an integrated photonic tensor core,”

    Nature, vol. 589, no. 7840, pp. 52–58, 2021.
  • [74] B. Sanchez-Lengeling and A. Aspuru-Guzik, “Inverse molecular design using machine learning: Generative models for matter engineering,” Science, vol. 361, no. 6400, pp. 360–365, 2018.
  • [75] R. M. Neal, Priors for Infinite Networks.   New York, NY: Springer New York, 1996, pp. 29–53, in Bayesian Learning for Neural Networks.
  • [76] P. Guo, C. L. P. Chen, and M. R. Lyu, “Cluster number selection for a small set of samples using the bayesian ying-yang model,” IEEE Trans. Neural Networks, vol. 13, no. 3, pp. 757–763, 2002. [Online]. Available: https://doi.org/10.1109/TNN.2002.1000144
  • [77]

    Y. Sun, B. Xue, M. Zhang, G. G. Yen, and J. Lv, “Automatically designing cnn architectures using the genetic algorithm for image classification,”

    IEEE Transactions on Cybernetics, vol. 50, no. 9, pp. 3840–3854, 2020. [Online]. Available: https://doi.org/10.1109/TCYB.2020.2983860
  • [78] D. Kingma and J. Ba, “ADAM: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, ser. https://iclr.cc/archive/www/doku.php%3Fid=iclr2015:accepted-main.html, 2014. [Online]. Available: https://arxiv.org/abs/1412.6980
  • [79] J. Stoer and R. Bulirsch, Introduction to Numerical Analysis, 3rd ed., ser. Texts in Applied Mathematics.   Springer-Verlag New York, 2002, vol. 12, p.243, Translated by Gautschi, W., Bartels, R., Witzgall, C.
  • [80] D. C. Liu and J. Nocedal, “On the limited memory BFGS method for large scale optimization,” Mathematical Programming, vol. 45, no. 1, pp. 503–528, 1989.
  • [81] W. Chen, Z. Wang, and J. Zhou, “Large-scale L-BFGS using mapreduce,” in Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 1332–1340.

Appendix A Learning Strategies

A-a Learning Optimization

Suppose the information along direction diffuse, at the time the system output is . We define the loss function and suppose it is differential, the optimal parameter at given data set is

(47)

We can use gradient-based optimization (first order methods) methods, such as plain gradient, steepest descent, conjugate gradient, Rprop, stochastic gradient, and so on. When used to minimize the loos function, a standard (or “batch”) gradient descent method would perform the following iterations:

(48)

where is a step size ( learning rate). From ref. [55], we have

(49)

This is an ordinary differential equation for network parameter .

There are several optimization methods, for example, genetic algorithm[77], but most used method is gradient descent based method. Following we will review gradient descent algorithms.

A-B Gradient Descent Algorithms

A-B1 First Order Gradient Descent Algorithm

One can also use Adam (Adaptive Moment Estimation) optimizer [78]. Adam’s parameter update is given by:

(50)

where is a small scalar (e.g. ) used to prevent division by 0, and (e.g. 0.9) and (e.g. 0.999) are the forgetting factors for gradients and second moments of gradients, respectively. Squaring and square-rooting is done element-wise.

A-B2 Second Order Optimization Algorithm

We hope to use second order optimization algorithm, for example, Newton’s method:

(51)

where is a Hessian,

Sometimes we hope to reduce the complexity of given problem, certain approximation for solving neural PDEs is adopted.

A-B3 Pseudoinverse Approximation of Hessian

In Newton’s method, the Hessian is needed to compute, but it is very computational intensive. , for a complex neural network, the parameter number is very large, hence the Hessian is huge. Therefore we need to take some approximations to reduce the complexity. Here we try to find other method, not along quasi- Newton method, to investigate PDE network solver theoretically.

As we know, If is invertible (it is indeed), its pseudoinverse is its inverse. That is, [79]. Now we use pseudoinverse instead of inverse in Eq. (51).

(52)

Where denotes Hermitian transpose (conjugate transpose), because is a real and symmetric matrix, it has . But we would like keep the pseudoinverse form. Now Eq. (51) becomes

(53)

When we assign the matrix , Eq. (53) can be written in the form:

(54)

This PDE is used for optimization, and we call it as pseudoinverse learning method in finding network’s paramenters.

A-B4 Symmetry Break

We rewrite the Eq. (16) and Eq. (54) as follows:

(55)

We merge and into new parameter , then we have:

(56)

Unlike Turing’s equations, two PDEs in Eq. (A-B4) is not symmetry, and they also has different meanings. variable is in data space, while variable is in parameter space. in first equation stands for depth direction in a virtual medium. While in second equation, means iterative step in parameter optimization processing.

A-B5 Gauss–Newton algorithm

In practical implementation, to avoiding direct computing Hessian, Quasi-Newton methods are utilized. Quasi-Newton methods are methods used to either find zeroes or local maxima and minima of functions, as an alternative to Newton’s method. They can be used if the Jacobian or Hessian is unavailable or is too expensive to compute at every iteration. The “full” Newton’s method requires the Jacobian in order to search for zeros, or the Hessian for finding extrema.

The Gauss–Newton algorithm is used to solve non-linear least squares problems. It is a modification of Newton’s method for finding a minimum of a function. Unlike Newton’s method, the Gauss–Newton algorithm can only be used to minimize a sum of squared function values, but it has the advantage that second derivatives, which can be challenging to compute, are not required.

Minimizing the sum of squares

we should start with an initial guess for the minimum, the method proceeds by the iterations,

where is the Jacobian matrix.

Note that is the left pseudoinverse of . This form looks like the same with Equation (53), in fact, it is a special case of Newton’s methods.

A-C L-BFGS Algorithm

Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden- Fletcher - Goldfarb -Shanno algorithm (BFGS) using a limited amount of computer memory [80].

Like the original BFGS, L-BFGS uses an estimate of the inverse Hessian matrix to steer its search through variable space, but where BFGS stores a dense approximation to the inverse Hessian ( being the number of variables in the problem), L-BFGS stores only a few vectors that represent the approximation implicitly [81] (Large-scale L-BFGS using Mapreduce).

Now we have two equations, one is for forward propagation, the other is for back propagation. In addition to L-BFGS algorithm, other algorithms also can be used for learning optimization.