I Introduction
In recent years, the field of artificial intelligence has made rapid progress, especially for the field of digital artificial intelligence (AI). Now researchers pay attention to physical artificial intelligence (PAI), and believe that PAI will be the most potential field in robot field in the next decades[1]. To develop PAI effectively, we need the solid foundation from theoretical physics. We believe that theories and methods from physics are able to pave the road to PAI, and it has being boosted artificial intelligence methods.
We have known in physics, theoretical work is said to be from first principles, if it does not make assumptions. While in physics, the principle of least action (PLA) is a very elegant and important discovery of mechanics[2] [3]. The PLA can be used to derive Newtonian, Lagrangian and Hamiltonian equations of motion, and even general relativity. The physicist Richard Feynman, demonstrated how this PLA can be used in quantum mechanics with path integral calculations [4]. It is well known that most of the fundamental laws in Physics are described with partial differential equations. As Stephen Wolfram said, “Indeed, it turns out that almost all the traditional mathematical models that have been used in physics and other areas of science are ultimately based on partial differential equations” [5].
Now question is what is the first principle of AI? Is there any fundamental equations for AI field? In our viewpoint, if there exists the first principle for AI research field, we believe it is this PLA [6]. Also, as the problem asked by Yann LeCun in IJCAI 2018, “What is the equivalent of ‘Thermodynamics’ in intelligence? [7], this has inspired us to seek the simple rule to build a world model, in other words, to establish the fundamental equations for AI with physics laws inspired methods.
In this work, we developed the generalized nonlinear parabolic partial differential equations (PPDE), which can be considered as the fundamental equations in the AI research field. And we also show that the reactiondiffusion equation in statistical physics, Schrödinger equation in quantum mechanics, Helmholtz equation in paraxial optics, are all examples of PPDEs. With finite element method (FEM) and the finite difference method (FDM), we discrete PPDEs into neural partial differential equations (NPDE). The NPDEs can be used to describe an AI systems, such as the synergetic learning systems.
Ii Background
Here we briefly review the synergetic learning systems (SLS) firstly [8].
Iia SLS theory
Our proposed SLS is an information processing systems based on the PLA, and the dynamics of the SLS should be described by differential dynamic equations. What kind of equations is this differential dynamic equation? It is known that when generalizing the equation to a multicomponent systems, we can obtain a variation of the “free energy” :
(1) 
in which , is the diffusion matrix, and .
is called convection vector and
is reaction vector[9].Therefore, the dynamics of the SLS is described by the reactiondiffusion equations (RDE). In statistical physics, the dissipative systems theory is a theoretical description of the selforganization of nonequilibrium systems, and the RDEs is utilized to model this systems also [10]. If the SLS is designed as a differential dynamical system and attention is paid to the attractor subnetwork [11], the RDEs can also be used to describe this systems.
IiB Turing’s Equations
Turing’s equations is a twocomponent RDEs [12], and is the mechanism for pattern formation[13]. When let and , the mathematical expression of Turing’s equations is
(2) 
In the literature, Prigogine proposed the theory of the dissipative systems, believing that “the energy exchange with the outside world is the fundamental reason for making the system orderly (contrary to the principle of entropy increase),” and founded a new discipline called nonequilibrium statistical mechanics [14], and the RDEs is utilized to model the dissipative structures.
Iii Neural Partial Differential Equations
The RDEs in statistical physics is widely used, it belongs a family of semilinear parabolic partial differential equations (PDE). We believe that nonlinear PDEs can be utilized to describe an AI systems, and it can be considered as a fundamental equations for the neural systems. Following we will present a general form of neural PDEs.
Now we use matrixvalued function , and , the elliptic operator is defined as [15]:
(3) 
This is a more general form of a secondorder divergence form nonlinear elliptic differential operator. When matrixvalued function is only function of , the operator is linear elliptic differential operator. Especially, if matrixvalued function is symmetric and positive definite for every , it said to be a uniformly elliptic operator.
Please note that this elliptic operator form is the same with right side of Eq. (1). The matrix is the diffusion matrix, the second term in right side of Eq. (3) is the convection vector, and the third therm is the reaction vector.
In mathematics, Laplace operator is expressed as:
(4) 
In the theory of partial differential equations, elliptic operators are differential operators that generalize the Laplace operator, since the Laplace operator is obtained by taking .
In mathematics, a PDE is an equation which imposes relations between the various partial derivatives of a multivariable function. And secondorder linear PDEs are classified as either elliptic, hyperbolic, or parabolic. Because we will discuss nonlinear PDEs, it has no mathematical strictly defined the neural PDEs belong to parabolic, hyperbolic , or elliptic types. Following we just take several parabolic PDEs besides RDEs.
Iiia Fisher’s Equations
Fisher’s Equation, also known as the KolmogorovPetrovskyPiskunov (KPP) equation, or Fisher–KPP equation[16]. The mathematical expression is
(5) 
Fisher’s equation belongs to the class of RDE, in fact, it is one of the simplest semilinear RDEs, the one which has the inhomogeneous term.
IiiB Heat Equations
The heat equation is a certain partial differential equation in mathematics and physics, it has the form[17]
(6) 
where denotes a general point of the domain. Or right side of the equation written in Laplacian,
(7) 
When study thermaloptic effect in nonlinear optical medium, the heat equation has the form [18][19][20]:
(8) 
where is temperature, , and , are heat conductive coefficient, medium density, medium specific heat capacity and light intensity, respectively.
IiiC Schrödinger Equations
The Schrödinger equation is a linear partial differential equation that governs the wave function of a quantummechanical system[21], the most general form is the timedependent Schrödinger equation:
(9) 
where is imaginary index, is the state vector of the quantum system, is time, and is an observable, the Hamiltonian operator.
The Schrödinger equation does have the same form with heat equation, when we perform Wick rotation[22].
For example, onedimensional Schrödinger equation for a free particle is
(10) 
Take the Wick rotation . Then the Schrödinger equation (10) turns into
(11) 
where . Eq. (11) is a homogeneous heat equation with diffusion coefficient . Conversely, apply the Wick rotation to the onedimensional homogeneous heat equation
(12) 
Then the resulting equation is
(13) 
where . Eq. (13) is a Schrödinger equation for a free particle with .
This example indeed shows an intriguing relationship between Schrödinger equation and heat equation.
Also, by Wick rotation , we have the Schrödinger equation for a particle in a potential :
This equation has the same form with heat diffusion equation.
IiiD Paraxial Helmholtz Equations
In mathematics, the eigenvalue problem for the Laplace operator is known as the Helmholtz equation. It corresponds to the linear partial differential equation:
where is the Laplace operator, is the eigenvalue, and
is the eigenfunction. When the equation is applied to waves,
is known as the wave number. The Helmholtz equation has a variety of applications in physics, including the wave equation and the diffusion equation.A wave is said to be paraxial if its wavefront normals are paraxial rays. In paraxial Optics, parabolic approximation, also called slowly varying envelope approximation (SVEA)[23], is utilized. In the paraxial approximation of the Helmholtz equation[24], the complex amplitude is expressed as
(14) 
where represents the complexvalued amplitude which modulates the sinusoidal plane wave represented by the exponential factor. Then under a suitable assumption, approximately solves
(15) 
where is the transverse Laplacian operator.
This is a parabolic partial differential equation, it represents waves propagating in directions significantly different from the zdirection. And it has important applications in optics science, where it provides solutions that describe the propagation of electromagnetic waves in the form of either paraboloidal waves[25] or Gaussian beams[26][27].
IiiE General Neural PDE
Above we presented several parabolic PDEs, including backward parabolic PDEs. Please note that parabolic PDEs can also be nonlinear. For example, Fisher’s equation is a nonlinear PDE that includes the same diffusion term as the heat equation but incorporates a linear growth term and a nonlinear decay term.
Now we write a general form of neural PDEs, it is named nonlinear parabolic PDEs.
(16) 
With quasilinear approximation, Eq. (16) will be substituted with
(17)  
Here quasilinear approximation means matrices are the linear function of , while can be the nonlinear function of , for example, .
Iv Generating Neural Architectures
Here we briefly review the related work firstly.
Iva Twomodel SLS
The generative model is called the System A, and RDE is adopted to describe the system evolution procedure. We introduce an auxiliary model to solve inverse procedure problems also, this auxiliary model just is defined as reduction model (System B).
IvB Finite Difference Methods
At present, most studies on partial differential equations in mathematics use the finite element method (FEM) [30] and the finite difference method (FDM) [31] to obtain numerical solutions. Before giving our method, we first describe the preliminary of finite difference methods for discrete Laplacian. For nonlinear partial differential equations, the numerical solution method usually is used. [32][33][34]
In onedimension case, the discrete Laplace operator is approximated as
(18) 
This approximation is usually expressed via the following 3point stencil
(19) 
As we known most used Laplace operator is Laplace of Gaussian in image processing, the discrete Laplace operator is a 2D filter [35]:
(20) 
corresponds to the (fivepoint stencil) finitedifference formula. When ninepoint stencil is used, it includes the diagonals:
(21) 
The stencil represents a symmetric, tridiagonal matrix. If an equidistant grid for a finite element unit is utilized, one will get a Toeplitz matrix.
Following we will present three most popular deep neural network (DNN) building blocks. DNN can be built with simple stacked blocks and/or combined cascade / parallel manner. In physics, when we study field distribution in the layeredwise structure medium, PDE will be used to find the system solution. The layeredwise structure medium model has some relationship with DNN model, for example, they have same layeredwise structure. This implies that DNN structure is discretized from layeredwise structure medium physical model.
IvC Convolutional Neural Networks
Suppose we use one component RDE to describe the System A in SLS as presented in Fig. (1).
(22) 
Following we will discuss one dimensional signal processing with convolutional neural network (CNN) problem first.
IvC1 1d Cnn
For one dimensional case, operator . For extreme deep and wide neural network, is the width of a layer in network. As we presented in ref. [36], the ResNetlike MLP is developed. Now we assign and
is an activation function.
(23) 
Taking Euler’s method to above partial differential equation, we have
(24) 
When taking step , we get
(25) 
Almost other’s work just let as a black box, only consider the network depth direction variation, such as in [37][38] [39]. Now we open this black box, to investigate the NPDE in (width) direction case, explain why this RDE is a NPDE and how it can generate neural architecture.
Here we describe the physical image our model. Imaging that the information is expressed with variable , which propagate through a 2dimensional virtual medium, for example, a thin film. This thin film has no thickness, only has width () and length (). The information along direction propagation with a constant speed^{1}^{1}1Strictly speaking, light speed is not a constant in the nonlinear medium. Here we just neglect the nonlinear effect. , .
So using symbol or will only has a constant difference. Recall our infinite width and infinite depth neural network model [36], in which stands for network’s width, and (or ) means network’s depth.
Let us give our interpretable NPDEs with Fig. (2).
In Fig. (2), and is explained as network layer index (
), and network neuron position (
) index, respectively. Variables, which is explained as network’s depth and width variable, respectively. is the function ofmatrix is related to the properties of the medium model, it can be considered as diffusion matrix in RDE. This 2D virtual medium is inhomogeneous medium when we assign matrix is the function of and , and just like weight matrix in DNN, each layer is not same, because is dependent of (or ) variables.
Combining Eq. (19), the elliptic operator in Eq. ( 25) (the operator part of the second term) can be written in the form:
(26) 
Because we assign the modeled medium is inhomogeneous medium everywhere, matrix is unknown, it may take any values. So we let this stencil is
(27) 
as a weight matrix to be learnt in CNN. The second term in Eq. ( 25) become
(28) 
We explain this to be the th neuron input at layer . Similar, from Fig. (2), we have
and so on .
Apparently, the elliptic operator plays a convolutional operation performance, in this example, the 1D convolutional kernel size is 3. Also, after convolutional operation is implemented, slice increase a small amount of quality, say .
The third term in Eq. (25) will be taken as pooling and/or activation nonlinear operation. For Fisher’s equation, we observed that . When
is a sigmoid function,
. (Suppose are seperable).Please note that information propagation speed at most is light speed because of physics law restriction, when calculate pooling and/or activate operation, we use the function , that means pooling or activate operation is after finishing convolution operation. A general express is
, where stands for a very small quality. In this way, the nonlinear PDE is approximated with quasilinear PDE.
Here we obtain the basic building block of 1D convolutional neural network from NPDE by applying FDM.
When we assign , , Eq. (23) reduced into Fisher’s Equation,
(29) 
IvC2 2d Cnn
Our idea is that incorporating diffusion matrix with discrete transverse Laplace operator expression Eq. (21), the combined filter is assigned as a learnable filter, the elements in this filer will be learnt like the convolutional kernel in CNN. In fact, following filter is a special case of the Elliptic operator.
(30) 
Here we can regard this filter as a second order derivative operator now.
With above discussion, discrete NPDE to the ResNet structure becomes very easy. We have known that ResNet can be regarded as ODE net, while discussed above showed how the ResNet is obtained by discretizing RDE.
IvD Full Connection–DNN (MLP)
MLP is well studied structure of DNN in the literatures.
As we have proved in [36], the full connected MLP is equivalent to full size kernel convolution in addition with multichannel operation. The channel number is desired neuron number in layer. Fig. (4) illustrates this relationship between full link layer and full size kernel with multichannel design architecture.
For example, suppose the th layer and th layer has and neurons, respectively, information at channel propagation to reach the activation function input is
(31) 
We need to design channels to get neurons at th layer.
As discussed in ref. [36], random vector functional link network (RVFL) can evolve to ResNetlike MLP with our design. The ResNetlike MLP’s layers output is^{2}^{2}2The symbol in Eqs. (32) and (33) has the same meaning with in this paper.
(32) 
This is the same with Eq. (3) in He et al’s paper[40].
(33) 
From the literatures, we known that this equation is discrete ordinary differential equation by Euler’s method. While in this work, we consider not only ordinary differential equation, but also partial differential equation. Eq. (
31) is just the second term of the Eq. (25), when it is fed into an activation function of hidden neuron , the will be figured out.Also quasilinear approximation is utilized to solve nonlinear PDE.
Above we presented full connection NN such as MLP, the weight layer also can be regarded as full size convolution kernel combine with multichannel operator [36], it is natural transferring to NPDE also according our idea.
IvD1 Restricted Boltzmann Machine
Restricted Boltzmann Machine (RBM) was proposed by Hinton et al in 2006 [41]
, which can be regarded as a generative stochastic neural network. In fact, the topological structure of RBM essential is the same with a full link single layer feed forward neural network, but the connecting weights and neurons are stochastic. So RBM can be generated with stochastic partial differential equation.
When using the FEM and the FDM to discrete RDEs, we know that if using a forward difference at time and a secondorder central difference for the space derivative at position we get the recurrence equation:
(34) 
This is an explicit method for solving the onedimensional inhomogeneous heat equation with diffusion matrix .
We can obtain from the other values this way:
(35)  
where .
If we use the backward difference at time and a secondorder central difference for the space derivative at position we get the recurrence equation:
(36) 
This is an implicit method for solving the onedimensional heat equation.
We can obtain from solving a system of linear equations:
(37)  
In explicit method (forward Euler), we regard in Eq. (35) as a hidden neuron. While in implicit method (backward Euler), we take in Eq, (37) as a visible neuron. Please note traditional heat equation only contains Laplace operator. When we take a general Elliptic operator, on considering and are learnable coefficients. with discrete Laplace operator stencil stands for connecting weight, one part
stands for bias vector, the other part is used as activation function, if we split
into two parts. Combining Eq. (17), Eq. (35) and Eq. (36), we will get energy configuration of the RBM. And the free energy can be used as a loss function to perform energybased learning
[42].IvE Recurrent Neural Network
A recurrent neural network (RNN) is widely used in temporal sequence processing, it can exhibit temporal dynamic behavior [43]. Hopfield networks is special kind of RNN, while modern Hopfield networks proposed by Ramsauer et al [44] may have broadly applications compared with attention mechanics [45].
RNNs have many variants, such as Long ShortTerm Memory model (LSTM)
[46], LSTM based model can be used to predict COVID19 [47]. Here we just discuss the basic RNNs. Basic RNNs are a network of neuronlike nodes organized into successive layers. Fig. (5) shows unfolded basic recurrent neural network.The mathematical expression of basic RNN is [43]
(38) 
This form belongs to Elman network [49].
The output is a single layer full connection network, this kind structure can be generated with Eq. (22), and it has been presented at subsection (IVD). Following we will discuss hidden layer output .
Suppose in Eq. (22) is a diagonal matrix, and . Now let (traveling wave case), we have
(39) 
When combining forward difference and a second order central difference for variable, we get the recurrence equation
(41) 
Where
, , and .
(43) 
Especially, when we still consider information diffusion in a finite time, quasilinear approximation can be applied to NPDE. Or above equation is the case of assigning activation function to be a linear function. Please note that NPDE contains spacial and time variables, it can be applied to describe spatiotemporal dynamic systems [51] also.
V NPDEs Solver
In the literatures, linear or semilinear parabolic PDEs are well studied, but only a few simple cases can obtain analytical form solutions, such as Fisher’s equation. An integral transform may transform the PDE to a simpler one, in particular, a separable PDE. This corresponds to diagonalizing an operator[52]. In the book of “Nonlinear ReactionDiffusion Convection Equations” [52], an important example of integral transform is shown, which is Fourier analysis. Fourier analysis diagonalizes the heat equation using the eigenbasis of sinusoidal waves. In the case of the domain is finite or periodic, an infinite sum of solutions such as a Fourier series is appropriate, but an integral of solutions such as a Fourier integral is generally required for infinite domains. The solution for a point source for the heat equation is an example of the utilization of a Fourier integral. For classical RDE solution problems, extensive studies have been done, investigation of various initial and boundary conditions can be found in the literatures, for example, in these books [53][54].
However, traditional PPDE Solver requires the system configuration function such as diffusion matrix should be given, but for NPDEs we suppose that the diffusion matrix has parameters which are determined by learning, that is the optimization algorithm should be adopted to develop the NPDEs Solver. With physical interpretation of our model, we image that the information wave pass through a layered medium, the state is described with NPDEs. When solving NPDEs with numerical method, we can use ninepoint stencil (in 2D case) finitedifference formula to find numerical solutions. When input information is given, this can be regarded as initial condition for NPDEs. Because we don’t know the medium parameter matrix , we can just simple initialize them with random values. As for , we can assign it as a nonlinear operator, which play the role such as activating operation and/or neurons pooling. Then we can compute the value of at , get . With these values, we can compute loss function . Consequently, we can use the second equation to find network parameter matrix (equivalent to determining coefficient matrix ). From these description, we can know that NPDEs is forward, while optimization algorithm is backward scenario. In Appendix A, Eq. (AB4
) describes these forward and backward scenarios mathematically. Once these configuration matrices are learnt, NPDEs are transferred into traditional deterministic PDEs. We know that almost all physical laws are described by PDEs, that means the physics systems are deterministic because they obey PDEs. A real physical processes governed by the PDEs will have deterministic solution. So when knowing the initial and boundary conditions, solving analytically or numerically these PDEs, we would know the future of a physical system for all times. This may give us a clue to solve the problem that to get expected effect is very difficult in deep learning, since that future states of a PAI systems would be precisely theoretically predicted by solution of the PDE.
Following we present several learning algorithms for finding numerical solutions of NPDEs.
Va Supervised Learning for NPDE
Those classification/regression problems usually belong to supervised learning cases, we can writer the loss function as
norm with weight decay regularization. Refer to [55], the loss function can be written in the form:(44) 
here is network output at the , is weight parameter, and is expected output.
The algorithm is trivial, it can be described as follows:
 Step 1

Give initial and boundary condition and , respectively.
 Step 2

Random set parameter , discretized elliptic operator value with step size and ).
 Step 3

Numerical solve NPDE, obtain the solution .
 Step 4

Optimizing loss function Eq. (44) with either Adaptive moment estimation (ADAM) or LBFGS algorithm to obtain parameter .
 Step 5

If expected loss is reached, end optimization, otherwise jump to Step 3.
More details about learning strategies and classical optimization algorithms, such as ADAM, LBFGS, are presented in Appendix A.
When taking the trivial algorithm to this NPDE, it has no advantage on learning speed. “God closes one door while opening another one for you”, the most important is its theoretical significance, we will present it on Section VI.
VB Unsupervised Learning for NPDE
In our SLS paper [8], we have discussed that a systems can be described with nonlinear RDEs. In twomodel SLS shown in Fig. (1), NPDEs can be used to model a generative network, and its inverse model is the reductive network[28]
. This framework can be utilized as unsupervised learning framework, since we can build an encoderdecoder framework with twomodel SLS. In this encoderdecoder framework, System A is assigned to be a decoder, while System B is the encoder. In practice, we can apply approximated synergetic learning algorithm in order to reduce the system’s complexity
[56] [57].VC PDEconstrained Optimization
Above discussed can be considered as a traditional nonlinear PDE solver, here we suggest developing new PDEconstrained optimization method for parameters learning NPDEs.
The idea is to adopt some metrics to measure the difference between expected system output and real system output, while the system is constrained by NPDEs.
PDEconstrained optimization is a subset of mathematical optimization where at least one of the constraints may be expressed as a PDE
[58]. A standard formulation of PDEconstrained optimization is given by [59]:(45) 
where is the control variable, is regularization parameter and is the Euclidean norm. Closedform solutions are generally unavailable for PDEconstrained optimization problems because of complexity, it is necessitating the development of numerical methods[60].
When we adopt PDEconstrained optimization method, we need to model the PAI systems with formal mathematical expression. In this work, we write the optimization equation as,
(46) 
where is the system output, is expected system output, and is system configuration parameters (network parameters). In the equation constraint term, , and the is the same with Eq. (17). This loss function has the form of first order Tikhonov regularizer [61][62].
For supervised learning,
will be assigned as data labels. While for unsupervised learning problem, we can design an autoencoder system, and
will be assigned as given data representation function. In practice, there are two strategies can be used to deal with this kind optimization problems, that is, Discretizethenoptimize or Optimizethendiscretize. More about this topic, please refer to book[59].Vi Discussion
By utilizing FEM and FDM, we discrete neural PDE to generate three popular building blocks for deep neural network architectures. The advantages of Neuronizing PDE have two folds.

One is that we provide the theoretical foundation for applying AI to solve mathematical physics PDEs, this will bring various applications in the fields of physics, chemistry, and so on. Some tricks in deep learning, such as paddle method including extended, wrap, mirror, crop and so on, can be used to deal with boundary condition problem.

The other is that we use NPDEs to describe the AI systems, this can give the interpretable AI based on statistical mechanics. NPDEs can be regarded as the theoretical foundation for physical artificial intelligence systems, which will provide theoretical guidance for designing AI analog computing device.
Some more words about these two folds are discussed as follows.
Via AI for PDEs
Using the AI to solve PDEs has a been studied for a decades, why AI can be applied to the PDEs described problems successfully? Our work demonstrates that when adopting FEM and FDM to solve PDES numerically, it basically is equivalent to using deep neural network to do so. From this point of view, we think that our work provide theoretical foundation for applying AI to solving PDEs.
Recently, some researchers make great success in applying AI to solve PDEs. For example, Han et al have used deep learning to solve highdimensional PDEs[63], including solving Hamilton  Jacobi Bellman equation for control. They also applied DNN to solve manyelectron schrödinger equation[64]. More recently, Pfau et al have developed a new neural network architecture, names FermiNet, to modeling the quantum state of large collections of electrons[65]. They believe that the FermiNet was the first demonstration of deep learning for computing the energy of atoms and molecules from first principles, and show how deep learning can help solve the fundamental equations of quantum mechanics for realworld systems.
ViB NPDEs for AI
In this work, we propose a general NPDEs as shown in Eq. (16). It can be considered as a fundamental equation for an AI systems, for example, a synergetic learning systems described with nonlinear RDEs [8].
ViB1 Interpretable AI
Using neural PDEs to describe an AI systems means that we should give an explainable AI systems. Taking a SLS shown in Fig. (1) as an example, we describe the PAI systems as following:
SLS can be regarded as a simple PAI system model, in which the information propagates through this system, and diffusion and reaction occurred in the system.
An information wave is different from an electromagnetic wave, which is the carrier of information, but the elliptic and parabolic equations in mathematical physics equations can be used to describe the process of information transmission. Methods to solve such equations depend on the complexity of the problem. At present, most studies on nonlinear partial differential equations in mathematics use the FEM and the FDM to obtain numerical solutions. In our earlier work, we used the heat diffusion equation to study the propagation of light beams in nonlinear media [26] [25] and the dynamics of interference filters [27]. The diffusion equation is a parabolic semilinear PDE and can also be used to study dispersive optical tomography [66].
ViB2 Analog Computing Device Design
To realize a PAI systems, we should have hardware computing components to assemble the systems. Currently, photonics computing is a potential developing filed[67], here we just simple introduce optical analogy computing devices.
It is well known that waveparticle duality has been demonstrated for photons (light), electrons and other microscopic objects (de Broglie wave). Here we propose the hypothesis that information has waveparticle duality, since photons have long been considered as the carriers of information in optical computing. Compared with electrons computing, we believe that optical analogy computing may display the advantages of analog computer verse to optical digital computer. Our previous works include studies Gaussian beam propagation in a nonlinear medium [26][25], properties of optical interference filter device[19][20]. Sone more discussions about nonlinear diffusion PDEs which describe neural information processing will be presented elsewhere.
The advantage of using optical computing device is that it uses photons as information carriers, the propagation of information is at the speed of light. Also, when we take the passive transmission of optical waves, it can aids ultralow power consumption. From our previous studies, we suggest that optical interference filters can be considered as the candidates of analog computing device[68][27]. For example, ZnSe film filter can be used as optical bistability device[69], but with proper design, it can be used as gate device to play the role of sigmoid type activation function[70][71].
Why does we consider to realize a PAI system with optical device? From Fourier optics principles we know that when an optical system consists of an input plane, and output plane, and a set of components. The output image is related to the input image by convolving the input image with the optical impulse response . If we design an optical device with layered microstructure, at the input plane (input layer) the input image is
The output plane (layer) is at . The output image is
The 2D convolution of input function against the impulse response function
i.e.,
That is, when we assign the optical impulse response to be convolutional kernel, an optical system naturally plays convolution operation.
As Wu & Dai said, “a hybrid optical–electronic framework that takes advantage of both photonic and electronic processors could revolutionize AI hardware in the near future” [67]. Details discussion about this topic is beyond the scope of this paper, further research work may focus on the design of nonlinear integrated photonic computing architectures [72] [73] based on NPDEs.
ViB3 Materials Design
Neural PDEs can be applied to materials design for PAI systems, it is named “inverse design”. Design new materials is to find those unknown coefficients , and in Elliptic operator (Eq. (3)).
When obtaining the numerical values of those unknown function in each time slice, we model them with a functional expression. The obtained functions can be used for materials design. This is an interesting topic worthy of further study, SanchezLengeling & AspuruGuzik’s review paper “
Inverse molecular design using machine learning: Generative models for matter engineering
” [74] describes generative models for matter engineering.ViC Insight into NPDEs
Here we give more discussions about neural PDEs.
ViC1 Difference with PDE Net
What is the difference with other’s ODE or PDE net?
We have discussed in [36], RVFL and MLP with direct link can evolve to NPDE, and we also shown that ResNet can evolve to NPDE with Elliptic operator. Previous research works, such as ODE and PDE net[37][38] [39], are transformation from discrete to continuous variables. Now we start from NPDEs, by taking FEM and FDM, it is transformation from continuous to discrete variables to make the problem become computable. As we have known, discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts. This procedure can be regarded as quantized problem also. In PDEs numerical solution, there are two hyperparameters and in Eq. (34), which control the depth and width of the NN structure, respectively. Also, from literatures, say book [33], we know that and control the error of PDEs solution.
Discretizing (Quantizing) PDEs can construct extreme deep and extreme wide neural architecture, this depends on the parameters and selection when FEM is applied. While PDEs with continuous variables can be applied to describe infinite deep and infinite wide neural architecture instead of infinite wide only neural networks [75]. In Eq. (4) of ResNet paper [40], summation term can be transformed to an integral form when network’s depth approach infinite. In verse, under sparse approximation, we can write an integral as summation[76]. The sparse approximation means we should assign relative large and parameters under a given finite volume of medium.
ViC2 Realizable PAI Systems
In order to realize a PAI systems, we must have solid theoretical foundations. It is notable that unlike ODEs, essentially all PDEs that have been widely studied come quite directly from physics. For example, the diffusion equation arises in physics from the evolution of temperature or of gas density, the wave equation represents the propagation of linear waves, such as along a compressible spring, along fluids. RDEs come from studies of dissipative structure, representing far from thermal equilibrium state in thermodynamics. In this work, we introduce NPDEs to describe the PAI systems, and several viewpoints are presented as follows.

Inspired by Blackbody radiation theory, radiated spectra from macro view is continues, from micro view is quantized. So mathematical method is FDM, its physical basis is quantization (discretization).

When we design a realizable PAI systems based on SLS architecture, it is mandatory that at least one subsystem should be real part, instead of two virtual subsystem. The real part subsystems is consist of some components which can implement optical computation, say with materials of nonlinear optical medium.

Our developed nonlinear to quasilinear PDEs approximated method makes NPDEs to be solvable. Quasilinear approximation is reasonable because it has physical basis. According to special relativity theory, there is no possibility of superluminal velocity. So information propagation in a PAI systems has a limited speed, it cannot faster than light.

Strictly speaking, there is no absolute simultaneous computing convolution and activation / pooling operation in a PAI systems because of physics law restriction. So in a PAI systems, asynchronous computing is real case, time delay should be taken into account when solving NPDEs. Both asynchronous and time delay should be considered when we design a realizable PAI systems.

In physical world, causality is related to time, because “causes” can’t happen after the “consequences”. If we believe that the arrow of time is unidirectional, the NPDEs with time variable will allow us to predict the future behavior of a PAI system. It also allows us to gain insight in a PAI system, and develop an interpretable NN for causal inference.
ViC3 NPDEs Solution
First principle in quantum chemistry is ab initio methods, which is solving Schrödinger equation. In AI systems, we believe that first principle is PLA, which is solving NPDEs.
In solving NPDEs numerically, we should pay attention to following problems:

The initial conditions and boundary conditions for various real world applications.

Step parameters and selection problem, how to make tradeoff between computation accuracy and complexity.

By applying FEM to NPDEs, how to establish relationship between Capsulate networks and elements obtained by FEM.

Compared with neural architecture search method, how to generate optimal neural architecture for the given condition (data set) with Truing’s equations.
Currently, we can use numerical methods to find NPDEs’ solution, however, like NavierStokes equations, one of the fundamental equations of fluid mechanics, it is considered very challenging mathematically to find analytical solutions. It may require mathematicians to develop new mathematics for solving neural PDEs analytically.
Vii Summary and Further Works
Viia Summary
Based on the first principle, we take free energy as the action, derive the RDE which used to describe SLS. In this work, we proposed the general neural PDEs to describe an AI systems. By utilizing FEM and FDM, we generate three most popular DNN building blocks by discretizing PDEs.
In addition to discuss generating neural architecture with NPDEs, we also involve the AI system optimization algorithm. The learning strategies, such as ADAM, LBFGS, pseudoinverse learning algorithms and PDE  constrained optimization methods, are also presented. Based on physics law, we proposed quasilinear approximation to solve nonlinear PDEs.
Most significant thing is that we present the clear physical image of interpretable deep neural networks, that is, an AI system is explained as information diffusion and reaction processing. When information is considered as wave propagation in an AI systems, analog optical computing device is suitable to construct a PAI system.
Classical PDEs describe the physics systems, while NPDEs describe the PAI system. By utilizing FEM and FDM, we built the bridge between mathematical physics PDEs and modern DNN architecture. In other words, DNNs is obtained by properly applying discretization to NPDEs with FEM and FDM. In this viewpoint, we can say that we developed a theory for PAI systems, currently deep neural networks are special case of NPDEs. Therefore, we intrinsically believe in that NPDEs would be the fundamental equations for a PAI systems, it provides the theoretical basis which makes it be possible for analog computing device design, and it paves the road to PAL also.
ViiB Further Works
The path integral formulation is a description in quantum mechanics that generalizes the action principle of classical mechanics. It replaces the classical notion of a single, unique classical trajectory for a system with a sum, or functional integral, over an infinity of quantummechanically possible trajectories to compute a quantum amplitude.
In artificial intelligence, we assign free energy (FE) as an action , and RDE is derived. We intend to apply path integral formulation to find an optimal network architecture. The key idea is that we assign different NN architecture as different path to reach the minimum FE state, one NN structure is regarded as one system configuration (corresponding one path in Feynman’s interpretation). We believe it will be an alternative to neural architecture search method, and expect that it is more effective.
Acknowledgment
The research work described in this paper was fully supported by the National Key Research and Development Program of China (No. 2018AAA0100203).

References
 [1] A. Miriyev and M. Kovač, “Skills for physical artificial intelligence,” Nature Machine Intelligence, vol. 2, no. 11, pp. 658–660, 2020. [Online]. Available: https://doi.org/10.1038/s4225602000258y
 [2] A. B. BASSET, “The principle of least action,” Nature, vol. 67, no. 1737, pp. 343–344, 1903. [Online]. Available: https://doi.org/10.1038/067343d0
 [3] C. G. Gray, G. Karl, and V. A. Novikov, “Progress in classical and quantum variational principles,” Reports on Progress in Physics, vol. 67, no. 2, pp. 159–208, jan 2004. [Online]. Available: https://doi.org/10.1088%2F00344885%2F67%2F2%2Fr02
 [4] R. P. Feynman, The Principle Of Least Action in Quantum Mechanics. WORLD SCIENTIFIC, 2021/01/07 2005, pp. 1–69. [Online]. Available: https://doi.org/10.1142/9789812567635_0001
 [5] S. Wolfram, A New Kind of Science. Champaign, IL: Wolfram Media, 2002. [Online]. Available: https://www.wolframscience.com
 [6] P. Guo, “What is the first principles for artificial intelligence (ver. 2),” Communications of the China Computer Federation, vol. 16, no. 10, pp. 53–58, October 2020, (in Chinese), [The First Principles for Artificial Intelligence (ver. 1), preprint, researchgate.net, https://doi.org/10.13140/RG.2.2.30526.72001] . [Online]. Available: https://dl.ccf.org.cn/institude/institudeDetail?_ack=1&id=5144072786659328
 [7] Y. LeCun, “Learning world models: the next step towards AI,” Keynote, the 27th IJCAI, July 2018.
 [8] P. Guo and Q. Yin, “Synergetic learning systems (I): Concept, architecture, and algorithms,” Preprint, researchgate.net (Chinese), https://doi.org/10.13140/RG.2.2.13681.12644/1; arXiv, (English), 01 2019, the third China Systems Science Conference (CSSC2019), Changsha, May 1819, 2019. (Chinese version). [Online]. Available: https://arxiv.org/abs/2006.06367
 [9] Q.X. Ye, “Introduction to reaction diffusion equation,” Practice and understanding of mathematics, pp. 48–56, 198402, in Chinese.
 [10] J. C. Willems, “Dissipative dynamical systems part I: General theory,” Archive for Rational Mechanics and Analysis, vol. 45, no. 5, pp. 321–351, 1972. [Online]. Available: https://doi.org/10.1007/BF00276493
 [11] Domnisoru, Cristina, K. ans Amina A., Tank, and D. W., “Membrane potential dynamics of grid cells,” Nature, vol. 495, pp. 199–204, 2013.
 [12] A. M. Turing, “The chemical basis of morphogenesis,” Philosophical Transactions of the Royal Society of London, Series B, vol. 237, no. 641, pp. 37–72, 01 1952.
 [13] J. E. Pearson, “Complex patterns in a simple system,” Science, vol. 261, no. 5118, pp. 189–192, Jul 1993. [Online]. Available: http://dx.doi.org/10.1126/science.261.5118.189
 [14] G. Nikolis and I. Prigogine, SelfOrganization in NonEquilibrium Systems. New York: Wiley, 1977.
 [15] R. O. Wells, Elliptic Operator Theory. New York, NY: Springer New York, 2008, pp. 108–153. [Online]. Available: https://doi.org/10.1007/9780387738925_4
 [16] G. Adomian, “Fisherkolmogorov equation,” Applied Mathematics Letters, vol. 8, no. 2, pp. 51 – 52, 1995. [Online]. Available: http://www.sciencedirect.com/science/article/pii/089396599500010N
 [17] A. Friedman, Partial differential equations of parabolic type. Englewood Cliffs, N.J.: PrenticeHall, 1964. [Online]. Available: //catalog.hathitrust.org/Record/000659848
 [18] J. Y. Bigot and G. J. B., “Switching dynamics of thermally induced optical bistability: Theoretical analysis,” Phys. Rev. (A), vol. 35, no. 9, pp. 810–816, 1987.
 [19] P. Guo, L. Chen, and Y.G. Sun, “Laser pulse induced thermal distribution in interference filter,” Journal of Beijing Normal University (Natural Science Edition), vol. 29, no. 1, pp. 82–86, 3 1993, (in Chinese), Received: 19920903. [Online]. Available: http://www.cnki.com.cn/Article/CJFDTOTALBSDZ199301018.htm
 [20] P. Guo and Y.G. Sun, “Thermo–optic dynamics analysis of interference filter bistable device,” Journal of Beijing Normal University (Natural Science Edition), vol. 29, no. 2, pp. 189–193, 6 1993, (in Chinese), Received: 19921103. [Online]. Available: http://www.cnki.com.cn/Article/CJFDTOTALBSDZ199302009.htm
 [21] D. J. Griffiths, Introduction to Quantum Mechanics. Princeton, NJ, 2007.
 [22] G. C. Wick, “Properties of bethesalpeter wave functions,” Phys. Rev., vol. 96, pp. 1124–1134, Nov 1954. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRev.96.1124
 [23] O. Svelto, SelfFocusing, SelfTrapping, and SelfPhase Modulation of Laser Beams. Elsevier, 1974, vol. 12, pp. 1–51. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0079663808702634
 [24] B. E. A. Saleh and M. C. Teich, Fundamentals of Photonics, 3rd ed. Hoboken, NJ: John Wiley & Sons, Inc, 2019.
 [25] P. Guo and Y.G. Sun, “Gaussian beam propagation with nonlinear medium limiter,” Acta Optica Sinica, vol. 10, no. 12, 1990, (in Chinese). [Online]. Available: http://www.cnki.com.cn/Article/CJFDTOTALGXXB199012006.htm
 [26] P. Guo, “Numerical solution of gaussian beam propagation in nonlinear gradient refractive media,” Laser Technology, vol. 14, no. 5, 1990, (in Chinese). [Online]. Available: http://www.cnki.com.cn/Article/CJFDTotalJGJS199005009.htm
 [27] P. Guo, A. A. S. Awwal, and C. L. P. Chen, “Dynamics of a coupled doublecavity optical interference filter,” Journal of Optics, vol. 46, no. 1, pp. 167–174, 1999.
 [28] P. Guo, “Synergetic learning systems (II): Interpretable neural network model with statistical physics approach,” Preprint, researchgate.net, May 2019, the Fifth National Statistical Physics & Complex Systems Conference (SPCSC 2019), Hefei, July 2629, 2019. [Online]. Available: https://doi.org/10.13140/RG.2.2.23969.66401
 [29] ——, “Synergetic learning systems (III): Automatic organization and evolution theory of neural network architecture,” Preprint, researchgate.net, 9 2020, the Fourth China Systems Science Conference (CSSC2020), QingDao, September 1920, 2020. [Online]. Available: https://doi.org/10.13140/RG.2.2.23186.07360
 [30] O. C. Zienkiewicz, R. L. Taylor, and J. Z. Zhu, The Finite Element Method: Its Basis and Fundamentals. Oxford: ButterworthHeinemann, 2013, p. 756. [Online]. Available: https://doi.org/10.1016/B9781856176330.000198
 [31] Ferziger, J. H., and M. Perić, Finite Difference Methods. Berlin, Heidelberg: Springer, 2002, in: Computational Methods for Fluid Dynamics.
 [32] O. P. Iliev, S. D. Margenov, P. D. Minev, P. S. Vassilevski, and L. T. Zikatanov, Eds., Numerical Solution of Partial Differential Equations: Theory, Algorithms, and Their Applications, 1st ed., ser. Springer Proceedings in Mathematics & Statistics. New York: SpringerVerlag, 2013, vol. 45.
 [33] L. Lapidus, Numerical Solution of Partial Differential Equations in Science and Engineering. WileyInterscience, 7 1999.
 [34] J. Lu and Z. Guan, Numerical Solving Methods for Partial Differential Equations, 2nd ed. TsingHua University Press, 1 2004, (in Chinese).
 [35] P. Guo, Q. Yin, and X.L. Zhou, Image Semantic Analysis. Beijing: Science Press, June 2015, (in Chinese). [Online]. Available: http://www.ecsponline.com/goods.php?id=174834
 [36] P. Guo, “On the structure evolutionary of the pseudoinverse learners in synergetic learning systems,” Preprint, researchgate.net, 12 2020. [Online]. Available: https://doi.org/10.13140/RG.2.2.12262.45121
 [37] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, “Neural ordinary differential equations,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, Eds., 2018, pp. 6572–6583. [Online]. Available: http://papers.nips.cc/paper/7892neuralordinarydifferentialequations
 [38] Z. Long, Y. Lu, and B. Dong, “PDENet 2.0: Learning PDEs from data with a numericsymbolic hybrid deep network,” Journal of Computational Physics, vol. 399, p. 108925, 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0021999119306308
 [39] Y. Lu, A. Zhong, Q. Li, and B. Dong, “Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations,” Proceedings of Machine Learning Research, J. G. Dy and A. Krause, Eds., vol. 80. PMLR, 2018, pp. 3282–3291.
 [40] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Computer Vision  ECCV 2016  14th European Conference, Amsterdam, The Netherlands, October 1114, 2016, Proceedings, Part IV, 2016, pp. 630–645.
 [41] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, pp. 504–507, 7 2006.
 [42] Y. Lecun, S. Chopra, R. Hadsell, M. Ranzato, and F. J. Huang, “A tutorial on energybased learning,” 2006.
 [43] Z. C. Lipton, “A critical review of recurrent neural networks for sequence learning,” 2015. [Online]. Available: http://arxiv.org/abs/1506.00019
 [44] H. Ramsauer, B. Schäfl, J. Lehner, P. Seidl, M. Widrich, L. Gruber, M. Holzleitner, M. Pavlovic, G. K. Sandve, V. Greiff, D. P. Kreil, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter, “Hopfield networks is all you need,” CoRR, vol. abs/2008.02217, 2020. [Online]. Available: https://arxiv.org/abs/2008.02217
 [45] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA pp. 5998–6008.
 [46] S. Hochreiter and J. Schmidhuber, “Long ShortTerm Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [Online]. Available: https://doi.org/10.1162/neco.1997.9.8.1735
 [47] N. Zheng, S. Du, J. Wang, H. Zhang, W. Cui, Z. Kang, T. Yang, B. Lou, Y. Chi, H. Long, M. Ma, Q. Yuan, S. Zhang, D. Zhang, F. Ye, and J. Xin, “Predicting covid19 in china using hybrid ai model,” IEEE Transactions on Cybernetics, vol. 50, no. 7, pp. 2891–2904, 2020. [Online]. Available: https://doi.org/10.1109/TCYB.2020.2990162
 [48] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 5 2015.
 [49] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179–211, 1990. [Online]. Available: https://doi.org/10.1207/s15516709cog1402_1
 [50] T. W. Hughes, I. A. D. Williamson, M. Minkov, and S. Fan, “Wave physics as an analog recurrent neural network,” Science Advances, vol. 5, no. 12, p. eaay6946, 2019.
 [51] J. W. Wang and H. N. Wu, “Design of suboptimal local piecewise fuzzy controller with multiple constraints for quasilinear spatiotemporal dynamic systems,” IEEE Transactions on Cybernetics, pp. 1–13, 2019.
 [52] R. Cherniha, M. Serov, and O. Pliukhin, Nonlinear ReactionDiffusionConvection Equations, 1st ed. New York: Chapman and Hall /CRC, 11 2017, vol. 104.
 [53] C.V. Pao, Nonlinear parabolic and elliptic equations. Boston, MA: Springer US, 1992. [Online]. Available: https://doi.org/10.1007/9781461530343
 [54] R. Cherniha and V. Davydovych, Nonlinear reactiondiffusion systems. Springer, 2017, vol. 2196, conditional Symmetry, Exact Solutions and their Applications in Biology.
 [55] C. M. Bishop, Pattern Recognition and Machine Learning. SpringerVerlag New York, 2006, chapter 10: Approximate Inference.
 [56] P. Guo, J.X. Hou, and B. Zhao, “Methodology for building synergetic learning system,” Preprint, researchget.net, (ICCS 2020, English version), 2019, the third China Systems Science Conference (CSSC2019), Changsha, May 1819, 2019. (Chinese version). [Online]. Available: http://dx.doi.org/10.13140/RG.2.2.10146.07368
 [57] Q. Yin, B. Xu, K. Zhou, and P. Guo, “Bayesian psudoinverse learners: From uncertainty to deterministic learning,” 2020, submitted to IEEE Trans. on Cybernetics.
 [58] G. Leugering, P. Benner, S. Engell, A. Griewank, H. Harbrecht, M. Hinze, R. Rannacher, and S. Ulbrich, Eds., Trends in PDE Constrained Optimization, 4th ed., ser. International Series of Numerical Mathematics. Birkhäuser, Cham, 2014, vol. 165.
 [59] H. Antil, D. P. Kouri, M. Lacasse, and D. Ridzal, Eds., Frontiers in PDEConstrained Optimization, 1st ed., ser. The IMA Volumes in Mathematics and its Applications. New York: SpringerVerlag, 2018, vol. 163.
 [60] H. Antil, M. Heinkenschloss, R. H. W. Hoppe, and D. C. Sorensen, “Domain decomposition and model reduction for the numerical solution of PDE constrained optimization problems with localized optimization variables,” Computing and Visualization in Science, vol. 13, no. 6, pp. 249–264, 2010.
 [61] A. N. Tikhonov and V. Y. Arsenin, Solutions of IllPosed Problems. Hoboken & New York: Wiley, 1977, chapter 2.
 [62] P. Guo, M. R. Lyu, and C. L. P. Chen, “Regularization parameter estimation for feedforward neural networks,” IEEE Trans. Systems, Man, and Cybernetics, Part B, vol. 33, no. 1, pp. 35–44, 2003. [Online]. Available: https://doi.org/10.1109/TSMCB.2003.808176
 [63] J. Han, A. Jentzen, and W. E, “Solving highdimensional partial differential equations using deep learning,” Proceedings of the National Academy of Sciences, vol. 115, no. 34, pp. 8505–8510, 2018.
 [64] J. Han, L. Zhang, and W. E, “Solving manyelectron schrödinger equation using deep neural networks,” Journal of Computational Physics, vol. 399, p. 108929, 09 2019.
 [65] D. Pfau, J. S. Spencer, A. G. D. G. Matthews, and W. M. C. Foulkes, “Ab initio solution of the manyelectron schrödinger equation with deep neural networks,” Phys. Rev. Research, vol. 2, p. 033429, Sep 2020.
 [66] H. Niu, P. Guo, L. Ji, Q. Zhao, and T. Jiang, “Improving image quality of diffuse optical tomography with a projectionerrorbased adaptive regularization method,” Optics Express, vol. 16, no. 17, pp. 12 423–34, 2008.
 [67] H. Wu and Q. Dai, “Artificial intelligence accelerated by light,” Nature, vol. 589, pp. 25–26, January 2021, news & Views. [Online]. Available: https://doi.org/10.1038/d4158602003572y
 [68] P. Guo, Y.G. Sun, J. Xiong, W. Wang, and Z. Jiang, Nonlinear threemirror FabryPerot multistability element. Singapore: World Scientific, 1987, pp. 83–87. [Online]. Available: https://www.researchgate.net/publication/330955256_Nonlinear_threemirror_FabryPerot_multistability_element
 [69] Y.G. Sun, Z. Jiang, J. Xiong, W. Wang, P. Guo, and S. Shang, Optical Bistability in Very Thin ZnSe Film. Singapore: World Scientific, 1987, pp. 95–97.
 [70] W. Wang, Y.G. Sun, J. Xiong, Z. Jiang, P. Guo, and S. Shang, Lowpower Optical Bistability in Improved ZnSe Interference Filters. Singapore: World Scientific, 1987, pp. 57–59. [Online]. Available: https://www.researchgate.net/publication/348390364_Lowpower_Optical_Bistability_in_Improved_ZnSe_Interference_Filters
 [71] P. Guo and M. R. Lyu, “A new approach to optical multilayer learning neural network,” in Proceedings of International Conference on artificial intelligence (ICAI’01), H. R. Arabnia, Ed., vol. I. Las Vegas, Nevada, USA: CSREA Press, 5 2001, pp. 426–432. [Online]. Available: https://www.researchgate.net/publication/331463645_A_New_Approach_to_Optical_Multilayer_Learning_Neural_Network
 [72] X. Xu, M. Tan, B. Corcoran, J. Wu, A. Boes, T. G. Nguyen, S. T. Chu, B. E. Little, D. G. Hicks, R. Morandotti, A. Mitchell, and D. J. Moss, “11 TOPS photonic convolutional accelerator for optical neural networks,” Nature, vol. 589, no. 7840, pp. 44–51, 2021.

[73]
J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Raja, J. Liu, C. D. Wright, A. Sebastian, T. J. Kippenberg, W. H. P. Pernice, and H. Bhaskaran, “Parallel convolutional processing using an integrated photonic tensor core,”
Nature, vol. 589, no. 7840, pp. 52–58, 2021.  [74] B. SanchezLengeling and A. AspuruGuzik, “Inverse molecular design using machine learning: Generative models for matter engineering,” Science, vol. 361, no. 6400, pp. 360–365, 2018.
 [75] R. M. Neal, Priors for Infinite Networks. New York, NY: Springer New York, 1996, pp. 29–53, in Bayesian Learning for Neural Networks.
 [76] P. Guo, C. L. P. Chen, and M. R. Lyu, “Cluster number selection for a small set of samples using the bayesian yingyang model,” IEEE Trans. Neural Networks, vol. 13, no. 3, pp. 757–763, 2002. [Online]. Available: https://doi.org/10.1109/TNN.2002.1000144

[77]
Y. Sun, B. Xue, M. Zhang, G. G. Yen, and J. Lv, “Automatically designing cnn architectures using the genetic algorithm for image classification,”
IEEE Transactions on Cybernetics, vol. 50, no. 9, pp. 3840–3854, 2020. [Online]. Available: https://doi.org/10.1109/TCYB.2020.2983860  [78] D. Kingma and J. Ba, “ADAM: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, ser. https://iclr.cc/archive/www/doku.php%3Fid=iclr2015:acceptedmain.html, 2014. [Online]. Available: https://arxiv.org/abs/1412.6980
 [79] J. Stoer and R. Bulirsch, Introduction to Numerical Analysis, 3rd ed., ser. Texts in Applied Mathematics. SpringerVerlag New York, 2002, vol. 12, p.243, Translated by Gautschi, W., Bartels, R., Witzgall, C.
 [80] D. C. Liu and J. Nocedal, “On the limited memory BFGS method for large scale optimization,” Mathematical Programming, vol. 45, no. 1, pp. 503–528, 1989.
 [81] W. Chen, Z. Wang, and J. Zhou, “Largescale LBFGS using mapreduce,” in Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, pp. 1332–1340.
Appendix A Learning Strategies
Aa Learning Optimization
Suppose the information along direction diffuse, at the time the system output is . We define the loss function and suppose it is differential, the optimal parameter at given data set is
(47) 
We can use gradientbased optimization (first order methods) methods, such as plain gradient, steepest descent, conjugate gradient, Rprop, stochastic gradient, and so on. When used to minimize the loos function, a standard (or “batch”) gradient descent method would perform the following iterations:
(48) 
where is a step size ( learning rate). From ref. [55], we have
(49) 
This is an ordinary differential equation for network parameter .
There are several optimization methods, for example, genetic algorithm[77], but most used method is gradient descent based method. Following we will review gradient descent algorithms.
AB Gradient Descent Algorithms
AB1 First Order Gradient Descent Algorithm
One can also use Adam (Adaptive Moment Estimation) optimizer [78]. Adam’s parameter update is given by:
(50) 
where is a small scalar (e.g. ) used to prevent division by 0, and (e.g. 0.9) and (e.g. 0.999) are the forgetting factors for gradients and second moments of gradients, respectively. Squaring and squarerooting is done elementwise.
AB2 Second Order Optimization Algorithm
We hope to use second order optimization algorithm, for example, Newton’s method:
(51) 
where is a Hessian,
Sometimes we hope to reduce the complexity of given problem, certain approximation for solving neural PDEs is adopted.
AB3 Pseudoinverse Approximation of Hessian
In Newton’s method, the Hessian is needed to compute, but it is very computational intensive. , for a complex neural network, the parameter number is very large, hence the Hessian is huge. Therefore we need to take some approximations to reduce the complexity. Here we try to find other method, not along quasi Newton method, to investigate PDE network solver theoretically.
As we know, If is invertible (it is indeed), its pseudoinverse is its inverse. That is, [79]. Now we use pseudoinverse instead of inverse in Eq. (51).
(52) 
Where denotes Hermitian transpose (conjugate transpose), because is a real and symmetric matrix, it has . But we would like keep the pseudoinverse form. Now Eq. (51) becomes
(53) 
When we assign the matrix , Eq. (53) can be written in the form:
(54) 
This PDE is used for optimization, and we call it as pseudoinverse learning method in finding network’s paramenters.
AB4 Symmetry Break
(55) 
We merge and into new parameter , then we have:
(56) 
Unlike Turing’s equations, two PDEs in Eq. (AB4) is not symmetry, and they also has different meanings. variable is in data space, while variable is in parameter space. in first equation stands for depth direction in a virtual medium. While in second equation, means iterative step in parameter optimization processing.
AB5 Gauss–Newton algorithm
In practical implementation, to avoiding direct computing Hessian, QuasiNewton methods are utilized. QuasiNewton methods are methods used to either find zeroes or local maxima and minima of functions, as an alternative to Newton’s method. They can be used if the Jacobian or Hessian is unavailable or is too expensive to compute at every iteration. The “full” Newton’s method requires the Jacobian in order to search for zeros, or the Hessian for finding extrema.
The Gauss–Newton algorithm is used to solve nonlinear least squares problems. It is a modification of Newton’s method for finding a minimum of a function. Unlike Newton’s method, the Gauss–Newton algorithm can only be used to minimize a sum of squared function values, but it has the advantage that second derivatives, which can be challenging to compute, are not required.
Minimizing the sum of squares
we should start with an initial guess for the minimum, the method proceeds by the iterations,
where is the Jacobian matrix.
Note that is the left pseudoinverse of . This form looks like the same with Equation (53), in fact, it is a special case of Newton’s methods.
AC LBFGS Algorithm
Limitedmemory BFGS (LBFGS or LMBFGS) is an optimization algorithm in the family of quasiNewton methods that approximates the Broyden Fletcher  Goldfarb Shanno algorithm (BFGS) using a limited amount of computer memory [80].
Like the original BFGS, LBFGS uses an estimate of the inverse Hessian matrix to steer its search through variable space, but where BFGS stores a dense approximation to the inverse Hessian ( being the number of variables in the problem), LBFGS stores only a few vectors that represent the approximation implicitly [81] (Largescale LBFGS using Mapreduce).
Now we have two equations, one is for forward propagation, the other is for back propagation. In addition to LBFGS algorithm, other algorithms also can be used for learning optimization.
Comments
There are no comments yet.