During the development of artificial intelligence (AI), scholars of different disciplines have refined their understanding of artificial intelligence, put forward different viewpoints, and produced different academic schools of thought. There are three major schools of thought that have great influence on artificial intelligence research: symbolism, connectionism, and actionism.
Symbolism, also known as logisticism, psychologism, or computerism, is based on the assumptions of physical symbology (i.e., symbolic operating systems) and the principle of limited rationality.
The main principle of connectionism, also known as bionicism or physiologism, is the connection mechanism and learning algorithm between neural networks.
Actionism, also known as the theory of evolutionism or cyberneticism, is based on cybernetics and perceptual-action-based control systems.
After two troughs, the development of artificial intelligence benefited from the persistence and continuous efforts of Hinton et al. In 2006, the concept and algorithm of deep belief networks were proposed, which rekindled the passion for neural networks in the field of artificial intelligence. In 2012, Hinton and his student Alex Krizhevsky designed AlexNet based on a Convolutional Neural Network (CNN) and took advantage of the powerful parallel computing power of GPUs in an image competition, which is the forefront of computer intelligent image recognition. The test error rate was 15.3%, which was much lower than the second test error rate of 26.2%. In 2015, LeCun, Bengio & Hinton jointly published a deep study review paperDeeplearning
in the journal Nature, and neural networks experience a strong resurgence under the name ‘̀deep learning‘̀. Thanks to the explosive growth of interconnected data, the significant increase in computing power and the development and maturity of deep learning algorithms, we have ushered in the third wave of development since the emergence of artificial intelligence.
With the evolution of time and the expansion of research, deep learning has encountered bottlenecks, and the theory of artificial intelligence has stagnated. Gary Marcus, a professor of psychology at New York University, poured cold water on deep learning. He criticized various problems in deep learning, as detailed in the literature Marcus2018 .
Xu Zongben, an Academician at Xi-an Jiao Tong University, stated that Xu2019three :
“It is difficult to design topologies, to anticipate effects, and to explain mechanisms with deep learning. There is no solid mathematical theory to support solving these three problems. Solving these problems is the main focus of deep learning for future research. ”
Recently, M. Mitchell Waldrop published a review article in the Proceedings of the National Academy of Sciences (PNAS) entitled Waldrop2019News , “News Features: What are the Limitations of Deep Learning?” In this PNAS feature, Waldrop briefly describes the history of deep learning and believes that all the glorious benefits of computing power have made artificial intelligence flourish today. However, deep learning has many limitations, including vulnerability to counterattack, low learning efficiency, application instability, lack of common sense, and interpretability. From a computability point of view, more and more people in the field of artificial intelligence research believe that to solve the shortcomings of deep learning, some fundamentally new ideas are needed.
Yann LeCun gave the speech entitled as “Learning World Models: the Next Step towards AI” at the opening ceremony of IJCAI-2018 LeCun2018jicai
. LeCun said that the future of the artificial intelligence revolution will be neither supervised learning, nor will it be pure reinforcement learning, but rather a world model with common sense reasoning and predictive ability. Intuitively, the world model contains general background knowledge about how the world works, the ability to predict the consequences of actions, and the ability to have long-term planning and reasoning. Yann LeCun summarized three learning paradigms, namely, reinforcement learning, supervised learning and self-supervised learning, and believes that self-supervised learning (formerly known as predictive learning) is a potential research direction to realize the world model. At the end of the lecture, Yann LeCun summarized the mutual drive and promotion between technology and science, such as telescopes and optics, steam engines and thermodynamics, or computers and computer science. He also raised several questions:
What is the equivalent of “thermodynamics” in intelligence?
Are there underlying principles behind artificial intelligence and natural intelligence?
Are there simple principles behind learning?
Is the brain a collection of a large number of “hacks” that evolved?
As there are many schools of thought regarding basic research on artificial intelligence, it is difficult to construct a unified theory to solve these questions. However, we believe that computational intelligence (CI) is a new stage in artificial intelligence development of artificial intelligence. CI, a nature-inspired intelligence, and a paradigm that has the potential to solve most problems. CI is initialized from the phenomena and laws of physics, chemistry, mathematics, biology, psychology, physiology, neuroscience and computer science, and integrates the three artificial intelligence schools of thought to form an organic whole. The system formed by the integration of multiple disciplines and technologies can realize complementary advantages, which will be more effective than a single discipline or technology and can achieve greater results. Therefore, in order to solve the shortcomings of deep learning, we propose the use of the cognitive neuroscience mechanism and mathematical tools in machine learning to develop a new generation of artificial intelligence. Based on computational intelligence, we should develop a Synergetic Learning Systems (SLS)Synergetic2019 to establish a theoretical foundation for intelligent “thermodynamics.”
For a more detailed analysis of the current status of artificial intelligence basic research and development trends, please refer to the literature ChinaXiv2019 .
The structure of Part I is as follows: the methodology for developing a Synergetic Learning Systems is given in section 2, section 3 introduces the basic concept of the Synergetic Learning Systems, section 4 describes the architecture of the Synergetic Learning Systems, section 5 lists the relevant optimization algorithms, and the last section summarizes and analyzes the future direction of the Synergetic Learning Systems.
We know that a methodology is a theoretical system aimed at solving problems, usually involving the elaboration of problem stages, tasks, tools, and methodological techniques.
We believe that the artificial intelligence system needs to be analyzed and studied systematically at multiple scales, levels, and perspectives. The so-called multi-scale refers to the study of artificial intelligence systems from the macro scale, micro-scale, and mesoscopic scale. Macroscopic, microscopic and mesoscopic are are our “Three perspectives” method-SLS .
In the theme report for the Second China System Science Conference, “How does the brain work in the whole?” Academician Guo Aike, a neuroscience and biophysicists in China, mentioned: “The brain function linkage map should be drawn from the macro-brain scale, the mesoscopic neural network scale, and the micro synapse scale, and more scales can be considered.”CSSC08
We can also draw on the research results of basic disciplines, such as the evolution of the universe on a macroscopic scale. The main physical parameters for the evolution of the universe are temperature and gravity. According to the Big Bang Theory, the initial temperature was very high, and the current galaxy structure evolved due to cooling and gravity. Therefore, the temperature of a system is a very important basic quantity. In the evolution of the cooperative learning system, the simulated annealing algorithm can be considered in order to solve the combinatorial optimization problem.
To analyze and study the artificial intelligence system systematically is to adopt the methodology of system science and use the viewpoint of the system to understand and grasp the essence and movement development of artificial intelligence. Our proposed Synergetic Learning Systems is inspired by natural intelligence and is based on the cognitive neuroscience mechanism and computational intelligence, integrating multidisciplinary knowledge and adopting the mode of complex system thinking. The Synergetic Learning Systems is based on the concept of systems science. The use of the term “collaboration” is influenced by the idea of “synergetic learning” Hermann10 Hermann11 . However, SLS includes cooperative synergetic learning and competitive synergetic learning among various elements, as well as concepts and methods, such as system evolution and evolutionary game.
In the field of neural network research, Hinton’s Boltzmann machine, the Helmholtz machine, and the restricted Boltzmann machine draw on the concepts and research methods of statistical physics. Therefore, we also need to look at the problem by using physical thinking during the research process. Statistical mechanics looks at the essence through phenomena, in which the phenomena is the observed data and essence is the law. In statistical mechanics, probability distribution, mathematical models and other tools are used to systematically quantify and analyze the general laws and randomness behind observed data.
For a more detailed discussion, please refer to method-SLS .
3 Basic Concept
The Synergetic Learning Systems we proposed is an information processing system, which is equivalent to an intelligent “thermodynamic system.” As we know, neural networks process information to achieve intelligent information processing and decision making in a given environment. The rule of “natural selection, survival of the fittest” followed by nature should adopt the “human selection” rule when the artificial intelligence system evolves. We believe that this law is the principle of free energy. Nature likes to find a physical system with minimum free energy, so free energy can also be utilized as an objective function of system evolution.
The concept of free energy comes from statistical physics. It refers to the part of the system that can be converted into external work in a certain thermodynamic process. In a particular thermodynamic process, the “useful energy” of the system’s external output can be divided into Helmholtz free energy and Gibbs free energy.
In NIPS’93 , Hinton et al. Hinton1994Autoencoders12 correlated the auto-encoder and minimum description length (MDL) principle with the Helmholtz free energy, and changed the auto-encoder to a restricted Boltzmann machine. Hinton borrowed from statistical physics concepts and explained the deep belief network. Based on the viewpoint of statistical physics and the relationship between free energy, internal energy and entropy in the regular ensemble, the interpretability problem of the model can be solved. In statistical physics, ensembles represent a collection of a large number of possible states of a system under certain conditions. In the canonical ensemble, the relationship between free energy , internal energy , entropy , and temperature of the state function is , and the relationship between the free energy and the partition function is:
where is the Boltzmann constant. Entropy is a linear combination of free energy , temperature and average internal energy , . The partition function is , where is the level of degeneracy and . In the regular ensemble, the probability distribution function is the Boltzmann distribution, . Therefore, as long as the free energy of the ensemble is defined and the combined network model is optimized by the principle of least action, the desired neural network system can be obtained.
The energy-based learning algorithm Lecun2006A13
should also be derived from statistical physics. Hinton also made the neuroscientist Karl Friston accept the idea that the best way to explore the brain is to think of it as a Bayesian probability machine. In 2010, Friston published a paper titled, “The free-energy principle: a unified brain theory?” in Nature Reviews Neuroscience, explaining the brain’s operating mechanism using the principle of free energyFriston2010 . From the work of Hinton et al. and Friston, we are convinced that the study of the “smart thermodynamics” system based on the principle of free energy undoubtedly contributed to the inspiration and success of the Synergetic Learning Systems theory.
To break through the dilemmas of poor deep learning network structure topology design and difficulty predicting the effects and explaining the mechanism, we propose to build a Synergetic Learning Systems to establish a “big unified theory” of intelligence.
Drawing on system theory to study SLS, the most fundamental Synergetic Learning Systems we designed has two subsystems (or models): the system reduction model (discriminative model) and the system evolution model (generative model). The evolution of the system is described by a differential dynamic system.
At the 2018 IJCAI, one of Yann LeCun’ s questions was, “is there a simple rule behind learning?” We think there should be a simple rule. It is well known that there are simple and elegant laws in physics, which is the principle of least action.
A mechanical system, using the result of varying the Hamiltonian by the principle of least action, can derive the Lagrangian equation describing the mechanical system. If we define the total effect of the system as equal to the sum of the actions of the gravitational field and the material field, Einstein’s general relativity equation can be derived according to the principle of least action.
Our proposed Synergetic Learning Systems is an information processing system based on the principle of free energy. Therefore, we propose that in an SLS, the amount of action in the system is equal to the free energy. The principle of free energy in the SLS is, in particular, equivalent to the principle of least action, and free energy is equivalent to the Hamiltonian of the mechanical system. Therefore, our proposition is:
Free energy == Hamiltonian;
Principle of free energy == principle of least action.
Therefore, for a given environment (data), as long as the “Hamilton” of the neural network system is defined, the self-organization and evolution of the neural network structure can be systematically studied through multi-view and multi-scale dynamics equations. This gives us an important concept, that is, principle of least action is the first principle for artificial intelligence.
The dynamics of the SLS should be described by differential dynamic equations. What kind of equation is this differential dynamic equation? We know that in the field of chemical research, the “free energy” of a one-component system can be represented by a variational form as a restricted evolution equation:wikiRDE2016
“Free energy” is given by:
In the above formula, is the chemical potential and . Therefore, we have
If we generalize the equation to a multi-component system, we obtain a variation of the “free energy” :
in which , is the diffusion matrix, and .
is the convection vector andis there action vectorQixiao16 .
Therefore, the free energy of the system is equal to the amount of action, and the amount of action is divided according to the principle of least action (the principle of free energy), and the reflection-diffusion equation is obtained! Therefore, the dynamics of the SLS is described by the reflection-diffusion equation. In statistical physics, the dissipative structure theory is a theoretical description of the self-organization of non-equilibrium systems, and uses the reflection-diffusion equation wikiRDE2016 .
Below we describe the architecture of the Synergetic Learning Systems.
4 System Architecture
The unity of structure and function is one of the basic concepts in biology. The brain has a complex neural network structure. Therefore, a unified architecture is very important, and it is unified with the function of the system.
Aristotle, a famous ancient Greek philosopher, proposed that “the whole is greater than the sum of the parts,” which is an ancient, simple overall view and a basic principle of modern system theory. In accordance with this basic principle, we have designed a SLS that consists of two or more subsystems. The SLS cannot be a simple isolated system. It contains at least two subsystems, so that the performance of the system reflects the sum of the whole and the part. The brain is an open complex giant system. To simulate the neural network structure of the brain, the SLS should also be an open complex giant system.
According to Mr. Hsue-Shen Tsien’ s classification method XueSen1990 , if there are many kinds of subsystems and a hierarchical structure in a system, the relationship between them is very complicated, thus giving rise to a complex giant system. If this system is also open, it is called an open complex giant system. Openness at here refers to the exchange of energy, information or matter between the system and the outside world. To be more precise: 1. the system and its subsystems have various exchanges of information with the outside world; 2. each subsystem in the system acquires knowledge through learning.
Academician Guo Aike elaborated on the working principle of the brain and the roots of his intelligence: “How does the human brain work as a whole? ‘The Tao produced One, One produced Two, Two produced Three, Three is All things.’ My initial understanding is the result of a multi-module synergetic operation; I believe that the function of the brain is the result of the collaboration of thousands of subsystems with different specialized skills, which is the result of the entanglement combination of millions of years of evolution” CSSC08 . Therefore, referring to the neurocognitive mechanism, the overall working state of the SLS is also the result of the coordinated operation of multiple subsystems (multi-modules).
The visual organ of Drosophila consists of more than 750 monoculars. William Bialek, a theoretical physicist at Princeton University, has shown that these eyes work together to create a visual system that enables highly accurate calculations Haykin2011nnnlm . This system illustrates the synergy of individuals in complex systems and is one of the biological bases for our proposed SLS.
Figure 1 is a schematic diagram of the architecture of our Synergetic Learning Systems. In this SLS, subsystems can have multiple structures and layers. The system is flexible and scalable and has complex interrelationships between subsystems.
4.1 Multi-agent system
If each subsystem is a peer-to-peer model, and each subsystem is an agent, the Synergetic Learning Systems will be a Multi-Agent System (MAS).
A Multi-Agent System is a collection of multiple agents. Its goal is to transform large, complex systems into small, coordinated, and manageable systems that can easily communicate with each other. Therefore, we can understand specific examples of the application of the divide-and-conquer strategy.
The Multi-Agent System is a coordination system, and each agent solves large-scale complex problems by coordinating with each other. The Multi-Agent System is also an integrated system, which uses information integration technology to integrate the information from each subsystem to complete the integration of complex systems.
In a Multi-Agent System, each agent communicates and coordinates with each other and solves problems in parallel, thereby effectively improving its ability to solve problems. Multi-Agent Systems are suitable for complex, open distributed systems. They solve the task through the cooperation of the agent. The key to realizing the Multi-Agent System is the communication and coordination between these agents, that is, the synergy. After the data are provided, the process of building the MAS is the process of synergetic learning.
We know that data is the manifestation and information carrier. Therefore, a specific SLS relies on a given data set, that is, it is a data-driven modeling process. As the system evolves, data-driven single-wheel drives will evolve into the two-wheel drive comprised of data and models.
We believe that the multi-agent system and swarm intelligence are highly similar, but the scope of swarm intelligence may be wider than that of the MAS. The SLS focuses on how each agent works with others. Each subsystem can be a complex neural network system, and the overall system attention is stronger than that of the MAS or swarm intelligence systems.
One of special examples is that if a hybrid expert system architecture is used, the gate network can be used to collaborate the opinions of various expert networks. The gate network can be designed to use simple voting or weighted voting. In addition, the gate network can be connected to various expert networks to coordinate the input and output of the expert network according to given task, which is the servo mechanism in synergetic learning. The gate network controls the various expert networks, in stacked generalization view, the gate network module is a meta learner.
4.2 Two-body system
A complex system based on reductionism consists of multiple simple systems. We should go from simple to complex when researching problems. The nonliving system usually obeys the second law of thermodynamics, as the system always spontaneously tends toward equilibrium and disorder, and the entropy of the system achieves a high value. The system spontaneously changes from order to disorder, while disorder does not spontaneously change to order, which is due to the irreversibility of the system and the stability of the equilibrium state. However, the life system is the opposite. Biological evolution and social development are always more orderly: from simple to complex and from low to high. Such systems are capable of spontaneously forming an orderly, stable structure. Life evolves from a simple structure to a complex form. Therefore, we can start from simple structures and gradually evolve into complex systems. The two-body system is the most basic SLS. In the AI field, examples of two subsystems abound. On the other hand, when dealing with many body problems in statistical physics, after using the mean field approximation, the many body problem can be simplified to the two-body problem. Although the degree of approximation is related to a specific problem, it is a proven method of solving complicated problems.
4.3 Universal two-body system concept
A system can be divided into two subsystems. In the AI field, for example, computer graphics, which are widely used in GPUs, are dual with imagery. A Chinese to English translation system has a dual translation system for English to Chinese. These two systems can form a large system, which can be viewed as dual learning and Synergetic Learning Systems. In the probabilistic statistical model, the Generative Adversarial Network (GAN) and the Variational Auto-Encoder (VAE) can be considered as Synergetic Learning Systems with two subsystems.
4.3.1 Simple two-body system
The simplest two-models system is the Autoencoder. The encoder is system A, the decoder is system B, and the two parts are combined into a simple SLS.
In this system, the cooperative manner is serial, and the input of the decoder depends on the output of the encoder. We can understand that these two subsystems are enslaved by the loss function. Minimizing the loss function (reconstruction error) is one way to achieve synergetic learning. The restricted Boltzmann machine can also be considered a simple SLS.
A detailed discussion will take place in future work, which is primarily concerned with the interpretability of neural network systems based on statistical physics. For the marginal distribution of the joint distribution, it is understood that is a subsystem and
is another subsystem. In fact, many energy-based models are called Boltzmann machines. The original Boltzmann machine was composed of two types of models: models with and without latent variables. What is now called the Boltzmann machine is a model with latent variablesIan18 .
Based on the free-energy principle, the SLS can learn based on energy. The state of a physical system can be studied when Action is defined, and the maximum value is obtained by the variational method based on the principle of least Action. In the SLS, we need to define the free energy of the system. After adopting statistical physics, the mathematical tool to describe the model is probability and statistics. Hopfield neural networks were early energy-based models Hopfield40 Hopfield41 . In an energy-based learning model, negative variational free energy is also known as the Evidence Low Bound (ELBO)Ian18Ian18
, the Markov Chain Monte Carlo (MCMC) algorithmBishop2006 and the variational inference algorithm Ian18 .
Examples in the equilibrium state (thermal equilibrium):
5.1 Variational inference algorithm Springer43 Beal2003
In the paper titled “A Tutorial on Energy-Based Learning” by Yann LeCun et. al., the variational free energy is defined as followsLecun2006A13 :
Where is the free erengy of the ensemble :
In the paper titled “Energy-based Generative Adversarial Network” by Y. LeCun et al.DBLP:46 , Discriminator is described as an energy function(negative evaluation function).That is, the smaller the function is, the truer the data. The auto-encoder AE is used as a discriminator (energy function). The energy function is defined as the error function of the discriminator:
According to the Boltzmann distribution, and the distribution function . The free energy can be expressed as follows:
From this, we can see that you can calculate the free energy just by finding the distribution function. Bengio et al. turned the problem into an estimate of a probability distributionKim2016 Kumar2019 . Therefore, the key issue of the problem is to estimate the probability density distribution function, and one of the algorithms is variational inference. To solve the SLS by the variational inference Springer43 Beal2003 .
To solve the SLS by the variational inference (Variational Bayes) algorithm, we need to define a certain environment, in other words, to assume some conditions. Suppose we design a form of a complete Bayesian model in which all parameters are given a prior distribution. The model has both parameters and potential variables. is used to represent the set of all potential variables and parameters. We use to represent the set of all the observed variables. For example, we might have a set of independent, identically distributed data, where and .
One of our models represents the joint distribution , and our goal is to find an approximation of the posterior distribution and model evidence .
In general, the form of posterior probability is very complicated, so we want to approximatewith a relatively simple and easily understood model , namely, . The other model is described by . .
Factorized distributions: The local interaction between individuals in the system can produce a relatively stable behavior at the macro level; therefore, we can assume posterior independence. That is,
Since logarithmic evidence, , is fixed by the corresponding , in order to minimize the Kullback Leible (KL) divergence, only should be maximized. By selecting an appropriate , is easily calculated and evaluated. In this way, the approximate analytical expression of the posterior and the lower bound of the log evidence can be obtained, which is also called the variational free energy:
For this equation, the first term on the right-hand side is defined as the energy, and the second term is the Shannon entropy. A more detailed discussion of problem solving can be found in chapter 10 of Bishop’s bookBishop2006 .
5.2 Approximate Synergetic Learning Algorithm
As discussed above, we can consider minimizing a Helmholtz Free Energy, this is equivalent to minimizing the expected log likelihood, under the model
We can also start by just minimizing the KL divergence between the models.
Where is a parameter group we try to learn. In most cases this is intractable, but if we sophisticated design the proper models, this will be tractable. The key issue is that we can use gradient descent optimization or pseudo-inverse learning (PIL) algorithm to train the neural network model. Consequently, we developed the approximate synergetic learning (ASL) algorithm, which is based on our previous work Guo2002 Guo2003reg Guo2008 , to tackle the complicated variational inference computation problems.
In our ASL algorithm, the models are designed as follows:
is a Gaussian mixture distribution, and is a nonparametric kernel estimation.
where is a kernel parameter.
With this definition, is posterior probability estimation in of EM algorithm. This means
(Normalized in hidden space, .)
5.2.1 Cluster Number Selection
The details of this work are discussed in Ref. Guo2002 .
With data set , we intend to cluster the data into several clusters.
Now we use Gaussian kernel density for ,
The hyper-parameter play the key rule in cluster number selection problem, if it estimated with gradient descent approach, it will approach to zero eventually. With minimizing KL divergence (Free energy ), we a new equation for estimating smoothing parameter .
and use Taylor expansion for at . When is small, we can omit the higher order terms and only keep the first-order term.
Now , stands for Gaussian component label.
5.2.2 Regularization Parameter Estimation for FNN
The details of this work are discussed in Ref. Guo2003reg .
Given data set , for supervised learning, can be output label, or samples drawn from regressed function.
Now the joint distribution in this work is designed as
where the kernel density function used the most is Gaussian kernel,
This model is under a very strong assumption ( are statistically independent ).
When error is Gaussian Bishop1996 ,
In our work Guo2003reg , the following designed model is considered:
where is a function of input variable and parameter .
Also, we design
, and is a neural network mapping function, for example, single hidden layer feedforward neural networks, or deep neural networks, or any other network architecture.
In our method, the Taylor expansion approximation is used, this method can be considered as “Local quadratic approximation.” (Eq. (17) in Ref. Guo2003reg ).
We derived the loss function as follows:
The first term is the traditional sum-square-error function, the second term is Jacobin regularization term, and the third term is Hessian regularization term. is the regularization parameter, which can be estimated with following formula (Eq.(50) in Guo2003reg , without considering Hessian term.) Also assume that prior
is a uniformly distributed function and regard it asindependent,
If we omit the second order derivative of Eq. (5.2.2), the loss function is reduced to the first-order Tikhonov regularizer.
With the generalized linear network assumption only for Jacobin regularization term, , weight decay regularizer is obtained.
5.3 Numerical Method for Reaction-Diffusion Equation
For non-equilibrium statistics, such as dissipative structures, we used the reaction-diffusion equation to study the dynamic process of the SLS. When studying a differential dynamical system, we are concerned with the properties (mainly global properties) of the system and its changes during perturbation. In a cooperative learning systems, the reaction-diffusion equation can also be used to describe the evolution dynamics of the system. If the SLS is designed as a differential dynamical system and attention is paid to the attractor subnetwork RollsComputational21 DomnisoruMembrane22 , the reaction-diffusion equation can also be used to describe it. Therefore, the mathematical basis of artificial intelligence should also include differential equations.
Example: Turing’s reaction-diffusion equation Turing1952 .
Turing’s reaction-diffusion equation is one of his revolutionary discoveries in the field of natural science and is the mechanism for Pattern FormationPearson1993 .
It is for this reason that Prigogine in proposed the theory of the Dissipative System, believing that “the energy exchange with the outside world is the fundamental reason for making the system orderly (contrary to the principle of entropy increase),” and founded a new discipline of non-equilibrium statistical mechanics Self-Organization26 .
: Here, we interpret the two substances in Turing’s model as two types of information. The reaction of substances is similar to the processes of information fusion and production, and the diffusion of substances is equivalent to the process of information transmission. By modeling the SLS system as a processing system of information, the so-called Information Granular processing system, we will study the general system of artificial intelligence. However, the information consists not only “particles” but also “wave-particle duality”. Information particles are described by high-dimensional random variables, and the function for information waves is the density distribution function.
An information wave is different from an electromagnetic wave, which is the carrier of information, but the elliptic and parabolic equations in the mathematical physical equation can be used to describe the process of information transmission. Methods to solve such equations depend on the difficulty of the problem. At present, most studies on partial differential equations in mathematics use the finite element methodZienkiewicz2013 and the finite difference method FerzigerFinite28 to obtain numerical solutions. In our earlier work, we used the heat diffusion equation to study the propagation of light beams in nonlinear media Guo1990 Guo1990b and the dynamics of interference filters GuoDynamics31 . The diffusion equation is a parabolic semi-linear partial differential equation and can also be used to study dispersive optical tomography Niu2008Improving32 .
If we transform the problem into a deterministic system learning problem, we can use a stochastic gradient or a pseudoinverse learning algorithm to optimize the SLSguo1995exact guo2001pseudoinverse guo2004pseudoinverse . In our 2003 paper Guo2003reg
, we set one model as a parametric model and the other as a nonparametric model based on the two models. After variational inference, a second order approximation was adopted to turn it into a deterministic system learning (optimization) problem.
This paper briefly introduces the SLS: concept, architecture and algorithm. Its main goals are to build a large, unified framework of the world model and to explore the road toward “intelligent mechanics.”
However, by constructing the SLS, can we achieve “intelligent mechanics” and develop grand, unified theories about artificial intelligence? We already know the significance of building a world model, but why study the grand, unified theories?
The world model described by LeCun is the world model of artificial intelligence, but the world we live in is a physical world. The intellectual activity of human beings belongs to the mental world. Therefore, the transition space when constructing an intelligence world model is the Cyber-Physics Model(CPM), which means that one of the vital cornerstones of artificial intelligence is physics, and physicists like to incorporate these theories. These theories describe complex phenomena as a set of concepts, and mathematical formulas that express these concepts can make very successful predictions.
In physics, the grand, unified theories, super symmetry, and the M theory are not only very beautiful thoughts but also deal with many problems that cannot be solved by standard models, thus attracting many theoretical physicists. However, no matter how wonderful they are, they will eventually require extensive experimental verification. We need to remember that a good scientific theory must meet the following three conditions:
1. It must be able to replicate all successful predictions of existing scientific theories;
2. It should be able to explain the latest experimental and observational data that existing scientific theories cannot explain;
3. Most importantly, it should also make predictions that can be tested.
Maxwell’s equations, General Relativity, and the Standard Model all conform to these three points.
Therefore, as a theory, we must not only explain the present but also predict the future. Does our SLS also meet these three points? We believe that the SLS world basically meets these three points in artificial intelligence.
In physics, a theory needs to make predictions that can be tested, but in the field of artificial intelligence, a theory ought to be subversive and innovative. The innovation behind our synergetic learning theory is based on the principle of least action, the variation of free energy, and the reaction-diffusion equation. This equation describes the process of system evolution. During the evolution of the system, the system is considered a differential dynamics system, described by the reaction-diffusion equation. Therefore, the intelligent “thermodynamic system” can be seen as the processing of information particles. The main innovation is that in the SLS with two subsystems, the model is generated by the differential dynamic system and is determined by the reduction model. We used the differential dynamic system to generate the model and the reduction model to correspond to the discriminant model.
In system science, a common phrase is “complex world, simple rules.” One of Yann LeCun’ s questions is whether or not there is a simple rule behind learning. If we think that if there is such a simple rule, then what is it? At present, no one has answered this question yet, but we have! We found this rule: the principle of free energy.
Why the principle of free energy is the mentioned simple rule? Brain development is the Darwin process of “Evolution & Selection.” The evolution of the human brain is not only related to the brain: it is also the sum of human evolutionary results and the coevolution of Earth’s entire biological community. The study of bio-intelligence has never been focused a single individual but, rather, on the evolution of all organisms in all populations in the history of the world, and it is a learning process with the function of survival as the optimization goal. The objective function of natural selection is driven by the probability of survival, and in nature, it always prefers the state with the least energy. The objective function in machine learning is set by human beings, which is a choice by human beings and conforms to their laws. Therefore, the principle of free energy is the law we chose for the evolution of the artificial intelligence world model.
The innovation of the SLS theory we propose is that most of the other methods consider either cooperation or competition; we believe that cooperation and competition exist between groups in a system that is harmonious and coexisting and this system is the opposites unity of contradictory. During the process of evolution, the relationship between groups is not static but fluctuates from cooperation to competition and from competition to cooperation. In a Synergetic Learning Systems, competition is also a synergy.
7 Summary and Perspective
In this paper a method for addressing the challenge of difficult in AI research fields is proposed. We believe that under a given environment (data, boundary conditions), the solution to the differential equation can be predicted with a defined evolutionary path, and the final effect is predictable and controllable. However, the problem with uncertainty is that during the evolution of the system from simple toward complex, a phenomenon that needs attention is emerging. The term “emergent phenomena,” as used by condensed matter physicists, refers to the complex nature produced by the interaction between a large number of simple components. In life, the current phenomenon is the interaction between molecules and how the molecules combine to form a structure or perform a function. Living systems evolve, adapt and change through interactions or information exchanges with other systems. Biological systems have feedback loops that make them difficult to analyze using standard differential equations. We do not know how to solve this problem. Ramin Golestanian, director of the Max Planck Institute for Dynamics and Self-Organization, said: “Physicists have studied many complex systems, but in terms of the number of complexity and degrees of freedom, life systems belong to a completely different category.” Therefore, the SLS is not a living system, but an artificial intelligence system. How to integrate the emerging phenomenon into the differential dynamic equation during the evolution process is a future research direction.
The further work will focus on exploring the difficult mechanism explanation problem and the interpretable neural network model problem. The further work is aimed at addressing the challenges of topology design and explores the design of neural network topology.
The research work described in this paper was fully supported by the National Key Research and Development Program of China (No. 2018AAA0100203). Prof. Ping Guo and Qian Yin are the authors to whom all correspondence should be addressed.
Beal, James, M.: Variational algorithms for approximate bayesian inference (2003)
- (2) Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford, UK (1996)
- (3) Bishop, C.M.: Pattern Recognition and Machine Learning. Springer-Verlag New York (2006). Chapter 10: Approximate Inference
- (4) Domnisoru, Cristina, ans Amina A., K., Tank, W., D.: Membrane potential dynamics of grid cells. Nature 495, 199–204 (2013)
- (5) Ferziger, H., J., Perić, M.: Finite Difference Methods. Springer, Berlin, Heidelberg (2002). In: Computational Methods for Fluid Dynamics.
- (6) Friston, K.: The free-energy principle: a unified brain theory? Nature Reviews Neuroscience 11(2), 127–138 (2010). DOI 10.1038/nrn2787. URL https://doi.org/10.1038/nrn2787
- (7) Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)
- (8) Guo, A.: How does the brain work on the whole. the second China Systems Science Conference (2018). In Chinese
- (9) Guo, P.: Numerical solution of gaussian beam propagation in nonlinear gradient refractive media. Laser Technology 14(5) (1990). In Chinese
- (10) Guo, P.: Synergetic learning systems (I): Concept, architecture, and algorithms. Preprint, researchgate.net (2019). DOI 10.13140/RG.2.2.13681.12644. URL https://doi.org/10.13140/RG.2.2.13681.12644. The third China Systems Science Conference (CSSC2019), Changsha, May 18-19, 2019. (Chinese version)
- (11) Guo, P., Awwal, A.A.S., Chen, C.L.P.: Dynamics of a coupled double-cavity optical interference filter. Journal of Optics 46(1), 167–174 (1999)
- (12) Guo, P., Chen, C.L.P., Lyu, M.R.: Cluster number selection for a small set of samples using the bayesian ying-yang model. IEEE Trans. Neural Networks 13(3), 757–763 (2002). DOI 10.1109/TNN.2002.1000144. URL https://doi.org/10.1109/TNN.2002.1000144
- (13) Guo, P., Chen, C.L.P., Sun, Y.: An exact supervised learning for a three-layer supervised neural network. In: Proceedings of the International Conference on Neural Information Processing (ICONIP’95), vol. 2, pp. 1041–1044 (1995). URL https://www.researchgate.net/publication/318445413_An_Exact_Supervised_Learning_for_a_Three-Layer_Supervised_Neural_Network
Guo, P., Jia, Y., Lyu, M.R.: A study of regularized gaussian classifier in high-dimension small sample set case based on MDL principle with application to spectrum recognition.Pattern Recognit. 41(9), 2842–2854 (2008). DOI 10.1016/j.patcog.2008.02.004. URL https://doi.org/10.1016/j.patcog.2008.02.004
- (15) Guo, P., Lyu, M.R.: Pseudoinverse learning algorithm for feedforward neural networks. In: N.E. Mastorakis (ed.) Advances in Neural Networks and Applications, pp. 321–326. World Scientific and Engineering Society Press, Athens, Greece (2001). URL https://www.researchgate.net/publication/293477570_Pseudoinverse_learning_algorithm_for_feedforward_neural_networks
- (16) Guo, P., Lyu, M.R.: A pseudoinverse learning algorithm for feedforward neural networks with stacked generalization applications to software reliability growth data. Neurocomputing 56, 101–121 (2004). DOI https://doi.org/10.1016/S0925-2312(03)00385-0. URL http://www.sciencedirect.com/science/article/pii/S0925231203003850
- (17) Guo, P., Lyu, M.R., Chen, C.L.P.: Regularization parameter estimation for feedforward neural networks. IEEE Trans. Systems, Man, and Cybernetics, Part B 33(1), 35–44 (2003). DOI 10.1109/TSMCB.2003.808176. URL https://doi.org/10.1109/TSMCB.2003.808176
- (18) Guo, P., Sun, Y.G.: Gaussian beam propagation with nonlinear medium limiter. Acta Optica Sinica 10(12) (1990). In Chinese
- (19) Guo, P., Zhao, B.: Methodology for building synergetic learning system. Preprint, researchget.net, (2019). DOI 10.13140/RG.2.2.10146.07368. The third China Systems Science Conference (CSSC2019), Changsha, May 18-19, 2019 (Chinese version)
- (20) Haken, H.: The mystery of nature. ISBN: 9787532736379 (2005-01)
- (21) Haken, H.: Information and self-organization: Sichuan education publishing. ISBN: 9787540853112 (2010-4)
- (22) Haykin, S.O.: Neural Networks and Learning Machines, 3rd edn. Pearson Higher Ed (2011)
- (23) Hinton, G.E., Zemel, R.S.: Autoencoders, minimum description length and helmholtz free energy. In: J.D. Cowan, G. Tesauro, J. Alspector (eds.) Advances in Neural Information Processing Systems 6, [7th NIPS Conference, Denver, Colorado, USA, 1993], pp. 3–10. Morgan Kaufmann (1993). URL http://papers.nips.cc/paper/798-autoencoders-minimum-description-length-and-helmholtz-free-energy
- (24) Hopfield, J.J.: Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences 79(8), 2554–2558 (1982). DOI 10.1073/pnas.79.8.2554. URL https://www.pnas.org/content/79/8/2554
Hopfield, J.J.: Neurons with graded response have collective computational properties like those of two-state neurons.Proceedings of the National Academy of Sciences 81(10), 3088–3092 (1984). DOI 10.1073/pnas.81.10.3088. URL https://www.pnas.org/content/81/10/3088
- (26) Ian Goodfellow Yoshua Bengio, A.C.: Deep Learning. MIT Press (2019)
- (27) Kim, T., Bengio, Y.: Deep directed generative models with energy-based probability estimation. CoRR abs/1606.03439 (2016). URL http://arxiv.org/abs/1606.03439
- (28) Kumar, R., Goyal, A., Courville, A.C., Bengio, Y.: Maximum entropy generators for energy-based models. CoRR abs/1901.08508 (2019). URL http://arxiv.org/abs/1901.08508
- (29) LeCun, Y.: Learning world models: the next step towards AI. Keynote, the 27th IJCAI (2018)
- (30) LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)
- (31) Lecun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F.J.: A tutorial on energy-based learning. In: Predicting Structured Data. MIT Press (2006)
- (32) Marcus, G.: Deep learning: A critical appraisal. arXiv:1801.00631 [cs.AI] (2018). URL http://arxiv.org/abs/1801.00631
- (33) Nikolis, G., Prigogine, I.: Self-Organization in Non-Equilibrium Systems. Wiley, New York (1977)
- (34) Niu, H., Guo, P., Ji, L., Zhao, Q., Jiang, T.: Improving image quality of diffuse optical tomography with a projection-error-based adaptive regularization method. Optics Express 16(17), 12423–34 (2008)
- (35) Pearson, J.E.: Complex patterns in a simple system. Science 261(5118), 189 – 192 (1993). URL http://dx.doi.org/10.1126/science.261.5118.189
- (36) Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y.N., Glorot, X.: Higher order contractive auto-encoder. In: D. Gunopulos, T. Hofmann, D. Malerba, M. Vazirgiannis (eds.) Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part II, Lecture Notes in Computer Science, vol. 6912, pp. 645–660. Springer (2011). DOI 10.1007/978-3-642-23783-6˙41. URL https://doi.org/10.1007/978-3-642-23783-6_41
Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: Explicit invariance during feature extraction.In: L. Getoor, T. Scheffer (eds.) Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pp. 833–840. Omnipress (2011). URL https://icml.cc/2011/papers/455_icmlpaper.pdf
- (38) Rolls, E.T., Loh, M., Deco, G., Winterer, G.: Computational models of schizophrenia and dopamine modulation in the prefrontal cortex. Nature Reviews Neuroscience 9, 696–709 (2008)
- (39) Smídl, V., Quinn, A.: The Variational Bayes Method in Signal Processing. Springer (2006)
- (40) Tsien, H.S., Yu, J.Y., Dai, R.W.: A new field of science - an open complex giant system and its methodology. Chinese Nature (1990). In Chines
- (41) Turing, A.M.: The chemical basis of morphogenesis. Philosophical Transactions of the Royal Society of London, Series B 237(641), 37–72 (1952). URL https://doi.org/10.2307/92463
- (42) Waldrop, M.M.: News feature: What are the limits of deep learning? Proceedings of the National Academy of Sciences 116(4), 1074–1077 (2019)
- (43) Wikipedia: Reaction diffusion system. wikipedia.org (2016). URL https://en.wikipedia.org/wiki/Reaction_diffusion_system
- (44) Xin, X., Guo, P.: A survey on the past, present and development trend of the basic theory of artificial intelligence. ChinaXiv:201905.00013 (2019). URL https://doi.org/10.12074/201905.00013. (in Chinese)
- (45) Xu, Z.: Grasping the focus of next-generation information technology. People’s Daily (2019). In Chinese
- (46) Ye, Q.X.: Introduction to reaction diffusion equation. Practice and understanding of mathematics pp. 48–56 (1984). In Chinese
- (47) Zhao, J.J., Mathieu, M., LeCun, Y.: Energy-based generative adversarial network. CoRR abs/1609.03126 (2016). URL http://arxiv.org/abs/1609.03126
- (48) Zienkiewicz, O.C., Taylor, R.L., Zhu, J.Z.: The Finite Element Method: Its Basis and Fundamentals, p. 756. Butterworth-Heinemann, Oxford (2013). DOI 10.1016/B978-1-85617-633-0.00019-8. URL https://doi.org/10.1016/B978-1-85617-633-0.00019-8