Eigen Artificial Neural Networks

by   Francisco Yepes Barrera, et al.

This work has its origin in intuitive physical and statistical considerations. An artificial neural network is treated as a physical system, composed of a conservative vector force field. The derived scalar potential is a measure of the potential energy of the network, a function of the distance between predictions and targets. Starting from some analogies with wave mechanics, the description of the system is justified with an eigenvalue equation that is a variant of the Schrodinger equation, in which the potential is defined by the mutual information between inputs and targets. The weights and parameters of the network, as well as those of the state function, are varied so as to minimize energy, using an equivalent of the variational theorem of wave mechanics. The minimum energy thus obtained implies the principle of minimum mutual information (MinMI). We also propose a definition of the work produced by the force field to bring a network from an arbitrary probability distribution to the potential-constrained system. At the end of the discussion we expose a recursive procedure that allows to refine the state function and bypass some initial assumptions. The results demonstrate how the minimization of energy effectively leads to a decrease in the average error between network and target predictions.



There are no comments yet.



Renormalized Mutual Information for Extraction of Continuous Features

We derive a well-defined renormalized version of mutual information that...

A robust solution of a statistical inverse problem in multiscale computational mechanics using an artificial neural network

This work addresses the inverse identification of apparent elastic prope...

Concerning the differentiability of the energy function in vector quantization algorithms

The adaptation rule for Vector Quantization algorithms, and consequently...

Hardware implementation of auto-mutual information function for condition monitoring

This study is aimed at showing applicability of mutual information, name...

Ab-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave Functions

Solving the Schrödinger equation is key to many quantum mechanical prope...

Criticality in Formal Languages and Statistical Physics

We show that the mutual information between two symbols, as a function o...

Global minimization via classical tunneling assisted by collective force field formation

Simple dynamical models can produce intricate behaviors in large network...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

This paper analyzes the problem of optimizing artificial neural networks (ANNs), ie the problem of finding functions , dependent on matrixes of input data and parameters , such that, given a target make an optimal mapping between and .

The starting point of this article is made up of some well-known theoretical elements:

  1. The training of an artificial neural network consists in the minimization of some error function between the output of the network and the target . In the best case it identify the global minimum of the error; in general it finds local minima. The total of minimums forms a discrete set of values.

  2. The passage from a prior to a posterior or conditional probability, that is the observation or acquisition of additional knowledge about data, implies a collapse of the function that describes the system: the conditional probability calculated with Bayes’ theorem leads to distributions of closer and more localized probabilities than prior ones


  3. Starting from the formulation of the mean square error produced by an artificial neural network and considering a set of targets whose distributions are independent

    with the conditional probability of given and the marginal probability of , it can be shown that


    being the expected value or conditional average of given , and the equal valid only at the optimum. In practice, any trial function leads to a quadratic deviation with respect to greater than that generated by the optimal function, , corresponding to the absolute minimum of the error, since this represents the conditional average of the target, as demonstrated by the known result [2]


These three points can be directly related to three theoretical elements at the base of wave mechanics [8]:

  1. Any physical system described by the Schrõdinger equation and constrained by a scalar potential leads to a quantization of energy values, which constitute a discrete set of real values.

  2. A quantum-mechanical system is formed by the superposition of a series of states described by the Schrõdinger equation, corresponding to as many eigenvalues. The observation of the system causes the collapse of the wave function on one of the states, being only possible to calculate the probability of obtaining the different eigenvalues.

  3. When it is not possible to analytically obtain the true wave function and the true energy of a quantum-mechanical system, it is possible to use trial functions , with eigenvalues , dependent on a set of parameters. In this case we can find an approximation to and varying and taking into account the variational theorem

Regarding point 3, we can consider the condition (I.1) as an equivalent of the variational theorem for artificial neural networks.

Ii Treatment of the optimization of artificial neural networks as an eigenvalue problem

The analogies highlighted in Section I suggest the possibility of dealing with the problem of optimizing artificial neural networks as a physical system. Analysis attempts using models from mathematical physics are not new [9]. The analogies are studied in this work to understand if it is possible to model the ANNs optimization problem with eigenvalue equations, as happens in the physical systems modeled by the Schrõdinger equation. This model allows to define the energy of the network, a concept already used in some types of neural networks, such as the Hopfield networks in which Lyapunov or energy functions can be derived for binary elements networks allowing a complete characterization of their dynamics . We will generalize the concept of energy for any type of ANN.

Suppose we can define a conservative force generated by the set of targets , represented in the input space with a vector field, being the dimensionality of . In this case we have a scalar function , called potential, which depends exclusively on the position and which is defined as


which implies that the potential of the force at a point is proportional to the potential energy possessed by an object at that point due to the presence of force. The negative sign in the equation (II.1) means that the force is directed towards the target, where the force is maximum and the potential is minimal, so generates an attractive force that attracts the bodies immersed in the field, represented by the average predictions of the network, with an intensity proportional to a function of the distance between and .

The equation (I.2) highlights how, at the optimum, the output of an artificial neural network is an approximation to the conditional averages or expected values of the targets given the input . Both and are given by the problem, with average values that do not vary over time. We can therefore hypothesize a stationary system and an eigenvalue equation independent of time, having the same structure as the Schrõdinger equation


with the state function of system (network), a scalar potential, the network energy, a multiplicative constant and a a variational parameter. seems necessary since the equation (II.2) does not arise from a true physical system, so the relative values between the first and second terms of the first member are unknown. Preliminary calculations show that for some problems the value of can be very small compared to the first term. We will consider that and is dimensionless.

The equation (II.2

) implements a parametric model for the ANNs in which the optimization consists in minimizing, on average, the energy of the network, function of

and , modeled by appropriate probability densities and a set of variational parameters . The working hypothesis is that the minimization of energy through a parameter-dependent trial function that makes use of the variational theorem (I.1) leads, using an appropriate potential, to a reduction of the error committed by the network in the prediction of .

Iii The potential

A function that satisfies all the requirements exposed to be used as potential is the mutual information, [14], which is a positive quantity. In this case, the minimization of energy through a variational state function that satisfies the equation (II.2) implies the principle of minimum mutual information (MinMI) [4, 6, 7, 15]

, equivalent to the principle of maximum entropy (

MaxEnt) [3, 10, 11, 13]. The scalar potential depends only on the vector and for targets becomes111When not specified, we will implicitly assume that integrals extend to all space in the interval .


The equation (III.1) assumes a superposition principle, similar to the valid one in the electric field, in which the total potential is given by the sum of the potentials with respect to each of the targets of the problem.

Considering that the network provides an approximation to the target given by a deterministic function with a noise , , and considering that the error

is normally distributed with mean zero, the conditional probability

can be written as [2]


Note that

is the standard deviation of

and , so . To be able to integrate the differential equation (II.2) we will consider the vector constant. We will see at the end of the discussion that it is possible to obtain an expression for dependent on , which allows us to derive a more precise description of the potential.

We also write unconditional probabilities for inputs and targets as Gaussians to simplify the mathematical treatment


Considering the absence of correlation between the input variables, the probability is reduced to


with, in this case, , representing with the Gaussian with mean

and variance

relative to the component of the vector . The equations (III.3) e (III.4) introduce in the model a statistical description of the problem starting from the observed data, through the set of constants , , e .

The integration of the equation (III.1) over gives




Mutual information in the potential (III.1) is expressed in nats. We will call the units of energy calculated from (II.2) nats of energy or enats.

It is known that a linear combination of Gaussians can approximate an arbitrary function. Using a base of dimension we can write the following expression for




and the bias term for the output unit . The equations (III.7) e (III.8

) propose a model of neural network of type RBF (Radial Basis Function), which contain a single hidden layer and allow to facilitate the calculation given the complexity of the model.

Taking into account the equations (II.1), (III.5) e (III.7) the components of the force, , are given by

In physical conservative fields, work, , is defined as the minus difference between the potential energy of a body subject to the forces of the field and that possessed by the body at a reference point, . In some types of central force fields, as in the electrostatic or gravitational cases, the reference point is located at an infinite distance from the source where, given the dependence of on , the potential energy is zero.

Since in the discrete case the mutual information is limited superiorly from the minimum among the marginal entropies, and ,222We use in lower case as discrete entropy to distinguish it from , which in this work is used as a symbol of the Hamiltonian operator and of the integrals . given that the distribution with maximum entropy is the uniform one, , and that the reference point against which to calculate the potential difference is arbitrary, we can propose the following definition of work333The treatment of the text makes considerations on the discrete case since the differential entropy can be negative.


For , the equation (III.9

) explains the work, in enats, carried out by the forces of the field to pass from a neural network that realizes uniformly distributed predictions to a network that realizes an approximation to the density


Iv The state equation

A dimensional analysis of the potential (III.5) shows that the term is dimensionless. Thus, the units of the potential are determined by the factor . To maintain the dimensional coherence in the equation (II.2) we multiplied the first term of the first member by the factor , where444This setting makes it possible to incorporate into the value of , but in the continuation we will leave it explicitly indicated.

cannot be a constant factor independent of the single components of since in general every has its own units and its own variance.

Given the variational parameter in the second term of the first member of the equation (II.2), we can without losing generality multiply the first term by , obtaining . The Hamiltonian operator

is real, linear and hermitian, and has the same structure as that used in the Schrõdinger equation. Hermiticity stems from the condition that the average value of energy is a real value, .555In this article we only use real functions, so the hermiticity condition is reduced to the symmetry of the and matrixes. and represent the operators related respectively to the kinetic and potential components of the Hamiltonian.

The final state equation is


Wanting to make an analogy with wave mechanics, we can say that the equation (IV.1) describes the motion of particle of mass subject to the potential (III.5). , as happens in quantum mechanics with the Planck constant, has the role of a scale factor: the phenomenon described by the equation (IV.1) is relevant in the range of variance for each single component of the vector .

We discussed the role of the operator : its variation in the space implies a force that is directed towards the target where is minimum and is maximum. The operator contains the divergence of a gradient in the space and represents the flow density of the gradient, being a measure of the deviation of the state function at a point with respect to the average of the surrounding points. The role of in the equation (IV.1) is to introduce information about curvature. In neural networks a similar role is found in the use of the Hessian matrix, calculated in the space of weights, in conventional second order optimization techniques.666In this case, as we will later show, the second derivatives are calculated in the space of the weights and not in the space .

Starting from the expected energy value obtained from the equation (IV.1)777Although all the functions used in this work are real, we will make their complex conjugates explicit in the equations, as is usual in the wave mechanics formulation.


assuming a base of dimension for the trial function


with the basis functions developed in a similar way to what we did for in the equations (III.7) and (III.8)


and considering the coefficients independent of each other, , the Rayleigh-Ritz method leads to the linear system




To obtain a nontrivial solution the determinant of the coefficients have to be zero


which leads to energies, equal to the size of the base (IV.3). The energy values represent an upper limit to the first true energies of the system. The substitution of every in (IV.5) allows to calculate the coefficients of relative to the state . The lowest value among represents the global optimum of the problem or fundamental state that leads, in the hypotheses of this article, to the minimum or global error of the neural network in the prediction of the target

, while the remaining eigenvalues can be interpreted as local minima. It can be shown that the eigenfunctions obtained in this way form an orthogonal set. The variational method we have discussed has a general character and can be applied, in principle, to artificial neural networks of any kind, not bound to any specific functional form for


The proposed model assumes a change of paradigm with respect to some known methods of optimizing neural networks, such as the gradient descent, which carry out a search in the parameter space, in particular the set of weights , through the search for a expression for with a form of error of the neural network. In this article the variables of the problem, and the search in the relative space, are the input with a set of variational parameters.

Using the equations (III.7) and (III.8) and taking into account the constancy of , the integrals (IV.6) and (IV.7) have the following expressions


The number of variational parameters of the model, , is


The energies obtained by the determinant (IV.8) allow to obtain a system of equations resulting from the condition of minimum


The system (IV.10) is implicit in , , and must be solved in an iterative way, as depends on which in turn is a function of .

V Interpretation of the state function

The model we have proposed contains two main weaknesses: 1) the normality of the marginal densities and ; 2) the constancy of the vector . The following discussion tries to resolve the second point.

Similarly to wave mechanics we can interpret the square module of as a probability. In this case, the Laplacian operator in equation (IV.1) models a probability flow. Given that we have obtained from a statistical description of the known targets, we can assume that represents the conditional probability of given , subject to the set of parameters


The equation (V.1) is related to the conditional probability through the Bayes theorem


Since we considered the targets independent, using the expressions (III.2) and (V.1) into (V.2), separating variables and integrating over , assuming that at the optimum is satisfied the condition , we have

which leads to an implicit equation in . For networks with a single output, , we have


With , and functions of . The equation (V.3) allows in principle an iterative procedure which, starting from the constant initial value which leads to a state function , through the resolution of the system (IV.5) permits to calculate successive corrections of .

Vi Results

The resolution of the system (IV.5

) requires considerable computational powers. For this reason the minimum energy was calculated in an approximate way with a genetic algorithm (GA), on an Intel 6-Core i7-8750H MT MCP processor. The equations have been treated symbolically with the Computer Algebra System maxima.


The test problem comes from the Statlib repository.999http://lib.stat.cmu.edu/datasets/ It is a synthetic dataset made up of 3848 records, generated by David Coleman, referred to for convenience as POLLEN and which represents geometric and physical characteristics of pollen grain samples. It consists of 5 variables: the first three are the lengths in the directions x (ridge), y (nub) and z (crack), the fourth is the weight and the fifth is the density, the latter being the target of the problem. In our model they represent, respectively, , , , and

. The choice of this problem lies in the fact that the data were generated with Gaussian distributions with low correlations, and is therefore close to the initial assumptions of the model for

and . Tables I and II show the general statistics of the dataset.

Original data Normalized data Skewness Kurtosis
Var / / / /
-3.637e-03 6.398 0.284536 0.041489 -0.130 -0.057
1.597e-04 5.186 0.312775 -0.026460 0.072 -0.311
3.103e-03 7.875 0.258654 0.014287 -0.057 -0.158
4.237e-03 10.004 0.285505 -0.024498 0.109 -0.163
1.662e-04 3.144 0.046501 0.274707 0.110 0.192
Table I: POLLEN dataset, general features. The table shows the means (, ), standard deviations (, ), skewness and kurtosis (the reference for normality is 0) of the original data and normalized data
1.00 0.13 -0.13 -0.90 -0.57
0.13 1.00 0.08 -0.17 0.33
-0.13 0.08 1.00 0.27 -0.15
-0.90 -0.17 0.27 1.00 0.24
-0.57 0.33 -0.15 0.24 1.00
Table II: Dataset POLLEN, correlation matrix

The characteristics of the genetic algorithm have been described in a previous paper [1]. This is a steady-state GA, with a generation gap of one or two, depending on the operator applied. The population has binary coding and implements a fitness sharing mechanism [5] to allow speciation and avoid premature convergence, according to the equations


being a function of the diversity between individuals and , the Hamming distance and the niche radius within which individuals are considered similar. Niche sharing implements a correction to energy calculated based on the similarity between the individual and the rest of the population. The more similar it is, the greater the value of , penalizing the energy in the equation (VI.1) since we are minimizing.

The decoding of the genotype implements the code of Gray to avoid discontinuities in the binary representation. The transformation between the binary representations, , and Gray, , for the -th bit, considering numbers composed of bits numbered from right to left, with the most significant bit on the left, is given by

with the XOR operator.

The GA uses four operators: crossover, mutation, uniform crossover and internal crossover, and performs a search in the space of the computed energies according to the equation (IV.2

), but simultaneously realizes a search in the space of the operators through the use of two additional bits in the genotype of each individual of the population. This allows a dynamic choice of the probabilities of each operator at each moment of the calculation, according to the fraction of elements of the population that encode for each of the four possibilities. The initial population is randomly generated.

The procedure for assessing an individual consists of the following steps:

  1. the values of the parameters are generated within certain prefixed ranges through the application of one of the operators;

  2. the network output, , is generated for each element of the dataset. This set of values allows us to calculate ;

  3. the elements of the matrices and are computed by means of the integrals (IV.6) and (IV.7);

  4. the determinant (IV.8) is calculated;

  5. the system (IV.5) is solved.

The result is the energy value, , and the coefficients of .

Before the execution of the tests, a preprocessing of the dataset was performed, normalizing and within the range [-1:1]. 15 calculations were conducted, each consisting of 10 concurrent processes sharing the best solution found. In each calculation the set of lower energy solutions found in the previous calculations were introduced. The values of the parameters have been varied within certain pre-established ranges, identified through a preliminary test campaign. The reference ranges are shown in Table III. Table IV shows the reference values of the parameters used in the genetic algorithm.

Variable Value
C 1
D 8
N 4
P 10
w [0:4]
Table III: Values of the model and range of variability of the variational parameters and of the data and of the dataset
Variable Value
Population 100
Point mutation probability 0.005
Chromosome length 2174 bits
Calculation cycles [5000:6000]
Table IV: Reference values of the genetic algorithm
P \ N
3.4013120 -0.764254 0.487696 0.828128 -0.309709
2.8162860 0.722523 -0.829504 0.884602 -0.440280
3.9395730 -0.977770 -0.860840 -0.385800 0.694424
1.1004576 0.563102 0.138821 0.801561 0.928019
0.1681290 0.829662 0.126966 -0.345363 -0.546116
0.8617840 -0.621750 -0.468552 -0.684078 0.462411
0.2275410 -0.969937 0.801109 -0.634125 -0.436867
3.2266080 0.424373 0.804881 0.896162 -0.218709
3.4019790 -0.756753 -0.440174 -0.688590 0.101702
3.6097500 0.937970 0.675499 0.980483 -0.884515
Table V: Results of the genetic algorithm for the parameters of the basis functions of the network
P \ C
Table VI: Results of the genetic algorithm for network weights
P \ N
-1.282546 0.110080 -0.391103 0.5176760 -0.572174 -0.302664
-1.438387 0.120303 -0.668461 0.857889 -0.470183 -0.626793
-0.419667 0.186642 -0.679090 0.564954 -0.457736 -0.676878
1.357698 0.100000 -0.552879 0.782675 -0.552136 -0.562775
0.233801 0.100029 -0.134230 0.284860 -0.598276 0.211275
-0.559200 0.194570 -0.661250 0.672405 -0.335728 -0.348150
0.027306 0.100185 0.173759 -0.846671 -0.848185 -0.514533
2.092270 0.159235 -0.635414 0.671673 -0.428168 -0.496624
Table VII: Results of the genetic algorithm for the coefficients and parameters of the basis functions of the state function
Variable Value
(train) 26.064396
(train) 2.748593
(train) -0.448009
(train) 0.194494
(test) 25.537031
(test) 2.375004
(test) -0.419022
(test) 0.195101
(train) 0.945%
(test) 0.951%
(train) 0.627771 enats
(test) 0.621814 enats
Table VIII: Results of the genetic algorithm for the network with lower energy
Figure VI.1: Evolution of the genetic algorithm that produced the solution with lower energy

For each element of the population, in addition to the energy value, has been calculated the square error percentage of the neural network [1, 12]

with the number of dataset records and the normalization interval used.

Some of the parameters in the Table IV deserve some observation:

  • implies the so-called triangular niche sharing;

  • has a considerable influence on the results and was chosen for each of the 10 concurrent processes of each calculation according to the criterion , with the process number. This allows to avoid arbitrary choices since can be dependent on the nature of the problem;

  • was chosen in the interval [0:4], which includes the value given by a heuristic RBF rule which proposes for the standard deviation of the associated Gaussian,

    , the reference value , with the average value between the centroids of the functions of the equation (III.8

    ). Considering an estimate of

    we get . The range is equivalent to . The same criterion has also been used for the vector .

The data was divided into two parts, a set of training (2886 records) and a set of testing (962 records). Technically, this subdivision is not necessary, since the two data partitions are generated by the same distribution and the characterization of the problem in the model is given exclusively by the value of the constants , , e .

The variational parameters of the best solution are reported in Tables V, VI and VII. The final results of the calculation, including error and energy, are shown in Table VIII. Figure VI.1 shows the evolution of error and energy (in the lower and average versions of the population) of the calculation that generated the lower energy solution, which shows how the minimization of energy leads to a decrease of the error committed by the net in the target prediction. The final error value for training (0.945%) and testing (0.951%) partitions is particularly significant given the low number of basis functions used in the definition of and .

Vii Conclusions

In this work we have developed a model for artificial neural networks based on an analogy with a physical-quantum mechanical system. One of the advantages of this approach is the possibility, potentially, of using wave mechanics techniques in their study. An example is the generalized Hellmann-Feynman theorem

whose validity needs to be demonstrated in this context, but whose use seems justified since it can be demonstrated by assuming exclusively the normality of and the hermiticiy of . Its applicability could help in the calculation of the system (IV.10).

It is necessary to carry out a systematic test campaign to verify the results obtained. These tests are currently underway and will be the subject of a subsequent work. The preliminary results obtained on a set of selected problems coming from the Statlib and UCI101010https://archive.ics.uci.edu/ml/index.php repositories confirm the validity of the model.


  • Barrera [2007] Francisco Yepes Barrera. Búsqueda de la estructura óptima de redes neurales con algoritmos genéticos y simulated annealing. verificación con el benchmark proben1. Inteligencia Artificial, Revista Iberoamericana de IA, 11(34):41–61, 2007.
  • Bishop [1995] Christopher M. Bishop.

    Neural Networks for Pattern Recognition

    Oxford University Press, 1995.
  • Finnegan and Song [2017] Alex Finnegan and Jun S. Song. Maximum entropy methods for extracting the learned features of deep neural networks. PLOS Computational Biology, 13(10):e1005836, October 2017. ISSN 1553-7358. doi: 10.1371/journal.pcbi.1005836.
  • Fitzgerald et al. [2011] Jeffrey D. Fitzgerald, Lawrence C. Sincich, and Tatyana O. Sharpee. Minimal Models of Multidimensional Computations. PLOS Computational Biology, 7(3):e1001111, March 2011. ISSN 1553-7358. doi: 10.1371/journal.pcbi.1001111.
  • Gao and Hu [2006] Lan Gao and Youwei Hu. Multi-target matching based on niching genetic algorithm. JCSNS International Journal of Computer Science and Network Security, 6(7A), July 2006.
  • Globerson and Tishby [2004] Amir Globerson and Naftali Tishby. The Minimum Information Principle for Discriminative Learning. In

    Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence

    , UAI ’04, pages 193–200, Arlington, Virginia, United States, 2004. AUAI Press.
    ISBN 978-0-9749039-0-3. event-place: Banff, Canada.
  • Globerson et al. [2009] Amir Globerson, Eran Stark, Eilon Vaadia, and Naftali Tishby. The minimum information principle and its application to neural code analysis. Proceedings of the National Academy of Sciences of the United States of America, PNAS, 106(9), march 2009.
  • Levine [2014] Ira N. Levine. Quantum chemistry. Pearson, Boston, seventh edition edition, 2014. ISBN 978-0-321-80345-0.
  • Movellan and McClelland [1993] Javier R. Movellan and James L. McClelland. Learning Continuous Probability Distributions with Symmetric Diffusion Networks. Cognitive Science, 17(4):463–496, October 1993. ISSN 03640213. doi: 10.1207/s15516709cog1704-1.
  • Park and Abusalah [1997] Joseph C. Park and Salahalddin T. Abusalah. Maximum Entropy: A Special Case of Minimum Cross-entropy Applied to Nonlinear Estimation by an Artificial Neural Network. Complex Systems, 11, 1997.
  • Pires and Perdigao [2012] Carlos A. L. Pires and Rui A. P. Perdigao. Minimum Mutual Information and Non-Gaussianity Through the Maximum Entropy Method: Theory and Properties. Entropy, 14(6):1103–1126, June 2012. ISSN 1099-4300. doi: 10.3390/e14061103.
  • Prechelt [1994] Lutz Prechelt. Proben1 - a set of neural network benchmark problems and benchmarking rules. Technical Report 21/94, Fakültat für Informatik, Universität Karlsruhe, 76128 Karlsruhe, Germany, September 1994.
  • Xiaodong [2014] Zhang Xiaodong. Evaluation model and simulation of basketball teaching quality based on maximum entropy neural network. page 5, 2014.
  • Xu [1999] Dongxin Xu. Energy, entropy and information potential for neural computation. PhD thesis, University of Florida, 1999.
  • Zhang et al. [2017] Yan Zhang, Mete Ozay, Zhun Sun, and Takayuki Okatani. Information Potential Auto-Encoders. arXiv:1706.04635 [cs, math, stat], June 2017. arXiv: 1706.04635.