I Introduction
This paper analyzes the problem of optimizing artificial neural networks (ANNs), ie the problem of finding functions , dependent on matrixes of input data and parameters , such that, given a target make an optimal mapping between and .
The starting point of this article is made up of some wellknown theoretical elements:

The training of an artificial neural network consists in the minimization of some error function between the output of the network and the target . In the best case it identify the global minimum of the error; in general it finds local minima. The total of minimums forms a discrete set of values.

The passage from a prior to a posterior or conditional probability, that is the observation or acquisition of additional knowledge about data, implies a collapse of the function that describes the system: the conditional probability calculated with Bayes’ theorem leads to distributions of closer and more localized probabilities than prior ones
[2]. 
Starting from the formulation of the mean square error produced by an artificial neural network and considering a set of targets whose distributions are independent
with the conditional probability of given and the marginal probability of , it can be shown that
(I.1) being the expected value or conditional average of given , and the equal valid only at the optimum. In practice, any trial function leads to a quadratic deviation with respect to greater than that generated by the optimal function, , corresponding to the absolute minimum of the error, since this represents the conditional average of the target, as demonstrated by the known result [2]
(I.2) 
These three points can be directly related to three theoretical elements at the base of wave mechanics [8]:

Any physical system described by the Schrõdinger equation and constrained by a scalar potential leads to a quantization of energy values, which constitute a discrete set of real values.

A quantummechanical system is formed by the superposition of a series of states described by the Schrõdinger equation, corresponding to as many eigenvalues. The observation of the system causes the collapse of the wave function on one of the states, being only possible to calculate the probability of obtaining the different eigenvalues.

When it is not possible to analytically obtain the true wave function and the true energy of a quantummechanical system, it is possible to use trial functions , with eigenvalues , dependent on a set of parameters. In this case we can find an approximation to and varying and taking into account the variational theorem
Regarding point 3, we can consider the condition (I.1) as an equivalent of the variational theorem for artificial neural networks.
Ii Treatment of the optimization of artificial neural networks as an eigenvalue problem
The analogies highlighted in Section I suggest the possibility of dealing with the problem of optimizing artificial neural networks as a physical system. Analysis attempts using models from mathematical physics are not new [9]. The analogies are studied in this work to understand if it is possible to model the ANNs optimization problem with eigenvalue equations, as happens in the physical systems modeled by the Schrõdinger equation. This model allows to define the energy of the network, a concept already used in some types of neural networks, such as the Hopfield networks in which Lyapunov or energy functions can be derived for binary elements networks allowing a complete characterization of their dynamics . We will generalize the concept of energy for any type of ANN.
Suppose we can define a conservative force generated by the set of targets , represented in the input space with a vector field, being the dimensionality of . In this case we have a scalar function , called potential, which depends exclusively on the position and which is defined as
(II.1) 
which implies that the potential of the force at a point is proportional to the potential energy possessed by an object at that point due to the presence of force. The negative sign in the equation (II.1) means that the force is directed towards the target, where the force is maximum and the potential is minimal, so generates an attractive force that attracts the bodies immersed in the field, represented by the average predictions of the network, with an intensity proportional to a function of the distance between and .
The equation (I.2) highlights how, at the optimum, the output of an artificial neural network is an approximation to the conditional averages or expected values of the targets given the input . Both and are given by the problem, with average values that do not vary over time. We can therefore hypothesize a stationary system and an eigenvalue equation independent of time, having the same structure as the Schrõdinger equation
(II.2) 
with the state function of system (network), a scalar potential, the network energy, a multiplicative constant and a a variational parameter. seems necessary since the equation (II.2) does not arise from a true physical system, so the relative values between the first and second terms of the first member are unknown. Preliminary calculations show that for some problems the value of can be very small compared to the first term. We will consider that and is dimensionless.
The equation (II.2
) implements a parametric model for the ANNs in which the optimization consists in minimizing, on average, the energy of the network, function of
and , modeled by appropriate probability densities and a set of variational parameters . The working hypothesis is that the minimization of energy through a parameterdependent trial function that makes use of the variational theorem (I.1) leads, using an appropriate potential, to a reduction of the error committed by the network in the prediction of .Iii The potential
A function that satisfies all the requirements exposed to be used as potential is the mutual information, [14], which is a positive quantity. In this case, the minimization of energy through a variational state function that satisfies the equation (II.2) implies the principle of minimum mutual information (MinMI) [4, 6, 7, 15]
, equivalent to the principle of maximum entropy (
MaxEnt) [3, 10, 11, 13]. The scalar potential depends only on the vector and for targets becomes^{1}^{1}1When not specified, we will implicitly assume that integrals extend to all space in the interval .(III.1) 
The equation (III.1) assumes a superposition principle, similar to the valid one in the electric field, in which the total potential is given by the sum of the potentials with respect to each of the targets of the problem.
Considering that the network provides an approximation to the target given by a deterministic function with a noise , , and considering that the error
is normally distributed with mean zero, the conditional probability
can be written as [2](III.2) 
Note that
is the standard deviation of
and , so . To be able to integrate the differential equation (II.2) we will consider the vector constant. We will see at the end of the discussion that it is possible to obtain an expression for dependent on , which allows us to derive a more precise description of the potential.We also write unconditional probabilities for inputs and targets as Gaussians to simplify the mathematical treatment
(III.3) 
Considering the absence of correlation between the input variables, the probability is reduced to
(III.4) 
with, in this case, , representing with the Gaussian with mean
and variance
relative to the component of the vector . The equations (III.3) e (III.4) introduce in the model a statistical description of the problem starting from the observed data, through the set of constants , , e .The integration of the equation (III.1) over gives
(III.5) 
with
(III.6) 
Mutual information in the potential (III.1) is expressed in nats. We will call the units of energy calculated from (II.2) nats of energy or enats.
It is known that a linear combination of Gaussians can approximate an arbitrary function. Using a base of dimension we can write the following expression for
(III.7) 
with
(III.8) 
and the bias term for the output unit . The equations (III.7) e (III.8
) propose a model of neural network of type RBF (Radial Basis Function), which contain a single hidden layer and allow to facilitate the calculation given the complexity of the model.
Taking into account the equations (II.1), (III.5) e (III.7) the components of the force, , are given by
In physical conservative fields, work, , is defined as the minus difference between the potential energy of a body subject to the forces of the field and that possessed by the body at a reference point, . In some types of central force fields, as in the electrostatic or gravitational cases, the reference point is located at an infinite distance from the source where, given the dependence of on , the potential energy is zero.
Since in the discrete case the mutual information is limited superiorly from the minimum among the marginal entropies, and ,^{2}^{2}2We use in lower case as discrete entropy to distinguish it from , which in this work is used as a symbol of the Hamiltonian operator and of the integrals . given that the distribution with maximum entropy is the uniform one, , and that the reference point against which to calculate the potential difference is arbitrary, we can propose the following definition of work^{3}^{3}3The treatment of the text makes considerations on the discrete case since the differential entropy can be negative.
(III.9) 
For , the equation (III.9
) explains the work, in enats, carried out by the forces of the field to pass from a neural network that realizes uniformly distributed predictions to a network that realizes an approximation to the density
.Iv The state equation
A dimensional analysis of the potential (III.5) shows that the term is dimensionless. Thus, the units of the potential are determined by the factor . To maintain the dimensional coherence in the equation (II.2) we multiplied the first term of the first member by the factor , where^{4}^{4}4This setting makes it possible to incorporate into the value of , but in the continuation we will leave it explicitly indicated.
cannot be a constant factor independent of the single components of since in general every has its own units and its own variance.
Given the variational parameter in the second term of the first member of the equation (II.2), we can without losing generality multiply the first term by , obtaining . The Hamiltonian operator
is real, linear and hermitian, and has the same structure as that used in the Schrõdinger equation. Hermiticity stems from the condition that the average value of energy is a real value, .^{5}^{5}5In this article we only use real functions, so the hermiticity condition is reduced to the symmetry of the and matrixes. and represent the operators related respectively to the kinetic and potential components of the Hamiltonian.
The final state equation is
(IV.1) 
Wanting to make an analogy with wave mechanics, we can say that the equation (IV.1) describes the motion of particle of mass subject to the potential (III.5). , as happens in quantum mechanics with the Planck constant, has the role of a scale factor: the phenomenon described by the equation (IV.1) is relevant in the range of variance for each single component of the vector .
We discussed the role of the operator : its variation in the space implies a force that is directed towards the target where is minimum and is maximum. The operator contains the divergence of a gradient in the space and represents the flow density of the gradient, being a measure of the deviation of the state function at a point with respect to the average of the surrounding points. The role of in the equation (IV.1) is to introduce information about curvature. In neural networks a similar role is found in the use of the Hessian matrix, calculated in the space of weights, in conventional second order optimization techniques.^{6}^{6}6In this case, as we will later show, the second derivatives are calculated in the space of the weights and not in the space .
Starting from the expected energy value obtained from the equation (IV.1)^{7}^{7}7Although all the functions used in this work are real, we will make their complex conjugates explicit in the equations, as is usual in the wave mechanics formulation.
(IV.2) 
assuming a base of dimension for the trial function
(IV.3) 
with the basis functions developed in a similar way to what we did for in the equations (III.7) and (III.8)
(IV.4) 
and considering the coefficients independent of each other, , the RayleighRitz method leads to the linear system
(IV.5) 
with
(IV.6) 
(IV.7) 
To obtain a nontrivial solution the determinant of the coefficients have to be zero
(IV.8) 
which leads to energies, equal to the size of the base (IV.3). The energy values represent an upper limit to the first true energies of the system. The substitution of every in (IV.5) allows to calculate the coefficients of relative to the state . The lowest value among represents the global optimum of the problem or fundamental state that leads, in the hypotheses of this article, to the minimum or global error of the neural network in the prediction of the target
, while the remaining eigenvalues can be interpreted as local minima. It can be shown that the eigenfunctions obtained in this way form an orthogonal set. The variational method we have discussed has a general character and can be applied, in principle, to artificial neural networks of any kind, not bound to any specific functional form for
.The proposed model assumes a change of paradigm with respect to some known methods of optimizing neural networks, such as the gradient descent, which carry out a search in the parameter space, in particular the set of weights , through the search for a expression for with a form of error of the neural network. In this article the variables of the problem, and the search in the relative space, are the input with a set of variational parameters.
V Interpretation of the state function
The model we have proposed contains two main weaknesses: 1) the normality of the marginal densities and ; 2) the constancy of the vector . The following discussion tries to resolve the second point.
Similarly to wave mechanics we can interpret the square module of as a probability. In this case, the Laplacian operator in equation (IV.1) models a probability flow. Given that we have obtained from a statistical description of the known targets, we can assume that represents the conditional probability of given , subject to the set of parameters
Since we considered the targets independent, using the expressions (III.2) and (V.1) into (V.2), separating variables and integrating over , assuming that at the optimum is satisfied the condition , we have
which leads to an implicit equation in . For networks with a single output, , we have
(V.3) 
With , and functions of . The equation (V.3) allows in principle an iterative procedure which, starting from the constant initial value which leads to a state function , through the resolution of the system (IV.5) permits to calculate successive corrections of .
Vi Results
The resolution of the system (IV.5
) requires considerable computational powers. For this reason the minimum energy was calculated in an approximate way with a genetic algorithm (GA), on an Intel 6Core i78750H MT MCP processor. The equations have been treated symbolically with the Computer Algebra System maxima.
^{8}^{8}8http://maxima.sourceforge.net/The test problem comes from the Statlib repository.^{9}^{9}9http://lib.stat.cmu.edu/datasets/ It is a synthetic dataset made up of 3848 records, generated by David Coleman, referred to for convenience as POLLEN and which represents geometric and physical characteristics of pollen grain samples. It consists of 5 variables: the first three are the lengths in the directions x (ridge), y (nub) and z (crack), the fourth is the weight and the fifth is the density, the latter being the target of the problem. In our model they represent, respectively, , , , and
. The choice of this problem lies in the fact that the data were generated with Gaussian distributions with low correlations, and is therefore close to the initial assumptions of the model for
and . Tables I and II show the general statistics of the dataset.Original data  Normalized data  Skewness  Kurtosis  

Var  /  /  /  /  
3.637e03  6.398  0.284536  0.041489  0.130  0.057  
1.597e04  5.186  0.312775  0.026460  0.072  0.311  
3.103e03  7.875  0.258654  0.014287  0.057  0.158  
4.237e03  10.004  0.285505  0.024498  0.109  0.163  
1.662e04  3.144  0.046501  0.274707  0.110  0.192 
1.00  0.13  0.13  0.90  0.57  
0.13  1.00  0.08  0.17  0.33  
0.13  0.08  1.00  0.27  0.15  
0.90  0.17  0.27  1.00  0.24  
0.57  0.33  0.15  0.24  1.00 
The characteristics of the genetic algorithm have been described in a previous paper [1]. This is a steadystate GA, with a generation gap of one or two, depending on the operator applied. The population has binary coding and implements a fitness sharing mechanism [5] to allow speciation and avoid premature convergence, according to the equations
(VI.1) 
being a function of the diversity between individuals and , the Hamming distance and the niche radius within which individuals are considered similar. Niche sharing implements a correction to energy calculated based on the similarity between the individual and the rest of the population. The more similar it is, the greater the value of , penalizing the energy in the equation (VI.1) since we are minimizing.
The decoding of the genotype implements the code of Gray to avoid discontinuities in the binary representation. The transformation between the binary representations, , and Gray, , for the th bit, considering numbers composed of bits numbered from right to left, with the most significant bit on the left, is given by
with the XOR operator.
The GA uses four operators: crossover, mutation, uniform crossover and internal crossover, and performs a search in the space of the computed energies according to the equation (IV.2
), but simultaneously realizes a search in the space of the operators through the use of two additional bits in the genotype of each individual of the population. This allows a dynamic choice of the probabilities of each operator at each moment of the calculation, according to the fraction of elements of the population that encode for each of the four possibilities. The initial population is randomly generated.
The procedure for assessing an individual consists of the following steps:

the values of the parameters are generated within certain prefixed ranges through the application of one of the operators;

the network output, , is generated for each element of the dataset. This set of values allows us to calculate ;

the determinant (IV.8) is calculated;

the system (IV.5) is solved.
The result is the energy value, , and the coefficients of .
Before the execution of the tests, a preprocessing of the dataset was performed, normalizing and within the range [1:1]. 15 calculations were conducted, each consisting of 10 concurrent processes sharing the best solution found. In each calculation the set of lower energy solutions found in the previous calculations were introduced. The values of the parameters have been varied within certain preestablished ranges, identified through a preliminary test campaign. The reference ranges are shown in Table III. Table IV shows the reference values of the parameters used in the genetic algorithm.
Variable  Value 

C  1 
D  8 
N  4 
P  10 
[1:  
[1:1]  
[0:4]  
w  [0:4] 
[0:1] 
Variable  Value 

Population  100 
Point mutation probability  0.005 
[0:0.9]  
1  
Chromosome length  2174 bits 
Calculation cycles  [5000:6000] 
P \ N  

3.4013120  0.764254  0.487696  0.828128  0.309709  
2.8162860  0.722523  0.829504  0.884602  0.440280  
3.9395730  0.977770  0.860840  0.385800  0.694424  
1.1004576  0.563102  0.138821  0.801561  0.928019  
0.1681290  0.829662  0.126966  0.345363  0.546116  
0.8617840  0.621750  0.468552  0.684078  0.462411  
0.2275410  0.969937  0.801109  0.634125  0.436867  
3.2266080  0.424373  0.804881  0.896162  0.218709  
3.4019790  0.756753  0.440174  0.688590  0.101702  
3.6097500  0.937970  0.675499  0.980483  0.884515 
P \ C  

0.262204  
0.654650  
0.270344  
0.479204  
0.291074  
1.255660  
0.103152  
2.376997  
0.905523  
0.479075  
1.621334 
P \ N  

1.282546  0.110080  0.391103  0.5176760  0.572174  0.302664  
1.438387  0.120303  0.668461  0.857889  0.470183  0.626793  
0.419667  0.186642  0.679090  0.564954  0.457736  0.676878  
1.357698  0.100000  0.552879  0.782675  0.552136  0.562775  
0.233801  0.100029  0.134230  0.284860  0.598276  0.211275  
0.559200  0.194570  0.661250  0.672405  0.335728  0.348150  
0.027306  0.100185  0.173759  0.846671  0.848185  0.514533  
2.092270  0.159235  0.635414  0.671673  0.428168  0.496624 
Variable  Value 

(train)  26.064396 
(train)  2.748593 
(train)  0.448009 
(train)  0.194494 
(test)  25.537031 
(test)  2.375004 
(test)  0.419022 
(test)  0.195101 
1  
(train)  0.945% 
(test)  0.951% 
(train)  0.627771 enats 
(test)  0.621814 enats 
For each element of the population, in addition to the energy value, has been calculated the square error percentage of the neural network [1, 12]
with the number of dataset records and the normalization interval used.
Some of the parameters in the Table IV deserve some observation:

implies the socalled triangular niche sharing;

has a considerable influence on the results and was chosen for each of the 10 concurrent processes of each calculation according to the criterion , with the process number. This allows to avoid arbitrary choices since can be dependent on the nature of the problem;

was chosen in the interval [0:4], which includes the value given by a heuristic RBF rule which proposes for the standard deviation of the associated Gaussian,
, the reference value , with the average value between the centroids of the functions of the equation (III.8). Considering an estimate of
we get . The range is equivalent to . The same criterion has also been used for the vector .
The data was divided into two parts, a set of training (2886 records) and a set of testing (962 records). Technically, this subdivision is not necessary, since the two data partitions are generated by the same distribution and the characterization of the problem in the model is given exclusively by the value of the constants , , e .
The variational parameters of the best solution are reported in Tables V, VI and VII. The final results of the calculation, including error and energy, are shown in Table VIII. Figure VI.1 shows the evolution of error and energy (in the lower and average versions of the population) of the calculation that generated the lower energy solution, which shows how the minimization of energy leads to a decrease of the error committed by the net in the target prediction. The final error value for training (0.945%) and testing (0.951%) partitions is particularly significant given the low number of basis functions used in the definition of and .
Vii Conclusions
In this work we have developed a model for artificial neural networks based on an analogy with a physicalquantum mechanical system. One of the advantages of this approach is the possibility, potentially, of using wave mechanics techniques in their study. An example is the generalized HellmannFeynman theorem
whose validity needs to be demonstrated in this context, but whose use seems justified since it can be demonstrated by assuming exclusively the normality of and the hermiticiy of . Its applicability could help in the calculation of the system (IV.10).
It is necessary to carry out a systematic test campaign to verify the results obtained. These tests are currently underway and will be the subject of a subsequent work. The preliminary results obtained on a set of selected problems coming from the Statlib and UCI^{10}^{10}10https://archive.ics.uci.edu/ml/index.php repositories confirm the validity of the model.
References
 Barrera [2007] Francisco Yepes Barrera. Búsqueda de la estructura óptima de redes neurales con algoritmos genéticos y simulated annealing. verificación con el benchmark proben1. Inteligencia Artificial, Revista Iberoamericana de IA, 11(34):41–61, 2007.

Bishop [1995]
Christopher M. Bishop.
Neural Networks for Pattern Recognition
. Oxford University Press, 1995.  Finnegan and Song [2017] Alex Finnegan and Jun S. Song. Maximum entropy methods for extracting the learned features of deep neural networks. PLOS Computational Biology, 13(10):e1005836, October 2017. ISSN 15537358. doi: 10.1371/journal.pcbi.1005836.
 Fitzgerald et al. [2011] Jeffrey D. Fitzgerald, Lawrence C. Sincich, and Tatyana O. Sharpee. Minimal Models of Multidimensional Computations. PLOS Computational Biology, 7(3):e1001111, March 2011. ISSN 15537358. doi: 10.1371/journal.pcbi.1001111.
 Gao and Hu [2006] Lan Gao and Youwei Hu. Multitarget matching based on niching genetic algorithm. JCSNS International Journal of Computer Science and Network Security, 6(7A), July 2006.

Globerson and Tishby [2004]
Amir Globerson and Naftali Tishby.
The Minimum Information Principle for Discriminative
Learning.
In
Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence
, UAI ’04, pages 193–200, Arlington, Virginia, United States, 2004. AUAI Press. ISBN 9780974903903. eventplace: Banff, Canada.  Globerson et al. [2009] Amir Globerson, Eran Stark, Eilon Vaadia, and Naftali Tishby. The minimum information principle and its application to neural code analysis. Proceedings of the National Academy of Sciences of the United States of America, PNAS, 106(9), march 2009.
 Levine [2014] Ira N. Levine. Quantum chemistry. Pearson, Boston, seventh edition edition, 2014. ISBN 9780321803450.
 Movellan and McClelland [1993] Javier R. Movellan and James L. McClelland. Learning Continuous Probability Distributions with Symmetric Diffusion Networks. Cognitive Science, 17(4):463–496, October 1993. ISSN 03640213. doi: 10.1207/s15516709cog17041.
 Park and Abusalah [1997] Joseph C. Park and Salahalddin T. Abusalah. Maximum Entropy: A Special Case of Minimum Crossentropy Applied to Nonlinear Estimation by an Artificial Neural Network. Complex Systems, 11, 1997.
 Pires and Perdigao [2012] Carlos A. L. Pires and Rui A. P. Perdigao. Minimum Mutual Information and NonGaussianity Through the Maximum Entropy Method: Theory and Properties. Entropy, 14(6):1103–1126, June 2012. ISSN 10994300. doi: 10.3390/e14061103.
 Prechelt [1994] Lutz Prechelt. Proben1  a set of neural network benchmark problems and benchmarking rules. Technical Report 21/94, Fakültat für Informatik, Universität Karlsruhe, 76128 Karlsruhe, Germany, September 1994.
 Xiaodong [2014] Zhang Xiaodong. Evaluation model and simulation of basketball teaching quality based on maximum entropy neural network. page 5, 2014.
 Xu [1999] Dongxin Xu. Energy, entropy and information potential for neural computation. PhD thesis, University of Florida, 1999.
 Zhang et al. [2017] Yan Zhang, Mete Ozay, Zhun Sun, and Takayuki Okatani. Information Potential AutoEncoders. arXiv:1706.04635 [cs, math, stat], June 2017. arXiv: 1706.04635.
Comments
There are no comments yet.