On randomization of neural networks as a form of post-learning strategy

11/26/2015
by   K. G. Kapanova, et al.
0

Today artificial neural networks are applied in various fields - engineering, data analysis, robotics. While they represent a successful tool for a variety of relevant applications, mathematically speaking they are still far from being conclusive. In particular, they suffer from being unable to find the best configuration possible during the training process (local minimum problem). In this paper, we focus on this issue and suggest a simple, but effective, post-learning strategy to allow the search for improved set of weights at a relatively small extra computational cost. Therefore, we introduce a novel technique based on analogy with quantum effects occurring in nature as a way to improve (and sometimes overcome) this problem. Several numerical experiments are presented to validate the approach.

READ FULL TEXT VIEW PDF

Authors

page 16

11/13/2017

Neural Networks Architecture Evaluation in a Quantum Computer

In this work, we propose a quantum algorithm to evaluate neural networks...
12/15/2013

Autonomous Quantum Perceptron Neural Network

Recently, with the rapid development of technology, there are a lot of a...
08/30/2021

On the effects of biased quantum random numbers on the initialization of artificial neural networks

Recent advances in practical quantum computing have led to a variety of ...
02/21/2019

Topology of Learning in Artificial Neural Networks

Understanding how neural networks learn remains one of the central chall...
05/17/2019

How Case Based Reasoning Explained Neural Networks: An XAI Survey of Post-Hoc Explanation-by-Example in ANN-CBR Twins

This paper surveys an approach to the XAI problem, using post-hoc explan...
06/07/2021

Application of neural networks to classification of data of the TUS orbital telescope

We employ neural networks for classification of data of the TUS fluoresc...
10/29/2018

The Expressive Power of Parameterized Quantum Circuits

Parameterized quantum circuits (PQCs) have been broadly used as a hybrid...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural Networks represent a multidisciplinary field, including e.g. neuroscience, mathematics, statistics, computer science, engineering, and physics. We can generally look at the development of the artificial neural network (ANN) field as few periods of extensive research, starting with the first idea of a neuron by W. McCulloch and W. Pitts (1943)

MCPitts . In the

, J. Hopfield introduced his recurrent neural network, while P. Werbos developed the back-propagation algorithm - one of the most widely used to this day

Haykin , Werbos . The reader should note that all developed networks are based on deterministic approach.

ANNs try to mimic only the four fundamental elements of the biological neurons - input, processing, learning and output. In order to be able to generate an output, they exploit the interconnection principle between biological neurons. We can broadly classify them by the type of learning: supervised or unsupervised. Through the learning (or training process), the weights and biases of the network are adapted. There are many strategies for learning, usually determined by the way the values are evolved. The main important characteristics of the learning process are represented by two concurring passages, i.e. the capacity to use minimum computational resources, and to provide robustness of the system. One further step remains the choice of the error function, or target function and the goal is to minimize the error by varying the weights of the ANN.

Nowadays in the standard approach, the changes in the weights are usually accomplished during the training process, with particularly nothing being implemented in the utilization of the network. Among the possible options used in the learning process, the gradient descent method is one of the most common. The back-propagation algorithm proposed in Werbos

uses the error to propagate it through the network layer by layer until an outcome is produced. The backpropagation looks for the minimum value of the error function in weight space through delta rule or gradient descent. On the other hand we have Evolutionary Algorithms, which are generally directed random searches. They start from a set of random population, slowly converging to a solution

Branke . Another interesting alternative is represented by simulated annealing Kirkpatrick

, which in some situation can perform faster than backpropagation or genetic algorithms. By analogy with the physics problem discussed in

Metropolis , the strategy is based on the transition process of a solid substance from increased temperature to thermal equilibrium. In this context, the cooling of a substance becomes equivalent to minimize the cost function of an optimization problem. In Kirkpatrick the simulated annealing method is achieved by substituting the cost for energy, and executing the algorithm by slowly decreasing the temperature values.

To the best of our knowledge, the above concepts are strongly based on analogies with classical (or deterministic) physics. An interesting possibility is to exploit, in some sense, quantum mechanical effects in a ANN. This was suggested for the first time in consciousness where biological aspects of quantum phenomena in brain activity related to the activation point by a nerve impulse are described.

In this article, the action potential plays a fundamental role in information processing inside the brain. The authors suggested that the firing of a neuron is obtained by the motion of a quantum particle, in the proximity of a potential (or energetic) barrier where typically effects such as tunnelling occurs. Alternatives to this explanation exist such as the one proposed in Penrose based on the concept of micro-tubules which are able to maintain a macroscopic coherent superposition. In particular, the suggested explanation described in consciousness inspired us to develop a technique which mimics quantum effects in order to improve the set of weights of a ANN in the post-learning stage. In more details, our method aims to reinforce the reliability of a network even in the case of training failure, at a relatively low computational cost.

The paper is organized as follows 111The reader should note that in this paper we interchangeably use the words quantum, random and noise having in mind the same meaning. The same applies for the terms classical and deterministic.. In the next section we introduce the methodology behind the development of our proposed network. Then, in order to validate this novel approach, we perform a set of numerical experiments involved in the problem of approximating a known function. In spite of the simplicity of the proposed technique, we believe that it provides a further chance for the network to escape from local minima, local optima or saddle points Pesky at a reasonable computational burden.

2 Formulation and Methodology

Nowadays, Neural Networks come in a great variety of ways - classification, data analysis, dimensionality reduction, etc, and thus there are many different implementations of neural networks. Among them we have, for example, the perceptron

Rosenblatt , the multilayer feedforward network Bishop , the probabilistic network Specht . Interesting alternative based on fuzzy logic can be found in GENEFIS ,PANFIS . In this work, we focus on multilayer feedforward network, which consist of many neurons, each of them fully connected to every neuron in adjacent forward layers, although the technique is not limited to this particular implementation.

Figure 1 describes the basic processing calculation of an artificial neuron.

Figure 1: Sketch of a typical neural network architecture.

There are several components that are valid for all artificial neurons, in spite of their position in the whole network (whether input, output or hidden layer); weights, summation functions, activation functions, transfer function, error target function

Haykin . The assembling of the neurons in layers, the provision of connections between the neurons in the layers, the summation and transfer function are common to all neural networks and represent the base of a neural network. In our specific implementation, the neuron’s activation function for used in the network is the function

. This is a rescaled sigmoid functions with output range

.


Many types of learning strategies exist, usually determined by the way the values change, which main goal is to reach a good balance between finding a sufficiently accurate (usually local) minimum and computational resources Haykin . Most widely used learning strategies are the Hebb’s rule Herz , the Hopfield law Hopfield , and the delta rule (also known as the Least Mean Square Learning Rule). In this particular work we have developed a three-layer feedforward network (see Fig. 2), consisting of one input neuron, hidden neurons and one output neuron capable of approximating a non-linear function (although this is certainly not the only possible choice). As previously mentioned the number of hidden neurons is critical to the network performance. During the design of the network Bishop , Fujita , Hornik , we considered the inner layer neuron count, accounting for the fact that more neurons will provide the capability to approximate functions of great complexity. At the same time, populating the hidden layer with too many neurons could lead to overfitting the training data and thus generate worse results. On the other hand, too few nodes in the hidden layer will lead to lack of power to provide desired outputs.

Figure 2: Schematic of the architecture of the neural network implemented in this work. The dots represent neurons, the lines express the connections between the neurons. The input layer consist of one neuron, the hidden layer is implemented by four neurons, and the output layer has one neuron.

The training process is based on the simulated annealing method, which consists of two steps: first of all, an effective temperature is increased to a maximum value, secondly the effective temperature is decreased slowly until the particles (representing the solution of the optimization problem) rearrange themselves in the ground state of the solid. Eventually the probability of a move for a point is given by

(1)

where is the difference between the actual energy and the energy before the move, and is the effective temperature of the system Kirkpatrick . Therefore a probabilistic acceptance is achieved by generating a random number in the range , which is then compared to and if , the move is accepted. When the optimization problem achieves a lower effective temperature, fewer instances are accepted of larger temperatures and it resembles closely downhill-only improvement.

2.1 Miming quantum randomness

The possibility of biological neural network exploiting quantum effects strongly suggests the opportunity of introducing randomness in a neural network which could, somehow, introduce computational advantages consciousness

. At a first glance, in the context of ANN, this may suggest the use of quantum mechanical laws inside the very core of a neuron. In practice, this would correspond to the necessity of numerically simulating the time-dependent Schrödinger’s equation (or any other equivalent formalism such as Feynman, Wigner, etc) to quantitatively determine the eventual tunnelling effects. This would amount to numerically simulate the following time-dependent partial differential equation

Goldberg

(2)

where is the imaginary unit, is the wave function defined over space and time, is the reduced Planck’s constant, is the position of the particle, is the time, is the mass of the particle, is the Laplacian operator, and is the potential energy acting on the particle. While this task would be definitely affordable for a relatively small number of (independent) artificial neurons, it represent a daunting task in the context of ANN where one may have to deal with thousands of neurons. Therefore we suggest a computationally more convenient technique which aim is the miming of the presence of randomness, intrinsic of a quantum system, without the burdening of high computational costs related to highly accurate quantum simulations.


The technique is based on the following fact: it is possible to model, by analogy, a biological neuron as a semiconductor heterostructure consisting of one energetic barrier (e.g. AlGaAs) sandwiched between two energetically lower areas (e.g. GaAs) consciousness . Therefore, the activation function of an artificial neuron can be viewed as one or more particles entering the heterostructure and interacting with the barrier (see Fig. 3). The modulus of its wave-function provides the probability of finding the particle in some point of the device at time , thus introducing randomness in the process (Born rule). If the probability of back scattering is higher than the probability of tunnelling we consider the activation function inhibited, and vice versa.

Figure 3: Left plot: a Gaussian wave-packet, (blue) continuous line, is travelling against an energetic potential barrier, (red) dashed line. Right plot: after a certain time, the wave-packet is interacting with the barrier. Part of the packet is scattering back while the rest is tunnelling.

The reader might establish similarities between our idea and the main structure of an Action Potential, which could also be described as a distinct voltage-gated ion channels in a cell’s membrane. Once the potential increases to a defined threshold value, the membrane potential of the cell opens. The membrane potential maintains an electric potential difference (voltage), which once triggered, is activated.


In practice, we achieve this goal by adding in the network a function addHiddenNoise which, for every neuron in the hidden layer, adds a noise to the already computed set of weights:

(3)

which mathematically corresponds to the expression

(4)

In the context of ANN training, our suggested technique is in some sense comparable to certain elements in the Genetic Algorithm (GA), specifically the mutation part of the algorithm. The mutation step can be viewed as an initialization of random walks through the search space of possible solutions. The mutations are ordinarily small and are defined by step and rate, which can be constant or adaptive. The genetic algorithm involves the creation of new generations of individuals, from which the algorithm selects the best ones and further evolves the population according to a predefined set of rules. During the training process, the main purpose of the mutation operator is to maintain the diversity within a population of ANN weights in order to prevent premature convergence. Our technique, on the other hand is designed to reach a random improvement at very low computational cost. The procedure involves the generation of random doubles by means of a Mersenne Twister. The main purpose remains to restrict the noise inside a certain range, specified by the user. This is achieved by comparing the new (random) set of weights to the one obtained by the classical network. Eventually the function backupState is used to copy the weights for restoration when the addition of noise produces a larger error (root mean square) compared to the noiseless set of weights. In these terms, our novel technique provides an algorithm that incurs negligible computational cost since it depends on the simplicity of generating random numbers.


In the next section, we describe several numerical experiments which aim is the validation of our approach.

3 Numerical Validation

In this section, we present a numerical validation of our suggested post-learning strategy for a neural network. The network’s aim is to fit two known functions - a polynomial of second degree () and the square root of a polynomial (), given three data points. The network architecture is identical for both functions - three layers, with one input neuron, hidden neurons and one output neuron. Additionally, we have assigned limitations to the weights space. In the first case, the weights search is constrained between and . The weights for the second function are randomly distributed between and . The training process is based on the simulated annealing method for either function.

In the current situation we deliberately stop the network at an arbitrary local minimum by means of a temperature rate, decreasing in a non-optimal fashion. This is done in order to clearly show that our technique can provide a way to further improve the training even after the optimization process. This situation is of importance as with nowadays available big data it becomes difficult to find the best set of weights for a ANN due the ever growing complexity. Learning is performed by utilizing three equidistant data points, belonging to the range , excluding the extrema. The first case (see Fig. 5, upper left side) utilizes the following data points , and . The three data points, exploited for training the network for square root of polynomial are as follows: , , .

To understand the extent of influence of noise on the network’s output and error decrease, we investigated different scenarios. First we run the network without noise addition in order to benchmark the quantum part of the network to the classical one (see Fig. 5 and Fig. 7 upper left plots). The rest of the tests include the network working with , , , and noise respectively.

The following two subsections attempt to explain how the level of noise is affecting the two functions.

3.1 Polynomial of second degree

Our first validation experiment addresses the fitting of a polynomial of second degree. To ascertain the network performance we initially run the network without any noise to provide a benchmark test for its correct functioning. To observe how the levels of noise affect the network’s output, we have initialized a case where we add minimal noise to the system (). The effect on the quantum comportment of the network are minimal (see Fig. 5 upper right plot), influencing the first few outputs. Corroboration is available from the error of the network (as seen in Fig. 6 upper right plot). As expected, for the first few data points, the quantum part of the network contributes to slight improvement of the result. To achieve this, at every point, the algorithm compares the classical to the quantum error, choosing the better option, and discarding the other. Following a better quantum solution, the network accepts it and continues forward.

Indicative to the amount of noise applied is the dispersement of quantum output and error through the plot in the situations where the amplitude of the noise is successively increased. The moderate enlargement of noise to reveals that the network output in the middle of the solutions is less accurate, with better results in the upper and lower bound of the curve.

The network precision further deteriorates when we apply

noise. This case supplies only few reliable outputs close to the fitting curve. One could note that the increase in noise contributes to more output outliers in the outcome (shown in Fig. 

5 middle right plot). However, with addition of noise, the quantum network outperforms the classical one in very few situations. Moving from to to and noise, the overall error tends to decrease slowly (Fig. 6, middle right and lower left plots).

In the current numerical validation, the performance of the network is stable with and noise added. The quantum part produces smaller error from the beginning of the calculations, with a possibility to decrease the error almost instantaneously.

The higher levels of noise still accomplish small improvements to the network, but at far lesser scale and depth.

3.2 Square root of a Polynomial

Considering the implication that our proposed technique could perform in a distinctive manner for computing different functions, we have executed second test in fitting a square root of a polynomial. As the amount of noise was set to , the results between the quantum part resembled the classical ones (Fig. 7 upper right plot). The pattern of error reduction is stepwise, appearing like the one in the similar situation for the first function. In this setup, only few better outcomes are provided, notwithstanding the steep descent of the error from point to point.

Our further analysis from the addition of noise, finds an outcome close to the optimal (Fig. 7 middle left plot). In fact, the error continues to taper off in a similar manner to the case from the noise. One would perceive, in accordance to the previous numerical validation test, that the increase in noise will decrease network performance. The current experiment exhibits the opposite direction.

The simulation shows that noise contributes to the increased efficiency of the quantum network. The network provides as much as twice as many quantum outcomes, as when imposing or noise, as well as lower errors (see Fig. 6 middle right plot). After the initial steep error reduction from the quantum part, significant divergence from the initial experiment is the clustering of the quantum error points near the classical ones.

In the next step, at noise, the error is consecutively decreasing, compared to any other amount of noise administered. The reader should note that in this situation (Fig. 7 lower left plot), the network provides as much as twice as many quantum outcomes, as when only is applied. Throughout the various noise scenarios, the outputs from the quantum network are analogue to the classical one.

One possible explanation for the difference of network performance for the two numerical experiments could be provided by the variation of the weight space range for the two functions. Further experiments are required to establish the network performance in relation to the weight space range, the noise amount and new functions.

Illustration of the divergence of error reduction for the two known function is available from Fig. 4 (for the sake of clarity we also report these results in the shape of a table, see table 1). The left plot confirms the perception of increased network performance on noise levels at or . The behavior of the network is opposite for the second numerical experiment. Running the network with gradual increase in the noise actually contributes to the tapering off of the error.

4 Conclusions

In this article, we introduced a novel post-learning strategy that is implemented as a auxiliary reinforcement to the classical learning process of neural networks. The main purpose of this novel technique is to provide a method that is computationally reasonable in the scenario of a network trying to circumvent a local minimum during the training process. In order to achieve it, we suggested an approach based on the generation of random numbers at the core of artificial neurons, which attempts to mimic the presence of quantum randomness. By performing several numerical experiments, we validated the method against the problem of fitting a known function, given a certain number of training points, and we have shown how our technique provides certain improvements in the system, without relevant additional computational costs. Certainly, further investigation is necessary to establish the right amount of noise, which has to be introduced in a network in order to achieve real improvements. This will be the subject of a future work.

Figure 4: Final error after the network calculates for output points. The left side plot illustrates the error for the function of polynomial of second degree. The right side plot depicts the error for the function of square root of a polynomial. The initial error for the left plot for all levels of noise is . The initial error for the right plot for every noise level is .
Noise
0% 0.1288600 0.1020300
0.5% 0.0983995 0.1015950
1.0% 0.0904621 0.1009560
2.0% 0.1198510 0.0971008
4.0% 0.1086160 0.0802115
Table 1:

Final error estimation after the network calculates

output points for the two numerical validation functions. See also fig. 4
Figure 5: The plots feature the output of our neural network for the function of polynomial of second degree. The (red) star symbolizes the desired network’s output, the (blue) indicates the output from the classical mode, and the quantum mode is denoted by a square. The upper left side plot exhibits a validation test for the network, running in both classical and quantum mode, with no noise applied. The upper right plot represents the output when noise is applied. The middle left and right plot, display the network’s output when and noise is assigned respectively. The final plots consists of the output when the network is executed with noise.
Figure 6: Error reduction from the neural network in the case of polynomial of second degree. The error from the classical part of the network is depicted as a (blue) , while the error from the quantum side of the network is shown as a (red) square. The upper left plot illustrates the network running with no noise for validation purposes. Upper right plot illustrates the network’s error in scenario with applied. The middle left and right plots constitute network error in simulations with and noise. The left lowest plot represents error with noise.
Figure 7: Neural network outputs in the case of function of a square root of a polynomial. The (red) star symbolizes the desired network’s output, the (blue) indicates the output from the classical mode, and the quantum mode is denoted by a square.The upper left and right plots represent network’s output when and noise is executed. The middle left and right plots exhibit the output affected from and noise respectively. The lowest left plot represent results from executing the network with noise.
Figure 8: Error decline from the neural network in the simulation of square root of a polynomial. The error from the classical part of the network is depicted as a (blue) , while the error from the quantum side of the network is shown as a (red) square. The upper left plot illustrates the network running with no noise for validation purposes. The network’s error with applied is depicted on the right upper plot. The middle left and right plots illustrate the network’s error in simulations with and noise. The left lowest plot indicates network’s error from a test with noise.

References

  • (1) W.S. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous activity, Bull Math Biophys 5(4), pp. 115–133, (1943).
  • (2) S. Haykin, Neural Networks and Learning Machines, Third edition, Pearson Education, (2009).
  • (3) P.J. Werbos, Generalization of Backpropagation with Application to a Recurrent Gas Market Model, Neural Networks 1.4, pp. 339-356, (1988).
  • (4) J. Branke, Evolutionary Algorithms for Neural Network Design and Training, In Proceedings of the First Nordic Workshop on Genetic Algorithms and its Applications, (1995).
  • (5) S. Kirkpatrick, C.D. Gelatt Jr, M.P. Vecchi, Optimization by Simulated Annealing, Neurocomputing: foundations of research, MIT Press, (1988).
  • (6) N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, Equation of State Calculations by Fast Computing Machines, The journal of chemical physics, 21(6), pp. 1087-1092, (1953).
  • (7) F. Beck, J.C. Eccles, Quantum Aspects of Brain Activity and the Role of Consciousness, How the SELF Controls Its BRAIN. Springer Berlin Heidelberg, pp. 145-165, (1994).
  • (8) R. Penrose, The Emperor’s New Mind: Concerning Computers, Brains and the Laws of Physics, Oxford University Press, (1999).
  • (9) T. Schaul, S. Zhang, Y. LeCun, No more pesky learning rates, arXiv preprint arXiv:1206.1106, (2012).
  • (10) F. Rosenblatt, The Perceptron: a Probabilistic Model for Information Storage and Organization in the Brain, Psychological review 65.6, 386, (1958).
  • (11)

    C.M. Bishop, Neural Networks for Pattern Recognition, MIT Press, (1993).

  • (12) D.F. Specht, Probabilistic Neural Networks, Neural networks 3.1,pp. 109-118, (1990).
  • (13) M. Pratama, S.G. Anavatti, E. Lughofer, GENEFIS: Toward an Effective Localist Network, Fuzzy Systems, IEEE Transactions, 22, no. 3, pp. 547-562, (2014).
  • (14) M. Pratama, S.G. Anavatti, P.P. Angelov, E. Lughofer, PANFIS: a novel incremental learning machine, Neural Networks and Learning Systems, IEEE Transactions, 25.1, pp. 55-68, (2014).
  • (15) A. Herz, B. Sulzer, R. Kühn, H.L. Van Hemmen, The Hebb Rule: Storing Static and Dynamic Objects in an Associative Neural Network, EPL (Europhysics Letters), 7(7), 663, (1988).
  • (16)

    J.J. Hopfield, Learning Algorithms and Probability Distributions in Feed-forward and Feed-back Networks, Proceedings of the National Academy of Sciences 84.23, pp. 8429-8433, (1987).

  • (17) O. Fujita, A Method for Designing the Internal Representation of Neural Networks and its Application to Network Synthesis, Neural Networks, 4(6), pp. 827-837, (1991).
  • (18) K. Hornik, Approximation Capabilities of Multilayer Feedforward Networks, Neural Networks, 4(2), pp. 251-257, (1991).
  • (19) A. Goldberg, H.M. Schey, J.L. Schwartz, Computer-generated Motion Pictures of One-dimensional Quantum-mechanical Transmission and Reflection Phenomena, American Journal of Physics 35(3), (1967).