1. Introduction. The dynamics of gradient flow. Neural networks and backpropagation.
Let be a smooth function in some open domain . We equip with the topology induced by the standard Euclidean norm defined by the canonical scalar product
. The gradient vector field defined inby is given by , where are canonical coordinates in . The critical points of are the solutions of , . Let be the set of all critical points of in (which can be unbounded and/or contain non–isolated points).
Let be the initial condition of (1.1). Then every solution , either leaves all compact subsets of or approaches as the critical set i.e
Under the additional analyticity condition the above convergence result can be made stronger:
It should be noticed that the gradient system (1.1) can not have any non–constant periodic or recurrent solutions, homoclinic orbits or heteroclitic cycles. Thus, trajectories of gradient dynamical systems have quite simple asymptotic behaviour.
Nevertheless, the localisation of basin of attraction of any equilibrium point (stable or saddle one) belonging to is a non trivial problem.
Supervised machine learning in multi–layered neural networks can be considered as application of gradient descent method in a non–convex optimization problem. The corresponding cost (or error) functions are of the general form
with data set and a certain highly non–linear function containing the weights . The main problem of the machine learning is to minimize the cost function with a suitable choice of weights . A gradient method, described above and called backpropagation in the context of neural network training, can get stuck in local minima or take very long time to run in order to optimize . This is due to the fact that general properties of the cost surface are usually unknown and only the trial and error numerical methods are available (see , , , , , , )). No theoretical approach is known to provide the exact initial weights in backpropagation with guaranteed convergence to the global minima of . One of most powerful techniques used in backpropagation is the adaptive learning rate selection  where the step size of iterations is gradually raised in order to escape a local minimum. Another approach is based on random initialization  of weights in order to fortunately select them to be close to the values that give the global minimum of the cost function. The deterministic approach, called global descent, was proposed in  where optimization was formulated in terms of the flow of a special deterministic dynamical system.
The present work seeks to integrate the ideas from the theory of ordinary differential equations to enrich the theoretical framework and assist in better understanding the nature of convergence in the training of multi–layered neural networks. The principal contribution is to propose the natural extension of classical gradient descent method by adding new degrees of freedom and reformulating the problem in the new extended phase space of higher dimension. We argue that this brings a deeper insight into the convergence problem since new equation become simpler algebraically and admit a family of known first integrals. While this proposal may seem radical, we believe that it offers a number of advantages on both theoretical and as numerical levels as our experiments clearly show. Common sense suggests that embedding the dynamics of a gradient flow in a more general phase space of a new more general dynamical system is always advantageous since it can bring new possibilities to improve the convergence and escape local minima by embedding the cost surface into the higher dimensional phase space.
The study is divided into three parts. In Section 2 we begin by reminding how the gradient descent method is applied to train the simplest possible neural network with only output layer. That corresponds to the conventional backpropagation algorithm known for its simplicity and which is frequently used in deep learning. Next we introduce a natural extension of the gradient system which is done by replacing the weights of individual neurones within the output layer by their nonlinear outputs. That brings more complexity to the iterative method, since the number of parameters rises considerably, but at the same time, the training data becomes built up into network in a quite natural way. The so obtained generalised gradient system is later converted to the observer one (see). The aim is to turn the constant level of known first integrals into the attractor set. We will explain how the Euler iterative method, applied to the observer system, and called overfly algorithm, is involved in achieving of convergence to the global minimum of the cost function. Sections 3 and 4 discuss the applications of this algorithm in training of –layer and multilayer networks. The objective is to put forward an explanation of how to expand the backpropagation algorithm to its overfly version via modifying the weights updating procedure only for the first network’s layer. In Section 5 we provide concrete numerical examples to illustrate the efficacy of the overfly algorithm in training of some particular neural networks.
2. Neural network without hidden layers
We define the sigmoid function
as a particular solution of the logistic algebraic differential equation:
In particular, is increasing and rapidly convergent map as .
Let and be two vectors called respectively weight and input ones . The analytic map defined by
is called a no hidden layer neural network.
be the training set of (2.3) containing input data vectors and corresponding scalar output values . We want to determine the weight vector so that the values match outputs as better as possible. That can be achieved by minimising the so called cost function
or, after the substitution of (2.3):
In general, is not coercive and not necessarily convex map.
To apply the gradient descent method one considers the following system of differential equations
Since is always decreasing along the trajectories of (2.7), it is natural to solve it starting from some initial point and use to minimise . The solution can converge (in the ideal case) to the global minimum of or, in the less favourable case, or converges to local minima or saddle points.
Here one approximates the time derivative by its discrete version
for some small step so that the approximative solution of (2.7) at time can be obtained by iterations:
We write (2.7) in a more simple algebraic form by introducing the additional variables
representing the nonlinear outputs of the network for given inputs of the training set. Using the equations (2.7) to compute the derivatives , one obtains the following system of differential equations
with – the symmetric Gram matrix. We call (2.11) the generalised gradient system.
Let be matrix defined by . Then and, as known from the elementary linear algebra: and . Since the number of training vectors usually exceeds the total number of weights of the network, we can assume that .
Thus, since , we have .
Let be a non–zero vector from the null space of and . As seen from the equations (2.11), is invariant under the flow of the system. Indeed, and are invariant hypersurfaces.
is a real analytic first integral of the system (2.11).
There exists functionally independent first integrals of the above form.
In the rest of the paper we will always assume that i.e the set contains sufficiently many independent vectors.
Let , be the basis of . Using the vector notation
the family of the first integrals given by Theorem 2.1 can be written simply as
Let , be the map defined by
is a submersion.
Thus, for all the set is a –dimensional invariant manifold for the system (2.11).
is diffeomorphic to .
Let . We define the map by
Then, and so .
To show that is invertible, let us fix . Since is one to one, there exists unique vector , such that , for and
because by substitution into (2.13).
We are looking now for the solution of the linear system , which can be written in the vector form as . The linear map , has . Moreover, where orthogonality is defined by the scalar product . Indeed, , by the direct verification, and by the rank–nullity theorem. Hence, the map is a linear bijection and the linear equation admits the unique solution since as follows from (2.17). The proof is done. ∎
The system (2.11) can be written in the vector form as where is a complete in vector field ( is a bounded open invariant set). Let and
be the –neighbourhood of . Together with (2.11), consider the following observer system
. Here, , and
The matrix is invertible and positive definite since . Thus, the vector field is well defined in .
is a Lyapunov one and verifies for every solution , of (2.11).
It is sufficient to derive and to use the positiveness of the Gram matrix . ∎
Firstly, while using the standard gradient descent method, instead of dealing with the system (2.7), one can solve the observer equations (2.19) with some initial condition and use then Lemma 2.2 to compute as corresponding to for some sufficiently large . It is well known that applying the Euler method (2.8) to solve (2.7), i.e following the conventional backpropagation algorithm, leads to accumulation of a global error proportional to the step size . At the same time, the numerical integration of the observer system (2.19), as due to the existence of the attractor set , is much more stable numerically since the solution is attracted by the integral manifold (see  for more details and examples).
Second improvement brought by the observer system (2.19) is more promising. Imagine we start integration of (2.19) with the perturbed initial condition , for some . Then, according to Theorem 2.2, , and as follows from Lemma 2.3, will be decreasing function of in a neighbourhood of since on . That can be seen as a coexistence of the local dynamics of the observer system in , pushing to the equilibrium point , of (2.7) and the dynamics of the gradient system (2.7) on forcing to approach the critical points set (see Figure 3).
One can suggest that this kind of double dynamics increases considerably the chances of convergence to the global minimum of the cost function (2.5). We call overfly the training of the neural network (2.3) done by solving the observer system (2.19) with help of the Euler first–order method starting from some initial point .
3. The –hidden layer network case
In this section we describe the generalised gradient system of differential equations appearing in the supervised backpropagation training of a –hidden layer network. As in the previous section, let belongs to the training set (2.4). Let be weight vectors of the hidden layer and is the weight vector of the output layer.
The –hidden layer neural network is a real analytic map defined as follows
where are the outputs of the first layer. We want to minimise the same cost function
where , is the training set. To solve the optimisation problem one can define the gradient system analogous to (2.7) with respect to the vector variables and :
Let us introduce the following scalar variables:
The function (3.2), expressed in new variables, takes the following form
where is the Gram matrix defined by the training set (2.4).
The next theorem is a generalisation of Theorem 2.1. Let and , .
The generalised gradient system admits functionally independent first integrals
The cost function defined by (3.5) is a Lyapunov function for
One verifies directly that is a first integral of by simple derivation. A rather tedious but elementary calculation shows that along the solutions of (see also Theorem 4.1 for the general proof). ∎
Indeed, let and are two matrices defined by
where the constant matrix is the same as in (2.20) and “” is the Kronecker matrix product.
The practical implementation of the overfly algorithm in the –layer case is analogous to one described in Section 2. Instead of modifying the weights of the first layer at every step of the gradient descent, one updates the values of and applying the Euler method to solve the observer equations (3.8).
For the sake of simplicity, we will provide below the explicit matrix form of the system (3.8) which is better adopted to numerical implementations. We introduce the following diagonal matrices:
and the –vector
Let be the –vector of the output layer. The observer system (3.8) can be written in the following compact form
4. General multilayer case
We want to analyse a general multilayer neuronal network with the architecture . Here is a number of inputs and
is the number of neurones in the very first layer. The network has only one output and in every layer the same sigmoid function (2.1) is used. The training set is defined by (2.4). Let , be the weight vectors of neurones of the first layer. We note the weights of other network’s layers. Let be the input vector. The generic multilayer neural network can be written as the composition of two maps:
where , is defined jointly by all layers different from the first one and
is the output vector of the first layer.
Using the chain rule one obtains for every:
where, according to (4.2),
We can compute now the partial derivatives of the cost function
with respect to the weights of the first layer:
The equation of the gradient system corresponding to the weight vector can be written as
Introducing the variables
The above equations can be written also as
Indeed, and are functions of and only. Moreover, the same holds for the cost function defined in (4.6) and its gradient : they can be written as functions of variables and .
Let be the dimension of the null space of the Gram matrix and . We note .
where , are the standard scalar products defined respectively in spaces and where is the total number of splitting weights and is the total number of weights of the neural network (4.1). One writes with help of (4.11):
The overfly algorithm for neural network training, already described in previous sections, can be easily adopted to the general multilayer case. The only difference from the conventional backpropagation applied to the network (4.1), consists in replacing the weights of the first layer by the splitting weights , while keeping updating the weights of other layers accordingly to the usual bacpropagation algorithm. At each iteration step, the evolution of parameters is governed by the Euler discretisation of the observer system (4.17).
5. Conclusion and numerical results
In this section we compare the usual backpropagation and the overfly methods for some particular neural networks. We start by a simple no hidden layer case (2.3).
We put and . Let and the input input values are defined by
with the corresponding output vector :
The couple defines the training set (2.4).
Analysing the equation , with defined in (2.5), one calculates, with help of Maple’s 10 RootFinding routine, two local minima and (see Figure1) of the cost function in points , and , with being the global minimum of .
To calculate the vector , corresponding to , one can apply Lemma 2.2 to find
Now, following the overfly approach, we consider the observer system (2.19) with and initial conditions with the perturbation vector defined by
The Euler method, applied to (2.19) with provides after iterations the value with . Since, is sufficiently close to we conclude that the overfly network converges to the global minimum rather than to the local one . So, the benefits of the overfly training are immediately visible.
We have tested numerically the overfly method for a neural network (3.1). It has inputs and hidden layer with neurones (). Both hidden and output layer have biases. The input data set has entries arranged into the following matrix :
The columns of were chosen randomly and have zero mean. The output target vector is of the form
and corresponds to a highly deviated data set. In particular:
Firstly, the standard neural network (3.1) was trained on the above data set using usual backpropagation method (BM) with randomly chosen in the interval weights and . The number of iterations was with the step size .
Then, the overfly algorithm was applied, as described in Section , with randomly chosen initial splitting weights , same and the dissipation parameter . The observer system was solved by Euler method with the same step size and using the same iteration number . At each iteration we computed the cost function value for both methods: using the formula (3.2) for BM and the expression (3.5) for the overfly method (OM). The final cost value, after iterations for BM, was and for OM it was with the ratio . Thus the overfly algorithm significantly outperforms the conventional backpropagation for this particular problem. The Figure 2 contains graphs of both cost functions in the logarithmic scale. We notice that our example is quite generic one since our numerical experiments show that statistically OM gives more precise results than BM for the large deviation output data sets.
We notice that there is an obvious resemblance between conventional backpropagation and overfly approaches. Below we summarise briefly the principal steps of the proposed method.
Step 1: Splitting. Assuming that the training data is given, firstly, it is necessary to compute the generating vectors of the null–space of the matrix i.e
determine . Secondly, one introduces splitting weights (4.9) to replace weights of neurones of the first layer. In practice, the number of training examples can be considerably larger than the input size of the network , so the splitting brings more additional parameters to be stored in the memory.
Step 2: Dissipation. Using the vectors spanning one creates a procedure computing the dissipation term defined by (2.20). The matrix inversion in (2.20) can be done, in the beginning, using the conjugate gradient algorithm [