On the Role of Time in Learning

07/14/2019 ∙ by Alessandro Betti, et al. ∙ Università di Siena UNIFI 4

By and large the process of learning concepts that are embedded in time is regarded as quite a mature research topic. Hidden Markov models, recurrent neural networks are, amongst others, successful approaches to learning from temporal data. In this paper, we claim that the dominant approach minimizing appropriate risk functions defined over time by classic stochastic gradient might miss the deep interpretation of time given in other fields like physics. We show that a recent reformulation of learning according to the principle of Least Cognitive Action is better suited whenever time is involved in learning. The principle gives rise to a learning process that is driven by differential equations, that can somehow descrive the process within the same framework as other laws of nature.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The process of learning has been recently formulated under the framework of laws of nature derived from variational principle [1]. While the paper addresses some fundamental issues on the links with mechanics, a major open problem is the one connected with the satisfaction of the boundary conditions of the Euler-Lagrange equations of learning.

This paper springs out from recent studies especially on the problem of learning visual features [3, 4, 2] and it is also stimulated by a nice analysis on the interpretation of Newtonian mechanics equations in the variational framework [5]. It is pointed out that the formulation of learning as Euler-Lagrange (EL) differential equation is remarkably different with respect to classic gradient flow. The difference is mostly originated from the continuous nature of time; while gradient flow has a truly algorithmic flavor, the EL-equations of learning, which are the outcome of imposing a null variation of the action, can be interpreted as laws of nature.

The paper shows that learning is driven by fourth-order differential equations that collapses to second-order under an intriguing interpretation connected with the mentioned result given in [5] concerning the arising of Newtonian laws.

2 Euler-Lagrange equations

Consider an integral functional of the following form


where maps a point into the real number and is a map of . Consider a partition of the interval into subintervals of length . Given a function one can identify the point , and in general one can define the subset of

Now consider the and consider the following “approximation” of the functional integral :

where . The stationarity condition on is , thus we have

Using the fact that and we get


This means that the condition implies


where, consistently with our previous definition we are assuming that .

This last equation is indeed the discrete counterpart of the Euler-Lagrange equations in the continuum:


The discovery of stationary points of the cognitive action defined by Eq. 1 is somewhat related with the gradient flow that one might activate to optimize , namely by the classic updating rule


This flow is clearly different with respect to Eq. 4 (see also its continuous counterpart 5). Basically, while the Euler-Lagrange equations yield an updating computation model of , the gradient flow moves

3 A surprising link with mechanics

Let us consider the action


The Euler-Lagrange equations are


Since we have


In case we make no assumption on the variation then these equations must be joined with the boundary condition . Now suppose , with . Then Eq. 9 becomes


The Lagrangian , with and , and , is the one used in mechanics, which returns the Newtonian equations

of the damping oscillator. We notice in passing that this equation arises when choosing the classic action from mechanics, which does not seem to be adequate for machine learning since the potential (analogous to the loss function) and the kinetic energy (analogous to the regularization term) come with different sign. It is also worth mentioning that the trivial choice

yields a pure oscillation with no dissipation, which is on the opposite the fundamental ingredient of learning.

This Lagrangian, however, does not convey a reasonable interpretation for a learning theory, since one very much would like , so as could be nicely interpreted as a temporal regularization parameter. Before exploring a different interpretation, we notice in passing that large values of , which corresponds with strong dissipation on small masses yields the gradient flow

4 Laws of learning and gradient flow

While the discussion in the previous section provides a somewhat surprising links with mechanics, the interpretation of the learning as a problem of least actions is not very satisfactory since, just like in mechanics, we only end up into stationary points of the actions that are typically saddle points.

We will see that an appropriate choice of the Lagrangian function yields truly laws of nature where Euler-Lagrange equations turns out to minimize corresponding actions that are appropriate to capture learning tasks. We consider kinetic energies that also involve the acceleration and two different cases which depend on the choice of . The new action is


where . In the continuum setting, the corresponding Euler-Lagrange equations can be determined by considering the variation associated with , where is a variation and . We have


If we integrate by parts, we get

and, therefore, the variation becomes

Now, suppose we give the initial conditions on and . In that case we can promptly see that this is equivalent with posing and . Hence, we get the Euler-Lagrange equation when posing

Now if we choose as a constant we immediately get


while if we choose as an affine function, when considering the above condition we get


Finally, the stationary point of the action corresponds with the Euler-Lagrange equations


that holds along with Cauchy initial conditions on and boundary conditions 13 and 14.

Now, let us consider the case in which . The Euler-Lagrange equations become


If we consider again the case we get


Now we consider the kinetic energy associated with the differential operator


Let us consider the following two different cases of . In both cases, they convey the unidirectional structure of time.

  1. In this case, when plugging the kinetic energy in Eq. 18 into Eq. 17 we get


    These equations hold along with Cauchy conditions and boundary conditions given by Eq. 13 and 14, that turn out to be


    A possible satisfaction is . Notice that as the Euler-Lagrange Eq. 19 reduces to


    and the corresponding boundary conditions are always verified.

  2. Let us assume that in the kinetic energy 18 and . In particular we consider the action


    In this case the Lagrange equations turn out to be


    along with the boundary conditions


    Interesting, as the Euler-Lagrange equations become:


    where the boundary conditions are always satisfied.

Notice that while we can choose the parameters in such a way that Eq. 19 is stable, the same does not hold for Eq. 24. Interestingly, stability can be gained for , which is corresponds with a singular solution. Basically if we denote by the solution associated with , we have that does not approximate corresponding at in case in which we can choose arbitrarily large domains .

5 Conclusions

While machine learning is typically framed in the statistical setting, in this case time is exploited in such a way that one relies on a sort of underlying ergodic principle according to which statistical regularities can be captured in time. This paper shows that the continuous nature of time gives rise to computational models of learning that can be interpreted as laws of nature. Unlike traditional stochastic gradient, the theory suggests that, just like in mechanics, learning is driven by the Euler-Lagrange equations that minimize a sort of functional risk. The collapsing from forth- to second-order differential equations opens the doors to an in-depth theoretical and experimental investigation.


We thank Giovanni Bellettini for insightful discussions.