In machine learning one is typically involved in the longstanding dilemma on whether to use on-line or batch-mode learning. Clearly, the trend towards on-line schemes is strongly motivated by the need of learning on huge training sets, though the forgetting behavior of stochastic gradient descent algorithms have promoted intermediate solutions based on weight updating on mini-batches of data. When focussing on perceptual tasks, it is worth mentioning that the iteration steps in on-line weight updating algorithms do not fully capture the notion of time. Whenever time does matter, appropriate computational models have been proposed with the purpose of capturing temporal dependencies (see e.g. hidden Markov models and recurrent neural networks). In these cases, the relation between the iteration step of the learning algorithm operating on sequences and time is quite involved and very much depend on the specific approach that has been adopted (see e.g. the differences between recurrent networks with and without state relaxation towards a fixed point[1, 2, 3]
. A close relation between time and neural activations is typically assumed in neurobiological models, since in this case one is directly involved with natural processes. Recently, there has been a growing interest in the formulation of computational models of learning that are fully embedded with time. When addressing the rising concerns on the biological plausibility of the backpropagation algorithm, in, an extension of the framework of “equilibrium propagation” has been proposed that does not compute the true gradient of the objective function [5, 4]. Using statistical physics, it has been shown that the inference processes from our sensory inputs and learning regularities can be described by the same principles 
. The perceptual processes turn out to be an emergent property of systems that conform to a free-energy principle. Also in this case, the corresponding computational models are truly embedded in time, which leads to an extension of the gradient heuristics.
The approach followed in this paper is based on a formulation of learning that parallels the principle of least action in analytic mechanics, where the potential energy is related to the loss function, while a generalized form of the kinetic energy is used to model the temporal evolution of the model parameters (weights). While the proposed principle draws interesting paths to explore, there are at least a couple of remarkable issues that need to be addressed. First, unlike mechanics, it is clear that the corresponding variational index, that is referred to as the cognitive action, needs to be minimized. The authors of  did not stress minimization issues, thus fully paralleling the approach used in mechanics, where one is only looking for stationary points of the action. While the proposed energy balancing approach somewhat helps understanding the dynamics, there was no effort to discover the minimum of the action, which is an important requirement for deeply understanding learning processes. Second, a more serious shortcoming in  is that the concrete interpretation of the learning processes driven by the Euler-Lagrange equations derived from the cognitive action requires the fulfillment of boundary conditions, that are typically violated when following the inspiration from analytic mechanics, where one drives the trajectory from Cauchy’s initial conditions. This is a serious problem, since we need a causal dynamic computational model to provide a truly on-line update of the parameters and, at the same time, we need to satisfy the consistency with the boundary conditions. The causality of the model is in fact required to gain the computational tractability, since any algorithmic search for the satisfaction of the boundary conditions seems to be hopeless. The typical assumption behind learning policies is that of discovering constant weights at the end of the learning process, which corresponds with imposing that all the temporal derivatives of the weights are null.
In this paper we address both limitations. First, we provide sufficient conditions for achieving the minimum of the cognitive action. As it will be shown, unlike mechanics, this requires to choose an action where the kinetic energy and the potential comes with the same sign. This confers kinetic energy the role of classic regularization terms in machine learning. Second, we solve the problem of making the boundary conditions consistent with Cauchy initialization, so as the Euler-Lagrange equations turn out to be a causal computational model. This is made possible by enforcing special dynamics over a manifold that corresponds with trajectories that are driven by an appropriate manipulation of the input. In particular, we give conditions such that when the input is turned to zero then all the temporal derivatives of the weights are quickly reset, thus respecting the boundary condition on the right border. Basically the model turns out to be strongly stable which supports the need of quick dynamics. The intuition behind this solution comes from the principle of avoiding information overload, which is somewhat similar to related ideas where the agent is expected to be exposed to the environment according to a certain teaching plan (see e.g. ,).
A fundamental result in this paper is that the analysis on the minimality of the action joint with the need of enforcing stability leads to a choice of the kinetic terms that yields a forth-order associated Euler-Lagrange differential equations of learning and inference, that throughout the paper, are referred to as the Cognitive Action Laws (CAL). The forth order turns out to be the minimum that guarantees the above conditions, which confers a special meaning to the special structure of CAL presented in this paper.
The proposed theory offers a framework to grasp an in-depth understanding of the dynamics of learning processes that are related to stochastic gradient to which they reduce under an appropriate selection of the action parameters. After having properly framed the CAL equations into the discrete setting of computation, we carry out a preliminary experimental analysis of the theory for problems of visual feature extraction. The purpose of this analysis is not that of addressing classic computer vision benchmarks, but is that of providing an experimental assessment of the novel concepts introduced in the theory in a specific example. In particular, we introduce an unsupervised learning process, that is based on the maximization of the mutual information from the video signal to a set of symbols. Basically, the mutual information turns out to be the potential of the action, while the kinetic term drives the temporal evolution. The results show that the theory leads to construct a consistent unsupervised scheme whose features resemble typical feature extraction in convolutional neural networks. This opens the doors to a systematic adoption of the theory, where the agent is simply exposed to its own visual environment. In a sense, this is a new perspective in which one can think of learning of “living agents” whose behavior is driven by information-based laws.
2 Cognitive action
Human cognitive processes do not present a neat distinction between training and test set. As time goes by, humans react surprisingly well to new stimuli, which suggests us to look for alternative foundations of learning by embedding the agent into its own learning environment, so as we can think of learning as the outcome of laws of nature. This view of learning relies on the principle that the acquisition of cognitive skills obeys to information-based laws, which hold regardless of biology. Based on , we establish a link with mechanics by paralleling the weights of a neural network to the Lagrangian coordinates of a system of particles. For reasons that will become clear in the rest of the paper, given an agent in its own environment, the following functional, referred to as the cognitive action, will be used to drive inferential and learning processes:
Here , , are the weights of the learning system that are the coordinates of a Lagrangian function composed of a regularization term111Throughout the paper is used to denote the norm. and of a potential term , where . In particular, we choose the following kinetic term
that somewhat characterizes the presence of an ongoing learning process. The choice of this kinetic energy is related to the one adopted in . As it will be shown in the remainder of the paper, the incorporation of second-order derivatives turns out to be useful when considering the need of minimizing the cognitive action while enforcing stability in the associated Euler-Lagrange equations. Notice that the kinetic energy is a sort of temporal regularization term that, once minimized, leads to develop weights that settle to constant values, while the quadratic term is the classic regularization term that favor solutions with small weights. The potential depends on the function , which describes the input information coming from the interactions with the learning environment. The underlying assumption is that . For example, in computer vision, is the video signal from which the agent is expected to learn. The purpose of learning is that of achieving conditions under which, for , we have . Hence, according to machine learning terminology, the potential can be regarded as a loss function. It is worth mentioning that if then the cognitive action reduces to the classic action of analytic mechanics, where the regularization term is aggregated with the potential. Interestingly, in this case, one can look for learning processes that turn out to be a stationary point of the action, while they are not necessarily minima. Unlike , here, we are mostly concerned with the extended notion of action, where we want to discover minima configurations. Overall, the Lagrangian (1) contains the factor that provides a growing weight as time goes by. The term is equivalent under re-scaling of the cognitive action, and clearly prescribes that the weight is gained at the end, while past information is gradually forgotten. A clear motivation for this weighing will be given the in the following, but one can easily noticed that it is a sort of discount factor that leads to focus more on recent information. If we pose , , , and , then Eq. (1) can be rewritten as
The interpretation of learning by means of functional (2) is especially interesting since, unlike the case of the classic action in mechanics, it admits a minimum under appropriate conditions. Of course, this property makes it more attractive for machine learning.
If the following coercivity conditions
hold true then functional , defined by Eq. 2, admits a minimum on the set
See Appendix A. ∎
We report a few qualitative comments in order to understand this result. First, we notice that, unlike mechanics, the choice helps the lower boundedness of . This is immediately clear when , but the term will be proven to play a fundamental role for the approximation of stochastic gradient dynamics. Its sign contributes to develop solutions that generate consistent learning trajectories while minimizing . Suppose the weights are growing, that is . Then a trajectory following yields a concave function that clearly contributes to minimize . Likewise, the same holds for and acts coherently while enforcing convexity of . At the light of these comments, the coercivity conditions (3) clearly contribute to impose a lower bound on . The reason is that the correspondent choices of and leads attribute a relevant weight to the second- and first-order kinetic terms that, unlike , are positive.
In order to determine the minimum, we must impose the conditions for determining stationary points, which require the fulfillment of the Euler-Lagrange equations. We will perform the variation of in the general assumption that , , , and have an explicit dependence on time. Although this does not change the structure of the resulting differential equation, it will turn out to be useful in the remainder of the paper. In order to simplify the calculations of the variation we use the following equivalent expression of :
where , , , , and . Now, let us consider the variation and define , where . In the analysis below, we will repeatedly use the fact that . This corresponds with the assignment of the initial values and . Since we want to provide a causal computational framework for , this is in fact the first step towards this direction. The stationarity condition for the functional is , 222Here and in the rest of the paper, we sometimes simplify the notation by removing the explicit dependence on time.
According to the fundamental lemma of variational calculus, if is a minimum, the above expression should hold for all the allowed variations , so as we can proceed as follows:
Consider only the variations such that . In this case yields the following differential equations
Because of Eq. (5), reduces to . Moreover, since and can be chosen independent one of each other, then the vanishing of the first variation also implies that
The set of triples that satisfy this condition is denoted by . Now, if we restore the explicit dependence of the coefficients on the term then Eq. (5) reads
These equations are referred to as the Cognitive Action Laws (CAL) of learning. Notice that the eventual non-linearity in this differential equation resides entirely in the gradient term; the remaining part is a linear ODE. When joining the result stated by Theorem 1 and the above analysis, we can state the following theorem:
Notice that the two initial conditions on , along with the right-boundary conditions (6), guarantee the existence of the minimum. While this is an interesting result, unfortunately, Theorem 2 does not offer a direct computational scheme for determining the minimum. There is in fact the typical causality issue that arises whenever one wants to optimize over time. Basically, the additional Neumann-like conditions (6) may be in conflict with the Cauchy conditions to be used for a causal solution. As pointed out in the introduction, this was in fact one of important problem left unsolved in . In the next section, however, we will show that a careful modulation of the input signal , that does not change the nature of the learning and inference task, allows us to establish a causal computational scheme that verifies, with an arbitrary degree of precision, the conditions (6).
3 Boundary conditions and developmental issues
The theoretical results of the previous section suggest to formulate learning and inference as the problem of determining
where the functional space is constructed from with augmented left and right boundary conditions:
Clearly, we cannot search in this space by a causal computational scheme, which would require to set also the values of and . These two additional conditions lead in fact to an overdetermined set of equations for finding the coefficients of the CAL equations. Hence, the chosen Cauchy conditions likely violates the Neumann-like conditions (6). It is worth mentioning that this degree of violation likely depends on , and that if the right-boundary conditions collapses to the Cauchy conditions (in the case of null initial conditions). An early discussion on how to make causality consistent with Neumann-like conditions (6) was given in . As discussed in  (Ch. 6), the basic idea can be naturally framed in the context of developmental learning, according to which one does not overload the agent with all the available information, but filter it properly so as to gain a causal optimization of . In other words, the environmental information is presented gradually to the agent so as to favor the acquisition of small chunks, for which the approximate satisfaction of the boundary conditions is facilitated. The gradual exposition of the agent can also benefit from an appropriate filtering of the input with the purpose of reducing the associated information. In experimental results reported in Section 5 we will provide evidence that a suitable modulation of the input signal, that does not change the nature of the problem, allows us to solve Eq. (5) with Cauchy conditions in such a way that the boundary conditions (6) are satisfied.
Now we show that there is another way of controlling the information overloading. Intuitively, the consistency with the boundary conditions can be gained by two decoupled dynamics in Eq. (5), one of which performs a “reset” of the derivatives. Hence, the divergent dynamics that gives rise from the Cauchy’s initialization and from the natural rhythm of incoming information can be controlled by discharging the learning state accumulated in the weights. While such a decoupling dynamics is admissible in the general model of CAL equations, we will show that we can always choose the parameters in such a way to implement the “reset”. Like in the previous discussion on developmental issue, where the information is supposed to be gradually presented, the rationale behind this result is that learning processes are typically consistent with the temporary detachment of the input for arbitrarily small intervals. Consider a sequence of times that defines the two sets with , and with , and we let 333We use Iverson’s notation: Given a statement we set to if is true and to if is false.
with . Not only Eq. (5) still holds, but with this special temporal dependence, the equations turn out to be decoupled, for all times apart from , as follows:
We make the fundamental assumption of controlling the input by , that is to reset the input when we are outside set . This choice comes when one bears in mind the previous discussion on the need to fulfill the boundary conditions. As a consequence the equation that describes the temporal evolution for reduces to the linear equation
Now we state two important theorems that show how the input leads to matching the desired boundary conditions thus gaining the consistency with Cauchy initialization.
See Appendix A. ∎
While this theorem guarantees the consistency between Cauchy’s initialization and the boundary conditions, one might wonder whether the reset of the derivatives of the weights in any segment can also be paired with the latching of the weights developed in the previous segment . As stated in the following theorem the choice of the roots of the characteristic polynomial in Eq. (12) guarantees that the values of at the beginning and at the end of any time interval , is the same with arbitrary degree of precision. We will show that if we choose the roots , then the value to achieve a precision depends on the Vandermonde matrix , on the value of the derivatives , and at on and on a suitable constant that bounds the entries of the inverse of the Vandermonde matrix ).
Let be. For every even consider the defined sets , . It is always possible to choose the coefficients in Eq. (12) such that , if we choose
we have , where for all and .
See Appendix A. ∎
This theorem enables the replacement of the solution of Eq (12) with the enforcement of a reset as it is described in Section 5. Basically, the information overloading associated with the temporal presentation of the source can be properly controlled by resetting all derivatives of the weights, while keeping their value. The corresponding solution keeps all the discussed properties and, particularly, makes Cauchy initialization consistent with the boundary conditions, an issue which was left open in . The conclusion that can be drawn from Theorems 3 and 4 is quite surprising, since the reset of the derivatives turns out to be fully consistent with the causality of the problem.
4 CAL dynamics
In this section we discuss the dynamics behind the cognitive action laws stated by Eq. (7). This is important for the appropriate set up of the parameters in the application to any cognitive task, like the one of vision described in the following experimental section. In particular, we will focus on case in which , where the free dynamics is driven by the kinetic term only. We also address the relationships of CAL dynamics with classic stochastic gradient and prove that it can be reproduced under appropriate choices of the parameters.
4.1 Free dynamics
On null input, since we assume that , Eq. (7) becomes , where we assume and use the notation , , , and . The solution is fully characterized by the nature of the roots of the characteristic polynomial In particular the behavior of the solution is mainly affected by the negativeness of the real part of the roots and by their imaginary part. The first condition ensure the asymptotic stability of the solution, while the violation of the second one prevents oscillatory behavior.
The characteristic polynomial with real coefficients is asymptotically stable if and only if
The proof is gained by the straightforward application the Routh-Hurwitz criterion (see for example ). ∎
If we replace with then we obtain the reduced quartic equation where .
The characteristic polynomial with real coefficients and with discriminant has only real roots if:
, and (4 distinct real roots);
and (roots real, only two equal);
, and (two pair of equal real roots);
, and (all roots real, three equal);
, and .
See . ∎
If we choose such that and:
then the following conditions are jointly verified:
admits a minimum in ;
the homogeneous equation associated with Eq. (7) has the following two properties:
it is asymptotically stable;
it yields aperiodic dynamics (the roots of the characteristic polynomial are real).
4.2 Reproducing gradient flow
Let us consider Eq. (7) with and . Then this equation reduces to
We can promptly see that as the CAL equation (7) restores the classic gradient flow with potential . As anticipated in Section 2, gradient flow arises from the term , whose intuitive contribution to the system dynamics was already given. The choice of the parameters that reduces CAL dynamics to a gradient flow transforms the boundary conditions (6) into . This is in fact the ordinary condition that one expect to be matched at the end of gradient-driven learning processes, namely that the weights converge to a constant value. Clearly, for such a convergence we tacitly assume that the learning task presents some form of regularity to be induced. A recent result in this direction is given in .
Notice that gradient flow is also recovered from the action of analytic mechanic with strong dissipation. This correspond with choosing , and . In this case the Euler Lagrange Equation reduces to
Like Eq. (15), as , also the above equation indeed returns a gradient flow. Thus in both cases the Euler approximation is .
The importance of the incorporation of gradient flow in CAL equations is that some of the results can be inherited also for classic gradient descent algorithms that are massively used in most applications of machine learning. On-line stochastic gradient and gradient descent on mini-batches are typically given a foundation by they association with batch mode gradient to invoke a sort of overall minimization property that emerges from data redundancy. The theory herein presented offers a clear foundations of those intuitive connections in a natural framework driven by the temporal representation of the input in the context of variational calculus.
5 The case of visual features
In this section we carry out an experimental analysis aimed at understanding the dynamics of CAL equations with the final purpose of exploring their behavior in learning tasks. We are mostly interested in understanding the role of the different parameters in the action functional and to validate the theoretical results stated in the previous section. In addition to the experimentation of the causal processing scheme we aimed at verifying the important role of filtering the input, as well as that of properly resetting the system dynamics. More specifically, we are not interested in carrying out experiments on classic benchmarks, that are typically based on large image collections, but on checking the agent behavior on real-world visual environments. With this purpose in mind, we consider the problem of unsupervised learning of visual features from videos. It is a classic perceptual task where the role of time plays a crucial role.
Let be the set of pixel coordinates and denote by the input video, where is the frame at time . We extract convolutional features from each pixel, where the coefficients of the convolutional filters are stored into . If the size of a filter is (for each of the input channels), then the number of components of is . The activations of the features on the pixel at a certain time are given by , where is the softmax function and is the convolution of the video with the filters computed in the pixel of coordinates .
A possible criterion to learn the filters is to require that the Mutual Information (MI) between the input video and the extracted features is maximized [15, 9]. Instead of using the Shannon entropy we use the quadratic entropy , and the following associated potential can be chosen which play the same role as maximizing the MI444In this case it is always possible to achieve just by adding a constant.:
is a probability measure on the pixel coordinates, that is supposed to be uniform. For an in-depth discussion on this equation see. Here, we we follow the spirit of MaxEnt and relax the definition of MI by introducing the parameters that weigh the contribution of the two entropies.
We implemented a solver for the CAL of Eq. (7) that is based on the Euler method with step size . After having reduced the CAL equations to the first order, the variables that were updated at each are , , , and . The code and data we used to run the following experiments can be downloaded at https://github.com/alessandro-betti/see, together with the full list of model parameters. We randomly selected two real world video sequences from the Hollywood Dataset HOHA2 , that we will refer to as “skater” and “car”, and a clip from the movie “The Matrix” (©Warner Bros. Pictures). The frame rate of all the videos is 25 fps, each frame was rescaled to and, unless differently specified, converted to grayscale. Videos have different lengths, ranging from to seconds, and they were repeated in loop until frames were generated, thus covering a significantly longer time span. We randomly initialized while the derivatives at time were set to . Following the developmental plan indicated in Section 3, the video was gradually presented to the system, starting from a completely null signal (all pixel intensities are zero), and slowly increasing the level of detail and the pixel intensities, in function of , , where is the source video signal,
is a Gaussian filter of variance, and is a customizable scaling factor. We start with , and then is progressively increased as time passes, (we set ). We refer to the quantity as the “blurring factor”.
According to the indications of Section 3, we also carried out the “reset plan” according to which the video signal undergoes a reset whenever the derivatives become too large. Formally, if , or , or then we forced to , switching from the case of Eq. (10) to the one of Eq. (11) (, for all ), and then we set to all the derivatives.
We evaluated the CAL dynamics by experimenting four instances of the set of parameters . Each instance is characterized by the roots of the characteristic polynomial that lead to stable or not-stable configurations, and with only real or also imaginary parts, keeping the roots close to zero, and fulfilling the conditions of Proposition 1 when stability and reality are needed. These configurations are all based on values of , while .
We performed experiments on the “skater” video clip; in particular we let , features, and chose filters of size . Results are reported in Fig. 1 (first column). The plots indicate that there is an initial oscillation that is due to the effects of the blurring factor, that vanish after about 10k frames. The MI portion of the cognitive action correctly increases over time555When evaluating results, we used the classical MI based on the Shannon entropy., and it is pushed toward larger values in the two extreme cases of “stability, reality” and “no-stability, no-reality”. The latter shows more evident oscillations in the frame-by-frame MI value, due to the not-stable configuration, and roots with imaginary part. In all the configurations the norm of increases over time, due to the small values of , while the frequency of reset operations is larger in the “no-stability, no-reality” case. When moving to the second column of Fig. 1 (smaller frame rate), we can see that all the configurations have more difficulties in reducing the cognitive action and increasing the MI index. This is due to the faster changes in the video signal and the larger that makes it hard to follow the dynamics of the CAL (leading to a bad solution in the case of “stability, no-reality”). Differently, when using a larger frame rate (third column of Fig. 1), we get better results, that seem to support our intuition of slowly presenting information to the system. The system is also able to develop the MI index in a smaller number of steps.
We investigated other configurations of parameters that are characterized by larger values of (between 10 and 20, in the not-stable configurations, and of the order of and in the last two configurations, respectively). Fig. 2 shows that the MI index is always pretty small. This is due to the stronger regularization that we enforce in the problem, so that the system has difficulties in developing good features. However, the norm of is either small or it becomes almost constant after awhile (with the exception of one configuration, where still grows), showing the convergence of the variables to a fixed value. We clearly observe that the unstable configurations make a wider use of the reset mechanism. We evaluated the quality of the developed features by freezing the final of Fig. 1 and computing the MI index over a single repetition of the whole video clip, reporting the results in Tab. 1 (a). We notice that, while in Fig. 1 we compute the MI on a frame-by-frame basis, here we compute it over the whole frames of the video at once, thus in a batch-mode setting. The result confirms that the 100 fps case is preferable, and that the two extreme configurations “stability, reality” and “no-stability, no-reality” show better results, on average. While this was expected in the “stability, reality” case, we explain the performances of “no-stability, no-reality” by the effect of the reset mechanism, that allows even such unstable configuration to develop good solutions.
We compared the behavior of the system on multiple video clips and using different filter sizes ( and ) and number of features ( and ) in Fig 3. We selected the “stability, reality” configuration of Fig. 1, that fulfils Proposition 1. Changing the video clip does not change the considerations we did so far, while increasing the filter size and number of features can lead to smaller MI index values, mostly due to the need of a better balancing the two entropy terms () to cope with the larger number of features. The MI of Table 1 (b) confirms this point. Interestingly, the best results are obtained in the longer video clip (“The Matrix”) that requires less repetitions of the video, being closer to the real online setting. In Fig. 6 we repot some of the developed filters, that clearly resemble oriented edges and corners.
Fig. 4 and Table 1 (c) show the results we obtain when using different developmental plans (“skater” clip), that is, different values of that lead to the blurring factors reported in the first graph of Fig. 4. These results suggest that a gradual introduction of the video signal helps the system to find better solutions than in the case in which no-plans are used, but also that a too-slow plan is not beneficial. The cognitive action has a big bump when no-plans are used, while this effect is more controlled and reduced in the case of both the slow and fast plans.
Finally, we experimented the setting of Eq. (15), thus simulating an online gradient-descent with . We generated an artificial video from the “skater” clip, by concatenating 3 instances of it, each of them using one of the R-G-B channels only. Fig. 5 and Table 1 (d) show that the gradient-like case leads to a smaller MI index and to an unstable evolution of it. We found that the MI is zero when reproducing the portions of video composed of shades of red or blue. This suggests that the system has focussed on features that are only about the greenish portion of the video, and that it was not able to capture information from the rest of the video, due to the large . However, since there is only one derivative involved, the number of reset operation is almost zero.
In this paper we have proven that the minimization of the cognitive action yields causal equations, referred to as the Cognitive Action Laws (CAL). Depending on the choice of the potential they model different tasks, while the choice of the kinetic energy characterizes the system dynamics, and nicely extends the classic gradient flow and, consequently, the stochastic gradient descent. The main results that arises from the analysis of causal optimization is that the learning process must be driven by the gradual presentation of the input that, in particular, when turned to zero, yields the reset of the dynamics. This sheds light on the video blurring process in newborns, which stimulated the preliminary validation with the task of visual feature extraction presented in this paper. While the experiments are currently exhibited on simple video, it is worth mentioning that the proposed approach is naturally suited for carrying out learning and inference by an agent which is “living in its own environment” on a continuous video stream, where there is no difference between learning and test set. The enforcing of the reset in the system dynamics might be very well suited for processing the video on the basis of the focus of attention as proposed in . When considering the very nature of this unsupervised learning scheme integrated with motion invariance , the proposed theory opens the doors to a new approach to learning on video streams with no supervision. This is especially interesting when considering deep convolutional architectures, that can be modeled by cognitive action laws by an opportune choice of the potential in the cognitive action.
Appendix A Proofs of theorems
Proof of Theorem 1: The proof can be readily adapted from Lemma 2.1 of . Because of the terms proportional to , and in Eq. (2) any minimizing sequence of this functional is bounded in ; hence it is compact in . This motivates us to choose the following notion of convergence in :
These arguments accounts for the coercivity of the functional (compactness of sub-levels). For the lower semicontinuity the only adjustment we have to do is to show that the term is lower semicontinuous with respect to Eq. (18). This, however, can be deduced from a well known result that states that in a separable Hilbert space with and and limited, the scalar product since:
This being done we can conclude as in .
In the proofs of Theorems 3 and 4 we will use the fact that the roots of the characteristic polynomial of Eq. (12) can always be chosen to be distinct, real, and to be more negative than any constant with and always bigger than (). The property of having real and distinct roots can be achieved by imposing condition 1 of Lemma 2, while the other two more conditions coincide with the Routh-Hurwitz criterion applied to and to . We have checked for the existence of a solution to all this conditions using the licensed software Wolfram Mathematica .
Proof of Theorem 3: The dynamic in each is given by Eq. (12); since one of the roots of the characteristic polynomial is , the solution in is thus , where with are the remaining roots. In order to show the theorem it is sufficinet to show that can be made arbitrary small for . Indeed since it is possible to show that the coefficients in Eq. 12 can be choosen in such a way to have and with for any positive and , the magnitude of the derivatives
is exponentially suppressed.
Proof of Theorem 4: The proof arises when imposing the continuity of the derivatives at the time . Suppose that we have already solved the problem in and therefore we know the values and . Since , the evolution in is given by . Hence for we must have and
Let be the above Vandermonde matrix. Since it is always possible to choose the coefficients of Eq. (12) so that and for all and positive, and since , if we let so that we have
where , and is such that for all and .
-  R. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Computation, vol. 1, pp. 270–280, 1989.
-  ——, “Experimental analysis of the real-time recurrent learning algorithm,” Connection Science, vol. 1, pp. 87–111, 1989.
-  F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” Trans. Neur. Netw., vol. 20, no. 1, pp. 61–80, Jan. 2009. [Online]. Available: http://dx.doi.org/10.1109/TNN.2008.2005605
-  B. Scellier, A. Goyal, J. Binas, T. Mesnard, and Y. Bengio, “Extending the framework of equilibrium propagation to general dynamics,” 2018.
B. Scellier and Y. Bengio, “Equilibrium propagation: Bridging the gap between energy-based models and backpropagation,”Front. Comput. Neurosci., vol. 2017, 2017.
-  K. Friston and K. E. Stephan, “Free-energy and the brain,” Synthese, vol. 3, pp. 417–458, 2007.
-  A. Betti and M. Gori, “The principle of least cognitive action,” Theoretical Computer Science, vol. 633, no. C, pp. 83–99, Jun. 2016.
-  Y. Bengio, J. Louradour, R. Lollobert, and J. Weston, “Curriculum learning,” in 26th Annual International Conference on Machine Learning, Montreal, Canada, June 2009, pp. 41–48.
-  M. Gori, M. Lippi, M. Maggini, and S. Melacci, “Semantic video labeling by developmental visual agents,” Computer Vision and Image Understanding, vol. 146, pp. 9–26, 2016.
-  A. Betti and M. Gori, “Convolutional Networks in Visual Environments,” ArXiv e-prints, Jan. 2018, arXiv:1801.07110 [cs.CV].
-  M. Gori, Machine Learning: A Constrained-Based Approach. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2018.
-  J. Uspensky, Theory of Equations. McGraw-Hill, 1948.
-  E. L. Rees, ‘‘Graphical discussion of the roots of a quartic equation,’’ The American Mathematical Monthly, vol. 29, no. 2, pp. 51--55, 1922. [Online]. Available: http://www.jstor.org/stable/2972804
-  G. Bellettini, A. Betti, and M. Gori, ‘‘Generalization in quasi-periodic environments,’’ ArXiv e-prints, Jul. 2018, arXiv:1807.05343 [cs.LG].
-  S. Melacci and M. Gori, ‘‘Unsupervised learning by minimal entropy encoding,’’ IEEE transactions on neural networks and learning systems, vol. 23, no. 12, pp. 1849--1861, 2012.
M. Marszałek, I. Laptev, and C. Schmid, ‘‘Actions in context,’’ in
IEEE Conference on Computer Vision & Pattern Recognition, 2009.
-  D. Zanca and M. Gori, ‘‘Variational laws of visual attention for dynamic scenes,’’ in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, 2017, pp. 3826--3835.
-  M. Liero and U. Stefanelli, ‘‘A new minimum principle for lagrangian mechanics,’’ Journal of Nonlinear Science, vol. 23, pp. 179--204, 2013.
-  W. R. Inc., ‘‘Mathematica, Version 10.4,’’ champaign, IL, 2016.