Log In Sign Up

Convolutional Networks in Visual Environments

by   Alessandro Betti, et al.

The puzzle of computer vision might find new challenging solutions when we realize that most successful methods are working at image level, which is remarkably more difficult than processing directly visual streams. In this paper, we claim that their processing naturally leads to formulate the motion invariance principle, which enables the construction of a new theory of learning with convolutional networks. The theory addresses a number of intriguing questions that arise in natural vision, and offers a well-posed computational scheme for the discovery of convolutional filters over the retina. They are driven by differential equations derived from the principle of least cognitive action. Unlike traditional convolutional networks, which need massive supervision, the proposed theory offers a truly new scenario in which feature learning takes place by unsupervised processing of video signals. It is pointed out that an opportune blurring of the video, along the interleaving of segments of null signal, make it possible to conceive a novel learning mechanism that yields the minimum of the cognitive action. Basically, while the theory enables the implementation of novel computer vision systems, it is also provides an intriguing explanation of the solution that evolution has discovered for humans, where it looks like that the video blurring in newborns and the day-night rhythm seem to emerge in a general computational framework, regardless of biology.


page 1

page 2

page 3

page 4


Motion Invariance in Visual Environments

The puzzle of computer vision might find new challenging solutions when ...

Learning Visual Features Under Motion Invariance

Humans are continuously exposed to a stream of visual data with a natura...

Learning to see like children: proof of concept

In the last few years we have seen a growing interest in machine learnin...

Cognitive Action Laws: The Case of Visual Features

This paper proposes a theory for understanding perceptual learning proce...

Least Action Principles and Well-Posed Learning Problems

Machine Learning algorithms are typically regarded as appropriate optimi...

Towards Biologically Plausible Convolutional Networks

Convolutional networks are ubiquitous in deep learning. They are particu...

1 Introduction

While the emphasis on a general theory of vision was already the main objective at the dawn of the discipline [16]

, it has evolved without a systematic exploration of foundations in machine learning. When the target is moved to unrestricted visual environments and the emphasis is shifted from huge labelled databases to a human-like protocol of interaction, we need to go beyond the current peaceful interlude that we are experimenting in vision and machine learning. A fundamental question a good theory is expected to answer is why children can learn to recognize objects and actions from a few supervised examples, whereas nowadays supervised learning approaches strive to achieve this task. In particular, why are they so thirsty for supervised examples? Interestingly, this fundamental difference seems to be deeply rooted in the different communication protocol at the basis of the acquisition of visual skills in children and machines.

So far, the semantic labeling of pixels of a given video stream has been mostly carried out at frame level. This seems to be the natural outcome of well-established pattern recognition methods working on images, which have given rise to nowadays emphasis on collecting big labelled image databases (e.g. 

[5]) with the purpose of devising and testing challenging machine learning algorithms. While this framework is the one in which most of nowadays state of the art object recognition approaches have been developing, we argue that there are strong arguments to start exploring the more natural visual interaction that animals experiment in their own environment.

Learning in wild visual environments. This leads to process video instead of image collection, that naturally leads to a paradigm-shift in the associated processes of learning to see. The idea of shifting to video is very much related to the growing interest of learning in the wild that has been explored in the last few years (see. e.g.

A crucial problem that has been recognized by Poggio and Anselmi [20]

is the need to incorporate visual invariances into deep nets that go beyond simple translation invariance that is currently characterizing convolutional networks. They propose an elegant mathematical framework on visual invariance and enlighten some intriguing neurobiological connections. Overall, the ambition of extracting distinctive features from vision poses a challenging task. While we are typically concerned with feature extraction that is independent of classic geometric transformation, it looks like we are still missing the fantastic human skill of capturing distinctive features to recognize ironed and rumpled shirts! There is no apparent difficulty to recognize shirts by keeping the recognition coherence in case we roll up the sleeves, or we simply curl them up into a ball for the laundry basket. Of course, there are neither rigid transformations, like translations and rotation, nor scale maps, that transforms an ironed shirt into the same shirt thrown into the laundry basket. Is there any natural invariance?

In this paper, we claim that motion invariance is in fact the only invariance that we need. The paradigm-shift of motion invariance Translation and scale invariance, that have been the subject of many studies, are in fact examples of invariances that can be fully gained whenever we develop the ability to detect features that are invariant under motion. If my inch moves closer and closer to my eyes then any of its representing features that is motion invariant will also be scale invariant. The finger will become bigger and bigger as it approaches my face, but it is still my inch! Clearly, translation, rotation, and complex deformation invariances derive from motion invariance. Humans life always experiments motion, so as the gained visual invariances naturally arise from motion invariance. Animals with foveal eyes also move quickly the focus of attention when looking at fixed objects, which means that they continually experiment motion. Hence, also in case of fixed images, conjugate, vergence, saccadic, smooth pursuit, and vestibulo-ocular movements lead to acquire visual information from relative motion. We claim that the production of such a continuous visual stream naturally drives feature extraction, since the corresponding convolutional filters are expected not to change during motion. The enforcement of this consistency condition creates a mine of visual data during animal life. Interestingly, the same can happen for machines. Of course, we need to compute the optical flow at pixel level so as to enforce the consistency of all the extracted features. Early studies on this problem [10], along with recent related improvements (see e.g. [2]) suggests to determine the velocity field by enforcing brightness invariance. As the optical flow is gained, it is used to enforce motion consistency on the visual features. Interestingly, the theory we propose is quite related to the variational approach that is used to determine the optical flow in [10]. It is worth mentioning that an effective visual system must also develop features that do not follow motion invariance. These kind of features can be conveniently combined with those that are discussed in this paper with the purpose of carrying out high level visual tasks.

The convolutional filters are somewhat inspired from the research activity reported in [8], where the authors propose the extraction of visual features as a constraint satisfaction problem, mostly based on information-based principles and early ideas on motion invariance.

Learning as the minimization of the cognitive action. In this paper, the importance of motion invariance is stressed and, moreover, the solution is derived in the framework of the principle of cognitive action [4], which gives rise to a time-variant differential equation, where the Lagrangian coordinates corresponds with the values of the convolutional filters. It is pointed out that, under causality conditions, the well-position of the problems arises thanks to the process of video-blurring taking place at the beginning of learning, which has also been experimented in children. The learning process can be interpreted in the framework of the minimization of the cognitive action that offers a self-consistent framework. In particular, if the video signal is almost periodic [3], then the computational model reduces to an asymptotically stable differential equation that yields a sort of statistical consistency.

2 Driving principles and main results

We are given a retina

, which can formally be regarded as a compact subset of the plane; for the moment we will not assume any specific shape — any deformation of the closed disk will serve. The purpose of this paper is that of analyzing the mechanisms that give rise to the construction of local features for any pixel

of the retina, at any time . These features, along with the video itself, can be regarded as visual fields, that are defined on the retina and on a given horizon of time ; clearly the analysis of on-line learning of visual features leads to regard the horizon as

. As it will be clear in the remainder of the paper, a set of symbols are extracted at any layer of a deep architecture, so as any pixel — along with its context —  turns out to be represented by the list of symbols extracted at each layer. The computational process that we define involves the video as well as appropriate vector fields that are used to express a set of pixel-based features properly used to capture contextual information. The video, as well as all the involved fields, are defined on the domain

. In what follows, points on the retina will be represented with two dimensional vectors on a defined coordinate system on the retina. The temporal coordinate is usually denoted by , and, therefore, the video signal on the pair is . For further convenience we also define the map so that . The color field can be thought of as a special field that is characterized by the RGB color components of any single pixel; in this case .

Figure 1: Convolutional computation in a deep network. The input is processed by convolutional filters which transform . Notice that the features are extracted at different level on the same pixel .

Now, we are concerned with the problem of extracting visual features that, unlike the components of the video, express the information associated with the pair and its spatial context. Basically, one would like to extract visual features that characterize the information in the neighborhood of pixel . Kernel-based computation feature extraction. A possible way of constructing this kind of features is to define111Throughout the paper we use the Einstein convention to simplify the equations.


Here we assume that symbols are generated from the components of the video. Notice that the kernel is responsible of expressing the spatial dependencies, and that one could also extend the context in the temporal dimension. However, the immersion in the temporal dimension that arises from the formulation given in this paper makes it reasonable to begin restricting the contextual information to spatial dependencies on the the retina.

Convolutions for built-in incorporation of space-invariance. In addition, it is worth mentioning that the agent is expected to return a decision also in case of fixed images, which represents a further element for considering features defined by Eq. (1). The filters can be regarded as maps from to , where is the number of the features defined by . It is worth mentioning that whenever the above definition reduces to an ordinary spatial convolution. The computation of yields a field with features, instead of the three components of color in the video signal. However, Eq. (1) can be used for carrying out a piping scheme where a new set of features is computed from . Of course, this process can be continued according to a deep computational structure with a homogeneous convolutional-based computation, which yields the features at the convolutional layer. The theory proposed in this paper focuses on the construction of any of these convolutional layers which are expected to provide higher and higher abstraction as we increase the number of layers. The filters are what completely determines the features . In this paper we formulate a theory for the discovery of that is based on three driving principles:

  • Optimization of information-based indices
    We use an information-based approach to determine . Beginning from the color field , we attach symbol of a discrete vocabulary to pixel

    with probability

    . MMI and MaxEnt. The principle of Maximum Mutual Information (MMI) is a natural way of maximizing the transfer of information from the visual source, expressed in terms of mixtures of colors, to the source of symbols . Clearly, the same idea can be extended to any layer in the hierarchy. Once we are given a certain visual environment over a certain time horizon  — which can be extended to  — once the filters have been defined, the mutual information turns out to be a functional of , that is denoted as . However, in the following, it will be shown that the more general view behind the the maximum entropy principle (MaxEnt) offers a better framework for the formulation of the theory.

  • Motion invariance
    While information-based indices optimize the information transfer from the input source to the symbols, the major cognitive issues of invariances are not covered. The same object, which is presented at different scales and under different rotations does require different representations, which transfers all the difficulty of learning to see to the subsequent problems interwound with language interpretation. Hence, it turns out that the most important requirement that the visual field must fulfill is that of exhibiting the typical cognitive invariances that humans and animals experiment in their visual environment. We claim that there is only one such fundamental invariance, namely that of producing the same representation for moving pixels. Classic invariances as motion invariance. This incorporates classic scale and rotation invariances in a natural way, which is what is experimented in newborns. Objects comes at different scale and with different rotations simply because children experiment their movement and manipulation. As we track moving pixels, we enforce consistent labeling, which is clearly far more general than enforcing scale and rotation invariance. The enforcement of motion constraint is the key for the construction of a truly natural invariance. It will be pointed out that motion invariance can always be expressed as the minimization of a functional .

  • Parsimony principle
    Like any principled formulation of learning, we require the filters to obey the parsimony principle. Amongst the philosophical implications, it also favors the development of a unique solution. The development of filters that are consistent with the above principles requires the construction of an on-line learning scheme, where the role of time becomes of primary importance. The main reason for such a formulation is the need of imposing the development of motion invariance features. Given the filters , there are two parsimony terms, one , that penalizes abrupt spatial changes, and another one, that penalizes quick temporal transitions.

Figure 2: Motion invariance in the feature extraction process. The symbol , that defines a features at the beginning of motion ( s), must be coherently extracted during the movement — see the enforcement of the coherence requirement at s.

Minimization of the cognitive action. Overall, the process of learning is regarded as the minimization of the cognitive action



are positive multipliers. While the first and third principles are typically adopted in classic unsupervised learning, motion invariance does characterize the approach followed in this paper. Of course, there are visual features that do not obey the motion invariance principle. Animals easily estimate the distance to the objects in the environment, a property that clearly indicates the need for features whose value do depend on motion. The perception of vertical visual cues, as well as a reasonable estimate of the angle with respect to the vertical line also suggests the need for features that are motion dependent. Since the above action functional

depends on the choice of the multipliers , it is quite clear that there is a wide range of different behavior that depend on the relative weight that is given to the terms that compose the action. As it will be shown in the following, the minimization of can be given an efficient computational scheme only if we give up to optimize the information transfer in one single step and rely on a piping scheme that clearly reminds deep network architectures. While this paper focuses on unsupervised learning, it is worth mentioning that the purpose of the agent can naturally be incorporated into the minimization of the cognitive action given by Eq (15).

Now, we provide arguments to support the principled framework of this paper. Like for human interaction, visual concepts are expected to be acquired by the agents solely by processing their own visual stream along with human supervisions on selected pixels, instead of relying on huge labelled databases. In this new learning environment based on a video stream, any intelligent agent willing to attach semantic labels to a moving pixel is expected to take coherent decisions with respect to its motion. Basically, any label attached to a moving pixel has to be the same during its motion. Hence, video streams provide a huge amount of information just coming from imposing coherent labeling, which is likely to be the primary information associated with visual perception experienced by any animal. Roughly speaking, once a pixel has been labeled, the constraint of coherent labeling virtually offers tons of other supervisions, that are essentially ignored in most machine learning approaches working on big databases of labeled images. It turns out that most of the visual information to perform semantic labeling comes from the motion coherence constraint, which explains the reason why children learn to recognize objects from a few supervised examples. The linguistic process of attaching symbols to objects takes place at a later stage of children development, when he has already developed strong pattern regularities. We conjecture that, regardless of biology, the enforcement of motion coherence constraint is a high level computational principle that plays a fundamental role for discovering pattern regularities. Concerning the MMI principle, it is worth mentioning that it can be regarded as a special case of the MaxEnt principle when the constraints correspond with the soft-enforcement of the conditional entropy, where the weight of its associated penalty is the same as that of the entropy (see e.g. [17]). Notice that while the maximization of the mutual information nicely addresses the need of maximizing the information transfer from the source to the selected alphabet of symbols, it does not guarantee temporal consistency of this attachment. Basically, the optimization of the index is also guaranteed by using the same symbol for different visual cues. Motion consistency faces this issue for any pixel, even if it is fixed. As for the adoption of the parsimony principle in visual environments, we can use appropriate functionals to enforce both the spatial and temporal smoothness of the solution. While the spatial smoothness can be gained by penalizing solutions with high spatial derivatives — including the zero-order derivatives — temporal smoothness arises from the introduction of kinetic energy terms which penalizes high velocity and, more generally, high temporal derivatives.

Since the optimization is generally formulated over arbitrarily large time horizons, all terms are properly weighted by a discount factor that leads to “forget” very old information in the agent life. This contributes to a well-position of the optimization problem and gives rise to dissipation processes [4].

The agent behavior turns out to be driven by the minimization of an appropriate functional that combines the all above principles. The main result in this paper is that this optimization can be interpreted in terms of laws of nature expressed by a temporal differential equation. When regarding the retina as a discrete structure, we can compute the probability that at time , in pixel , the emitted symbol is by . Here, for any pair of symbols , , and for any pixel with position , in the coordinate system defined by , the filter is the temporal function that the agent is expected to learn from the visual environment. Basically, the process of learning consists of determining

In Section 4 we prove that there is no local solution to this problem, since any stationary point of this functional turns out to be characterized by the integro-differential equation (14). We also show that we can naturally gain a local solution when introducing focus of attention mechanisms. Its purpose is to provide a weighed contribution of the single terms of the action by attaching higher weights to pixels where the agent is focussing attention. Under this re-stating of the problem, we prove that the minimum of the cognitive action corresponds with the discovery of the filters that satisfy the time-variant differential equation Fourth-order Euler-Lagrange model of learning.


where is the linearized vector of , matrices , , , , depend on the time through the video signal and the trajectory of the focus of attention, and is a bounded nonlinear vector. The subsequent analysis will provide clear evidence on the need of a fourth-order differential equation for the determination of the filters. Equation (3) has quite a complex structure, since it also contains the non-linear term

that, however, as it will be shown is piece-wise linear. It is shown that the dependence on time of the coefficients is inherited by the time-variance of the video. Hence, the solution of the differential equation involves dynamics whose spectrum is induced by the video. The analysis carried out in the paper shows how can we attack the problem either in the case in which the agent is expected to learn from a given video stream with the purpose to work on subsequent text collections, or in the case in which the agent lives in a certain visual environment, where there is no distinction between learning and test phases. Basically, it is pointed out that only the second case leads to a truly interesting and novel result.

Learning by blurring trajectories. In particular, the solution of the above differential equation is strongly facilitated when performing an initial blurring of the video that lasts until all the visual statistical cues are likely been presented to the agent. This very much resembles early stages of developments in newborns [18]. In so doing, at the beginning, the coefficients of Eq. (3) are nearly constant. In this case, the analysis of the equations leads to conclude that only a very slow dynamics takes place, which means that all the derivatives of are nearly null and, consequently, is nearly constant. This strongly facilitates the numerical solutions and, in general, the computational model turns out to be very robust, a property that is clearly welcome also in nature. As time goes by, while the blurring process increases the visual acuity the coefficients of the differential equation begin to change with velocity that is connected with motion. However, in the meantime, the values of the filters have reached a nearly-constant value. Basically, the learning trajectories are characterized by the mentioned nearly-null derivatives, a condition that, again strongly facilitates the well-position of the problem.

A further intuitive reason for a slow dynamics of is also a consequence of visual invariant features. For example, when considering a moving car and another one of the same type parked somewhere in the same frame, during the motion interval, the processing over the parked car would benefit from a nearly constant solution. This suggests also searching for the same constant solution on the corresponding moving pixel. When regarding the problem of learning in a truly on-line mode, the previous differential equation can be considered as the model for computing given the Cauchy conditions. Of course, the solution is affected by these initial conditions. Moreover, as it will be clear in the reminder of the paper, the previous differential equations yield the minimization of the action under appropriate border conditions that correspond with forcing a trajectory that satisfies the condition of nearly-null of the first, second, and third derivatives of . When joined with the blurring process this leads to a causal dynamics driven by initial conditions that are compatible with boundary conditions imposed at any time of the agent’s life.

The puzzle of extracting robust cues from visual scenes has only been partially faced by nowadays successful approaches to computer vision. The remarkable achievements of the last few years have been mostly based on the accumulation of huge visual collections gathered by crowdsourcing. An appropriate set up of convolutional networks trained in the framework of deep learning has given rise to very effective internal representations of visual features. They have been successfully used by facing a number of relevant classification problems by transfer learning. Clearly, this approach has been stressing the power of deep learning when combining huge supervised collections with massive parallel computation. In this paper, we argue that while stressing this issue we have been facing artificial problems that, from a pure computational point of view, are likely to be significantly more complex than natural visual tasks that are daily faced by animals. In humans, the emergence of cognition from visual environments is interwound with language. This often leads to attack the interplay between visual and linguistic skills by simple models that, like for supervised learning, strongly rely on linguistic attachment. However, when observing the spectacular skills of the eagle that catches the pray, one promptly realizes that for an in-depth understanding of vision, that likely yields also an impact in computer implementation, one should begin with a neat separation with language! This paper is mostly motivated by the curiosity of addressing a number of questions that arise when looking at natural visual processes. While they come from natural observation, they are mostly regarded as general issues strongly rooted in information-based principles, that we conjecture are of primary importance also in computer vision.

The theory proposed in this paper offers a computational perspective of vision regardless of the “body” which sustains the processing. In particular, the theory addresses some fundamental questions, reported below, that involve vision processes taking place in both animals and machines.

  1. How can animals conquer visual skills without requiring “intensive supervision”?
    Recent remarkable achievements in computer vision are mostly based on tons of supervised examples —  of the order of millions! This does not explain how can animals conquer visual skills with scarse “supervision” from the environment. The call for theories of unsupervised learning. Hence, there is plenty of evidence and motivations for invoking a theory of truly unsupervised learning capable of explaining the process of extraction of features from visual data collections. While the need for theories of unsupervised learning in computer vision has been advocated in a number of papers (see e.g. [23], [15],[21], [9]

    ), so far, the powerful representations that arise from supervised learning, because of many recent successful applications, seem to attract much more interest. While information-based principles could themselves suffice to construct visual features, the absence of any feedback from the environment make those methods quite limited with respect to supervised learning. Interestingly, the claim of this paper is that motion invariance offers a huge amount of free supervisions from the visual environment, thus explaining the reason why humans do not need the massive supervision process that is dominating feature extraction in convolutional neural networks.

  2. How can animals gradually conquer visual skills in a truly temporal-based visual environment?
    Animals, including primates, not only receive a scarse supervision, but they also conquer visual skills by living in their own visual environment. This is gradually achieved without needing to separate learning from test environments. At any stage of their evolution, it looks like they acquire the skills that are required to face the current tasks. On the opposite, most approaches to computer vision do not really grasp the notion of time. The typical ideas behind on-line learning do not necessarily capture the natural temporal structure of the visual tasks. Time plays a crucial role in any cognitive process. One might believe that this is restricted to human life, but more careful analyses lead us to conclude that the temporal dimension plays a crucial role in the well-positioning of most challenging cognitive tasks, regardless of whether they are faced by humans or machines. Interestingly, while many people struggle for the acquisition of huge labeled databases, the truly incorporation of time leads to a paradigm shift in the interpretation of the learning and test environment. Visual stream can easily surpass any large image collection. In a sense, such a distinction ceases to apply, and we can regard unrestricted visual collections as the information accumulated during all the agent life, that can likely surpass any attempt to collect image collection. The theory proposed in this paper is framed in the context of agent life characterized by the ordinary notion of time, which emerges in all its facets. We are not concerned with huge visual data repositories, but merely with the agent life in its own visual environments.

  3. Can animals see in a world of shuffled frames?
    One might figure out what human life could have been in a world of visual information with shuffled frames. Could children really acquire visual skills in such an artificial world, which is the one we are presenting to machines? Notice that in a world of shuffled frames, a video requires order of magnitude more information for its storing than the corresponding temporally coherent visual stream. This is a serious warning that is typically neglected; any recognition process is remarkably more difficult when shuffling frames, which clearly indicates the importance of keeping the spatiotemporal structure that is offered by nature. This calls for the formulation of a new theory of learning capable of capturing spatiotemporal structures. Basically, we need to abandon the safe model of restricting computer vision to the processing of images. The reason for formulating a theory of learning on video instead of on images is not only rooted in the curiosity of grasping the computational mechanisms that take place in nature. In modern computer vision we have been facing a problem that is more difficult then that offered by nature. It looks like that, while ignoring the crucial role of temporal coherence, the formulation of most of nowadays current computer vision tasks leads to tackle a problem that is remarkably more difficult than the one nature has prepared for humans! We conjecture that animals could not see in a world of shuffled frames, which indicates that such an artificial formulation might led to a very hard problem. In a sense, the very good results that we already can experiment nowadays are quite surprising, but they are mostly due to the stress of the computational power. The theory proposed in this paper relies of the choice of capturing temporal structures in natural visual environments, which is claimed to simplify dramatically the problem at hand, and to give rise to lighter computation.

  4. How can humans attach semantic labels at pixel level?
    Humans provide scene interpretation thanks to linguistic descriptions. This requires a deep integration of visual and linguistic skills, that are required to come up with compact, yet effective visual descriptions. However, amongst these high level visual skills, it is worth mentioning that humans can attach semantic labels to a single pixel in the retina. Pixel-based fundamental primitives. While this decision process is inherently interwound with a certain degree of ambiguity, it is remarkably effective. The linguistic attributes that are extracted are related to the context of the pixel that is taken into account for label attachment, while the ambiguity is mostly a linguistic more than a visual issue. The theory proposed in this paper addresses directly this visual skill since the labels are extracted for a given pixel at different levels of abstraction. Unlike classic convolutional networks, there is no pooling; the connection between the single pixels and their corresponding features is kept also when the extracted features involve high degree of abstraction, that is due to the processing over large contexts. The focus on single pixels allows us to go beyond object segmentation based sliding windows, which somewhat reverses the pooling process. Instead of dealing with object proposals [26], we focus on the attachment of symbols at single pixels in the retina. The bottom line is that human-like linguistic descriptions of visual scenes is gained on top of pixel-based feature descriptions that, as a byproduct, must allow us to perform semantic labeling. Interestingly, there is more; as it will be shown in the following, there are in fact computational issues that lead us to promote the idea of carrying our the feature extraction process while focussing attention on salient pixels.

  5. Why are there two mainstream different systems in the visual cortex (ventral and dorsal mainstream)?
    It has been pointed out that the visual cortex of humans and other primates is composed of two main information pathways that are referred to as the ventral stream and dorsal stream [6]

    . Is motion invariance the fundamental functional property that differentiate dorsal and ventral streams? The traditional distinction distinguishes the ventral “what” and the dorsal “where/how” visual pathways, so as the ventral stream is devoted to perceptual analysis of the visual input, such as object recognition, whereas the dorsal stream is concerned with providing motion ability in the interaction with the environment. The enforcement of motion invariance is clearly conceived for extracting features that are useful for object recognition to assolve the “what” task. Of course, neurons with built-in motion invariance are not adeguate to make spatial estimations. Depending on the the value of the

    parameter, the theory presented in this paper leads to interpret the computational scheme of “ventral neurons”, that are appropriate for recognition —  high value of  —  or “dorsal neurons” that are more appropriate for environmental interactions —  . The model behind the learning of the filters indicates the need to access to velocity estimation, which is consistent with neuroanatomical evidence.

  6. Why is the ventral mainstream organized according to a hierarchical architecture with receptive fields?
    Beginning from early studies by Hubel and Wiesel [11], neuroscientists have gradually gained evidence of that the visual cortex presents a hierarchical structure and that the neurons process the visual information on the basis of inputs restricted to receptive field. Is there a reason why this solution has been developed? We can promptly realize that, even though the neurons are restricted to compute over receptive fields, deep structures easily conquer the possibility of taking large contexts into account for their decision. Is it there a computational framework to motivates hierarchical architectures? Is this biological solution driven by computational laws of vision? In this paper we provide evidence of the fact that receptive fields do favor the acquisition of motion invariance which, as already stated, is the fundamental invariance of vision. Since hierarchical architectures is the natural solution for developing more abstract representations by using receptive fields, it turns out that motion invariance is in fact at the basis of the biological structure of the visual cortex. The computation at different layers yields features with progressive degree of abstraction, so as higher computational processes are expected to use all the information extracted in the layers.

  7. Why do animals focus attention?
    The retina of animals with well-developed visual system is organized in such a way that there are very high resolution receptors in a restricted area, whereas lower resolution receptors are present in the rest of the retina. Is focus of attention driven by computational laws? Why is this convenient? One can easily argue that any action typically takes place in a relatively small zone in front of the animals, which suggests that the evolution has led to develop high resolution in a limited portion of the retina. On the other hand, this leads to the detriment of the peripheral vision, that is also very important. In addition, this could apply for the dorsal system whose neurons are expected to provide information that is useful to support movement and actions in the visual environment. The ventral mainstream, with neurons involved in the “what” function does not seem to benefit from foveal eyes. From the theory proposed in this paper, the need of foveal retinas is strongly supported for achieving efficient computation for the construction of visual features. When looking at Eq. (3) it becomes also clear that quick eye movements with respect to the dynamics of change of the weights of the filters dramatically simplifies the computation.

  8. Why do foveal animals perform eye movements?
    Human eyes make jerky saccadic movements during ordinary visual acquisition. One reason for these movements is that the fovea provides high-resolution in portions of about degrees. Because of such a small high resolution portions, the overall sensing of a scene does require intensive movements of the fovea. Hence, the foveal movements do represent a good alternative to eyes with uniformly high resolution retina. On the other hand, the preference of the solution of foveal eyes with saccadic movements is arguable, since while a uniformly high resolution retina is more complex to achieve than foveal retina, saccadic movements are less important. The information-based theory presented in this paper makes it possible to conclude that foveal retina with saccadic movements is in fact a solution that is computationally sustainable and very effective.

  9. Why does it take 8-12 months for newborns to achieve adult visual acuity?
    There are surprising results that come from developmental psychology on what a newborn see. Charles Darwin came up with the following remark:

    It was surprising how slowly he acquired the power of following with his eyes an object if swinging at all rapidly; for he could not do this well when seven and a half months old.

    Is there any computational basis of video blurring? At the end of the seventies, this early remark was given a technically sound basis [24]. In the paper, three techniques,  —  optokinetic nystagmus (OKN), preferential looking (PL), and the visually evoked potential (VEP)  —  were used to assess visual acuity in infants between birth and 6 months of age. More recently, the survey by Braddick and Atkinson [18] provides an in-depth discussion on the state of the art in the field. It is clearly stated that for newborns to gain adult visual acuity, depending on the specific visual test, several months are required. Is the development of adult visual acuity a biological issue or does it come from higher level computational laws? This paper provides evidence to conclude that the blurring process taking place in newborns is in fact a natural strategy to optimize the cognitive action defined by Eq. 15 under causality requirements. Moreover, the strict limitations both in terms of spatial and temporal resolution of the video signal, according to the theory, help conquering visual skills.

  10. Causality and Non Rapid Eye Movements (NREM) sleep phases
    Computer vision is mostly based on huge training sets of images, whereas humans use video streams for learning visual skills. Notice that because of the alternation of the biological rhythm of sleep, humans somewhat process collections of visual streams pasted with relaxing segments composed of “null” video signal. This happens mostly during NREM phases of sleep, in which also eye movements and connection with visual memory are nearly absent. Interestingly, the Rapid Eye Movements (REM) phase is, on the opposite, similar to ordinary visual processing, the only difference being that the construction of visual features during the dream is based on the visual internal memory representations  [22]. As a matter of fact, the process of learning the filters experiments an alternation of visual information with the reset of the signal. Day-night rhythm and relaxation of system dynamics. We provide evidence to claim that such a relaxation coming from the reset of the signal nicely fits the purpose of optimizing an overall optimization index based on the previously stated principles. In particular, we point out that periodic resetting of the visual information favors the optimization under causality requirements. Hence, the theory offers an intriguing interpretation of the role of eye movement and of sleep for the optimal development of visual features. In a sense, the theory offers a general framework for interpreting the importance of the day-night rhythm in the development of visual features. When combined with newborns blurring, it contributes to a relaxation dynamical process that turns out to be of fundamental importance for the final purpose of optimization of the visual constraints.

3 Visual constraints

We can provide an interpretation of the processing carried out by our visual agent in the framework of information theory. The basic idea is that the agent produces a set of symbols from a given alphabet while processing the video.

MMI principle. 

Let us define random variables


, which take into account the spatial and temporal probability distribution, while

is used to specify the probability distribution over the possible symbols, and to specify the frames. In order to assess the information transfer from to we consider the corresponding mutual information . Clearly, it is zero whenever random variable is independent of , and . The mutual information can be expressed by


The conditional entropy is given by


where is the conditional probability of conditioned to the values of , and , is the joint measure of the variable , and is a Borel set in the space. The agent generates symbols along with the corresponding probabilities on the basis of input source that is based on symbols that are still given along with their probability. Now, let us make two fundamental assumptions:

  • The conditional probability , where is a realization of random variable , is given by the -th feature field .

  • Random variables follows the ergodic-like assumption, Ergodic assumption: probabilistic indices emerge while living in “wild visual environments” so as we can perform the replacement:

In what follows we will assume that the measure is . Moreover, we assume that is factorized according to


where is the trajectory of the focus of attention and is monotonic increasing function. This ergodic translation of the probabilistic measure suggests that we pay attention where the eye is focussing attention, that is in the neighborhood of : This can be achieved by means of a function peaked on the focus of attention. Ergodic translation: more weight on pixels of focus of attention and on “recent visual cues.” Such a trajectory is assumed to be available but, as pointed out in Section 7, it can also be determined in the overall framework of the theory presented in this paper. In addition, ergodicity here means that we pay attention mostly on “recent visual life.” Clearly, this very much depends on the choice of . It is quite obvious that the measure only makes sense provided that the function does not change significantly during statistically significant portions of visual environments. Whenever these two assumptions hold, we can rewrite the conditional entropy defined by Eq. (5) as


Similarly for the entropy of the variable we can write


Now, if we use the law of total probability to express

in terms of the conditional probability and use the above assumptions we get




Finally the mutual information becomes Mutual information based on probabilities .


Of course, is subject to the probabilistic constraints


MaxEnt principle.  A more general view: visual constraint satisfaction while maximizing the entropy. An agent driven by the MMI principle can carry out an unsupervised learning process aimed at discovering the symbols defined by random variable . Interestingly, when the constraints are given a soft-enforcement, the MMI principle has a nice connection with the Max-Ent principle [13]: The maximization of the mutual information corresponds with the maximization of the entropy while softly-enforcing the constraint that the conditional entropy is null. While both the entropy terms get the same absolute value of the weight, once can think of different implementations of the MaxEnt principle that very much depend on the special choice of the weights. When shifting towards the MaxEnt principle one is primarily interested in the satisfaction of the conditional entropy constraint, while bearing in mind that the maximization of the entropy protects us from the development of trivial solutions (see [7] pp. 99–103 for further details). Of course, the probabilistic normalization constraints stated by Eq. 12 comes along with the conditional entropy constraint. The computational mechanism that drives the discovery of the symbols described in this paper is based on MaxEnt, but instead of limiting the unsupervised process to the fulfillment of the conditional entropy constraint, we enrich the model with other environmental constraints.

First, we notice that the map which originates the symbol production mechanism has not ben given any guideline. The conditional entropy constraint only involves the value taken by which depends on , but there is no structural enforcement on the function ; its spatiotemporal changes are ignored. Spatiotemporal regularization can be interpreted as constraints in the framework of MaxEnt. Ordinary regularization issues suggest to discover functions such that

is “small”, where are spatial and temporal differential operators, and are non-negative reals. Notice that the ergodic translation of , in this case, only involves the temporal factor .

Second, as already pointed out, many relevant visual features need to be motion invariant. Just like an ideal fluid is adiabatic — meaning that the entropy of any particle fluid remains constant as that the particles move about in space — in a video, once we have assigned the correct symbol to a pixel, due to the fact that the movement of object is continuous, that symbol is conserved as the object moves on the retina. If we focus attention on a the pixel at time , which moves according to the trajectory then , being a constant. This “adiabatic” condition is thus expressed by the condition , which yields


where is the velocity field that we assume that is given, and is the partial derivative with respect to . When replacing as stated by Eq. (1) we get Motion invariance is a linear constraint in the filter functions .

which holds for any . Notice that this constraint is linear in the field . This can be interpreted by stating that learning under motion invariance consists of determining elements of the kernel of the function . A discussion on the problem of determining the kernel of is given in [7].

4 Cognitive action

MaxEnt as the minimum of the cognitive action. In the previous section we have proposed a method to determine the filters based on the MaxEnt principle. We provide a soft-interpretation of the constraints, so as the adoption of the principle corresponds with the minimization of the “action”


where the notation is used to stress the fact that depends functionally on the filters . Here the first line is the negative of the mutual information and the constants , and are positive multipliers. In the above formula, and in what follows we will use consistently Einstein summation convention.

We notice that the mutual information (the first line) is rather involved, and it becomes too cumbersome to be used with a principle of least action. However, if we give up to attach the information-based terms the interpretation in terms of bits, we can rewrite the entropies that define the mutual information as

Interestingly, this replacement does retain all the basic properties on the stationary points of the mutual information and, at the same time, it simplifies dramatically the overall action, which becomes


We shall — form now on — assume that the fields are extracted by convolution, so that . In order to be sure to preserve the commutativity of convolution — a property that in general holds when the integrals are extended to the entire plane — we have to make assumptions on the retina and on the domain on which the filters are defined. First of all assume that and define . If the video has support on , then it is convenient to assume that has spatial support on ; in doing so the commutative property of the convolution is maintained if we perform the integration on . Hence, we have


Expression of the variation of the action. In what follows we assume that .

The Euler-Lagrange equation of the action arises from . So we need to take the variational derivative of all the terms of action in Eq. (15). In the following calculation, we will assume that . The first term yields


while the second term gives


The variation of the third term similarly yields


The variation of the terms that implements positivity is a bit more tricky:

However, the second term is zero since

The difference of the two Iverson’s brakets is always zero unless the epsilon-term makes the argument of the first braket have an opposite sign with respect to the second. Since is arbitrary small, this can only happen if . Thus in either cases the whole term vanishes. Hence, we get


Finally, the variation of the last term is a bit more involved and yields (see Appendix A):




In doing all this calculations we have used the commutative property of the convolution as stated in Eq. (16), if we had not done this we would have obtained, in some cases, expressions with an higher degree of space non-locality (i.e. with more than one integral over ). Euler-lagrange integro-differential equations for the cognitive action: they are neither local in time, nor in space! Then the Euler-Lagrange equations reads:


where and .

Temporal locality.  Approximate and adjoint variable-based methods for removing temporal non-locality. From Eq. (23) we immediately see that the first term in the second line of this equation is non local in time; this means that the equations are non-causal and therefore it is impossible to regard them as evolution equations for the filters . To overcome this problem we propose two different approaches:

  • Enforce time locality by computing the entropy on frames rather than on the entire life of the agent:

  • Define a causal entropy

    and insert in the action the time average of this quantity together with the constraint that enforces this definition. In this way the entropy term in the Lagrangian will be replaced with

  • Define the same causal entropy of point 2. above but insert the derivative of the constraint that enforces this definition. In this way the entropy term in the Lagrangian will be replaced with

    In this way the E-L equations that we derive are automatically local in time.

For the moment we will develop the theory using the first assumption. The variation of this term gives, as expected the local form of Eq. 17: .

Space locality.  Adjoint equations to remove spatial locality and focus of attention. We showed that the Euler-Lagrange equations for our theory are

and as we can see the unknown fields appear inside a space integral. Now we ask ourselves if it is possible to make this equations local in space so that they can be regarded as differential equation.

We found that it is possible to do this “localization” exploiting a crucial property of human vision: The focus of attention. Once we choose , we can choose a differential operator such that .

Now if we define the adjoint function so that


and the function as a solution of


we can rewrite Euler Lagrange equations as


Equations (24), (25) and (26) together form a system of differential equations.

Notice that the spatial function that we used here to resolve space non-locality is the same function that appears in the measure ; however we could also have chosen a different function.

5 Neural interpretation in discrete retina

Allocating one neuron per pixel: The filters are defined on the quantized retina . Up to this point we have proposed our theory as a field theory, we now consider the corresponding theory defined on a discretized retina . For each point of the discretized retina we then have a variable .

Since all the terms in the cognitive action (except for the kinetic terms) are expressed in terms of the feature field , so we need to show how this fields can be written on a discretized retina . On a discrete retina we will have instead of the fields a bunch of functions of the time variable , indexed by the point on the retina other than the filter indices and . Similarly the color field will be replaced by .

Using Einstein notation we have that the discretized form of the feature fields is , where the sum over is performed over the discrete retina . Then for example the two pieces of the motion invariance term becomes

The term of motion invariance becomes a quadratic form in and since


The other relevant terms of the theory (the entropy, the relative entropy and the probabilistic constrains) are just a function .

Factorization of Notice also that because of the proposed factorization of the weight function the term in the discretized formulation is also a function of time, and as it has been pointed out in Section 2. This contributes to the time dependence that affects the coefficients of the differential equation that governs the evolution of the filters. However since plays the role of a probability distribution over the retina it must be that for every .

Tensor linearization. Before going on to describe the theory on the discretized retina we will show how it is possible to linearize the indices of in order to deal with a vectorial variable rather than a more complex tensorial index structure. In order to be more precise in the construction we will split the retina index of into its two discrete coordinates and so that the filters fields will be identified by four rather than three indices. For the same reason when considered necessary we will also explicitly write down the summations. As we have argued the first step towards discretization is

In Appendix B we show that the feature field can be rewritten as

where are the linearized features and the linearized input. The map transforms appropriately into another index depending on the point in which the convolution is computed (see Appendix B).

The cognitive action then can be written with appropriate regularization terms as

where is a suitable regularization term that we will discuss in the following.

Let us now see how the action can be rewritten in terms of the variables . The motion invariance term becomes (see Eq. (27))

where the matrices can be expressed as:


As it is explained in Appendix B, by a careful redefinition of these matrices, we can transform the sum over as a sum over the entire set (definitions of and are also in Appendix B and they are essentially subsets of ). Thus the motion invariance term is just

and from now on the range of the indices of repeated sums is intended to be .

Because of the way in which the problem is formulated, it seems natural to chose as a criterion for the choice of the filters the minimization of the functional . For this reason the regularizing part of the functional must be carefully chosen so that it does not spoil the coercivity of the functional, but we need also to be sure that it will give rise to stable EL equations.

Coercivity and stability cannot be obtained with a regularization term that contains only first derivatives in time (see [4]).

Cognitive action in its linearized form. All the other terms can be discretized as well (see Appendix B for the details) so that the action as a functional of the s reads


where , , , , , and are real positive constants while


Then if we formulate our minimization problem on the set under the assumption that is limited in the interval , and that we can choose big enough so that the quadratic term in in Eq. (LABEL:disc-action-1-form) is positive definite, we can prove that the minimum of the functional on the set exists (the apparently dangerous linear term in with possible negative coefficient can also be controlled with the regularization term ).

Time-local form of the cognitive action in its linearized form; space locality is a direct consequence of retina quantization. If we also make the entropy term local in time we get


where . As it is remarked above the minimization problem takes place in the convex and closed set and then in order to evaluate the first variation of this functional we need to take as a varying function .

The differential E-L equation for the whole functional thus reads: Forth-order Euler-Lagrange differential equation of learning; the term is piece-wise linear.