Learning to see like children: proof of concept

08/11/2014 ∙ by Marco Gori, et al. ∙ Università di Siena 0

In the last few years we have seen a growing interest in machine learning approaches to computer vision and, especially, to semantic labeling. Nowadays state of the art systems use deep learning on millions of labeled images with very successful results on benchmarks, though it is unlikely to expect similar results in unrestricted visual environments. Most learning schemes essentially ignore the inherent sequential structure of videos: this might be a critical issue, since any visual recognition process is remarkably more complex when shuffling video frames. Based on this remark, we propose a re-foundation of the communication protocol between visual agents and the environment, which is referred to as learning to see like children. Like for human interaction, visual concepts are acquired by the agents solely by processing their own visual stream along with human supervisions on selected pixels. We give a proof of concept that remarkable semantic labeling can emerge within this protocol by using only a few supervised examples. This is made possible by exploiting a constraint of motion coherent labeling that virtually offers tons of supervisions. Additional visual constraints, including those associated with object supervisions, are used within the context of learning from constraints. The framework is extended in the direction of lifelong learning, so as our visual agents live in their own visual environment without distinguishing learning and test set. Learning takes place in deep architectures under a progressive developmental scheme. In order to evaluate our Developmental Visual Agents (DVAs), in addition to classic benchmarks, we open the doors of our lab, allowing people to evaluate DVAs by crowd-sourcing. Such assessment mechanism might result in a paradigm shift in methodologies and algorithms for computer vision, encouraging truly novel solutions within the proposed framework.



There are no comments yet.


page 5

page 16

page 17

page 18

page 20

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Section 1 Introduction

Semantic labeling of pixels is amongst most challenging problems that are nowadays faced in computer vision. The availability of an enormous amount of image labels enables the application of sophisticated learning and reasoning models that have been proving their effectiveness in related applicative fields of AI.

margin: Shuffling
frames makes
vision hard

Interestingly, so far, the semantic labeling of pixels of a given video stream has been mostly carried out at frame level. This seems to be the natural outcome of well-established pattern recognition methods working on images, which have given rise to nowadays emphasis on collecting big labelled image databases (e.g.

[12]) with the purpose of devising and testing challenging machine learning algorithms. While this framework is the one in which most of nowadays state of the art object recognition approaches have been developing, we argue that there are strong arguments to start exploring the more natural visual interaction that humans experiment in their own environment. To better grasp this issue, one might figure out what human life could have been in a world of visual information with shuffled frames. Any cognitive process aimed at extracting symbolic information from images that are not frames of a temporally coherent visual stream would have been extremely harder than in our visual experience. Clearly, this comes from the information-based principle that in any world of shuffled frames, a video requires order of magnitude more information for its storing than the corresponding temporally coherent visual stream. As a consequence, any recognition process is remarkably more difficult when shuffling frames, and it seems that most of current state of the art approaches have been attacking a problem which is harder than the one faced by humans. This leads us to believe that the time has come for an in-depth re-thinking of machine learning for semantic labeling. As it will be shown in Section 2, we need a re-foundation of computational principles of learning under the framework of a human-like natural communication protocol to naturally deal with unrestricted video streams.

margin: Beyond the
“peaceful interlude”

From a rough analysis of the growing role played in the last few years by machine learning in computer vision, we can see that there is a rich collection of machine learning algorithms that have been successfully integrated into state of the art computer vision architectures. On the other side, when the focus is on machine learning, vision tasks are often regarded as yet other benchmarks to provide motivation for the proposed theory. However, both these approaches seem to be the outcome of the bias coming from two related, yet different scientific communities. In so doing we are likely missing an in-depth understanding of fundamental computational aspects of vision. In this paper, we start facing the challenge of disclosing the computational basis of vision by regarding it as a truly learning field that needs to be attacked by an appropriate vision learning theory. Interestingly, while the emphasis on a general theory of vision was already the main objective at the dawn of the discipline [34], it has evolved without a systematic exploration of foundations in machine learning. When the target is moved to unrestricted visual environments and the emphasis is shifted from huge labelled databases to a human-like protocol of interaction, we need to go beyond the current peaceful interlude that we are experimenting in vision and machine learning. A fundamental question a good theory is expected to answer is why children can learn to recognize objects and actions from a few supervised examples, whereas nowadays machine learning approaches strive to achieve this task. In particular, why are they so thirsty for supervised examples? Interestingly, this fundamental difference seems to be deeply rooted in the different communication protocol at the basis of the acquisition of visual skills in children and machines. In this paper we propose a re-foundation of the communication protocol between visual agents and the environment, which is referred to as learning to see like children (L2SLC). Like for human interaction, visual concepts are expected to be acquired by the agents solely by processing their own visual stream along with human supervisions on selected pixels, instead of relying on huge labelled databases. In this new learning environment based on a video stream, any intelligent agent willing to attach semantic labels to a moving pixel is expected to take coherent decisions with respect to its motion. Basically, any label attached to a moving pixel has to be the same during its motion111Interestingly, early studies on tracking exploited the invariance of brightness to estimate the optical flow [25].. Hence, video streams provide a huge amount of information just coming from imposing coherent labeling, which is likely to be the essential information associated with visual perception experienced by any animal. Roughly speaking, once a pixel has been labeled, the constraint of coherent labeling virtually offers tons of other supervisions, that are essentially ignored in most machine learning approaches working on big databases of labeled images. It turns out that most of the visual information to perform semantic labeling comes from the motion coherence constraint, which explains the reason why children learn to recognize objects from a few supervised examples. The linguistic process of attaching symbols to objects takes place at a later stage of children development, when he has already developed strong pattern regularities. We conjecture that, regardless of biology, the enforcement of motion coherence constraint is a high level computational principle that plays the fundamental role for discovering pattern regularities. On top of the representation gained by motion coherence, the mapping to linguistic descriptions is dramatically simplified with respect to machine learning approaches to semantic labeling based on huge labeled image databases. This also suggests that the enormous literature on tracking is a mine of precious results for devising successful methods for semantic labeling.

margin: Deep learning
from visual

The work described in this paper is rooted on the theory of learning from constraints [20] that allows us to model the interaction of intelligent agents with the environment by means of constraints on the tasks to be learned. It gives foundations and algorithms to discover tasks that are consistent with the given constraints and minimize a parsimony index. The notion of constraint is very well-suited to express both visual and linguistic granules of knowledge. In the simplest case, a visual constraint is just a way of expressing the supervision on a labelled pixel, but the same formalism is used to express motion coherence, as well as complex dependencies on real-valued functions, that also include abstract logic formalisms222This is made possible by adopting the T-norm mechanism to express predicates by real-valued functions. (e.g. First-Order-Logic (FOL)) [13]. In addition to learning the tasks, like for kernel machines, given a new constraint, one can check whether it is compatible with the given collection of constraints [23]. While the representation of visual knowledge by logic formalisms is not covered in this paper, we can adopt the same mathematical and algorithmic setting used for representing the visual constraints herein discussed. The main reason for the adoption of visual constraints is that they nicely address the chicken-and-egg dilemma connected with the classic problem of segmentation. The task of performing multi-tag prediction for each pixel of the input video stream, with semantics that involves different neighbors, poses strong restrictions on the computational mechanisms, thus sharing intriguing connections with biology. We use deep architectures that progressively learn convolutional filters by enforcing information-theoretic constraints, thus maximizing the mutual information between the input receptive fields and the output codes (Minimal Entropy Encoding, MEE [38]

). The filters are designed in such a way that they are invariant under geometric (affine) transformations. The learned features lead to a pixel-wise deep representation that, in this paper, is used to predict semantic tags by enforcing constraints coming from supervised pairs, spatial relationships, and motion coherence. We show that the exploitation of motion coherence is the key to reduce the computational burden of the invariant feature extraction process. In addition, we enforce motion coherence within the manifold regularization framework to express the consistency of high-level tag-predictions.

margin: Life-long

The studies on learning from constraints covered in [20] lead to learning algorithms whose fundamental mechanism consists of checking the constraints on big data sets333This holds true for soft-constraints, which are those of interest in this paper., which suggests that there are classic statistical learning principles behind the theory. It is worth mentioning that this framework of learning suggests to dismiss the difference between supervised and unsupervised examples, since the case of supervised pairs is just an instance of the general notion of constraints. While this is an ideal view to embrace different visual constraints in the same mathematical and algorithmic framework, clearly we need the re-formulation of the theory to respond to the inherent on-line L2SLC communication protocol. Basically, the visual agent is expected to collect continuously its own visual stream and acquire human supervisions on labels to be attached to selected pixels. This calls for lifelong learning computational schemes in which the system adapts gradually to the incoming visual stream. In this paper clustering mechanisms are proposed to store a set of templates under the restrictions imposed by the available memory budget. This allows us to have a stable representation to handle transformation invariances and perform real-time predictions while the learning is still taking place. It turns out that, in addition to dismissing the difference between supervised and unsupervised examples, the lifelong computational scheme associated with our visual agents leads to dismissing also the difference between learning and test set. These intelligent agents undergo developmental stages that very much resemble humans’ [21] and, for this reason, throughout the paper, they are referred to as Developmental Visual Agents (DVAs).

margin: Proof of

This paper provides a proof of concept of the feasibility of learning to see like children with a few supervised examples of visual concepts. In addition to the CamVid benchmark, which allows us to relate DVAs performance to the literature, we propose exploring a different experimental validation that seems to resonate perfectly with the proposed L2SLC communication protocol. Given any visual world, along with a related collection of object labels, we activate a DVA which starts living in its own visual environment by experimenting the L2SLC interaction. Interestingly, just like children, as time goes by, the DVA is expected to perform object recognition itself. Now, it takes a little to recognize if a child is blind or visually impaired. The same holds for any visual agent, whose skills can be quickly evaluated. Humans can easily and promptly judge the visual skills by checking a visual agent at work. The idea can fully be grasped444From the site, you can also download a library for testing DVAs in your lab. at


where we open our lab to people willing to evaluate DVAs performance. Interestingly, the same principle can be used for any visual agent which experiments the L2SLC protocol. The identity of people involved in the assessment is properly verified so as to avoid unreliable results. A massive crowd-sourcing could give rise to a truly novel performance evaluation that could nicely complement benchmark-based assessment. DVAs are only expected to be the first concrete case of living visual agents that lean under the L2SLC that are evaluated by crowd-sourcing. Other labs, either by using their current methods and technologies or by conceiving novel solutions, might be stimulated to come up with their own solutions in this framework, which could lead to further significant improvements with respect to those reported in this paper. Results on the CamVid benchmark confirm the soundness of the approach, especially when considering the simple and uniform mechanism to acquire visual skills.

1.1 Related work

There are a number of related papers with our approach. Notably, the idea of receptive field can be traced back to the studies of Hubel and Wiesel [26] and, later on, it was applied to computer vision in the Neocognitron model [18]

. Convolutional neural networks


have widely embraced this idea, and they recently leaded to state-of-the art results in object recognition on the ImageNet data

[29, 12]. Those results were also extended toward tasks of localization and detection [44]. Recently, some attempts of transferring the internal representation to other tasks were studied in [42] with interesting results. Other approaches develop hierarchies of convolutional features without any supervision. The reconstruction error and sparse coding are often exploited [27, 28]

, as well as solutions based on K-Means 


. Different autoencoders have been proposed in the last few years 

[51, 43], which have been tested on very large scale settings [30]. The issue of representation has been nicely presented in [3], which also contains a comprehensive review of these approaches. Some preliminary results concerning low-level features developed by DVAs have been presented in [24, 39]. The notion of invariance in feature extraction has been the subject of many analyses on biologically inspired models [45, 6]. Invariances to geometric transformations are strongly exploited in hand-designed low level features, such as SIFT [33], SURF [2], and HOG [10], as well as in the definition of similarity functions [46]. We share analogies with the principles inspiring scene parsing approaches, which aim at assigning a tag to each pixel in an image. Recent works have shown successful results on classical benchmarks [32, 48, 47], although they seem to be very expensive in terms of computational resources. Fully supervised convolutional architectures were exploited for this task in [15]

, while a successful approach based on random forest is given is 


The theory of learning from constraints [20] was applied in several contexts and with different types of knowledge, such as First-Order Logic clauses [13, 23] and visual relationships in object recognition [40]. In the case of manifold regularization based constraints [37], our on-line learning system was evaluated using heterogenous data, showing promising results [17]. Finally, the notion of constraint is used in this paper to model motion coherence, thus resembling what is usually done in optical flow algorithms [25, 52]. Motion estimation in DVAs is part of the feature extraction process; some qualitative results can be found in [22].

Section 2 En plein air

The impressive growth of computer vision systems has strongly benefited from the massive diffusion of benchmarks which, by and large, are regarded as fundamental tools for performance evaluation. However, in spite of their apparent indisputable dominant role in the understanding of progress in computer vision, some criticisms have been recently raised (see e.g. [49]), which suggest that the time has come to open the mind towards new approaches.

Figure 1: En plein air in computer vision according to the current interpretation given in this paper for DVAs at http://dva.diism.unisi.it. Humans can provide strong and weak supervisions (Pink Panther cartoon, © Metro Goldwyn Mayer).

The benchmark-oriented attitude, which nowadays dominates the computer vision community, bears some resemblance to the influential testing movement in psychology which has its roots in the turn-of-the-century work of Alfred Binet on IQ tests. In both cases, in fact, we recognize a familiar pattern: a scientific or professional community, in an attempt to provide a rigorous way of assessing the performance or the aptitude of a (biological or artificial) system, agrees on a set of standardized tests which, from that moment onward, becomes the ultimate criterion for validity. As well known, though, the IQ testing movement has been severely criticized by many a scholar, not only for the social and ethical implications arising from the idea of ranking human beings on a numerical scale but also, more technically, on the grounds that, irrespective of the care with which these tests are designed, they are inherently unable to capture the multifaceted nature of real-world phenomena. As David McClelland put it in a seminal paper which set the stage for the modern competency movement in the U.S., the criteria for establishing the validity of these new measures really ought to be not grades in school, but grades in life in the broadest theoretical and practical sense. Motivated by analogous concerns, we maintain that the time is ripe for the computer vision community to adopt a similar grade-in-life attitude towards the evaluation of its systems and algorithms. We do not of course intend to diminish the importance of benchmarks, as they are indeed invaluable tools to make the field devise better and better solutions, but we propose we should use them in much the same way as we use school exams for assessing the abilities of our children: once they pass the final one, and are therefore supposed to have acquired the basic skills, we allow them to find a job in the real world. Accordingly, in this paper we open the doors of our lab to go

en plein air, thereby allowing people all over the world to freely play and interact with the visual agents that will grow up in our lab555The idea of en plein air, along with the underlined relationships with human intelligence has mostly come from enjoyable and profitable discussions with Marcello Pelillo, who also coined the term and contributed to the above comment during our common preparation of a Google Research Program Award proposal. In the last couple of years, the idea has circulated during the GIRPR meetings thanks also to contributions of Paolo Frasconi http://girpr.tk/sites/girpr.tk/files/GIRPRNewsletter_Vol4Num2.pdf and Fabio Roli https://dl.dropboxusercontent.com/u/57540122/GirprNewsletter_V6_N1.pdf... margin: Evaluation by
A crowd-sourcing performance evaluation scheme can be conceived where registered people can inspect and assess the visual skills of software agents. A prototype of a similar evaluation scheme is proposed in this paper and can be experimented at http://dva.diism.unisi.it. The web site hosts a software package with a graphical interface which can be used to interact with the DVAs by providing supervisions and observing the resulting predictions. The human interaction takes place at symbolic level, where semantic tags are attached to visual patterns within a given frame. In our framework, users can provide two kinds of supervisions:

Strong supervision - one or more labels are attached to a specific pixel of a certain frame to express the presence of an object at different levels of abstraction;

Weak supervision - one or more labels are attached to a certain frame to express the presence of an object, regardless of its specific location in the frame.

The difference between strong and weak supervision can promptly be seen in Figure 1. Strong supervision conveys a richer message to the agent since, in addition to the object labels, also specifies the location. In the extreme case, it can be a pixel, but labels can also be attached to areas aggregated by the agent. Weak supervision has a higher degree of abstraction, since it also requires the agent to locate object positions. In both cases, an object is regarded as a structure identified by a position where one can attach different labels which depend on the chosen context. For example, in Figure 1, the labels eye and Pink Panther could be attached during strong supervision while pointing to a Pink Pather’s eye. Weak supervision can easily be provided by a microphone while wearing a camera, but it is likely to be more effective after strong supervisions have already been provided, thus reinforcing visual concepts in their initial stages. A visual agent is also expected to ask

to take the initiative by asking for supervision, and it is also asked to carry out an active learning scheme. The results reported in this paper for DVAs are only based on strong supervision, but the extension to weak supervision is already under investigation

666An interesting solution for constructing visual environments for massive experimentation is that of using computer graphics tools. In so doing, one can create visual world along with symbolic labels, that are available at the time of visual construction. Clearly, because of the abundance of supervised pixels, similar visual environment are ideal for statistical assessment. This idea was suggested independently by Yoshua Bengio and Oswald Lanz..

While this paper gives the proof of concept of the L2SLC protocol along with the en plein air crowd-sourcing assessment, other labs could start exposing their models and technologies, as well as new solutions, within the same framework.

Section 3 Architectural issues

Figure 2: An overview of the architecture of a Developmental Visual Agent. A deep network learns a smooth satisfaction of the visual constraints within the general framework proposed in [20]. An appropriate interpretation of the theory is given to allow the implementation of a truly lifelong learning scheme.

The architecture of the whole system is depicted in Figure 2. Basically, it is a deep network whose layers contain features that are extracted from receptive fields. As we move towards the output, the hierarchical structure makes it possible to virtually cover larger and larger areas of the frames. margin: Pixel-based
Let be a video stream, and the frame processed at time . For each layer , a DVA extracts a set of features , , for each pixel , where is the input of layer at time , i.e. . The features are computed over a neighborhood of (receptive field) at the different levels of the hierarchy. To this aim, we model a receptive field of by a set of Gaussians , , located nearby the pixel. We define receptive input of , denoted by , the value


The receptive input is a filtered representation of the neighborhood of , which expresses a degree of image detail that clearly depends on the number of Gaussians

and on their variance

. Notice that, , the association of each pixel to its receptive input induces the function . Although the position of the centers is arbitrary, we select them on a uniform grid of unitary edge centered on . From the set of features learned at layer ,

, a corresponding set of probabilities is computed by the softmax function


so that all the features satisfy the probabilistic normalization, thus competing during their development. The feature learning process takes place according to information-theoretic principles as described in Section 5. In order to compact the information represented by the features, we project them onto a space of lower dimensionality by applying stochastic iterations of the NIPALS (non-linear iterative partial least squares) algorithm [53] to roughly compute the principal components over a time window. Moreover, in order to enhance the expressiveness of DVA features, they are partitioned into subsets (categories), so as the learning process takes place in each category

by producing the probability vector

of elements, independently of the ones of other categories, with . Different categories are characterized by the different portions of the input taken into account for their computation. For example, at the first layer, each category can operate on a different input channel (e.g., for an RGB encoding of the input video stream) or on different projections of the input.

margin: Region-based

After features have been extracted at pixel-level, an aggregation process takes place for partitioning the input frame into “homogeneous regions”. To this aim, we extend the graph-based region-growing algorithm by Feszenwalb and Huttenlocher [16] in order to enforce motion coherence in the development. The original algorithm in [16] starts with each pixel belonging to a distinct region, and then progressively aggregates pixels by evaluating a dissimilarity function based on color similarity (basically, Euclidean distance between RGB triplets or grayscale values). We enrich this measure by decreasing (increasing) the dissimilarity score of pixels whose motion estimation is (is not) coherent. The idea is to enforce the similarity of neighbor pixels locally moving to the same direction. The similarity is also increased for those pairs of neighbor pixels that at the previous frame were assigned to the same static region (no motion). Once the regions have been located, properly region-based features are constructed which summarizes in different ways the associated information.

margin: Developmental
Object Graph

The regions correspond to visual patterns that are described in terms of an appropriate set of features and that are stored into the nodes of a graph, referred to as Developmental Object Graph (DOG). The edges of the graph represent node similarity as well as motion flow information between two consecutive frames (see Section 6).

The symbolic layer is associated with the functions that are defined on the DOG nodes. These functions are also forced to respect constraints based on the spatio-temporal manifold induced by the DOG structure. We overload the symbol to define both pixel-wise low-level features and high-level tag predictors to refer to functions that are developed under learning from constraints. We assume that operates in a transductive environment both on the receptive inputs and on the DOG nodes. As it will be shown later, this allows us to buffer predictions and perform real-time response.

Section 4 Learning from visual constraints

The features and the symbolic functions involved in the architectural structure of Fig. 2 are learned within the framework of learning from constraints [20]. In particular, the feature functions turn out to be the smooth maximization of the mutual information between the output codes and the input video data (Section 5). The high-level symbolic functions are subject to constraints on motion and spatial coherence, as well as to constraints expressing supervised object (Section 6). Additional visual constraints can express relationships on the symbolic functions, including logic expressions. The constraints enrich their expressiveness with the progressive exposition to the video so as to follow a truly lifelong learning paradigm. DVAs are expected to react and make predictions at any time, while the learning still evolves asynchronously. margin: Parsimonious
Let be a vectorial function such that . We introduce its degree of parsimony by means of an appropriate norm777See [19] for an in-depth discussion on the norms to express parsimony in Sobolev spaces and for the connections with kernel machines. on . We consider a collection of visual constraints indexed by and indicate by a penalty that expresses their degree of fulfillment. The problem of learning from (soft) constraints consists of finding


Its general treatment is given in [20], where a functional representation is given along with algorithmic solutions. In this paper we follow one of the proposed algorithmic approaches that is based on considering the sampling of constraints. While and operate on different domains they can both be determined by solving eq. 3 and, therefore, for the sake of simplicity, we consider a generic domain , without making any distinctions between feature-based and high-level symbolic functions. In addition, the hypothesis of sampling the constraints makes it possible to reduce the above parsimony index to the one induced by a Reproducing Kernel Hilbert Space (RKHS) , that is , where is the classic regularization parameter. In [20] it is shown that, regardless of the kind of constraints, the solution of 3 is given in terms of a Support Constraint Machine (SCM).

margin: Transductive

A representer theorem is given that extends the classical kernel-based representation of traditional learning from examples. In particular, can be given an optimal representation that is based on the following finite expansion


where is the kernel associated with the selected norm888Under certain boundary conditions, in [19], it proven that is the Green function of the differential operator , where , is the adjoint of , and ., and are the parameters to be optimized. They can be obtained by gradient-based optimization of the function that arises when plugging 4 into 3, so as the functional , collapses to finite dimensions.

A very important design choice of DVAs is that they operate into a transductive environment. This is made possible by clustering the incoming data into the set of representative elements . margin: On-line
Clearly, the clustering imposes memory restrictions, and it turns out to be important to define a budget to store the elements of , as well as their removal policy. The clustering differs in the case of and , and it will be described in Section 5 and 6, respectively. The values are cached over after each update of , so that DVAs make predictions at any time, independently of the status of the optimization process. The on-line learning consists of updating along with the data stream. The parameters associated with newly introduced representatives are set to zero, to avoid abrupt changes of .

Section 5 Learning invariant features

In this section we describe the on-line learning algorithm used by DVAs for developing the pixel-level features . As sketched in Figure 3, the features are learned by means of a two stage process. First, DVAs gain invariance by determining an appropriate receptive input and, then, they learn the local pattern shapes by a support constraint machine.

Let us start with the stage devoted to discovering invariant receptive inputs. Given a generic layer and category999For the sake of simplicity, in the rest of this section, we drop the layer and category indices., for each pixel , we want to incorporate the affine transformations of the receptive field into the receptive input . Since any 2D affine map can be rewritten as the composition of three 2D transformations and a scale parameter, then we can express , where and , , are

with , , and [41, 36]. These continuous intervals are discretized into grids , and, similarly, we collect in a set of discrete samples of (starting from ). The domain collects all the possible transformation tuples, where , , and can be considered as hidden variables for , depending on pixel . Given a tuple , we can calculate each component of the receptive input as


where the value of affects both the width of the Gaussians and their centers, and the dependency of , , and from has been omitted to keep the notation simpler. Note that computing the receptive input for all the pixels and for all the transformations in only requires to perform Gaussian convolutions per-pixel, independently of the number of centers and on the size of the grids , since only affects the shape of the Gaussian functions101010The non-uniform scaling of should generate anisotropic Gaussians (see [41]), that we do not consider here both for simplicity and to reduce the computational burden.. The receptive input can also include invariance to local changes in brightness and contrast, that we model by normalizing to zero-mean and unitary norm111111Those receptive inputs that are almost constant are not normalized. The last feature of each category, i.e., , is excluded from the learning procedure and hard-coded to react to constant patterns..

Figure 3: The two-step process for the computation of pixel-based features. First, invariance is gained by searching in for the optimal receptive input . Second, the parameters are learned in the framework of the support constraint machines, with mutual information constraints. Notice that an efficient search in (see dotted lines) is made possible by a local search based on motion coherence.

For any pair , the tuple is selected in order to minimize the mismatch of from a discrete sampling of the receptive inputs processed up to the current frame-pixel. Let be such a collection of receptive inputs121212We do not explicitly indicate the dependance of on the frame and pixel indices to keep the notation simpler., and let be a metric on . margin: Dealing with
Formally, we associate to such that


being the closest element to . Such matching criterion allows us to associate each pixel to its nearest neighbor in and also to store the information of its transformation parameters . We introduce a tolerance which avoids storing near-duplicate receptive inputs in . Clearly, the choice of determines the sampling resolution, thus defining the clustering process. After having solved eq. (6), if , then is added to , otherwise it is associated with the retrieved (see Figure 3). The data in are distributed on a -sphere of radius , because of the normalization and the mean subtraction. When is chosen as the Euclidean distance, a similarity measure based on the inner product can be equivalently employed to compare receptive inputs, such that the constraint can be verified as . The set is an -net of the subspace of that contains all the observed receptive inputs. Such nets are standard tools in metric spaces, and they are frequently exploited in searching problems because of their properties [22]. For instance, it can be easily shown that there exists a finite set for any processed video stream.

margin: The mutual

The second stage of learning consists of discovering a support constraint machine which operates on the set within the framework of Section 4. The idea is that of maximizing the mutual information (MI) of , which represents the codebook of features for the considered category, and , which represents data stored in . When searching for smooth solutions , this problem is an instance of Minimal Entropy Encoding (MEE) [38] and, more generally, of learning with Support Constraints Machines [20]. Let us denote by and the entropy and the conditional entropy so as

The constraint enforces the maximization of the mutual information, and we can use eq. (3) to define the penalty where

where the solution is given by eq. (4), i.e. by the kernel expansion

with .

Finding a solution to eq. (6) for all pixels in a given frame can be speeded up by a pivot-based mechanism [22], but it still quickly becomes computationally intractable, when the resolution of the video and the cardinality of the set of transformation tuples achieve reasonable values.

margin: Motion
of invariance

In order to face tractability issues, we exploit the inherent coherence of video sequences so as the pairs are used to compute the new pairs . The key idea is that the scene smoothly changes in subsequent frames and, therefore, at a certain pixel location , we are expected to detect a receptive input which is very similar to one of those detected in a neighborhood of in the previous frame. In particular, we impose the constraint that both the transformation tuple and the receptive input

should be (almost) preserved along small motion directions. Therefore, we use a heuristic technique which performs quick local searches that can provide good approximations of the problem stated by eq. (

6), while greatly significantly speeding up the computation131313We refer to [22] for the details.. It is worth mentioning that the proposed heuristics to determine invariant parameters also yields, as a byproduct, motion estimation for all the pixels of any given frame. Strict time requirements in real-time settings can also be met by partitioning the search space into mini-batches, and by accepting sub-optimal solutions of the nearest neighbor computations within a pre-defined time budget.

margin: Blurring

At the beginning of the life of any visual agent, is empty, and new samples are progressively inserted as the time goes by. From the dynamic mechanism of feature development, we can promptly realize that the clustering process, along with the creation of , turns out to be strongly based on the very early stage of life. In principle, this does not seem to be an appropriate developmental mechanism, since the receptive inputs that become cluster representatives in might not naturally represent visual patterns that only come out later in the agent life. In order to face this problem we propose using a blurring scheme such that ends up into a nearly stable configuration only after a certain visual developmental time. We initially set the variance scaler of the Gaussian filters of eq. (5) to a large value, and progressively decrease it with an exponential decay that depends on . This mechanism produces initial frames (layer inputs) strongly blurred, so that only a few ’s are added to for each frame (even just one or none141414The tuple assigned to the first addition to is arbitrary.). As is decreased, the number of items in grows until a stable configuration is reached. When the memory budget for is given, we propose using a removal policy of those elements that are less frequently solutions of eq. (6). This resembles somehow curriculum learning [4], where examples are presented to learning systems following an increasing degree of complexity. Interestingly, the proposed blurring scheme is also related to the process which takes place in infants during the development of their visual skills [11, 14, 50]. In a sense, the agent gets rid of the information overload and operates with the amount of information that it can handle at the current stage of development. Interestingly, this seems to be rooted in information-based principles more than in biology.

margin: Deep nets and
developmental stages

At each layer of the deep net, the feature learning is based on the same principles and algorithms. However, we use developmental stages based on learning layers separately, so as upper layers activate the learning process only when the features of the lower layers have already been learned. The pixel-based features that are developed in the whole net are used for the construction of higher-level representations that are involved in the prediction of symbolic functions.

Section 6 Learning symbolic constraints

In order to build high-level symbolic functions, we first aggregate homogenous regions (superpixels) of , as described in Section 3. This reduces the computational burden of pixel-based processing, but it requires to move from the pixel-based descriptors to region-based descriptors , where is the index of a region of .

margin: High-level

In detail, the aggregation procedure generates regions151515In the following description we do not explicit the dependence of the region variables on time to keep the notation simple. (superpixels) , , where each collects the coordinates of the pixels belonging to the -th region. While the bare average of over all the could be directly exploited to build , we aggregate by means of a co-occurrence criterion. This allows us to consider spatial relationships among the pixel features that would be otherwise lost in the averaging process. First, we determine the winning feature in . Then, we count the occurrences of pairs of winning features in the neighborhood of , for all pixels in the region, building a histogram that is normalized by the total number of counts. The normalization yields a representation that is invariant w.r.t. scale changes (number of pixels) of the region . Then, we repeat the process for all the categories and layers, stacking the resulting histograms to generate the region descriptor . We also add the color histogram over , considering equispaced bins for each channel of the considered (RGB or Lab) color space. Finally, we normalize to sum to one, giving the same weight to the feature-based portion of and to the color-based one (more generally, the weight of the two portions could be tuned by a customizable parameter). The length of is .

Region descriptors and their relationships are stored as vertices (also referred to as “nodes”) and edges of the Developmental Object Graph (DOG). Nodes are the entities on which a tag-prediction is performed, whereas edges are used to generate coherence constraints, as detailed in the following. Similarly to the case of the set in Section 5, the set of the nodes in the DOG is an -net in which the minimum distance between pairs of nodes is , a user-defined tolerance. Each region descriptor , is either mapped to its nearest neighbor in (if the distance from it is ), or it is added to (if the distance is greater than ). In the former case, we say that “ hits node ”, and inherit the tag-predictions of , that can be easily cached (Section 4). As for the nearest neighbor computations concerning receptive inputs, also in this case the search procedure can be efficiently performed by partitioning the search space, and by tolerating sub-optimal mappings. A pre-defined time budget is defined, crucial for real-time systems, and we return the best response within such time constraint. The distance is exploited, since it is well suited for comparing histograms.

margin: Node
by motion

As for receptive inputs, we can also use motion coherence to strongly reduce the number of full searches required to map the region descriptors to the nodes of . We partition the image into rectangular portions of the same size, and we associate each region to the portion containing its barycenter. Given the region-descriptor-to-node mappings computed in the frame , we can search for valid hits at time by comparing the current descriptors with the nodes associated to the regions of the nearby image portions in the previous frame.

margin: Spatial and

DOG edges are of two different types, spatial and motion-based, and their weights are indicated with and and stored into the (symmetric) adjacency matrices and , respectively. Spatial connections are built by assuming that close descriptors represent similar visual patterns. Only those nodes that are closer than a predefined factor are connected, leading to a sparse set of edges. The edge weights are computed by the Gaussian kernel, as .

Nodes that represent regions with similar appearance may not be actually spatially close due to slight variations in lighting conditions, occlusions, or due to the suboptimal solutions of the matching process (Section 5). The motion between frames and can be used to overcome this issue, and, for this reason, we introduce links between nodes that are estimated to be the source and the destination of a motion flow. The weights are initialized as at , for each pair , and then they are estimated by a two-step process. First the likelihood , that two DOG nodes are related in two consecutive frames and due to the estimated motion, is computed. Then the weight of the edge between the two corresponding DOG nodes is updated. is computed by considering the motion vectors that connect each pixel in to another pixel of (Section 5). For each pair of connected pixels, one belonging to region and the other to , we consider the DOG nodes and to which and are respectively associated. The observed event gives an evidence of the link between and , and, hence, the frequency count for is increased by a vote, scaled by to avoid penalizing smaller regions. Moreover, in the computation we consider only votes involving regions of comparable size, i.e. , to reduce the effects due to significant changes in the detection of regions in two consecutive frames. Finally, since a DOG node corresponds to all the region descriptors that hit it, the total votes accumulated for the edge between and are also scaled by the number of distinct regions of that contributed to the votes. Similarly to the spatial case, a sparse connectivity is favored by pruning the estimates below a given threshold , in order to avoid adding weak connections due to noisy motion predictions.

The edge weights are computed by averaging in time the computed , as . This step can be done with an incremental update that does not require to store the likelihood estimates for all the time steps.

The agent interacts with the external environment, gathering different kinds of knowledge over the data in , represented under the unifying notion of constraint [20] (Section 4). The most prominent example of knowledge comes from the interaction with human users (Section 2) who provide custom class-supervisions (with values in for negative and positive supervision, respectively). For each DOG node, the agent will be able to predict tag-scores for those classes for which it received at least one (positive) supervision.

At a given time , let us suppose that the agent received supervisions for a set of classes. We indicate with the function that models the predictor of the -th class, and, again, we follow the framework of Section 4. We select the Gaussian kernel and we also assume that the agent is biased towards negative predictions, i.e. we added a fixed bias term in eq. (4) equal to , allowing it to learn from positive examples only.

The constraints in are of two types: supervision constraints , and coherence constraints . margin: Supervision
The former enforce the fulfillment of labels on some DOG nodes and for some functions . For each , the supervised nodes are collected into the set , and


The scalar is the belief [20] of each point-wise constraint. When a new constraint is fed by the user, its belief is set to a fixed initial value. Then, is increased if the user provides the same constraint multiple times or decreased in case of mismatching supervisions, keeping . This allows the agent to better focus on those supervisions that have been frequently provided, and to give less weight to noisy and incoherent labels. Weak supervisions (Section 3) on the tag are converted into constraints as in eq. (7) by determining if there exists a node associated to the current frame for which the -th tag-score is above a predefined threshold.

margin: Spatial
and motion

The coherence constraints enforce a smooth decision over connected vertices of the DOG,


leading to an instance of the classical manifold regularization [37]. In this case, the belief of each point-wise constraint is , that is given by a linear combination of the aforementioned edge weights and ,


Here defines the global weight of the coherence constraints while can be used to tune the strength of the spatial-based connections w.r.t. the motion-based ones.

Following Section 4, the solution of the problem is given by eq. (4) and it is a kernel expansion , where is the real valued descriptor associated to the corresponding DOG node in . The set of DOG nodes progressively grows as the video stream is processed, up to a predefined maximum size (due to the selected memory budget). When the set reaches the maximum allowed size, the adopted removal policy selects those nodes that have not been recently hit by any descriptors, with a small number of hits, and that are not involved by any supervision constraint.

Section 7 Experiments

In this section, we present experimental results to evaluate several aspects of the DVA architecture, from feature extraction up to the symbolic level of semantic labeling. Experiments were carried out on a variety of different videos ranging from artificial worlds and cartoons to real-world scenes, to show the flexibility of learning in unrestricted visual environments. The website of the project (http://dva.diism.unisi.it) hosts supplementary material with video sequences illustrating several case studies. The DVA software package can also be downloaded and installed under different versions of Linux and Mac OS X.

7.1 Feature extraction

We evaluated the impact of the invariances in building the set from which the low-level features are learned. A shallow DVA (1 layer) was run on three unrelated real-world video sequences from the Hollywood Dataset HOHA 2 [35], so as to explore the growth of . Videos were rescaled to , and converted to grayscale. We selected an architecture with receptive fields, , and we repeated the experiment by activating invariances to different classes of geometric transformations (with , , , , see Section 5). Figure 4 highlights the crucial impact of using invariances for reducing . We set a memory budget that allowed DVA to store up to 6,000 ’s into . When full affine invariance is activated, there is a significant reduction of , thus simplifying the feature learning procedure. When considering the case with no-invariances, we reached the budget-limit earlier that in the case of scale-invariance-only.

Figure 4: The size of (Section 5) when different invariances to geometric transformations are activated. In these experiments, the memory budget was set to 6,000 data points.

A deeper DVA (3 layers) processed the same sequences in order to learn features per layer, (one-category, ). The same architecture was also used in processing a cartoon clip with different resolution. Figure 5 shows the feature maps on four sample frames. Each pixel is depicted with the color that corresponds to the winning feature, i.e. the color of is indexed by . While features of the lowest layer easily follow the details of the input, higher layers develop functions that capture more abstract visual patterns. From the third row of Figure 5, we can see that bright red pixels basically indicate the feature associated with constant receptive inputs. Moving toward the higher layers, such a feature becomes less evident, since the hierarchical application of the receptive fields virtually captures larger portions of the input frame, thus reducing the probability of constant patterns. From the last two rows of Figure 5, the orange feature seems to capture edge-like patterns, independently of their orientation, thanks to the invariance property of the DVA features. For instance, we can appreciate that such feature is high both along vertical and horizontal edges. Notice that feature orientation, scale, and other transformation-related properties are defined by the heuristic searching procedure of Section 5. Hence, for each pixel, these transformations can also be recovered.

Frame                             Layer 0                            Layer 1                          Layer 2

Figure 5: The feature maps of a 3 layered DVA, processing a cartoon clip (Donald Duck, © The Walt Disney Company) and a sequence from the Hollywood Dataset HOHA 2 [35]. Each pixel is depicted with the color that corresponds to the winning feature (best viewed in color).

7.2 The role of motion

Motion plays a crucial role at several levels of the DVA architecture. In Section 5 we have seen that handling invariances to geometric transformations allows DVA to estimate the motion. It turns out that, while imposing motion coherence on low-level features, the velocity of each pixel is itself properly determined.

An example of motion estimation is given in Figure 6, where in the third column each pixel is colored with a different hue according to the angle associated to its velocity vector. In this context, the use of motion coherence in feature extraction is crucial also to speed up computation to solve eq. (6). We used again three random clips from the HOHA 2 dataset, and measured the average computational time required by a 1-layer DVA to process one frame at resolution. The impact of motion is dramatic, as the required time dropped from 5.2 seconds per frame to 0.3 seconds per frame on an Intel Core-i7 laptop. It is worth mentioning that time budget requirements can also be imposed in order to further speed up the computation and perform real-time frame processing.

Figure 6: Two examples of motion estimation. Two subsequent frames are presented (column 1 and 2) along with a colormap of motion estimation per pixel (column 3), where colors indicate the angle of the estimated velocity vector, according to the hue colorwheel (where red equal to degrees). Top row: The dog is lowering its leg (green/yellow) while moving its body up (cyan/blue) and its face and ear towards up-right (violet/red) (Donald Duck, © The Walt Disney Company). Bottom row: the actor is turning his head towards the left edge of the frame (cyan) while moving his right arm towards his chest (red).

Now we show some results illustrating the impact of using motion coherence in the process of region aggregation. Figure 7 shows a pair of consecutive frames taken from a Pink Panther cartoon (top row), with the results obtained by the region-growing algorithm without (middle row) or with (bottom row) exploiting motion coherence. Regions having the same color are mapped to the same DOG node, therefore sharing very similar descriptors (their distance being , see Section 6). This example shows that the role of motion is crucial in order to get coherent regions through time 161616Clearly, this simplifies dramatically the subsequent recognition process.. In the middle row we can observe that the body of the Pink Panther changes from blue to orange, while it is always light green when exploiting motion information (bottom row); the water carafe, the faucet, and the tiles in the central part of the frame are other examples highlighting this phenomenon.

Figure 7: The effect of motion coherence on aggregation. Top: two consecutive frames in a Pink Panther cartoon (© Metro Goldwyn Mayer); middle/bottom: aggregation without/with motion coherence. The same color corresponds to the same node within the DOG, which means identical descriptors, up to tolerance . More nodes are kept by exploiting motion coherence (e.g., the body of the Pink Panther, the water carafe, the faucet and tiles).
margin: Digit test:
learning to see
with one

In order to investigate the effect of motion constraints in the development of the symbolic functions (Section 6), we created a visual environment of Lucida (antialiased) digits which move from the left to right by generic roto-translations. While moving, the digits also scale up and down. Each digit (from “0” to “9”) follows the same trajectory, as shown in Figure 8 for the case of digit “2”. The visual environment consists of a (loop) collection of 1,378 frames with resolution of pixels, with a playback speed of frames per second.

Figure 8: The digits visual environment. Each digit moves from left to right by translations and rotations. While moving, it also scale up and down, as depicted for digit “2”. DVAs are required to learn the class for any pixel and any frame.

A DVA processed the visual environment while a human supervisor interacted with the agent by providing only 11 pixel-wise positive supervisions, i.e. 1 per-digit and 1 for the background. The descriptors of rotated/scaled instances of the same digit turned out to be similar, due to the invariance properties of the low-level features. On the other hand, no all descriptors were mapped to the same DOG node, since when imposing memory and time budget, the solution of eq. (6) by local coherence (Section 5) might yield suboptimal results. Because of the simplicity of this visual environment, we selected a shallow DVA architecture, and we kept a real-time processing of the video: receptive fields, with minimum equal to , and output features. We compared three different settings to construct the symbolic functions: the first one is based on supervision constraints only; the second one adds spatial coherence constraints; the last one includes motion coherence constraints too. We also evaluated a baseline linear SVM that processed the whole frames, rescaled to , while each pixel-wise supervision was associated to the corresponding frame. For this reason, when reporting the results, we excluded the background class that cannot be predicted by the baseline SVM, while DVA can easily predict the background just like any other class. We also generated negative supervisions, that were not used by the DVA, to train the SVM in a one-vs-all scheme. Table 1 reports the macro accuracy for the digit-classes (excluding class of digit “9”, that is not distinguishable from a rotated instance of digit “6”).

Model Accuracy

SVM classifier (baseline)

DVA, 10 sup. constraints 87.47%
DVA, 10 sup. + spatial coherence constr. 92.70%
DVA, 10 sup. + spatial/motion coherence constr. 99.76%
Table 1: Macro accuracy on the digit visual environment. Notice that these experiments consider the extreme case in which each class received one supervision only. Motion coherence constraints play a fundamental role to disentangle the ambiguities among similar digits.

Clearly, the SVM classifier, which uses full-frame supervisions only, does generalize in the digit visual environment, whereas the DVA produces very good predictions even with one supervision per class only. Spatial coherence constraints allows the DVA to better generalize the prediction on unlabeled DOG nodes, thus exploiting the underlying manifold of the descriptor space. However, it turns out that the classes “2” and “5” are confused, due to the invariance properties of the low-level features that yield spatially similar descriptors for these classes. When introducing motion-based constraints the DVA disentangles these ambiguities, since the motion flow enforces a stronger coherence over those subsets of nodes that are related to the same digit. Notice that the enforcement of motion constraints is not influenced by the direction; the movement on different trajectories (e.g., playing the video frames in reverse order) generates the same results of Table 1.

The experimental results reported for the DVA refer to the case in which it processes the given visual environment without making any distinction between learning and test. The results shown in Table 1 refer to a configuration in which the overall structure of the DVA does not change significantly as time goes by. We also construct an analogous artificial dataset by using Comic Sans MS font instead of Lucida. As shown in Figure 9, the test of the agent on this new digit visual environment, without supplying any additional supervision, yielded very similar results also when playing the video in reverse order.

Figure 9: Generalization on a different digit visual environment: Lucida (left) vs. Comic Sans MS (right). Only one supervision per digit on the Lucida font was given.

7.3 Crowd-sourcing and data-base evaluation

DVAs can naturally be evaluated within the L2SLC protocol by crowd-sourcing at http://dva.diism.unisi.it/rating.html. In this section we give insights on the performance of DVAs on some of the many visual environments that we have been experimenting in our lab.

margin: Artificial
and natural
visual environments

A visual environment from the AI-lab of UNISI was constructed using a 2-minute video stream acquired by a webcam. During the first portion of the video, a supervisor interacted with the DVA by providing as many as 72 pixel-wise supervisions, out of which only 2 were negative. The supervisions covered four object classes (bottle, chair, journal, face). In the remaining portion of the video no further supervision is given. Figure 10 collects examples of both user interactions and the most confident DVA predictions, highlighting only regions having the highest tag score above the threshold. The frame sequence is ordered following the real video timeline (left to right, top to bottom). The first two rows show samples taken from the first portion of the video, where red-framed pictures mark user supervisions, while the others illustrate DVA’s responses. For example, we can observe a wrong “bottle” label predicted over the black monitor in the third sample, which is corrected, later on, by a subsequent negative supervision. The last two rows refer to the video portion in which no supervisions were provided. The system is capable of generalizing predictions even in presence of small occlusions (chair, bottle), or in cases where objects appear in contexts that are different from the ones in which they were supervised (bottle, journal, face). Humans involved in crowd-sourcing assessment are likely to provide different scores but it is clear that the learning process of the DVA leads to remarkable performance by receiving only a few human supervisions.

Figure 10: Sample predictions and user interactions taken form a real-world video stream. User interactions are marked with red-crosses (in red-framed pictures). The most confident DVA predictions are highlighted only for regions having the highest tag score above . Only the first two rows refer to the portion of the video during which the user was providing supervisions (70 positive, 2 negative labels).

Following the same paradigm, DVAs have been developed on several visual environments, ranging from cartoons to movies. Regardless of the visual environment, we use a few supervisions, ranging from 1 to 10 per class, for a number of categories between 5 and 10. Many of these DVAs can be downloaded at http://dva.diism.unisi.it with screenshots and video sequences from which the frames of Figure 11 were extracted.

Figure 11: Some examples of semantic labeling performed by DVAs on a number of different videos (Donald Duck, © The Walt Disney Company; Pink Panther, © Metro Goldwyn Mayer). Only regions where the confidence of the prediction is above a certain threshold are associated with a tag.
margin: CamVid

Now, following the standard evaluation scheme, we show results on the Cambridge-driving Labeled Video Database (CamVid) [7]. This benchmark consists of a collection of videos captured by a vehicle driving through the city of Cambridge, with ground truth labels associated to each frame from a set of 32 semantic classes. We reproduced the experimental setting employed by almost any work on the CamVid database in recent years [48, 1]. We considered only the 11 most frequent semantic classes, and used only the 600 labeled frames of the dataset (resulting in a video at 1Hz), by splitting them into a training set (367 frames) and a test set (233 frames). For each ground truth region in each frame of the training set, a supervision was provided to DVA, by computing the medoid of the ground truth region and by attaching the supervision to the region constructed by the DVA aggregation process, which contains the medoid pixel. By following this scheme, it is worth mentioning that, with respect to the existing work on the same dataset, DVAs are conceived for online interaction, and not for a massive processing of labeled frames. Therefore, we decided to use a fraction of the available supervisions. While almost all the other approaches exploit all the supervised pixels (about 28 millions), we used about 17,000 supervisions, that is more than three order of magnitude less. A variety of different approaches were used on this dataset. The state-of-the-art is obtained by exploiting Markov Random Fields [1] or Conditional Random Fields [48]. In this paper, we do not compare DVAs against these approaches, because the analysis of a post-processing (or refinement) stage for DVA predictions based on spatial or semantic reasoning is beyond the scope of this work. We therefore compare only against those methods which exploit appearance features, motion and geometric cues. We refer to (i) [8], where bag-of-textons are used as appearance features, and motion and structure properties are estimated from cloud points, and to (ii) [1]

, where several versions of a convolutional neural network (CNN) are tested, either enforcing spatial coherence among superpixels (CNN-superpixels), or weighting the contributions of multilayer predictions with a single scale (CNN-MR fine) or with multiple scales (CNN-multiscale). In order to incorporate the information regarding region positions within the frame, which is an important feature in this scenario, we simply estimated from the training set the a priori probability of each class given the pixel coordinates, and, for each region, we multiplied the score computed by DVA for each class with the prior probability of its centroid. Table 

2 shows the results of the experimental comparison. The performance of DVAs are better than CNN-superpixels and are comparable with motion-structures cues, while they are slightly inferior to appearance cues. Not surprisingly, given the general nature of our approach, more specific methods oriented to this task perform better than DVA on most classes. It is the case of the last three competitors in Table 2 that, beside exploiting a larger amount of supervisions, rely on combinations of multiple hypotheses specifically designed for the benchmark.















DVA 28.3 34.9 95.7 31.3 22.2 91.4 54.4 29.2 11.9 74.4 13.4 44.3 63.6
CNN-superpixels [1] 3.2 59.7 93.5 6.6 18.1 86.5 1.9 0.8 4.0 66.0 0.0 30.9 54.8
Motion-Structure cues [8] 43.9 46.2 79.5 44.6 19.5 82.5 24.4 58.8 0.1 61.8 18.0 43.6 61.8
Appearance cues [8] 38.7 60.7 90.1 71.1 51.4 88.6 54.6 40.1 1.1 55.5 23.6 52.3 66.5
CNN-MR fine [1] 37.7 66.2 92.5 77.0 26.0 84.0 50.9 43.7 31.0 65.7 29.7 54.9 68.3
CNN-multiscale[1] 47.6 68.7 95.6 73.9 32.9 88.9 59.1 49.0 38.9 65.7 22.5 58.6 72.9
Combined cues [8] 46.2 61.9 89.7 68.6 42.9 89.5 53.6 46.6 0.7 60.5 22.5 53.0 69.1
Table 2: Quantitative evaluation on the CamVid dataset.

Section 8 Conclusions

In this paper we provide a proof of concept that fully learning-based visual agents can acquire visual skills only by living in their own visual environment and by human-like interactions, according to the “learning to see like children” communication protocol. margin: L2SLC: proof
of concept

This is achieved by DVAs according to a lifelong learning scheme, where the differences between supervised and unsupervised learning, and between learning and test sets are dismissed. The most striking result is that DVAs provide early evidence of capabilities of

learning in any visual environment by using only a few supervised examples. This is mostly achieved by shifting the emphasis on the huge amount of visual information that becomes available within the L2SLC protocol. Basically, motion coherence yields tons of virtual supervisions that are not exploited in most of nowadays state of the art approaches.

margin: DVAs: social

The DVAs described in this paper can be improved in different ways, and many issues are still open. The most remarkable problem that we still need to address by appropriate theoretical foundations is the temporal evolution of the agents. In particularly, the dismissal of the difference between learning and test set, along with the corresponding classic statistical framework, opens a seemingly unbridgeable gap with the community of computer vision, which uses to bless scientific contributions on the basis of appropriate benchmarks on common data bases. However, some recent influential criticisms on the technical soundness of some benchmarks [49] might open the doors to the crowd-sourcing evaluation proposed in this paper. The current version of DVAs can be extended to perform action recognition, as well as higher level cognitive tasks by exploiting logic constraints, that are contemplated in the proposed theoretical framework.

margin: Machine

The need to respond to the L2SLC protocol has led to a deep network where novel learning mechanisms have been devised. In particular, we have extended the framework of learning from constraints to on-line processing of videos. The basic ideas of [20] have been properly adapted to the framework of kernel machines by an on-line scheme which operates on a transductive environment. We are currently pursuing an in-depth reformulation of the theory given in [20]

on the feature manifold driven by the visual data as temporal sequences. Instead of using the kernel machine mathematical and algorithmic apparatus, in that case the computational model is based on ordinary differential equations on manifolds

171717This seems to be quite a natural solution, that has more solid foundations than updating schemes based on kernel methods..

margin: En plein air:
birth of the

No matter what the performance of DVAs are, this paper suggests that other labs can naturally face the challenge of learning to see like children. This could give rise to the birth of the movement of the “en plein air” in computer vision, which could stimulate a paradigm shift on the way machine learning is used. Hence, the most important contribution of this paper might not be the specific structure of DVAs, but the computational framework in which they are constructed.


The results coming from this research couldn’t have been achieved without the contribution of many people, who have provided suggestions and supports in different forms. In particular, we thank Salvatore Frandina, Marcello Pelillo, Paolo Frasconi, Fabio Roli, Yoshua Bengio, Alessandro Mecocci, Oswald Lanz, Samuel Rota Bulò, Luciano Serafini, Ivan Donadello, Alberto Del Bimbo, Federico Pernici, and Nicu Sebe.


  • [1] José Manuel Álvarez, Yann LeCun, Theo Gevers, and Antonio M. López. Semantic road segmentation via multi-scale ensembles of learned features. In Andrea Fusiello, Vittorio Murino, and Rita Cucchiara, editors, ECCV Workshops, volume 7584 of Lecture Notes in Computer Science, pages 586–595. Springer, 2012.
  • [2] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (surf). Computer vision and image understanding, 110(3):346–359, 2008.
  • [3] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1798–1828, 2013.
  • [4] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Andrea Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman, editors, ICML, volume 382 of ACM International Conference Proceeding Series. ACM, 2009.
  • [5] Horst Bischof, Samuel Rota Bulò, Marcello Pelillo, and Peter Kontschieder. Structured labels in random forests for semantic labelling and object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 99(PrePrints):1, 2014.
  • [6] Jake Bouvrie, Lorenzo Rosasco, and Tomaso Poggio. On invariance in hierarchical models. In Advances in Neural Information Processing Systems, pages 162–170, 2009.
  • [7] Gabriel J. Brostow, Julien Fauqueur, and Roberto Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 2008.
  • [8] Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla. Segmentation and recognition using structure from motion point clouds. In David A. Forsyth, Philip H. S. Torr, and Andrew Zisserman, editors, ECCV (1), volume 5302 of Lecture Notes in Computer Science, pages 44–57. Springer, 2008.
  • [9] Adam Coates, Andrew Y. Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In

    International Conference on Artificial Intelligence and Statistics

    , pages 215–223, 2011.
  • [10] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.
  • [11] Charles Darwin. A biographical sketch of an infant. Mind, 2:285–294, 1877.
  • [12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
  • [13] Michelangelo Diligenti, Marco Gori, Marco Maggini, and Leonardo Rigutini. Bridging logic and kernel machines. Machine learning, 86(1):57–88, 2012.
  • [14] Velma Dobson and Davida Y. Teller. Visual acuity in human infants: a review and comparison of behavioral and electrophysiological studies. Vision Research, 18(11):1469–1483, 1978.
  • [15] Clément Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning hierarchical features for scene labeling. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1915–1929, 2013.
  • [16] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2):167–181, 2004.
  • [17] Salvatore Frandina, Marco Lippi, Marco Maggini, and Stefano Melacci.

    On–line laplacian one–class support vector machines.

    In Artificial Neural Networks and Machine Learning–ICANN 2013, pages 186–193. Springer Berlin Heidelberg, 2013.
  • [18] Kunihiko Fukushima. Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural networks, 1(2):119–130, 1988.
  • [19] Gnecco Giorgio, Gori Marco, and Sanguineti Marcello. Learning with boundary conditions. Neural Computation, 25(4):1029–1106, April 2013.
  • [20] Giorgio Gnecco, Marco Gori, Stefano Melacci, and Marcello Sanguineti. Foundations of support constraint machines. Neural Computation (accepted for publication): http://www.dii.unisi.it/~melacci/lfc.pdf, 2014.
  • [21] Marco Gori. Semantic-based regularization and piaget’s cognitive stages. Neural Networks, pages 1035–1036, 2009.
  • [22] Marco Gori, Marco Lippi, Marco Maggini, and Stefano Melacci. On-line video motion estimation by invariant receptive inputs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 712–717, 2014.
  • [23] Marco Gori and Stefano Melacci. Constraint verification with kernel machines. Neural Networks and Learning Systems, IEEE Transactions on, 24(5):825–831, 2013.
  • [24] Marco Gori, Stefano Melacci, Marco Lippi, and Marco Maggini. Information theoretic learning for pixel-based visual agents. In Computer Vision–ECCV 2012, pages 864–875. Springer Berlin Heidelberg, 2012.
  • [25] Berthold K. Horn and Brian G. Schunck. Determining optical flow. In 1981 Technical Symposium East, pages 319–331. International Society for Optics and Photonics, 1981.
  • [26] David H. Hubel and Torsten N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1):106, 1962.
  • [27] Koray Kavukcuoglu, Marc’Aurelio Ranzato, Rob Fergus, and Yann LeCun. Learning invariant features through topographic filter maps. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 2009.
  • [28] Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau, Karol Gregor, Michaël Mathieu, and Yann LeCun. Learning convolutional feature hierachies for visual recognition. In Advances in Neural Information Processing Systems (NIPS 2010), 2010.
  • [29] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [30] Quoc V. Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Greg Corrado, Kai Chen, Jeffrey Dean, and Andrew Y. Ng. Building high-level features using large scale unsupervised learning. In ICML, 2012.
  • [31] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [32] Ce Liu, Jenny Yuen, and Antonio Torralba. Nonparametric scene parsing via label transfer. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(12):2368–2382, 2011.
  • [33] David G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
  • [34] David Marr. Vision: A Computational Investigation Into the Human Representation and Processing of Visual Information. Henry Holt and Company, 1982.
  • [35] Marcin Marszalek, Ivan Laptev, and Cordelia Schmid. Actions in context. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 2929–2936. IEEE, 2009.
  • [36] Stefano Melacci. A survey on camera models and affine invariance. Technical report, University of Siena, Italy, http://dva.diism.unisi.it/publications.html, 2014.
  • [37] Stefano Melacci and Mikhail Belkin. Laplacian Support Vector Machines Trained in the Primal. Journal of Machine Learning Research, 12:1149–1184, March 2011.
  • [38] Stefano Melacci and Marco Gori. Unsupervised learning by minimal entropy encoding. Neural Networks and Learning Systems, IEEE Transactions on, 23(12):1849–1861, 2012.
  • [39] Stefano Melacci, Marco Lippi, Marco Gori, and Marco Maggini. Information-based learning of deep architectures for feature extraction. In Image Analysis and Processing–ICIAP 2013, pages 101–110. Springer Berlin Heidelberg, 2013.
  • [40] Stefano Melacci, Marco Maggini, and Marco Gori. Semi–supervised learning with constraints for multi–view object recognition. In Artificial Neural Networks–ICANN 2009, pages 653–662. Springer, 2009.
  • [41] Jean-Michel Morel and Guoshen Yu.

    Asift: A new framework for fully affine invariant image comparison.

    SIAM Journal on Imaging Sciences, 2(2):438–469, 2009.
  • [42] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In Computer Vision and Pattern Recognition, 2014. CVPR 2014. IEEE Conference on, pages 1717–1724. IEEE, 2014.
  • [43] Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 833–840, 2011.
  • [44] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
  • [45] Thomas Serre, Lior Wolf, and Tomaso Poggio. Object recognition with features inspired by visual cortex. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 994–1000. IEEE, 2005.
  • [46] Patrice Y Simard, Yann A LeCun, John S Denker, and Bernard Victorri. Transformation invariance in pattern recognition–tangent distance and tangent propagation. In Neural networks: tricks of the trade, pages 235–269. Springer, 2012.
  • [47] Joseph Tighe and Svetlana Lazebnik. Finding things: Image parsing with regions and per-exemplar detectors. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3001–3008. IEEE, 2013.
  • [48] Joseph Tighe and Svetlana Lazebnik. Superparsing - scalable nonparametric image parsing with superpixels. International Journal of Computer Vision, 101(2):329–349, 2013.
  • [49] Antonio Torralba and Alexei A. Efros. Unbiased look at dataset bias. In CVPR, pages 1521–1528. IEEE, 2011.
  • [50] Gerald Turkewitz and Patricia A Kenny. Limitations on input as a basis for neural organization and perceptual development: A preliminary theoretical statement. Developmental psychobiology, 15(4):357–368, 1982.
  • [51] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol.

    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.

    The Journal of Machine Learning Research, 11:3371–3408, 2010.
  • [52] Manuel Werlberger, Werner Trobin, Thomas Pock, Andreas Wedel, Daniel Cremers, and Horst Bischof. Anisotropic huber-l1 optical flow. In BMVC, volume 1, page 3, 2009.
  • [53] Herman Wold. Path Models with latent variables: The NIPALS approach. Acad. Press, 1975.