Code to reproduce the experiments on our NeurIPS2020 paper
Unsupervised learning from continuous visual streams is a challenging problem that cannot be naturally and efficiently managed in the classic batch-mode setting of computation. The information stream must be carefully processed accordingly to an appropriate spatio-temporal distribution of the visual data, while most approaches of learning commonly assume uniform probability density. In this paper we focus on unsupervised learning for transferring visual information in a truly online setting by using a computational model that is inspired to the principle of least action in physics. The maximization of the mutual information is carried out by a temporal process which yields online estimation of the entropy terms. The model, which is based on second-order differential equations, maximizes the information transfer from the input to a discrete space of symbols related to the visual features of the input, whose computation is supported by hidden neurons. In order to better structure the input probability distribution, we use a human-like focus of attention model that, coherently with the information maximization model, is also based on second-order differential equations. We provide experimental results to support the theory by showing that the spatio-temporal filtering induced by the focus of attention allows the system to globally transfer more information from the input stream over the focused areas and, in some contexts, over the whole frames with respect to the unfiltered case that yields uniform probability distributions.READ FULL TEXT VIEW PDF
Spatio-temporal graphs such as traffic networks or gene regulatory syste...
The variations in the temporal performance of human actions observed in
Video captioning is a challenging task that captures different visual pa...
This paper concerns the analysis of random second order linear different...
Deep neural networks are efficient learning machines which leverage upon...
Stimulated by the need of describing useful notions related to informati...
This paper proposes a theory for understanding perceptual learning proce...
Code to reproduce the experiments on our NeurIPS2020 paper
Nowadays the most popular benchmarks in the machine learning community are composed of batches of data that are commonly processed in an offline manner using stochastic updates of the model parameters, periodically shuffling the available samples(Russakovsky et al., 2015; Krizhevsky, 2009; Damen et al., 2018). A smaller effort has been devoted by the research community to the direction of focusing on a single, potentially life-long video, in which the model continuously processes a stream of frames, that is a very natural setting resembling the flow of information that hits the eyes of each human Betti et al. (2020b). An important feature of the human visual system that is frequently neglected in several algorithms is the attention mechanism that drives the gaze over different spatial regions of the input stimulus. As a matter of fact, it is implicitly assumed that all the pixels equally contribute to the learning process, assuming a uniform probability distribution of their coordinates over the retina. In the last few years, a lot of importance has been devoted to attention in neural models, for example in learning to play games Zhang (2019), in learning task-specific attention Mnih et al. (2014), or in mixing bottom-up and top-down attention Xiao et al. (2015). A different research direction, closer to Neuroscience, is the one that specifically studies saliency in the context of the human visual attention systems Borji and Itti (2012), where dynamic models of visual attention have been recently proposed, able to predict in an online manner the trajectory of the attention Zanca and Gori (2017); Zanca et al. (2019).
In this paper, we cast the problem of processing a visual stream in a truly online setting, motivated by recent studies that connected learning over time and classical mechanics Betti et al. (2020b, a, 2019). The framework proposed in Betti et al. (2020b) naturally deals with learning problems in which time plays a crucial role, and it is well-suited to learn from streams of visual data in a principled way. The temporal trajectories of the variables of the learning problem are modeled by the so called 4th order Cognitive Action Laws (CALs) that come from stationarity conditions of a functional, as it happens for generalized coordinates in classical mechanics. We intersect these ideas with the recent human-like attention model of Zanca et al. (2019), that has shown state-of-the art results in focus estimation. Motion and visual features are treated as a mass distribution in the gravitational field that determines the trajectory of the focus of attention. The focus of attention implements a filtering procedure on the input video, allowing the system to deal only with those areas that would attract the human attention. We propose a 2nd order model that, under some mild conditions, leads to a simplified and more manageable instance of the CALs, yielding ODEs of same order of the ones that drive the attention.
With the goal of studying the impact of the focus of attention dynamics in videos, we consider the problem of transferring information from the input visual stream to the output space of a neural architecture that performs pixel-wise predictions Betti et al. (2020a, 2019). This problem consists in maximizing the Mutual Information (MI) index Betti et al. (2020b). One of the key issues with MI maximization over time, especially when focusing the attention on a few pixels, is the fact that stochastic updates of the model parameters do not keep track of the entropy of the output space due to the data processed so far, leading to poorly informed updates. We investigate the case in which the global changes in the entropy of the output space are approximated by introducing a specific constraint or a moving average. It turns out that, when learning over the focus trajectory, the MI index grows more significantly over the focused areas with respect to the unfiltered case, and, in some configurations, it is also larger than considering other distributions of the pixel coordinates. This suggests that filtering the information by a bottom-up attention model helps the system in transferring information from the whole stream.
The topic of MI maximization has recently attracted the attention of several researches Belghazi et al. (2018); Hjelm et al. (2019); Tian et al. (2019); Oord et al. (2018); Tschannen et al. (2020). Most of the recent works are about customized MI-based criteria to learn representations for downstream tasks, that is not the case of this paper. Moreover, Hjelm et al. (2019); Tian et al. (2019) are based on surrogate functions that loosely approximate Tschannen et al. (2020) the continuous MI formulation, while here we directly consider the discrete MI index, that, for instance, has been previously used as criterion to relate different views of the input data Hu et al. (2017) or in clustering Melacci and Gori (2012). The information transferred by multi-layer networks is discussed in the context of the popular information bottleneck principle by Naftali Tishby and other authors as a mean to study deep network internal dynamics Tishby and Zaslavsky (2015); Shwartz-Ziv and Tishby (2017); Saxe et al. (2019).
In summary, the contributions of this paper are: (1) we study human-like attention mechanisms in conjunction with learning in video data, (2) considering a new 2nd order differential model and (3) evaluating the impact of different criteria to approximate the entropy estimate over the whole stream. This paper is organized as follows. Section 2 describes the learning framework, 2nd order models, and the problem of MI maximization. Section 3 is about injecting the focus of attention dynamics, while experiments are reported in Section 4. Section 5 concludes the paper with ideas for future work.
We consider the problem of processing a stream of data over time and, in particular, a stream of video frames from a target source, being the frame at time in the time horizon
. The stream is processed by a neural network whose weights and biases at time
are represented by the generic vector variable, while , are respectively its first and second derivatives. Our work is rooted in the ideas presented in Betti et al. (2020b, a, 2019), where learning is described in analogy with classical mechanics, as a variational problem whose objective is to find a stationary point of the following functional of the maps ,
The Lagrangian is composed of a kinetic energy and a potential energy , while , when appropriately chosen, is responsible of introducing energy dissipation. The term is selected in function of the way is implemented (see Betti et al. (2020a) for details111In this paper we changed the notation w.r.t. Betti et al. (2020a) in order to simplify the description of our approach.). In particular, in Betti et al. (2020a, 2019, b) we have , ,
is composed of the loss functionof the considered problem and a quadratic regularizer on , and includes the squared norm of the derivatives plus their dot product, leading to
where and , , , are custom positive scalars, is the Euclidean norm in and is the standard scalar product in , being the size of .
The Euler-Lagrange (EL) equations of Eq. (2) yield the Cognitive Action Laws (CALs), 4th order differential equations that, when integrated, allows to be updated over time. In particular, they are222We removed the time index to simplify the notation. We will do it occasionally also in the rest of the paper.
being and the fourth and third derivatives of , respectively, and is the gradient of with respect to its first argument. Cauchy’s initial conditions can be provided on and , while stationarity conditions of prescribe that Eq. (3) must be paired with boundary conditions on the right border (). Thus, in order to solve the problem of determining in a causal way (i.e. in such a way that the solution at time does not depend on values in ), the fulfilment of the boundary conditions in is approximated in Betti et al. (2020a) by introducing a mechanism that sets ( “resets”) to zero all the derivatives up to , whenever their norms become too large. See Betti et al. (2020a) for more details on CALs.
Despite their robust principled formulation, the main drawbacks of the 4th order CALs is the difficulty in tuning the parameters that weigh the contribute of the derivatives, and the computational/memory burden due to the integration of a 4th order ODE. Moreover, the theoretical guarantees on the stability of Eq. (3) are experimentally shown to not be necessarily needed, mostly due to the aforementioned derivative reset procedure Betti et al. (2020a). For these reasons, in this paper we will use the CAL theory in a particular causal regime of the parameters for which two important simplifications are attained. First, the dynamics of the weights are described by a 2nd order ODE (instead of Eq. (3)). Second, we get direct causality without the need of any reset mechanisms.
The limiting procedure that leads to the 2nd order laws is based on a conjecture by De Giorgi Ambrosio et al. (2006) which has been subsequently proved and studied in Stefanelli (2011); Serra and Tilli (2012); Liero and Stefanelli (2013). In detail, we consider a reparametrization in terms of of the functional, where , , . This allows us to rewrite Eq. (2) in line with De Giorgi’s functional,
where we also chose, for simplicity, . Letting , the minima of the functional with fixed initial conditions on and converges to the solution of a Cauchy problem based on a 2nd order differential equation, thus gaining full causality, i.e., measures the “degree of causality” of the solution. Notice that the factor in Eq. (4) becomes peaked on as , and the minimization procedure of will be mainly concerned in the minimization of the loss calculated at . At a first glance, this might seem counter-intuitive. However, it becomes a useful feature when considered in conjunction with the properties of the input signal . Let us indicate with the temporal scale of , that is a small time span under which the variations of are semantically negligible. The whole temporal interval can be partitioned into disjoint intervals , in each of which the aforementioned picky behaviour is not critical due to the temporal scale of . The minimization of Eq. (4) can be iteratively defined by minimizing in each interval, where the conditions on the left boundary are given by the solution of the minimization in the previous interval. When , the minimization problem can be well interpreted in terms of the value of , for .
To introduce the EL equations of the newly introduced problem, for simplicity, we will describe the limiting procedure in the interval , that applies to each of the previously described intervals. The EL equations for the minimizer of with initial conditions and are
and the following theorem holds:
The solution of the problem (5) converges (weakly in to the solution of
We consider the problem of transferring information from an input visual stream to the output space of a multi-layer convolutional network with layers, that processes each frame and yields pixel-wise predictions. This corresponds to the maximization of the Mutual Information (MI) from the pixels of the input frames to the -dimensional output space yielded by the units of the last layer, being the size of the filter bank in layer
. Hyperbolic tangent is used as activation function in each layer, while the last layer is equipped with a softmax activation, generating probabilities , being a pair of pixel coordinates and the processed frame. This problem is studied in Betti et al. (2020b) and related papers Betti et al. (2019, 2020a), where single-layer models (or stacks of sequentially trained single-layer models) are considered, while, in this paper, we exploit a deep network trained end-to-end. Previous approaches based on kernel machines can be found in Gori et al. (2016, 2012).
In order to define the MI index, we consider a generic, time independent weight configuration . We introduce the average output activation on the video portion between time instants and ,
where is a spatio-temporal density and is the set of points that constitute the retina. The MI index over the video portion , is defined as
where is the entropy function, and and
are random variables (is discrete) associated with the input333Since we are dealing with convolutional feature a realization of the random variable is specified by the coordinates of a point , the value of the temporal instant and the value of the video . and output space, respectively444When selecting a in base , the MI is in , that is what we will assume in the rest of the paper.. When no further information is available, is commonly assumed to be uniform in time and space and it is normalized such that .
Performing maximum-MI-based online learning of using the CALs in the time horizon is not straightforward. Once we restore the dependency of on time, by inserting in place of , we cannot simply plug (minus) the MI index as a potential loss in the Lagrangian due to the lack of temporal locality. As a matter of fact, in order to implement online learning dynamics, must be temporally local, i.e., it should depend on and at time only. For this reason, the authors of Betti et al. (2020b) compute the MI index at time , and not in an interval; the approximation of the MI in is yielded by the outer integration in the functional of Eq. (4) (or, equivalently, in the one of Eq. (2)). A drawback of this formulation is that, due to this temporal assumption, it could lead to a loose approximation of the original term of Eq. (8), for which the inner integration on time (Eq. (7)) is lost, and replaced by the outer integration of the functional. In order to better cope with the optimization dynamics, the two entropy terms are commonly weighted by positive scalars , . In addition to the plain-vanilla case we just described (referred to as PLA), we explore two other alternative criteria to mitigate the impact of time locality, that we will evaluate in Section 4. The first one (VAR) consists in introducing an additional auxiliary variable , that is used to replace of Eq. (7), while its variation, , is constrained to be almost equivalent to . The Lagrangian is augmented with , a soft-constraint that enforces to approximate the case in which the probability estimate is not limited to the current frame ().555Probabilistic normalization must be enforced after every update of . This idea is presented in Betti et al. (2020b) but not followed-up in any experimentation. As a second criterion (AVG), we propose to replace with the outcome of an averaging operation that keeps track of the past activation of the output units, i.e., , for two consecutive time instants .
The way video data is commonly processed by machines usually lacks a key property of the human visual perception, that is the capability of exploiting eye movements to perform shifts in selective visual attention. High visual acuity is restricted to a small area in the center of the retina (fovea), and the purpose of the Focus Of Attention (FOA) is to selectively orient the gaze toward relevant areas with high information, filtering out irrelevant information from cluttered visual scenes McMains and Kastner (2009); Kowler (2010); Zanca et al. (2019). In the context of Section 2.2, we consider a visual stream and a neural architecture with output dimensions (per pixel), and we aim at developing the network weights such that the MI index is maximized as strongly as possible with respect to the model capacity. Of course, restricting the attention to a subset of the spatio-temporal coordinates of the video, due to a FOA mechanism, seems to inherently carry less information than when considering the whole video. However, in the latter case, the processed data will be characterized by a larger variability, mixing up noisy/background information with what could be more useful to understand the video. Such mixture of data could be harder to disentangle by a learning model than well-selected information coming from a human-like FOA trajectory, leading to a worse MI estimate. Curiously, the learning process restricted to the FOA trajectory could end-up in facilitating the development of the weights, so that the MI computed on the whole frame area could be larger than when learning without restrictions. Following the notation of Eq. (8), the MI maximization, for each , is based on the spatial distribution . Such distribution models the relevance of each coordinate when learning from frame . In Betti et al. (2020a, 2019), is assumed to be uniform over the frame area, while in Betti et al. (2020b) it is also described the idea of considering ( in Betti et al. (2020b)) as the most natural candidate for implementing a FOA-based mechanism. Let us assume that are the spatial coordinates of the FOA at time , then we define
being a function that is peaked on . Following this parametrization of , we borrow a state-of-the art model for scanpath prediction defined in Zanca et al. (2019), that shares a physics-inspired formulation as CALs. Such FOA model has been proven to be strongly human-like in free-viewing conditions Zanca et al. (2020). It is based on the intuition that the attention emerges as a gravitational process, in which both low-level (gradient, contours, motion) or high-level features (objects, context) may act as gravitational masses. In particular, given the gravitational field , the law that drives the attention is
that is indeed another 2nd order model as the one we proposed in Section 2.1 (see Zanca et al. (2019) for more details). The dissipation is controlled by , and the importance of each mass can also be tuned. Interestingly, Eq. (10) describes the dynamics of the FOA, and it is not based on pre-computed or given saliency maps. In this paper, following Zanca et al. (2019), we consider two basic (low-level) perceptive features as masses, the spatial gradient of the brightness and the strength of the motion field. The trajectories simulated by the model show the same patterns of movement characteristic of human eyes: fixations, when the gaze remains still in a location of interest; saccades, rapid movement to reallocate attention on a new target; smooth pursuit, slow movements performed in the presence of a visual feedback with the purpose of tracking a stimulus.
Different choices on are possible. In Section 4 we will consider the extreme case in which is a Dirac delta on the coordinates (we will refer to it as FOA), so that is essentially a mono-dimensional signal. A less extreme setting is the one in which is a squared window centered in that covers a small fraction of the frame (FOAW), while the most-relaxed setting is when is simply uniform on the whole frame (UNI), i.e., is not used.
. A PyTorch-based implementation can be downloaded as supplementary material.
Models. Architectures are referred to as S (Small), D (Deeper), DL (Deeper and with a Larger number of neurons), and they are based on filters (except for the last layer – filters), (S) or (D, DL) layers, and either (S, D) or (DL) filters in layer . Networks S and D are composed of filters in each hidden layer, while DL has filters in each hidden layer. Following Section 3, we compared 3 potential terms based on 3 different input probability densities , named UNI, FOA, FOAW (uniform, foa-restricted, and foa-window-restricted, respectively – window edge is of the min frame dimension). For each of them, we tested the 3 criteria of Section 2.2 to extend the temporal locality, PLA, VAR, AVG (fully local, variable-based, average).
Setting & Data. We considered three visual streams with frames each. The first frames are the ones on which learning is performed, integrating the CALs. Then, the developed weights are used to measure the MI index over the following frames, directly applying the MI formulation of Eq. (8), i.e., , that is what we report in the results of this section. For all the models, independently on the probability density used in their potentials, we measured the MI index using in the UNI, FOA, FOAW cases.
This means that, for example, a model trained following the FOA trajectory is then evaluated in the 5k test frames either considering the whole frame area, the FOA trajectory, or the window-based FOA trajectory. The three streams (Fig. 1), have different properties. The first one, SparseMNIST, is composed of a static frame () in which 10 digits from the MNIST data are sparsely located over a dark background. The second video, Carpark, is taken from a fixed camera monitoring a car parking area in front of a building. The last video, Call, is a recording taken from a webcam during a video call. Videos are repeated until the target number of frames is reached. The last two videos are processed at pixels per frame, grayscale, frames per second.
Parameters. The FOA trajectory was generated by weighing the two gravitational masses (frame details) and (motion), respectively, and adjusting in order to adapt it to the each video. We analyze the behaviour of the FOA trajectories in Fig. 2. After a first experimentation in which we qualitatively observed the behaviour of the 2nd order laws, we set , , . For each model we considered multiple weighing schemes of the parameters , , , , selecting the ones that returned the largest MI. As a general rule of thumb, using a lower value of the conditional entropy weighing term w.r.t. the entropy weight , helps the model to exploit all the available output symbols. The network weights were randomly initialized, enforcing the same initialization to all the compared model.
Main result. Our main results are highlighted in Tab. 1.
Each column, starting from the third one, is about a model, defined by the pair (architecture, density used in the training potential). For each model, the MI index is reported when measured using different spatio-temporal densities (they are labeled in column “Test”). We used the temporal locality criterion that led to the best results. Overall, the models trained on FOA-based densities (columns FOA, FOAW) usually perform better than the ones that were exposed to a uniform over the frame area (columns UNI). This is particularly noticeable in the SparseMNIST and Call streams, characterized by a still and not-much-detailed background and few regions of interest, i.e. the digits or the moving speaker, respectively. The filtering approach induced by the attention in the training stage highly improves the information transfer over most of the considered test measurements, with just a few exceptions. These considerations holds at a lesser degree also in the Carpark stream, in which frames are more detailed. The focus is attracted by a busy road or by people parking their cars. However, also the immediate surroundings of those regions contain much information, so that training with FOAW density achieves the best results in architectures D and DL, while the more extreme FOA approach do not compete with models trained considering the whole frame (UNI). In both the Carpark and Call streams, the S architecture does not benefit from learning over the attention trajectory. We motivate this result by considering that S is a shallower model, that inherently learns lower level features that the other ones. These features are more common to different frame location, making the impact of attention less evident. In the case of SparseMNIST, the dark-uniform background dominates the frame, and learning over induces a largest information transfer also in network S.
Temporal locality. In order to evaluate the impact of the temporal locality criteria (PLA, AVG, VAR), we restrict our analysis to models trained with a FOA-restricted probability density. In this case, we describe each model by the pair (architecture, temporal locality criterion), and we report results in Tab. 2. In general, the moving average criterion (AVG) achieves the best performances in all settings, with some exceptions. The Carpark stream has temporal dynamics that are pretty repetitive and periodic (e.g., cars crossing the same crossroad etc.). Hence, the addition of a criterion to better keep track of the temporal information turns out to be less necessary. We notice higher value of MI index in the fully temporally local case (PLA) in architecture DL. This may be due to the fact that DL has a larger number parameters and units than the other nets, and it has intrinsically more capacity to memorize the temporal information. The MI index is lower that the one of the other architectures due to the largest size of the output space.
Random scanpaths. We are left with the open question on whether the largest information transfer we experienced is due to the state-of-the art attention model we used or it is only due the reduction of the size of the input data. We compared models trained on the FOA trajectories used so far with the same networks trained randomly sampling
from a uniform distribution over the retina. The results of Fig.3 show that the human-like trajectory estimated by the selected attention model has a clear positive impact in the information transfer. Interestingly, in the Carpark case we sometimes observe that fixations which explore random coordinates highly foster information transfer. This confirms our previous statements regarding the large amount of information in whole the frame area.
Learning dynamics. We investigate the behaviour of the models during the training stage, in the case of architecture D and a single training/test probability density, FOA. The plots of Fig. 4, for each value of the -axis, shows the MI index computed in the interval along the FOA trajectory, for different temporal criteria (PLA, AVG, VAR). The variable-based (VAR) model tends to quickly find a stationary condition of the estimated MI index value. Both PLA and AVG incur in an initial stage with evident fluctuations before becoming more stable, usually in larger values than VAR. The models have to deal with pretty varied conditions at the first stages of learning, which is limited to a single location in each frame. As long as time passes and a largest portion of stream is processed, fluctuations are mitigated reaching more stable configurations.
In this work we delved into a novel approach to Mutual Information (MI) maximization rising from the conjunction of online entropy estimation mechanisms and human-like focus of attention. We introduced a 2nd order differential model, providing insightful experimental results to support the intuition that using the focus of attention to drive the learning dynamics fosters an increment of the globally transferred information from the input stream. Future work will be devoted to enforcing coherence over the predictions performed on the focus trajectory to develop high-level representations.
Our work is a foundational study. We believe that there are neither ethical aspects nor future societal consequences that should be discussed.
This work was partly supported by the PRIN 2017 project RexLearn, funded by the Italian Ministry of Education, University and Research (grant no. 2017TWNMH2)
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 2009–2015. External Links: Cited by: §1, §1, §2.2, §2, §3.
European Conference on Computer Vision (ECCV), Cited by: §1.
On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment 2019 (12), pp. 124020. Cited by: §1.
The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
In order to prove Theorem 1 we first describe a technical hypothesis on the potential . In detail, for all positive there exists two positive integrable functions and such that for every and for all we have
Notice that here, in order to simplify the notation, we use the same symbol for and for . We will also denote with the solution of problem (5).
As it is also remarked below the proof articulates as follow: first of all we asses the convergence of by compactness arguments, basically by performing an estimate on the solution ; then the uniform estimate on the norm of is used to check that the limit actually solves the problem (6).
The proof of this theorem follows the spirit of Theorem 4.2 of Serra and Tilli (2012). We will start with an uniform (in ) estimate of and then we will use this estimate in weak form of the Euler equation to show the convergence of to the solution of (6). We will prove the theorem in the case and .
Uniform Estimate. Start form the differential equation in (5) and scalar multiply it by :
then integrate this equation on the interval , and using the boundary conditions (5) integrate by parts to obtain
Now let us integrate this equality again in the interval , therefore obtaining
Now we can take all the positive (for small enough) terms to the right hand side to obtain
Now using Eq. (1) we can choose to further reduce this inequality down to
where and are constant with respect to the parameter . Using Peter-Paul inequality we have that for all . Similarly since , we can write and using Peter-Paul and Cauchy-Schwartz we also end up with the estimate for all , which implies
Convergence. Once we have this uniform bound we can complete the proof by arguing along the very same lines of the proof of Section 3.2 of Serra and Tilli (2012) to obtain the thesis. ∎