Attentive monitoring of multiple video streams driven by a Bayesian foraging strategy

by   Paolo Napoletano, et al.

In this paper we shall consider the problem of deploying attention to subsets of the video streams for collating the most relevant data and information of interest related to a given task. We formalize this monitoring problem as a foraging problem. We propose a probabilistic framework to model observer's attentive behavior as the behavior of a forager. The forager, moment to moment, focuses its attention on the most informative stream/camera, detects interesting objects or activities, or switches to a more profitable stream. The approach proposed here is suitable to be exploited for multi-stream video summarization. Meanwhile, it can serve as a preliminary step for more sophisticated video surveillance, e.g. activity and behavior analysis. Experimental results achieved on the UCR Videoweb Activities Dataset, a publicly available dataset, are presented to illustrate the utility of the proposed technique.



There are no comments yet.


page 14


Video-based Person Re-identification with Two-stream Convolutional Network and Co-attentive Snippet Embedding

Recently, the applications of person re-identification in visual surveil...

A Nonparametric Model for Multimodal Collaborative Activities Summarization

Ego-centric data streams provide a unique opportunity to reason about jo...

Activity Recognition Using A Combination of Category Components And Local Models for Video Surveillance

This paper presents a novel approach for automatic recognition of human ...

Double-Coupling Learning for Multi-Task Data Stream Classification

Data stream classification methods demonstrate promising performance on ...

SegCodeNet: Color-Coded Segmentation Masks for Activity Detection from Wearable Cameras

Activity detection from first-person videos (FPV) captured using a weara...

Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning

Studies on self-supervised visual representation learning (SSL) improve ...

Out the Window: A Crowd-Sourced Dataset for Activity Classification in Surveillance Video

The Out the Window (OTW) dataset is a crowdsourced activity dataset cont...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The volume of data collected by current networks of cameras for video surveillance clearly overburdens the monitoring ability of human viewers to stay focused on a task. Further, much of the data that can be collected from multiple video streams is uneventful. Thus, the need for the discovery and the selection of activities occurring within and across videos for collating information most relevant to the given task has fostered the field of multi-stream summarise.

At the heart of multi-stream summarisation there is a “choose and leave” problem that moment to moment an ideal or optimal observer (say, a software agent) must solve: choose the most informative stream; detect, if any, interesting activities occurring within the current stream; leave the handled stream for the next “best” stream.

In this paper, we provide a different perspective to such “choose and leave” problem based on a principled framework that unifies overt visual attention behavior and optimal foraging. The framework we propose is just one, but a novel, way of formulating the multi-stream summarisation problem and solution (see Section II, for a discussion).

In a nutshell, we consider the foraging landscape of multiple streams, each video stream being a foraging patch, and the ideal observer playing the role of the visual forager (cfr. Table I). According to Optimal Foraging Theory (OFT), a forager that feeds on patchily distributed preys or resources, spends its time traveling between patches or searching and handling food within patches [1]. While searching, it gradually depletes the food, hence, the benefit of staying in the patch is likely to gradually diminish with time. Moment to moment, striving to maximize its foraging efficiency and energy intake, the forager should make decisions: Which is the best patch to search? Which prey, if any, should be chased within the patch? When to leave the current patch for a richer one?

Here visual foraging corresponds to the time-varying overt deployment of visual attention achieved through oculomotor actions, namely, gaze shifts. Tantamount to the forager, the observer is pressed to maximize his information intake over time under a given task, by moment-to-moment sampling the most informative subsets of video streams.

Multi-stream attentive processing Patchy landscape foraging
Observer Forager
Observer’s gaze shift Forager’s relocation
Video stream Patch
Proto-object Candidate prey
Detected object Prey
Stream selection Patch choice
Deploying attention to object Prey choice and handling
Disengaging from object Prey leave
Stream leave Patch leave or giving-up
TABLE I: Relationship between Attentive vision and Foraging

All together, choosing the “best” stream, deploying attention to within-stream activities, leaving the attended stream, represent the unfolding of a dynamic decision making process. Such monitoring decisions have to be made by relying upon automatic interpretation of scenes for detecting actions and activities. To be consistent with the terminology proposed in the literature [2], an action refers to a sequence of movements executed by a single object ( e.g., “human walking” or “vehicle turning right”). An activity contains a number of sequential actions, most likely involving multiple objects that interact or co-exist in a shared common space monitored by single or multiple cameras (e.g., “passengers walking on a train platform and sitting down on a bench”). The ultimate goal of activity modelling is to understand behavior, i.e. the meaning of activity in the shape of a semantic description. Clearly, action/activity/behavior analysis entails the capability of spotting objects that are of interest for the given surveillance task.

Thus, in the work presented here the visual objects of interest occurring in video streams are the preys to be chased and handled by the visual forager. Decisions at the finer level of a single stream concern which object is to be chosen and analyzed (prey choice and handling, depending on task), and when to disengage from the spotted object for deploying attention to the next (prey leave).

The reformulation of visual attention in terms of foraging theory is not simply an informing metaphor. What was once foraging for tangible resources in a physical space became, over evolutionary time, foraging in cognitive space for information related to those resources [3], and such adaptations play a fundamental role in goal-directed deployment of visual attention [4]. Under these rationales, we present a model of Bayesian observer’s attentive foraging supported by the perception/action cycle presented in Fig. 1. Building on the perception/action cycle, visual attention provides an efficient allocation and management of resources.

The cycle embodies two main functional blocks: the perceptual component and the executive control component. The perceptual component is in charge of “What” to look for, and the executive component accounts for the overt attention shifts, by deciding “Where and How” to look at, i.e., the actual gaze position, and thus the observer’s Focus of Attention (FoA). The observer’s perceptual system operates on information represented at different levels of abstraction (from raw data to task dependent information); at any time, the currently sensed visual stimuli depend on the oculomotor action or gaze shift performed either within the stream (within-patch) or across streams (between-patch). Based on perceptual inferences at the different levels, the main feedback information passed on to the executive component, or controller, is an index of stream quality formalized in terms of the configurational complexity of potential objects sensed within the stream.

A stream is selected by relying upon its pre-attentively sensed quality. Once within stream, the observer attentively detects and handles objects that are informative for the given task. Meanwhile, by intra-stream foraging, the observer gains information on the actual stream quality in terms of experienced detection rewards. Object handling within the attended stream occurs until a decision is made to leave for a more profitable stream. Such decision relies upon a Bayesian strategy, which is the core of this paper. The strategy extends to stochastic landscapes observed under incomplete information, the deterministic global policy derived from classic Charnov’s Marginal Value Theorem (MVT, [5]), while integrating within-stream observer’s experience.

By relying on such perception/action cycle, we assume that the deployment of gaze to one video frame precisely reflects the importance of that frame. Namely, given a number of video streams as the input, at any point in time, we designate the current gazed frame as the relevant video frame to be included in the final output summarisation. The output succinctly captures the most important data (objects engaged in actions) for the surveillance analysis task.

Fig. 1: Monitoring of multiple video streams as attentive foraging. The ideal observer (forager) is involved in a perception/action loop that supports foraging activity. Multiple streams are the raw sensory input. The observer pre-attentively selects the most informative stream (patch) and sets his Focus of Attention via a gaze shift action; within the stream, interesting objects (preys) are attentively detected and handled through local gaze shifts. Moment to moment, a Bayesian optimal strategy is exploited to make a decision whether to stay or leave the scrutinized stream by shifting the gaze to a more profitable one. The strategy relies upon the perceptual feedback of the overall “quality” (complexity) of streams.

The idea of a layered framework for the control of gaze deployment, implementing a general perception/action loop is an important one in the visual attention literature (cfr., Schütz et al. [6] for a discussion) and it has been fostered by Fuster [7, 8]. Such idea together with the assumption that attention is algorithmic in nature and needs not to occupy a distinct physical place in the brain is germane to our theme. Fuster’s paradigm has been more recently formalized by Haykin [9] under the name of Cognitive Dynamic Systems. In this perspective, our cognitive foraging approach to monitoring can be considered closely related to the Cognitive Dynamic Surveillance System (CDSS) approach, a remarkable and emerging domain proposed by Regazzoni and colleagues [10, 11, 12]. In CDSS, attentive mechanisms [13] are likely to add relevant value in the effort of designing the next generation of surveillance systems.

In the rest of this paper, Section II describes the related literature and contributions of this work. Section III provides a formal overview of the model. Section IV details the pre-attentive stage. Stream selection is discussed in Section V. Section VI describes within-stream visual attention deployment, while Section VII discusses the Bayesian strategy for leaving the stream. Experimental work is presented in Section VIII. Finally, Section IX concludes this paper.

Ii Related work and our contributions

Main efforts in the summarisation literature have been spent on the single-camera case, while the multi-camera setting has not received as much attention (see [14, 15, 16] for review). Specifically, the work by Leo and Manjunath[15] shares our concern of providing a unified framework to generate summaries. Different from us, they rely on document analysis-inspired activity motif discovery. Time series of activations are computed from dense optical flow in different regions of the video and high-level activities are identified using the topic model analysis. The step from activities detected in individual video streams to a complete network summary relies on identifying and reducing inter-and intra-activity redundancy. This approach, cognate with those based on sparse coding dictionaries for finding the most representative frames (e.g.,[17]), requires off-line learning of activities from documents, each document being the time series of activations of a single stream. While offering some advantage for inferring high level activities[14], these methods avoid confronting with complex vision problems and distributed optimal control strategies brought on by the multi-stream setting [18, 19, 20]. On the other hand, difficulties arise when dealing with large video corpora and with dynamic video streams, e.g. on-line summarisation in visual sensor networks [16], a case which is more related to our scenario.

In this view, beyond multi-camera summarisation, it is of interest work concerning multi-camera surveillance, where manual coordination becomes unmanageable when the number of cameras is large. To some extent, the “choice and leave” problem previously introduced bears relationships with two challenging issues: camera assignment (which camera is being used to extract essential information) and camera handoff (the process of finding the next best camera). Indeed, the complexity of these problems on large networks is such that Qureshi and Terzopoulos [21] have proposed the use of virtual environments to demonstrate camera selection and handover strategies. Two remarkable papers address the issue of designing a general framework inspired by non-conventional theoretical analyses, in a vein similar to the work presented here. Li and Bhanu [22]

have presented an approach based on game-theory. Camera selection is based on a utility function that is computed by a bargaining among cameras capturing the tracking object. Esterle

et al. [23] adopted a fully decentralised socio-economic approach for online handover in smart camera networks. Autonomous cameras exchange responsibility for tracking objects in a market mechanism in order to maximize their own utility. When a handover is required, an auction is initiated and cameras that have received the auction initiation try to detect the object within their the field of view.

At this point it is worth noting that, in the effort towards a general framework for stream selection and handling, all works above, differently from the approach we present here, are quite agnostic about the image analysis techniques to adopt. They mostly rely on basic tools (e.g., dense optical flow [15], Camshift tracking manually initialized [22], simple frame-to-frame SIFT computation [23]). However, from a general standpoint, moving object detection and recognition, tracking, behavioral analysis are stages that deeply involve the realms of image processing and machine vision. In these research areas, one major concern that has been an omnipresent topic during the last years is how to restrict the large amount of visual data to a manageable rate [19, 18].

Yet to tackle information overload, biological vision systems have evolved a remarkable capability: visual attention, which gates relevant information to subsequent complex processes (e.g., object recognition). A series of studies published under the headings of Animate [24], or Active Vision [25] has investigated how the concepts of human selective attention can be exploited for computational systems dealing with a large amount of image data (for an extensive review, see [26]

). Indeed, determining the most interesting regions of an image in a “natural”, human-like way is a promising approach to improve computational vision systems.

Surprisingly enough, the issue of attention has been hitherto overlooked by most approaches in video surveillance, monitoring and summarisation [14, 16], apart from those in the emerging domain of smart camera networks embedding pan-tilt-zoom (PTZ) cameras. PTZ cameras can actively change intrinsic and extrinsic parameters to adapt their field of view (FOV) to specific tasks [19, 20]. In such domain, active vision is a pillar [19, 18]

, since FOV adaptation can be exploited to focus the “video-network attention” on areas of interest. In PTZ networks, each of the cameras is assumed to have its own embedded target detection module, a distributed tracker that provides an estimate of the state of each target in the scene, and a distributed camera control mechanism 

[20]. Control issues have been central to this field: the large amount of camera nodes in these networks and the tight resource limitations requires balancing among conflicting goals [27, 21]. In this respect, the exploitation vs. exploration dilemma is cogent here much like in our work. For example, Sommerlade and Reid [28] present a probabilistic approach to maximize the expected mutual information gain as a measure for the utility of each parameter setting and task. The approach allows balancing conflicting objectives such as target detection and obtaining high resolution images of each target. Active distributed optimal control has been given a Bayesian formulation in a game theoretic setting. The Bayesian formulation enables automatic trading-off of objective maximization versus the risk of losing track of any target; the game-theoretic design allows the global problem to be decoupled into local problems at each PTZ camera [29, 30].

In most cases visual routines and control are treated as related but technically distinct problems [20]. Clearly, these involve a number of fundamental challenges to the existing technology in computer vision and the quest for efficient and scalable distributed vision algorithms [18]

. The primary goal of these systems has been tracking distinct targets, where adopted schemes are extensions of the classic Kalman Filter to the distributed estimation framework 

[20]. However, it is important to note that tracking is but one aspect of multi-stream analysis and of visual attentive behavior ([2], but see Section III-B for a discussion). To sum up, while the development of PTZ networks has cast interest for active vision techniques that are at the heart of the attentive vision paradigm [24, 25], yet even in this field we are far from a full exploitation of tools made available by such paradigm.

There are some exceptions to this general state of affairs. The use of visual attention has been proposed by Kankanhalli et al.[31]. They embrace the broad perspective of multimedia data streams, but the stream selection process is yet handled within the classic framework of optimization theory and relying on an attention measure (saturation, [31]). Interestingly, they resort to the MVT result, but only for experimental evaluation purposes. In our work the Bayesian extension of the MVT is at the core of the process. The interesting work by Chiappino et al.  [13] proposes a bio-inspired algorithm for attention focusing on densely populated areas and for detecting anomalies in crowd. Their technique relies on an entropy measure and in some respect bears some resemblance to the pre-attentive monitoring stage of our model. Martinel et al. [32] identify the salient regions of a given person, for person re-identification across non-overlapping camera views. Recent work on video summarisation has borrowed salience representations from the visual attention realm. Ejaz et al. [33] choose key frames as salient frames on the basis of low-level salience. High-level salience based on most important objects and people is exploited in [34] for summarisation, so that the storyboard frames reflect the key object-driven events. Albeit not explicitly dealing with salience, since building upon sparse coding summarisation , Zhao and Xing [35] differentiate from [17] and generate video summaries by combining segments that cannot be reconstructed using the learned dictionary. Indeed, this approach, which incorporates in summaries unseen and interesting contents, is equivalent to denote salient those events that are unpredictable on prior knowledge (salient as “surprising”, [26]). Either [34] and [35] only consider single-stream summarisation. The use of high-level saliency to handle the multi-stream case has been addressed in [36], hinging on [37]; this method can be considered as a baseline deterministic solution to the problem addressed here (cfr., for further analysis, Section VIII).

Our method is fundamentally different from all of the above approaches. We work within the attentive framework but the main novelty is that by focusing on the gaze as the principal paradigm for active perception, we reformulate the deployment of gaze to a video stream or to objects within the stream as a stochastic foraging problem. This way we unify intra- and inter-stream analyses. More precisely, the main technical contributions of this paper lie in the following.

First, based on OFT, a stochastic extension of the MVT is proposed, which defines an optimal strategy for a Bayesian visual forager. The strategy combines in a principled way global information from the landscape of streams with local information gained in attentive within-stream analysis. The complexity measure that is used is apt to be exploited for within-patch analysis (e.g, from group of people to single person behavior), much like some foragers do by exploiting a hierarchy of patch aggregation levels [38].

Second, the visual attention problem is formulated as a foraging problem by extending previous work on Lévy flights as a prior for sampling gaze shift amplitudes [39], which mainly relied on bottom-up salience. At the same time, task dependence is introduced, which is not achieved through ad hoc procedures. It is naturally integrated within attentional mechanisms in terms of rewards experienced in the attentive stage when the stream is explored. This issue is seldom taken into account in computational models of visual attention (see [26, 6] but in particular Tatler et al [40]). A preliminary study on this challenging problem has been presented in [41], but limited to the task of searching for text in static images.

Iii Model overview

In this Section we present an overview of the model to frame detailed discussion of its key aspects covered in Sections IV (pre-attentive analysis), V (stream choice), VI (within-stream attentive analysis) and VII (Bayesian strategy for stream leave).

Recall from Section I that the input to our system is a visual landscape of video streams, each stream being a sequence of time parametrized frames , where is the time parameter and . Denote the spatial support of , and the coordinates of a point in such domain. By relying on the perception/action cycle outlined in Fig. 1, at any point in time, we designate the current gazed frame of stream as the relevant video frame to be selected and included in the final output summarisation

To such end, each video stream is the equivalent of a foraging patch (cfr. Table I

) and objects of interest (preys) occur within the stream. In OFT terms, it is assumed that: the landscape is stochastic; the forager has sensing capabilities and it can gain information on patch quality and available preys as it forages. Thus, the model is conceived in a probabilistic framework. Use the following random variables (RVs):

  • : a RV with values corresponding to the task pursued by the observer.

  • : a multinomial RV with values corresponding to objects known by the observer

As a case study, we deal with actions and activities involving people. Thus, the given task corresponds to “pay attention to people within the scene”. To this purpose, the classes of objects of interest for the observer are represented by faces and human bodies, i.e., .

The observer engages in a perception/action cycle to accomplish the given task (Fig.1). Actions are represented by the moment-to-moment relocations of gaze, say , where and are the old and new gaze positions, respectively. We deal with two kinds of relocations: i) from current video stream to the next selected (between-patch shift), i.e. ; ii) from one position to another within the selected stream (within-patch gaze shifts), . Since we assume unitary time for between-stream shifts, in the following we will drop the index and simply use to denote the center of the FoA within the frame without ambiguity. Relocations occur because of decisions taken by the observer upon his own perceptual inferences. In turn, moment to moment, perceptual inferences are conditioned on the observer’s current FoA set by the gaze shift action.

Iii-a Perceptual component

Perceptual inference stands on the visual features that can be extracted from raw data streams, a feature being a function . In keeping with the visual attention literature [6], we distinguish between two kinds of features:

  • bottom-up or feed-forward features, say - such as edge, texture, color, motion features - corresponding to those that biological visual systems learn along evolution or in early development stages for identifying sources of stimulus information available in the environment (phyletic features, [8]);

  • top-down or object-based features, i.e. .

There is a large variety of bottom-up features that could be used (see [26]). Following [42], we first compute, at each point in the spatial support of the frame from the given stream, spatio-temporal first derivatives (w.r.t temporally adjacent frames and ). These are exploited to estimate, within a window, local covariance matrices , which in turn are used to compute space-time local steering kernels . Each kernel response is vectorised as

. Then vectors

are collected in a local window () and in a center + surround window () both centered at to form a feature matrix . The motivation for using such features stems from the fact that local regression kernels capture the underlying local structure of the data exceedingly well, even in the presence of significant distortions. Further they do not require explicit motion estimation.

As to object-based features, these are to be learned by specifically taking into account the classes of objects at hand. In the work presented here, the objects of interest are ; thus, we compute face and person features by using the Haar/AdaBoost features exploited by the well-known Viola-Jones detector. This is a technical choice guided by computational efficiency issues; other choices [43, 37] would be equivalent from the modeling standpoint.

In order to be processed, features need to be spatially organized in feature maps. A feature map is a topographically organized map that encodes the joint occurrence of a specific feature at a spatial location. It can be equivalently represented as a unique map encoding the presence of different object based features (e.g., face and body map), or a set of object-specific feature maps, i.e. (e.g., a face map, a body map, etc.). More precisely, referring to the -th stream, is a matrix of binary RVs denoting if feature is present or not present at location at time . Simply put, given , is a map defining the spatial mask of .

To support gaze-shift decisions, we define the RV capturing the concept of priority map. Namely, for the -th stream, denote the matrix of binary RVs denoting if location is to be considered relevant () or not () at time

. It is important to note that the term “relevant” is to be specified with respect to the kind of feature map used to infer a probability density function (pdf) over

. For instance, if only bottom-up features are taken into account, then “relevant” boils down to “salient”, and gaze shifts will be driven by the physical properties of the scene, such as motion, color, etc.

Eventually, in accordance with object-based attention approaches, we introduce proto-objects as the actual dynamic support for gaze orienting. Following Rensink [44], they are conceived as the dynamic interface between attentive and pre-attentive processing. Namely, a “quick and dirty” time-varying perception of the scene, from which a number of proto-objects is suitable to be glued in the percept of an object by the attentional process. Here, operatively, proto-objects are drawn from the priority map and each proto-object is used to sample interest points (IPs). The latter provide a sparse representation of a candidate objects to gaze at; meanwhile, the whole set of IPs sampled at time on video stream is used to compute the configurational complexity, say , which is adopted as a prior quality index of the stream, in terms of foraging opportunities (potential preys within the patch).

More generally, whilst and can be conceived as perceptual memories, the dynamic ensemble of proto-objects is more similar to a working memory that allows attention to be temporarily focused on an internal representation [8].

Fig. 2: The perception component (cfr. Fig. 1) as a Probabilistic Graphical Model. Graph nodes denote RVs and directed arcs encode conditional dependencies between RVs. Grey-shaded nodes stand for RVs whose value is given (current gaze position and task). Time index has been omitted for simplicity

Given the task and the current gaze position , perceptual inference relies upon the joint pdf . The representation of such pdf can be given the form of the directed Probabilistic Graphical Model (PGM, [45]), say , presented in Fig. 2. The PGM structure captures the assumptions about the visual process previously discussed. For example, the assumption that given task , object class is likely to occur, is represented through the dependence .

Stated technically, the structure encodes the set of conditional independence assumptions over RVs (the local independencies, [45]) involved by the joint pdf. Then, the joint pdf factorizes according to (cfr., Koller [45], Theorem 3.1):


(time index has been omitted for notational simplicity). The factorization specified in Eq. 1 makes explicit the local distributions (the set of independence assertions that hold in pdf , ), and related inferences at the different levels of visual representation guiding gaze deployment.

Iii-A1 Object-based level

is the multinomial distribution defining the prior on object classes under the given task, whose parameters can be easily estimated via Maximum-Likelihood (basically, object occurrence counting).

represents the object-based feature likelihood. In current simulation, we use the Viola-Jones detector for faces and persons and convert the outcome to a probabilistic output (see [46], for a formal justification).

Iii-A2 Spatial-based level

denotes the prior probability of gazing at location

of the scene. For example, specific pdfs can be learned to account for the gist of the scene [47] (given a urban scene, pedestrian are more likely to occur in the middle horizontal region) or specific spatial biases, e.g. the central fixation bias [40]. Here, we will not account for such tendencies, thus we assume a uniform prior. The factor is the proto-object likelihood given the priority map, which will be further detailed in Section V.

Iii-A3 Feature map level

represents the likelihood of object-based feature to occur at location . Following [43], when the feature is present at we set equal to a Gaussian centered at (, to activate nearby locations; otherwise, to a small value ().

The factor is the feed-forward evidence obtained from low-level features computed from frame as sensed when gaze is set at . In the pre-attentive stage the position of gaze is not taken into account, and the input frame is a low-resolution representation of the original. In the attentive stage, is used to simulate foveation - accounting for the contrast sensitivity fall-off moving from the center of the retina, the fovea, to the periphery; thus, the input frame is a foveated image  [37].

The feed-forward evidence is proportional to the output of low-level filters . A variety of approaches can be used [26]

from a simple normalization of filter outputs to more sophisticated Gaussian mixture modeling


Here, based on the local regression kernel center/surround features, the evidence from a location of the frame is computed as , where

is the matrix cosine similarity (see

[42], for details) between center and surround feature matrices and computed at location of the foveated frame.

Figure 3 illustrates main representations discussed above (spatio-temporal priority maps, proto-objects and IPs sampled from proto-objects)

Input Priority map Proto objects Interest points
Input Priority map Proto objects Interest points Chosen FoA
Fig. 3: The main perceptual representation levels involved by pre-attentive (top row) and attentive stages (bottom row). The input of the pre-attentive stage is the stream at low resolution. The priority map is visualized as a color map: reddish values specify most salient regions. Selected proto-objects are parametrised as ellipses. IPs sampled from proto-objects are displayed as red dots (cfr. Section IV). The input of the attentive stage is the foveated stream obtained by setting the initial FoA at the centre of the image. Candidates gaze shifts are displayed as yellow trajectories from the center of current FoA. The next FoA is chosen to maximise the expected reward (cfr. Section VI), and displayed as a white/blue circle.

Iii-B Action control

The model exploits a coarse-to-fine strategy. First, evaluation of stream “quality” is pre-attentively performed, resorting to the configurational complexity (cfr., Section IV). This stage corresponds to the pre-attentive loop briefly summarised in Algorithm 1.

1:{Parallel execution on all streams }
3:Compute bottom-up feature and weight the feature map .
4:Sample the priority map conditioned on .
5:Sample the potential object regions or proto-objects from .
6:Based on available proto-objects , sample IPs and compute the quality of the stream via complexity
Algorithm 1 Pre-attentive loop

On this basis, the “best” quality stream is selected (cfr., Section V), and the within-stream potential preys are attentively handled, in order to detect the actual targets that are interesting under the given task (cfr., Section VI). Intra-stream behavior thus boils down to an instance of the classic deployment of visual attention: spotting an object and keeping to it - via either fixational movements or smooth pursuit - or relocating to another region (saccade) [39]. Thus, intra-stream behavior does not reduce to tracking, which is to be considered solely the computational realization of visual smooth pursuit. The attentive loop is summarised in Algorithm 2.

1:Input: ,
2:{Patch choice}
3:Based on the complexities , sample the video stream to be analyzed.
4:Execute the between-stream gaze shift at the center of current frame of stream and set the current FoA;
7:     Compute bottom-up and top-down features and sample the feature map , based on
8:     Sample the priority map conditioned on .
9:     Sample proto-objects from ;
10:     Based on , sample IPs and compute the quality of the stream via complexity
11:{Prey handling}
12:     Execute the within-stream gaze shift in order to maximize the expected reward with respect to the IP value and analyze the current FoA.
14:until giving-up condition is met
15:{Patch leave}
Algorithm 2 Attentive loop

Clearly, since the number of targets is a priori unknown (partial information condition), efficient search requires tailoring a stopping decision to target handling within the stream. From a foraging standpoint, the decision to leave should also depend on future prospects for food on the current patch, which in turn depends on posterior information about this patch. This issue is addressed in the framework of optimal Bayesian foraging [48, 49] (cfr., Section VII).

Iv Pre-attentive sensing

The goal of this stage is to infer a proto-object representation of all the patches within the spatial landscape (Fig. 3, top row). To this end, the posterior is calculated from the joint pdf. In the derivations that follows we omit the time index for notational simplicity.

Rewrite the joint pdf factorization in Eq. 1 under the assumption of object-based feature independence, i.e., . Then is obtained by marginalizing over RVs and . Use the following: by definition; by assumption; by local conditional independence in . Thus:


The term is a prior “tuning” the preference for specific object-based features. In the pre-attentive stage we assume a uniform prior, i.e. , and restrict to feed-forward features . Then, Eq. 2 boils down to the probabilistic form of a classic feed-forward saliency map (see Fig. 3), namely,


where the likelihood is modulated by bottom-up feature likelihood .

Given the priority map, a set of proto-objects or candidate preys can be sampled from it. Following [39], we exploit a sparse representation of proto-objects. These are conceived in terms of “potential bites”, namely interest points sampled from the proto-object. At any given time , each proto-object is characterised by different shape and location, i.e., . Here is the sparse representation of proto-object as the cluster of IPs sampled from it; is a parametric description of a proto-object, .

The set stands for a map of binary RVs indicating at time the presence or absence of proto-object , and the overall map of proto-objects is given by . Location and shape of the proto-object are parametrized via . Assume independent proto-objects:


and for


The first step (Eq. 4) samples the proto-object map from the landscape. The second (Eq. 5) samples proto-object parameters .

Here, is drawn from the priority map by deriving a preliminary binary map , such that if , and otherwise. The threshold is adaptively set so as to achieve % significance level in deciding whether the given priority values are in the extreme tails of the pdf. The procedure is based on the assumption that an informative proto-object is a relatively rare region and thus results in values which are in the tails of . Then, following [50], is obtained as , where the function labels around .

We set the maximum number of proto-object to to retain the most important ones.

As to Eq. 5, the proto-object map provides the necessary spatial support for a 2D ellipse maximum-likelihood approximation of each proto-object, whose location and shape are parametrized as for (see [39] for a formal justification).

In the third step (Eq. 6), the procedure generates clusters of IPs, one cluster for each proto-object (see Fig. 3

). By assuming a Gaussian distribution centered on the proto-object - thus with mean

and covariance matrix given by the axes parameters of the 2D ellipse fitting the proto-object shape -, Eq. (6) can be further specified as [39]:


We set the maximum number of IPs and for each proto-object , we sample from a Gaussian centered on the proto-object as in (7). The number of IPs per proto-object is estimated as , being the size (area) of proto-object . Eventually, the set of all IPs characterising the pre-attentively perceived proto-object can be obtained as .

V Stream selection

Streams vary in the number of objects they contain and maybe other characteristics such as the ease with which individual items are found. We assume that in the pre-attentive stage, the choice of the observer to spot a stream, is drawn on the basis of some global index of interest characterizing each stream in the visual landscape. In ecological modelling for instance, one such index is the landscape entropy determined by dispersion/concentration of preys [1].

Here, generalizing these assumptions, we introduce the time-varying configurational complexity of the -th stream. Intuitively, by considering each stream a dynamic system, we resort to the general principle that complex systems are neither completely random neither perfectly ordered and complexity should reach its maximum at a level of randomness away from these extremes [51]. For instance, a crowded scene with many pedestrians moving represents a disordered system (high entropy, low order) as opposed to a scene where no activities take place (low entropy, high order). The highest complexity is thus reached when specific activities occur: e.g., a group of people meeting. To formalize the relationship between stream complexity and stream selection we proceed as follows. Given , the choice of the -th stream is obtained by sampling from the categorical distribution




Keeping to [51], complexity is defined in terms of order/disorder of the system,


where is the disorder parameter, is the order parameter, and the Boltzmann-Gibbs-Shannon (BGS) entropy with its supremum. and are calculated as follows.

For each stream , we compute the BGS entropy as a function of the spatial configuration of the sampled IPs. The spatial domain is partitioned into a configuration space of cells (rectangular windows), i.e., , each cell being centered at . By assigning each IP to the corresponding window, the probability for point to be within cell at time can be estimated as , where if and otherwise.

Thus, , and (10) can be easily computed. Since dealing with a fictitious thermodynamical system, we set Boltzmann’s constant . The supremum of is and it is associated to a completely unconstrained process, that is a process where , since with reflecting boundary conditions the asymptotic distribution is uniform.

When stream is chosen at time , attention is deployed to the stream via the gaze shift , and the “entering time” is set.

Vi Attentive stream handling

When gaze is deployed to the -th stream, the is positioned at the centre of the frame, and foveation is simulated by blurring through an isotropic Gaussian function centered at

, whose variance is taken as the radius of a FoA,

. This is approximately given by , where , being the dimension of the frame support . This way we obtain the foveated image, which provides the input for the next processing steps. The foveation process is updated for every gaze shift within the patch that involves a large relocation (saccade), but not during small relocations, i.e. fixational or pursuit eye movements. At this stage, differently from pre-attentive analysis, the observer exploits the full priority posterior as formulated in Eq. 2, rather than the reduced form specified in Eq. 3. In other terms, the object-based feature likelihood, , is taken into account.

Object search is performed by sampling, from current location , a set of candidate gaze shifts (cfr. Fig.3, bottom-right picture). In simulation, candidate point sampling is performed as in [39]. In a nutshell, are sampled via a Langevin-type stochastic differential equation, where the drift component is a function of IPs’ configuration, and the stochastic component is sampled from the Lévy -stable distribution. The latter accounts for prior oculomotor biases on gaze shifts. We use different -stable parameters for the different types of gaze shifts - fixational, pursuit and saccadic shifts -, that have been learned from eye-tracking experiments of human subjects observing videos under the same task considered here. The time-varying choice of the family of parameters is conditioned on the current complexity index ([39] for details).

Denote the reward consequent on a gaze shift. Then, next location is chosen to maximize the expected reward:


The expected reward is computed with reference to the value of proto-objects available within the stream,


Here is the average value of proto-object with respect to the posterior , which, by using samples generated via Eq. 7, can be simply evaluated as


The observer samples candidate gaze shifts. Using Eqs. 7 and 13, Eq. 12 can be written as


where defines the region around . In foraging terms, Eq. 12 formalises the expected reward of gaining valuable bites of food (IPs) in the neighbourhood of the candidate shift .

Note that effective reward is gained by the observer only if the gaze shift is deployed to a point that sets a FoA overlapping an object of interest for the task (in the simulation, for simplicity, when a face or a body is detected, and in other cases). Thus, as the observer attentively explores the stream, he updates his estimate of stream quality in terms of accumulated rewards, which will provide the underlying support for the stream giving-up strategy.

A final remark concerns the number of objects that can be detected within the stream. Attentive analysis is sequential by definition. In principle, all relevant objects in the scene can be eventually scrutinized, provided that enough time is granted to the observer. For instance, as to detection performance, current implementation of the model exploits adaboost face and body detectors that have been trained on a much larger dataset than original Viola-Jones detectors, leading to about detection accuracy (considering a minimal detectable region of pixel area). But cogently, the actual number of scrutinized objects is the result of observer’s trade-off between the quality of the visited stream and the potential quality of the other streams. Namely, it depends on the stream giving-up time as dynamically determined by the Bayesian strategy.

Vii The Bayesian giving-up strategy

In this Section we consider the core problem of switching from one stream to another. In foraging theory this issue is addressed as “How long should a forager persevere in a patch?”. Two approaches can be pursued: i) patch-based or global/distal models; ii) prey-based or local/proximal models. These are, for historical reasons, subject to separate analyses and modeling [1]. The Bayesian strategy we propose here aims at filling such gap.

Vii-a Global models. Charnov’s Marginal Value Theorem

In the scenario envisaged by Charnov [5] the landscape is composed of food patches that deliver food rewards as a smooth decreasing flow. Briefly, Charnov’s MVT states that a patch leave decision should be taken when the expected current rate of information return falls below the mean rate that can be gained from other patches. MVT considers food intake as a continuous deterministic process where foragers assess patch profitability by the instantaneous net energy intake rate. In its original formulation, it provides the optimal solution to the problem, although only once the prey distribution has already been learnt; it assumes omniscient foragers (i.e. with a full knowledge of preys and patch distribution). The model is purely functional, nevertheless it is important for generating two testable qualitative predictions [52]: 1) patch time should increase with prey density in the patch; 2) patch times should increase with increasing average travel time in the habitat and should decrease with increasing average host density in the patches.

Vii-B Local models

The MVT and its stochastic generalization do not take into account the behavioral proximate mechanisms used by foragers to control patch time or to obtain information about prey distribution [52]. Such a representation of intake dynamics is inadequate to account for the real search/capture processes occurring within the patch. These, in most cases, are discrete and stochastic events in nature. For instance, Wolfe [4] has examined human foraging in a visual search context, showing that departures from MVT emerge when patch quality varies and when visual information is degraded.

Experience on a patch, in terms of cumulative reward, gives information on current patch type and on future rewards. A good policy should make use of this information and vary the giving-up time with experience. In this perspective, as an alternative to MVT, local models, e.g., Waage’s [38], assume that the motivation of a forager to remain and search on a particular patch would be linearly correlated with host density. As long as this “responsiveness” is above a given (local) threshold, the forager does not leave the patch [38]. As a consequence, the total time spent within the patch, say , eventually depends on the experience of the animal within that patch.

Vii-C Distal and proximal strategies in an uncertain world

To deal with uncertainty [48, 49], a forager should persevere in a patch as long as the probability of the next observation being successful is greater than the probability of the first observation in one among the patches being successful, taking into account the time it takes to make those observations.

Recall that complexity is used as a pre-attentive stochastic proxy of the likelihood that the -th stream yields a reward to the observer. Thus, defined in Eq. 9 stands for the prior probability of objects being primed for patch (in OFT, the base rate [1]).

A common detection or gain function, that is the probability of reinforcement vs. time, is an exponential distribution of times to detection

[49], and can be defined in terms of the conditional probability of gaining a reward in stream by time , given that it has been primed via complexity :


where is the detection rate. Then, by generalising the two patch analysis discussed in [49], the following holds.

Proposition VII.1

Denote the average complexity of the streams other than . Under the hypothesis that at , , leaving the -th stream when


defines an optimal Bayesian strategy for the observer.


See Appendix A.       

The strategy summarised via Eq. 16 can be considered as a Bayesian version of the MVT-based strategy  [5]. In order to reconcile the distal functional constraint formalised through Eq. 16, with the behavioral proximate mechanisms used by foragers within the patch, we put a prior distribution on the

parameter of the exponential distribution in the form of a Gamma distribution, i.e.,

, where are now hyper-parameters governing the distribution of the parameter. Assume that when the observer selects the stream, the initial prior is .

The hyper-parameters represent initial values of expected rewards and “capture” time, respectively, thus stand for “a priori” estimates of stream profitability. For , the posterior over can be computed via Bayes’ rule as

. Since the Gamma distribution is a conjugate prior, the Bayesian update only calls for the determination of the hyper-parameter update


being the number of handled objects, that is the number of rewards effectively gained up to current time, i.e., , and the interval of time spent on the -th proto-objects. The latter, in general, can be further decomposed as , where and denote the time to spot and handle the -th proto-object, respectively. Clearly, time elapses for any proto-object within the stream, whilst is only taken into account when the object has been detected as such (e.g., a moving proto-object as a pedestrian) and actual object handling occurs (e.g., tracking the pedestrian), otherwise . In the experimental analyses we will assume, for generality, and , where and are times to process elements (pixels, super-pixels, point representation or parts) defining the prey, which depend on the specific algorithm adopted; is a function (linear, quadratic, etc) of the dimension of the processed item.

Eventually, when hyper-parameters have been computed (Eq. 17), a suitable value for can be obtained as the expected value . As a consequence, the total within-stream time depends on the experience of the observer within that stream

Here, this proximal mechanism is formally related to the distal global quality of all streams, via the condition specified through Eq. 16 so that the decision threshold is dynamically modulated by the pre-attentive observer’s perception across streams. As a result, even though on a short-time scale the observer might experience local motivational increments due to rewards, on a longer time scale the motivation to stay within the current stream will progressively decrease.

Viii Experimental work

Viii-a Dataset

We used a portion of the the UCR Videoweb Activities Dataset [53], a publicly available dataset containing data recorded from multiple outdoor wireless cameras. The dataset contains days of recording and several scenes for each day, about hours of video displaying dozens of activities along with annotation. For the first three days, each scene is composed of a collection of human activities and motions which forms a continuous storyline.

The dataset is designed for evaluating the performance of human-activity recognition algorithms, and it features multiple human activities viewed from multiple cameras located asymmetrically with overlapping and non-overlapping views, with varying degrees of illumination and lighting conditions. This amounts to a large variety of simple actions such as walking, running, and waving.

We experimented on three different scenes recorded in three different days. Here we present results obtained from scene 1, recorded in the second day (eight camera recordings). Results from the other scenes are reported as Supplementary Material. The scene contains the streams identified by the following ids: , , , , , , , . Each video is at fps and cameras are not time-synchronized. We synchronized video streams by applying the following shifts between cameras: . Cameras , and can be used as time reference. Since the video of the camera is the shortest ( frames), the analyzes presented in the following consider the frames between and .

Annotated activities are: argue within two feet, pickup object, raised arms, reading book, running, sit cross legged, sit on bench, spin while talking, stand up, talk on phone, text on phone. All are performed by humans.

As previously discussed, we are not concerned with action or activity recognition. Nevertheless, the dataset provides a suitable benchmark. The baseline aim of the model is to dynamically set the FoA on the most informative subsets of the video streams in order to capture atomic events that are at the core of the activities actually recorded. In this perspective, the virtual forager operates under the task “pay attention to people within the scene”, so that the classes of objects of interest are represented by faces and human bodies. The output collection of subsets from all streams can eventually be evaluated in terms of the retrieved activities marked in the ground-truth.

Viii-B Experimental evaluation

Evaluation of results should consider the two dimensions of i) visual representation and ii) giving-up strategy. For instance, it is not straightforwardly granted that a pre-attentive representation for choosing the patch/video might perform better (beyond computational efficiency considerations) than an attentive representation, where all objects of interest are detected before selecting the video stream.

As to the giving-up time choice, any strategy should in principle perform better than random choice. Again, this should not be given for granted, since in a complex scenario a bias-free, random allocation could perform better than expected. Further, a pure Charnov-based strategy, or a deterministic one, e.g. [36], could offer a reasonable solution. Under this rationale, evaluation takes into account the following analyses.

Viii-B1 Representations of visual information

Aside from the basic priority map representation (denoted M in the remainder of this Section), which is exploited by our model (Eqs. 3 and  2 for the pre-attentive and attentive stages, respectively), the following alternatives have been considered.

  • Static (denoted S): the baseline salience computation by Itti et al. [54]. The method combines orientation, intensity and color contrast features in a purely bottom-up scheme. The frame-based saliency map is converted in a probability map (as in [37]) so to implement the bottom-up priority map (Eq. 3). Attentive exploration is driven by bottom-up information and object-based likelihood is kept uniform when computing Eq. 2.

  • Static combined with Change Detection and Face/Body Detection (S+CD+FB): this representation has been used in [36]. It draws on the Bayesian integration of top-down / bottom-up information as described in [37]. Novelties are computed by detecting changes between two subsequent frames at a lower spatial resolution [36]. In our setting, it amounts to assume that the observer has the capability of detecting objects before selecting the stream; namely, it boils down to directly compute Eq. 2.

  • Proposed model with early prey detection (M+): akin to the S+CD+FB scheme, the full priority probability (Eq. 2) is exploited before stream selection, instead of the bottom-up priority (Eq. 3).

Clearly, there are differences between adopting one representation or the other. These can be readily appreciated by analyzing behavior over time of stream complexities obtained by adopting the above representations. One stream is hardly distinguishable from another when using the S and the S+CD+FB representations; by contrast, higher discriminability is apparent for the M and M+ settings (cfr. Fig 12 and 13, Supplementary Material). Yet, most interesting here is to consider representational performance as related to foraging strategy.

Viii-B2 Foraging strategies

As to stream giving-up, we compare the following strategies.


The simplest strategy [1]. A camera switch is triggered after a fixed time . Higher values of within-stream time entail a low number of switches.


This strategy triggers a camera switch after a random time . In this case is a RV drawn from a uniform pdf , where is a suitable parameter.


We adapted the solution to Charnov’s MVT  [5] by Lundberg et al  [55]. If the observer chooses stream at time , the optimal stream residence time is defined as , where is the resource level in stream (here assessed in terms of complexity) at entering time , is the average resource level across streams, is a parameter determining the initial slope of the gain function in the stream, and is the average switching (travelling) time between two video streams. By assuming constant traveling time (), the only parameter to determine is the slope . Note that, when , higher values of entail higher values of .

Viii-C Evaluation measures

Fig. 4:

A probabilistic glance at the information content of multiple video streams: the distribution of activities within each stream of the dataset represented in terms of joint and marginal probability distributions

, and

, respectively. Distributions are visualised as Hinton diagrams, i.e., the square size is proportional to the probability value. The plots for the marginal distributions have different scales from those for the joint distribution (on the same scale, the marginals would look larger as they sum all of the mass from one direction).

Fig. 5: Distribution of activities across cameras and time. A darker color indicates that several activities co-occur at the same time while the white color indicates the total absence of activities.

The definition of measures that capture the subtleties of activity dynamics across multiple cameras is not straightforward. One has to account for the overall distribution of activities with respect to the different streams. Meanwhile, each stream should be characterized in terms of the activities as evolving in time.

As to the first issue, consider the joint probability where can now be considered as a discrete RV indexing the streams and is a discrete RV indexing the given activity set (argue within two feet, etc.). Such joint distribution can be empirically estimated as , where denotes the number of frames of stream that displays activity . In Fig. 4, is rendered as a 2D Hinton diagram. The joint distribution is suitable to provide two essential pieces of information.

On the one hand, the marginalization of over , i.e., , offers an insight into the relevance of each stream - in terms of the marginal likelihood - to the job of collating informative stream subsets (cfr. Fig. 4). Intuitively, we expect the corresponding marginal computed after stream processing, say , to be a sparse summarisation of . Yet, it should account for the original representational relevance of the streams (cfr. Fig. 6). This can be further understood by displaying the distribution of activities across cameras and time as in Fig. 5 (a darker color indicates several activities co-occurring at the same time). By comparing with the Hinton diagram of in Fig. 4, it is readily seen that some cameras capture a large amount of activities - for instance and -, whilst other cameras, e.g., and , feature few activities. At the same time, the information displayed by subsets of one stream can be considered as redundant with respect to subsets of another stream (e.g., with respect to , Fig. 5).

On the other hand, by marginalizing over , the distribution of the activities in the data set is recovered, i.e. . It can be noted that (cfr. Fig. 4) such distribution is not uniform: some activities are under-represented compared to other activities. This class imbalance problem entails two issues. First, any kind of processing performed to select subsets of the video streams for collating the most relevant data and information of interest should preserve the shape of such distribution, i.e. , where is the marginal distribution after processing. Second, non uniformity should be accounted for when defining quantitative evaluation measures [56]. Indeed, a suitable metric should reveal the true behavior of the method over minority and majority activities: the assessments of over-represented and under-represented activities should contribute equally to the assessment of the whole method. To cope with such a problem we jointly use two assessment metrics: the standard accuracy and the macro average accuracy [56].

Denote: the number of positives, i.e., the number of times the activity occurs in the entire recorded scene, independently of the camera; the number of true positives for activity , i.e., the number of frames of the output video sequence that contain activity . Given and for each activity, the following can be defined.

  • Standard Accuracy . Note that this is a global measure that does not take into account the accuracy achieved on a single activity. From now on we will refer to this metric simply as accuracy.

  • Macro Average Accuracy . This is the arithmetic average of the partial accuracy of each activity. It allows each partial accuracy to contribute equally to the method assessment. We will refer to this metric simply as average accuracy.

Viii-D Parameters and experiments setup

We used frames to setup strategy parameters:

  • Bayesian: the initial hyper-parameters ;

  • Random: the parameter of the probability distribution;

  • Deterministic: the within-stream time parameter that modulates camera switches;

  • Charnov: the slope of the gain function .

The remaining