Predicting the Driver’s Focus of Attention: the DR(eye)VE Project. A deep neural network learnt to reproduce the human driver focus of attention (FoA) in a variety of real-world driving scenarios.
In this work we aim to predict the driver's focus of attention. The goal is to estimate what a person would pay attention to while driving, and which part of the scene around the vehicle is more critical for the task. To this end we propose a new computer vision model based on a multi-branch deep architecture that integrates three sources of information: raw video, motion and scene semantics. We also introduce DR(eye)VE, the largest dataset of driving scenes for which eye-tracking annotations are available. This dataset features more than 500,000 registered frames, matching ego-centric views (from glasses worn by drivers) and car-centric views (from roof-mounted camera), further enriched by other sensors measurements. Results highlight that several attention patterns are shared across drivers and can be reproduced to some extent. The indication of which elements in the scene are likely to capture the driver's attention may benefit several applications in the context of human-vehicle interaction and driver attention analysis.READ FULL TEXT VIEW PDF
Predicting the Driver’s Focus of Attention: the DR(eye)VE Project. A deep neural network learnt to reproduce the human driver focus of attention (FoA) in a variety of real-world driving scenarios.
According to the J3016 SAE international Standard, which defined the five levels of autonomous driving , cars will provide a fully autonomous journey only at the fifth level. At lower levels of autonomy, computer vision and other sensing systems will still support humans in the driving task. Human-centric Advanced Driver Assistance Systems (ADAS) have significantly improved safety and comfort in driving (e.g. collision avoidance systems, blind spot control, lane change assistance etc.). Among ADAS solutions, the most ambitious examples are related to monitoring systems [29, 21, 33, 43]: they parse the attention behavior of the driver together with the road scene to predict potentially unsafe manoeuvres and act on the car in order to avoid them – either by signaling the driver or braking. However, all these approaches suffer from the complexity of capturing the true driver’s attention and rely on a limited set of fixed safety-inspired rules. Here, we shift the problem from a personal level (what the driver is looking at) to a task-driven level (what most drivers would look at) introducing a computer vision model able to to replicate the human attentional behavior during the driving task.
We achieve this result in two stages: First, we conduct a data-driven study on drivers’ gaze fixations under different circumstances and scenarios. The study concludes that the semantic of the scene, the speed and bottom-up features all influence the driver’s gaze. Second, we advocate for the existence of common gaze patterns that are shared among different drivers. We empirically demonstrate the existence of such patterns by developing a deep learning model that can profitably learn to predict where a driver would be looking at in a specific situation.
|(a) RGB frame||(b) optical flow|
|(c) semantic segmentation||(d) predicted map|
To this aim we recorded and annotated 555,000 frames (approx. 6 hours) of driving sequences in different traffic and weather conditions: the DR(eye)VE dataset. For every frame we acquired the driver’s gaze through an accurate eye tracking device and registered such data to the external view recorded from a roof-mounted camera.
The DR(eye)VE data richness enables us to train an end-to-end deep network that predicts salient regions in car-centric driving videos. The network we propose is based on three branches which estimate attentional maps from a) visual information of the scene, b) motion cues (in terms of optical flow) and c) semantic segmentation (Fig. 1).
In contrast to the majority of experiments, which are conducted in controlled laboratory settings or employ sequences of unrelated images [68, 11, 30], we train our model on data acquired on the field.
Final results demonstrate the ability of the network to generalize across different day times, different weather conditions, different landscapes and different drivers.
Eventually, we believe our work can be complementary to the current semantic segmentation and object detection literature[76, 70, 45, 13, 44] by providing a diverse set of information. According to , the act of driving combines complex attention mechanisms guided by the driver’s past experience, short reactive times and strong contextual constraints. Thus, very little information is needed to drive if guided by a strong focus of attention (FoA) on a limited set of targets: our model aims at predicting them.
The paper is organized as follows. In Sec. 2, related works about computer vision and gaze prediction are provided to frame our work in the current state-of-the-art scenario. Sec. 3 describes the DR(eye)VE dataset and some insights about several attention patterns that human drivers exhibit. Sec. 4 illustrates the proposed deep network to replicate such human behavior, and Sec. 5 reports the performed experiments.
The way humans favor some entities in the scene, along with key factors guiding eye fixations in presence of a given task (e.g. visual search) has been extensively studied for decades [66, 74]. The main difficulty that rises when approaching the subject is the variety of perspectives under which it can be cast. Indeed, visual attention has been approached by psychologists, neurobiologists and computer scientists, making the field highly interdisciplinary 
. We are particularly interested in the computational perspective, in which predicting human attention is often formalized as an estimation task delivering the probability of each point in a given scene to attract the observer’s gaze.
Conversely, bottom-up models capture salient objects or events naturally popping out in the image, independently of the observer, the undergoing task and other external factors. This task is widely known in literature as visual saliency prediction. In this context, computational models focus on spotting visual discontinuities, either by clustering features or considering the rarity of image regions, locally [57, 39] or globally [1, 77, 14].
For a comprehensive review of visual attention prediction methods, we refer the reader to .
Recently, the success of deep networks involved both task-driven attention and saliency prediction, as models have become more powerful in both paradigms, achieving state-of-the-art results on public benchmarks [34, 37, 28, 15, 16].
In video, attention prediction and saliency estimation are more complex with respect to still images since motion heavily affects human gaze. Some models merge bottom-up saliency with motion maps, either by means of optical flow  or feature tracking . Other methods enforce temporal dependencies between bottom-up features in successive frames. Both supervised [79, 59] and unsupervised [42, 72, 73]feature extraction can be employed, and temporal coherence can be achieved either by conditioning the current prediction on information from previous frames  or by capturing motion smoothness with optical flow [79, 59]. While deep video saliency models still lack, an interesting work is 
, which relies on a recurrent architecture fed with clip encodings to predict the fixation map by means of a Gaussian Mixture Model (GMM). Nevertheless, most methods limit to bottom-up features accounting for just visual discontinuities in terms of textures or contours. Our proposal, instead, is specifically tailored to the driving task and fuses the bottom-up information with semantics and motion elements that have emerged as attention factors from the analysis of theDR(eye)VE dataset.
presented a model that exploits visual saliency with a non-linear SVM classifier for the detection of traffic signs. The validation of this study was performed in a laboratory non-realistic setting, emulating an in-car driving session. A more realistic experiment was then conducted with a larger set of targets, e.g. including pedestrians and bicycles.
In this section we present the DR(eye)VE dataset (Fig. 2), the protocol adopted for video registration and annotation, the automatic processing of eye-tracker data and the analysis of the driver’s behavior in different conditions.
The dataset. The DR(eye)VE dataset consists of 555,000 frames divided in 74 sequences, each of which is 5 minutes long. Eight different drivers of varying age from 20 to 40, including 7 men and a woman, took part to the driving experiment, that lasted more than two months. Videos were recorded in different contexts, both in terms of landscape (downtown, countryside, highway) and traffic condition, ranging from traffic-free to highly cluttered scenarios. They were recorded in diverse weather conditions (sunny, rainy, cloudy) and at different hours of the day (both daytime and night). Tab. I recaps the dataset features and Tab. II compares it with other related proposals. DR(eye)VE is currently the largest publicly available dataset including gaze and driving behavior in automotive settings.
|# Videos||# Frames||Drivers||Weather conditions||Lighting||Gaze Info||Metadata||Camera Viewpoint|
|74||555,000||8||sunny||day||raw fixations||GPS||driver (720p)|
|cloudy||evening||gaze map||car speed||car (1080p)|
|rainy||night||pupil dilation||car course|
|Pugeault et al. ||158,668||–||
|Simon et al.||40||30||Downtown||Gaze Maps||No||No|
|Underwood et al.||120||77||Urban Motorway||–||No||No|
|Fridman et al.||1,860,761||50||Highway||6 Gaze Location Classes||Yes||No|
The Acquisition System. The driver’s gaze information was captured using the commercial SMI ETG 2w Eye Tracking Glasses (ETG). ETG capture attention dynamics also in presence of head pose changes, which occur very often during the task of driving. While a frontal camera acquires the scene at 720p/30fps, users pupils are tracked at 60Hz. Gaze information are provided in terms of eye fixations and saccade movements. ETG was manually calibrated before each sequence for every driver.
Simultaneously, videos from the car perspective were acquired using the GARMIN VirbX camera mounted on the car roof (RMC, Roof-Mounted Camera). Such sensor captures frames at 1080p/25fps, and includes further information such as GPS data, accelerometer and gyroscope measurements.
Video-gaze registration. The dataset has been processed to move the acquired gaze from the egocentric (ETG) view to the car (RMC) view. The latter features a much wider field of view (FoV), and can contain fixations that are out of the egocentric view. For instance, this can occur whenever the driver takes a peek at something at the border of this FoV, but doesn’t move his head. For every sequence, the two videos were manually aligned to cope with the difference in sensors framerate. Videos were then registered frame-by-frame through a homographic transformation that projects fixation points across views. More formally, at each timestep the RMC frame and the ETG frame are registered by means of a homography matrix , computed by matching SIFT descriptors  from one view to the other (see Fig. 3). A further RANSAC 
procedure ensures robustness to outliers. While homographic mapping is theoretically sound only across planar views - which is not the case of outdoor environments - we empirically found that projecting an object from one image to another always recovered the correct position. This makes sense if the distance between the projected object and the camera is far greater than the distance between the object and the projective plane. In Sec.13 of the supplementary material, we derive formal bounds to explain this phenomena.
Fixation map computation.
The pipeline discussed above provides a frame-level annotation of the driver’s fixations.
In contrast to image saliency experiments , there is no clear and indisputable protocol for obtaining continuous maps from raw fixations when acquired in task-driven real-life scenarios. This is even more evident when fixations are collected in task-driven real-life scenarios. The main motivation resides in the fact that observer’s subjectivity cannot be removed by averaging different observers’ fixations. Indeed two different observers cannot experience the same scene at the same time (e.g. two drivers cannot be at the same time in the same point of the street). The only chance to average among different observers would be the adoption of a simulation environment, but it has been proved that the cognitive load in controlled experiments is lower than in real test scenarios and it effects the true attention mechanism of the observer . In our preliminary DR(eye)VE release , fixation points were aggregated and smoothed by means of a temporal sliding window. In such a way, temporal filtering discarded momentary glimpses that contain precious information about the driver’s attention. Following the psychological protocol in  and , this limitation was overcome in the current release where the new fixation maps were computed without temporal smoothing.
Both  and  highlight the high degree of subjectivity of scene scanpaths in short temporal windows ( sec) and suggest to neglect the fixations pop-out order within such windows. This mechanism also ameliorates the inhibition of return phenomenon that may prevent interesting objects to be observed twice in short temporal intervals [51, 27], leading to the underestimation of their importance.
More formally, the fixation map for a frame at time is built by accumulating projected gaze points in a temporal sliding window of frames, centered in . For each time step in the window, where , gaze points projections on are estimated through the homography transformation that projects points from the image plane at frame , namely , to the image plane in . A continuous fixation map is obtained from the projected fixations by centering on each of them a multivariate Gaussian having a diagonal covariance matrix
(the spatial variance of each variable is set topixels) and taking the max value along the time axis:
The Gaussian variance has been computed by averaging the ETG spatial acquisition errors on 20 observers looking at calibration patterns at different distances from 5 to 15 meters. The described process can be appreciated in Fig. 4. Eventually, each map
is normalized to sum to 1, so that it can be considered a probability distribution of fixation points.
|(a) Acting - 69 719||(b) Inattentive - 12 282|
|(c) Error - 22 893||(d) Subjective - 3 166|
Labeling attention drifts. Fixation maps exhibit a very strong central bias. This is common in saliency annotations  and even more in the context of driving.
For these reasons, there is a strong unbalance between lots of easy-to-predict scenarios and unfrequent but interesting hard-to-predict events.
is reported as an indication of its spread (the determinant equals the product of eigenvalues, each of which measures the spread along a different data dimension). The bar plots illustrate the amount of downtown (red), countryside (green) and highway (blue) frames that concurred to generate the average gaze position for a specific speed range. Best viewed on screen.
Probability of fixation
To enable the evaluation of computational models under such circumstances, the DR(eye)VE dataset has been extended with a set of further annotations. For each video, subsequences whose ground truth poorly correlates with the average ground truth of that sequence are selected. We employ Pearson’s Correlation Coefficient () and select subsequences with . This happens when the attention of the driver focuses far from the vanishing point of the road. Examples of such subsequences are depicted in Fig. 5. Several human annotators inspected the selected frames and manually split them into (a) acting, (b) inattentive, (c) errors and (d) subjective events:
errors can happen either due to failures in the measuring tool (e.g. in extreme lighting conditions) or in the successive data processing phase (e.g. SIFT matching);
inattentive subsequences occur when the driver focuses his gaze on objects unrelated to the driving task (e.g. looking at an advertisement);
subjective subsequences describe situations in which the attention is closely related to the individual experience of the driver, e.g. a road sign on the side might be an interesting element to focus for someone that has never been on that road before but might be safely ignored by someone who drives that road every day.
acting subsequences include all the remaining ones.
Acting subsequences are particularly interesting as the deviation of driver’s attention from the common central pattern denotes an intention linked to task-specific actions (e.g. turning, changing lanes, overtaking …). For these reasons, subsequences of this kind will have a central role in the evaluation of predictive models in Sec. 5.
By analyzing the dataset frames, the very first insight is the presence of a strong attraction of driver’s focus towards the vanishing point of the road, that can be appreciated in Fig. 6.
The same phenomenon was observed in previous studies [67, 6] in the context of visual search tasks.
We observed indeed that drivers often tend to disregard road signals, cars coming from the opposite direction and pedestrians on sidewalks.
This is an effect of human peripheral vision , that allows observers to still perceive and interpret stimuli out of - but sufficiently close to - their focus of attention (FoA). A driver can therefore achieve a larger area of attention by focusing on the road’s vanishing point: due to the geometry of the road environment, many of the objects worth of attention are coming from there and have already been perceived when distant.
Moreover, the gaze location tends to drift from this central attractor when the context changes in terms of car speed and landscape. Indeed  suggests that our brain is able to compensate spatially or temporally dense information by reducing the visual field size. In particular, as the car travels at higher speed the temporal density of information (i.e. the amount of information that the driver needs to elaborate per unit of time) increases: this causes the useful visual field of the driver to shrink . We also observe this phenomenon in our experiments, as shown in Fig. 7.
DR(eye)VE data also highlight that the driver’s gaze is attracted towards specific semantic categories. To reach the above conclusion, the dataset is analysed by means of the semantic segmentation model in  and the distribution of semantic classes within the fixation map evaluated. More precisely, given a segmented frame and the corresponding fixation map, the probability for each semantic class to fall within the area of attention is computed as follows: First, the fixation map (which is continuous in ) is normalized such that the maximum value equals 1. Then, nine binary maps are constructed by thresholding such continuous values linearly in the interval . As the threshold moves towards 1 (the maximum value), the area of interest shrinks around the real fixation points (since the continuous map is modeled by means of several Gaussians centered in fixation points, see previous section). For every threshold, a histogram over semantic labels within the area of interest is built, by summing up occurrences collected from all DR(eye)VE frames. Fig. 8 displays the result: for each class, the probability of a pixel to fall within the region of interest is reported for each threshold value. The figure provides insight about which categories represent the real focus of attention and which ones tend to fall inside the attention region just by proximity with the formers. Object classes that exhibit a positive trend, such as road, vehicles and people, are the real focus of the gaze, since the ratio of pixels classified accordingly increases when the observed area shrinks around the fixation point. In a broader sense, the figure suggests that despite while driving our focus is dominated by road and vehicles, we often observe specific objects categories even if they contain little information useful to drive.
dataset is sufficiently large to allow the construction of a deep architecture to model common attentional patterns. Here, we describe our neural network model to predict human FoA while driving.
half a second) holds sufficient contextual information for predicting where the driver would focus in that moment. Indeed, human drivers can take even less time to react to an unexpected stimulus. Our architecture takes a sequence of 16 consecutive frames (0.65s) as input (called clips from now on) and predicts the fixation map for the last frame of such clip.
the drivers’ FoA exhibits consistent patterns, suggesting that it can be reproduced by a computational model;
the drivers’ gaze is affected by a strong prior on objects semantics, e.g. drivers tend to focus on items lying on the road;
motion cues, like vehicle speed, are also key factors that influence gaze.
Accordingly, the model output merges three branches with identical architecture, unshared parameters and different input domains: the RGB image, the semantic segmentation and the optical flow field. We call this architecture multi-branch model. Following a bottom-up approach, in Sec. 4.1 the building blocks of each branch are motivated and described. Later, in Sec. 4.2 it will be shown how the branches merge into the final model.
Each branch of the multi-branch model is a two-input two-output architecture composed of two intertwined streams. The aim of this peculiar setup is to prevent the network from learning a central bias, that would otherwise stall the learning in early training stages 111For further details the reader can refer to Sec. 14 and Sec. 15 of the supplementary material.
. To this end, one of the streams is given as input (output) a severely cropped portion of the original image (ground truth), ensuring a more uniform distribution of the true gaze, and runs through theCOARSE module, described below. Similarly, the other stream uses the COARSE module to obtain a rough prediction over the full resized image and then refines it through a stack of additional convolutions called REFINE model. At test time, only the output of the REFINE stream is considered. Both streams rely on the COARSE module, the convolutional backbone (with shared weights) which provides the rough estimate of the attentional map corresponding to a given clip. This component is detailed in Fig. 9.
where indexes different input feature maps, is the value at the position at time of the kernel connected to the -th feature map, and , and are the dimensions of the kernel along width, height and temporal axis respectively; is the bias from layer to layer .
From C3D, only the most general-purpose features are retained by removing the last convolutional layer and the fully connected layers which are strongly linked to the original action recognition task. The size of the last pooling layer is also modified in order to cover the remaining temporal dimension entirely. This collapses the tensor from 4D to 3D, making the output independent of time. Eventually, a bilinear upsampling brings the tensor back to the input spatial resolution and a 2D convolution merges all features into one channel. See Fig. 9 for additional details on the COARSE module.
Training the two streams together The architecture of a single FoA branch is depicted in Fig. 10. During training, the first stream feeds the COARSE network with random crops, forcing the model to learn the current focus of attention given visual cues rather than prior spatial location. The C3D training process described in , employs a image resize, and then a random crop. However, the small difference in the two resolutions limits the variance of gaze position in ground truth fixation maps and is not sufficient to avoid the attraction towards the center of the image. For this reason, training images are resized to before being cropped to . This crop policy generates samples that cover less than a quarter of the original image thus ensuring a sufficient variety in prediction targets. This comes at the cost of a coarser prediction: as crops get smaller, the ratio of pixels in the ground truth covered by gaze increases, leading the model to learn larger maps.
In contrast, the second stream feeds the same COARSE model with the same images, this time resized to – and not cropped. The coarse prediction obtained from the COARSE model is then concatenated with the final frame of the input clip, i.e. the frame corresponding to the final prediction. Eventually, the concatenated tensor goes through the REFINE module to obtain a higher resolution prediction of the FoA.
The overall two-stream training procedure for a single branch is summarized in Algorithm 1.
Prediction cost can be minimized in terms of Kullback-Leibler divergence:
where is the ground truth distribution, is the prediction, the summation index spans across image pixels and is a small constant that ensures numerical stability222Please note that inputs are always normalized to be a valid probability distribution despite this may be omitted in notation to improve equations readability.. Since each single FoA branch computes an error on both the cropped image stream and the resized image stream, the branch loss can be defined as:
where and denote COARSE and REFINE modules, is the -th training example in the -th domain (namely RGB, optical flow, semantic segmentation), and and indicate the crop and the resize functions respectively.
Inference step While the presence of the stream is beneficial in training to reduce the spatial bias, at test time only the stream producing higher quality prediction is used. The outputs of such stream from each branch are then summed together, as explained in the following section.
As described at the beginning of this section and depicted in Fig. 11, the multi-branch model is composed of three identical branches. The architecture of each branch has already been described in Sec. 4.1 above. Each branch exploits complementary information from a different domain and contributes to the final prediction accordingly. In detail, the first branch works in the RGB domain and processes raw visual data about the scene . The second branch focuses on motion through the optical flow representation described in . Eventually, the last branch takes as input semantic segmentation probability maps . For this last branch, the number of input channels depends on the specific algorithm used to extract the results, 19 in our setup (Yu and Koltun ). The three independent predicted FoA maps are summed and normalized to result in a probability distribution.
To allow for larger batch size, we choose to bootstrap each branch independently by training it according to Eq. 4. Then, the complete multi-branch model which merges the three branches is fine-tuned with the following loss:
The algorithm describing the complete inference over the multi-branch model in detailed in Alg. 2.
In this section we evaluate the performance of the proposed multi-branch model. First, we start by comparing our model against some baselines and other methods in literature. Following the guidelines in , for the evaluation phase we rely on Pearson’s Correlation Coefficient () and Kullback–Leibler Divergence () measures. Moreover, we evaluate the Information Gain ()  measure to assess the quality of a predicted map with respect to a ground truth map in presence of a strong bias, as:
where is an index spanning all the pixels in the image, the bias computed as the average training fixation map and ensures numerical stability.
Furthermore, we conduct an ablation study to investigate how different branches affect the final prediction and how their mutual influence changes in different scenarios. We then study whether our model captures the attention dynamics observed in Sec. 3.1. Eventually, we assess our model from a human perception perspective.
Implementation details. The three different pathways of the multi-branch model (namely FoA from color, from motion and from semantics) have been pre-trained independently using the same cropping policy of Sec. 4.2 and minimizing the objective function in Eq. 4. Each branch has been respectively fed with:
frames clips in raw RGB color space;
frames clips with optical flow maps, encoded as color images through the flow field encoding ;
frames clips holding semantic segmentation from  encoded as scalar activation maps, one per segmentation class.
During individual branch pre-training clips were randomly mirrored for data augmentation. We employ Adam optimizer with parameters as suggested in the original paper , with the exception of the learning rate that we set to . Eventually, batch size was fixed to 32 and each branch was trained until convergence. The DR(eye)VE dataset is split into train, validation and test set as follows: sequences 1-38 are used for training, sequences 39-74 for testing. The 500 frames in the middle of each training sequence constitute the validation set.
Moreover, the complete multi-branch architecture was fine-tuned using the same cropping and data augmentation strategies minimizing cost function in Eq. 5. In this phase batch size was set to due to GPU memory constraints and learning rate value was lowered to . Inference time of each branch of our architecture is milliseconds per videoclip on an NVIDIA Titan X.
|Test sequences||Acting subsequences|
|Mathe et al.||0.04||3.30||-2.08||-||-||-|
|Wang et al.||0.04||3.40||-2.21||-||-||-|
|Wang et al.||0.11||3.06||-1.72||-||-||-|
|Palazzi et al.||0.55||1.48||-0.21||0.37||2.00||0.20|
In Tab. III we report results of our proposal against other state-of-the-art models [72, 59, 15, 46, 4, 73] evaluated both on the complete test set and on acting subsequences only.
All the competitors, with the exception of  are bottom-up approaches and mainly rely on appearance and motion discontinuities. To test the effectiveness of deep architectures for saliency prediction we compare against the Multi-Level Network (MLNet) , which scored favourably in the MIT300 saliency benchmark , and the Recurrent Mixture Density Network (RMDN) , which represents the only deep model addressing video saliency. While MLNet works on images discarding the temporal information, RMDN encodes short sequences in a similar way to our COARSE module, and then relies on a LSTM architecture to model long term dependencies and estimates the fixation map in terms of a GMM. To favor the comparison, both models were re-trained on the DR(eye)VE dataset.
Results highlight the superiority of our multi-branch architecture on all test sequences. The gap in performance with respect to bottom-up unsupervised approaches [72, 73] is higher, and is motivated by the peculiarity of the attention behavior within the driving context, which calls for a task-oriented training procedure. Moreover, MLNet’s low performance testifies for the need of accounting for the temporal correlation between consecutive frames that distinguishes the tasks of attention prediction in images and videos. Indeed, RMDN processes video inputs and outperforms MLNet on both and metrics, performing comparably on . Nonetheless, its performance is still limited: indeed, qualitative results reported in Fig. 12 suggest that long term dependencies captured by its recurrent module lead the network towards the regression of the mean, discarding contextual and frame-specific variations that would be preferrable to keep. To support this intuition, we measure the average between RMDN predictions and the mean training fixation map (Baseline Mean), resulting in a value of 0.11. Being lower than the divergence measured with respect to groundtruth maps, this value highlights the closer correlation to a central baseline rather than to groundtruth. Eventually, we also observe improvements with respect to our previous proposal , that relies on a more complex backbone model (also including a deconvolutional module) and processes RGB clips only. The gap in performance resides in the greater awareness of our multi-branch architecture of the aspects that characterize the driving task as emerged from the analysis in Sec. 3.1. The positive performances of our model are also confirmed when evaluated on the acting partition of the dataset. We recall that acting indicates sub-sequences exhibiting a significant task-driven shift of attention from the center of the image (Fig. 5). Being able to predict the FoA also on acting sub-sequences means that the model captures the strong centered attention bias but is capable of generalizing when required by the context.
This is further shown by the comparison against a centered Gaussian baseline (BG) and against the average of all training set fixation maps (BM). The former baseline has proven effective on many image saliency detection tasks  while the latter represents a more task-driven version. The superior performance of the multi-branch model w.r.t. baselines highlights that despite the attention is often strongly biased towards the vanishing point of the road, the network is able to deal with sudden task-driven changes in gaze direction.
In this section we investigate the behavior of our proposed model under different landscapes, time of day and weather (Sec. 5.2.1); we study the contribution of each branch to the FoA prediction task (Sec. 5.2.2); and we compare the learnt attention dynamics against the one observed in the human data (Sec. 5.2.3).
The DR(eye)VE data has been recorded under varying landscapes, time of day and weather conditions. We tested our model in all such different driving conditions. As would be expected, Fig. 13 shows that the human attention is easier to predict in highways rather than downtown, where the focus can shift towards more distractors. The model seems more reliable in evening scenarios, rather than morning or night, where we observed better lightning conditions and lack of shadows, over-exposure and so on. Lastly, in rainy conditions we notice that human gaze is easier to model, possibly due to the higher level of awareness demanded to the driver and his consequent inability to focus away from vanishing point. To support the latter intuition, we measured the performance of BM baseline (i.e. the average training fixation map), grouped for weather condition. As expected, the value in rainy weather () is significantly lower than the ones for cloudy () and sunny weather (), highlighting that when rainy the driver is more focused on the road.
|Test sequences||Acting subsequences|
In order to validate the design of the multi-branch model (see Sec. 4.2), here we study the individual contributions of the different branches by disabling one or more of them.
Results in Tab. IV show that the RGB branch plays a major role in FoA prediction. The motion stream is also beneficial and provides a slight improvement, that becomes clearer in the acting subsequences. Indeed, optical flow intrinsically captures a variety of peculiar scenarios that are non-trivial to classify when only color information is provided, e.g. when the car is still at a traffic light or is turning. The semantic stream, on the other hand, provides very little improvement. In particular, from Tab. IV and by specifically comparing I+F and I+F+S, a slight increase in the measure can be appreciated. Nevertheless, such improvement has to be considered negligible when compared to color and motion, suggesting that in presence of efficiency concerns or real-time constraints the semantic stream can be discarded with little losses in performance. However, we expect the benefit from this branch to increase as more accurate segmentation models will be released.
The previous sections validate quantitatively the proposed model. Now, we assess its capability to attend like a human driver by comparing its predictions against the analysis performed in Sec. 3.1.
First, we report the average predicted fixation map in several speed ranges in Fig. 14. The conclusions we draw are twofold: i) generally, the model succeeds in modeling the behavior of the driver at different speeds, and ii) as the speed increases fixation maps exhibit lower variance, easing the modeling task, and prediction errors decrease.
We also study how often our model focuses on different semantic categories, in a fashion that recalls the analysis of Sec. 3.1, but employing our predictions rather than ground truth maps as focus of attention. More precisely, we normalize each map so that the maximum value equals 1, and apply the same thresholding strategy described in Sec. 3.1. Likewise, for each threshold value a histogram over class labels is built, by accounting all pixels falling within the binary map for all test frames. This results in nine histograms over semantic labels, that we merge together by averaging probabilities belonging to different threshold. Fig. 15 shows the comparison. Color bars represent how often the predicted map focuses on a certain category, while gray bars depict ground truth behavior and are obtained by averaging histograms in Fig. 8 across different thresholds. Please note that, to highlight differences for low populated categories, values are reported on a logarithmic scale. The plot shows a certain degree of absolute error is present for all categories. However, in a broader sense, our model replicates the relative weight of different semantic classes while driving, as testified by the importance of roads and vehicles, that still dominate, against other categories such as people and cycles that are mostly neglected. This correlation is confirmed by Kendall rank coefficient, which scored when computed on the two bar series.
To further validate the predictions of our model from the human perception perspective, 50 people with at least 3 years of driving experience were asked to participate in a visual assessment333These were students (11 females, 39 males) of age between 21 and 26 () recruited at our University on a voluntary basis through an online form..
First, a pool of 400 videoclips (40 seconds long) is sampled from the DR(eye)VE dataset. Sampling is weighted such that resulting videoclips are evenly distributed among different scenarios, weathers, drivers and daylight conditions. Also, half of these videoclips contain sub-sequences that were previously annotated as acting.
To approximate as realistically as possible the visual field of attention of the driver, sampled videoclips are pre-processed following the procedure in . As in  we leverage the Space Variant Imaging Toolbox  to implement this phase, setting the parameter that halves the spatial resolution every 2.3 to mirror human vision [71, 36]. The resulting videoclip preserves details near to the fixation points in each frame, whereas the rest of the scene gets more and more blurred getting farther from fixations until only low-frequency contextual information survive. Coherently with  we refer to this process as foveation (in analogy with human foveal vision). Thus, pre-processed videoclips will be called foveated videoclips from now on. To appreciate the effect of this step the reader is referred to Fig. 16.
Foveated videoclips were created by randomly selecting one of the following three fixation maps: the ground truth fixation map (G videoclips), the fixation map predicted by our model (P videoclips) or the average fixation map in the DR(eye)VE training set (C videoclips). The latter central baseline allows to take into account the potential preference for a "stable" attentional map (i.e. lack of switching of focus). Further details about the creation of foveated videoclips are reported in Sec. 8 of the supplementary material.
Each participant was asked to watch five randomly sampled foveated videoclips. After each videoclip, he answered the following question:
Would you say the observed attention behavior comes from a human driver? (yes/no)
Each of the 50 participant evaluates five foveated videoclips, for a total of 250 examples.
The confusion matrix of provided answers is reported in Fig. 17. Participants were not particularly good at discriminating between human’s gaze and model generated maps, scoring about the 55% of accuracy which is comparable to random guessing; this suggests our model is capable of producing plausible attentional patterns that resemble a proper driving behavior to a human observer.
This paper presents a study of human attention dynamics underpinning the driving experience. Our main contribution is a multi-branch deep network capable of capturing such factors and replicating the driver’s focus of attention from raw video sequences. The design of our model has been guided by a prior analysis highlighting i) the existence of common gaze patterns across drivers and different scenarios; and ii) a consistent relation between changes in speed, lightning conditions, weather and landscape, and changes in the driver’s focus of attention. Experiments with the proposed architecture and related training strategies yielded state-of-the-art results. To our knowledge, our model is the first able to predict human attention in real-world driving sequences. As the model only input are car-centric videos, it might be integrated with already adopted ADAS technologies.
We acknowledge the CINECA award under the ISCRA initiative, for the availability of high performance computing resources and support. We also gratefully acknowledge the support of Facebook Artificial Intelligence Research and Panasonic Silicon Valley Lab for the donation of GPUs used for this research.
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, June 2009.
State-of-the-art in visual attention modeling.IEEE transactions on pattern analysis and machine intelligence, 35(1), 2013.
Large-scale video classification with convolutional neural networks.In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014.
Deep gaze I: boosting saliency prediction with feature maps trained on imagenet.In International Conference on Learning Representations Workshops (ICLRW), 2015.
Central and peripheral vision for scene recognition: A neurocomputational modeling explorationwang & cottrell.Journal of Vision, 17(4), 2017.
The following table reports the design the DR(eye)VE dataset. The dataset is composed of 74 sequences of 5 minutes each, recorded under a variety of driving conditions. Experimental design played a crucial role in preparing the dataset to rule out spurious correlation between driver, weather, traffic, daytime and scenario. Here we report the details for each sequence.
The aim of this section is to provide additional details on the implementation of visual assessment presented in Sec. 5.3 of the paper. Please note that additional videos regarding this section can be found together with other supplementary multimedia at https://ndrplz.github.io/dreyeve/. Eventually, the reader is referred to https://github.com/ndrplz/dreyeve for the code used to create foveated videos for visual assessment.
Space Variant Imaging System (SVIS) is a MATLAB toolbox that allows to foveate images in real-time, which has been used in a large number of scientific works to approximate human foveal vision since its introduction in 2002. In this frame, the term foveated imaging refers to the creation and display of static or video imagery where the resolution varies across the image. In analogy to human foveal vision, the highest resolution region is called the foveation region. In a video, the location of the foveation region can obviously change dynamically. It is also possible to have more than one foveation region in each image.
The foveation process is implemented in the SVIS toolbox as follows: first the the input image is repeatedly low-passed filtered and down-sampled to half of the current resolution by a Foveation Encoder. In this way a low-pass pyramid of images is obtained. Then a foveation pyramid is created selecting regions from different resolutions proportionally to the distance from the foveation point. Concretely, the foveation region will be at the highest resolution; first ring around the foveation region will be taken from half-resolution image; and so on. Eventually, a Foveation Decoder
up-sample, interpolate and blend each layer in the foveation pyramid to create the output foveated image.
The software is open-source and publicly available here: http://svi.cps.utexas.edu/software.shtml. The interested reader is referred to the SVIS website for further details.
From fixation maps back to fixations. The SVIS toolbox allows to foveate images starting from a list of coordinates which represent the foveation points in the given image (please see Fig. 18 for details). However, we do not have this information as in our work we deal with continuous attentional maps rather than discrete points of fixations. To be able to use the same software API we need to regress from the attentional map (either true or predicted) a list of approximated yet plausible fixation locations. To this aim we simply extract the 25 points with highest value in the attentional map. This is justified by the fact that in the phase of dataset creation the ground truth fixation map for a frame at time is built by accumulating projected gaze points in a temporal sliding window of frames, centered in (see Sec. 3 of the paper). The output of this phase is thus a fixation map we can use as input for the SVIS toolbox.
Taking the blurred-deblurred ratio into account. To the visual assessment purposes, keeping track the amount of blur that a videoclip has undergone is also relevant. Indeed, a certain video may give rise to higher perceived safety only because a more delicate blur allows the subject to see a clearer picture of the driving scene. In order to consider this phenomenon we do the following.
Given an input image the output of the Foveation Encoder is a resolution map , taking value in range , as depicted in Fig. 18 (b). Each value indicates the resolution that a certain pixel will have in the foveated image after decoding, where 0 and 255 indicates minimum and maximum resolution respectively.
For each video , we measure video average resolution after foveation as follows:
where N is the number of frames in the video ( in our setting) and denotes the pixel of the resolution map corresponding to the frame of the input video. The higher the value of the more information is preserved in the foveation process. Due to the sparser location of fixations in ground truth attentional maps, these result in much less blurred videoclips. Indeed videos foveated with model predicted attentional maps have in average only the 38% of the resolution w.r.t. videos foveated starting from ground truth attentional maps. Despite this bias, model predicted foveated videos still gave rise to higher perceived safety to assessment participants.
The assessment of predicted fixation maps described in Sec 5.3 has also been carried out for validating the model in terms of perceived safety. Indeed, partecipants were also asked to answer the following question:
If you were sitting in the same car of the driver whose attention behavior you just observed, how safe would you feel? (rate from 1 to 5)
The aim of the question is to measure the comfort level of the observer during a driving experience when suggested to focus at specific locations in the scene.
The underlying assumption is that the observer is more likely to feel safe if he agrees that the suggested focus is lighting up the right portion of the scene, that is what he thinks it is worth looking in the current driving scene. Conversely, if the observer wishes to focus at some specific location but he cannot retrieve details there, he is going to feel uncomfortable.
The answers provided by subjects, summarized in Fig. 19, indicate that perceived safety for videoclips foveated using the attentional maps predicted by the model is generally higher than for the ones foveated using either human or central baseline maps. Nonetheless the central bias baseline proves to be extremely competitive, in particular in non-acting videoclips in which it scores similarly to the model prediction. It is worth noticing that in this latter case both kind of automatic predictions outperform human ground truth by a significant margin (Fig. 19b). Conversely, when we consider only the foveated videoclips containing acting subsequences, the human ground truth is perceived as much safer than the baseline, despite still scores worse than our model prediction (Fig. 19c). These results hold despite due to the localization of the fixations the average resolution of the predicted maps is only the 38% of the resolution of ground truth maps (i.e. videos foveated using prediction map feature much less information). We did not measure significant difference in perceived safety across the different drivers in the dataset ().
We report in Fig 20 the composition of each score in terms of answers to the other visual assessment question (“Would you say the observed attention behavior comes from a human driver? (yes/no)”). This analysis aims to measure participants’ bias towards human driving ability. Indeed, increasing trend of false positives towards higher scores suggests that participants were tricked into believing that “safer” clips came from humans. The reader is referred to Fig. 20 for further details.
The driving task is inherently composed of many subtasks, such as turning or merging in traffic, looking for parking and so on. While such fine-grained subtasks are hard to discover (and probably to emerge during learning) due to scarcity, here we show how the proposed model has been able to leverage on more common subtask to get to the final prediction. These subtasks are: turning left/right, going straight, being still. We gathered automatic annotation through GPS information released with the dataset. We then train a linear SVM classifier to distinguish the above 4 different actions starting from the activations of the last layer of multi-path
model, unrolled in a feature vector. The SVM classifier scores a 90% of accuracy on the test set (5000 uniformely sampled videoclips), supporting the fact that network activations are highly discriminative for distinguishing the different driving subtasks. Please refer to Fig.21 for further details. Code to replicate this result is available at https://github.com/ndrplz/dreyeve along with the code of all other experiments in the paper.
In this section we report exemplar cases that particularly benefit from the segmentation branch. In Fig. 22 we can appreciate that, among the three branches, only the semantic one captures the real gaze, that is focused on traffic lights and street signs.
In Fig. 23 we showcase several examples depicting the contribution of each branch of the multi-branch model in predicting the visual focus of attention of the driver. As expected, the RGB branch is the one that more heavily influences the overall network output.
A homography is a projective transformation from a plane to another plane such that the collinearity property is preserved during the mapping. In real world applications, the homography matrix is often computed through an overdetermined set of image coordinates lying on the same implicit plane, aligning points on the plane in one image with points on the plane in the other image. If the input set of points is approximately lying on the true implicit plane, then can be efficiently recovered through least square projection minimization.
Once the transformation has been either defined or approximated from data, to map an image point from the first image to the respective point in the second image, the basic assumption is that actually lies on the implicit plane. In practice this assumption is widely violated in real world applications, when the process of mapping is automated and the content of the mapping is not known a-priori.
In Fig. 24 we show the generic setting of two cameras capturing the same 3D plane. To construct an erroneous case study, we put a cylinder on top of the plane. Points on the implicit 3D world plane can be consistently mapped across views with an homography transformation and retain their original semantic. As an example, the point is the center of the cylinder base both in world coordinates and across different views. Conversely, the point on the top of the cylinder cannot be consistently mapped from one view to the other. To see why, suppose we want to map from view to view . Since the homography assumes to also be on the implicit plane, its inferred 3D position is far from the true top of the cylinder and is depicted with the leftmost empty circle in Fig. 24. When this point gets reprojected to view , its image coordinates are unaligned with the correct position of the cylinder top in that image. We call this offset the reprojection error on plane , or . Analogously, a reprojection error on plane could be computed with an homographic projection of point from view to view .
The reprojection error is useful to measure the perceptual misalignment of projected points with their intended locations, but due to the (re)projections involved is not an easy tool to work with. Moreover, the very same point can produce different reprojection errors when measured on and on . A related error also arising in this setting is the metric error , or the displacement in world space of the projected image points at the intersection with the implicit plane. This measure of error is of particular interest because it is view-independent, does not depend on the rotation of the cameras with respect to the plane and is zero if and only if the reprojection error is also zero.
Since the metric error does not depend on the mutual rotation of the plane with the camera views, we can simplify Fig. 24 by retaining only the optical centers and from all cameras and by setting, without loss of generality, the reference system on the projection of the 3D point on the plane. This second step is useful to factor out the rotation of the world plane, which is unknown in the general setting. The only assumption we make is that the non-planar point can be seen from both camera views. This simplification is depicted in Fig. 25(a), where we have also named several important quantities such as the distance of from the plane.
In Fig. 25(a), the metric error can be computed as the magnitude of the difference between the two vectors relating points and to the origin:
The aforementioned points are at the intersection of the lines connecting the optical center of the cameras with the 3D point and the implicit plane. An easy way to get such points is through their magnitude and orientation. As an example, consider the point . Starting from the following two similar triangles can be built:
Since they are similar, i.e. they share the same shape, we can measure the distance of from the origin. More formally,
from which we can recover
The orientation of the vector can be obtained directly from the orientation of the vector, which is known and equal to
Eventually, with the magnitude and orientation in place, we can locate the vector pointing to :
Similarly, can also be computed. The metric error can thus be described by the following relation:
The error is a vector, but a convenient scalar can be obtained by using the preferred norm.
When the plane inducing the homography remains unknown, the bound and the error estimation from the previous section cannot be directly applied. A more general case is obtained if the reference system is set off the plane, and in particular, on one of the cameras. The new geometry of the problem is shown in Fig. 25(b), where the reference system is placed on camera . In this setting, the metric error is a function of four independent quantities (highlighted in red in the figure): i) the point , ii) the distance of such point from the inducing plane , iii) the plane normal and iv) the distance between the cameras , which is also equal to the position of camera .
To this end, starting from Eq. (13), we are interested in expressing , , and in terms of this new reference system. Since is the projection of on the plane it can also be defined as
where is the plane normal, is an arbitrary point on the plane that we set to , i.e. the projection of on the plane. To ease the readability of the following equations, . Now, if describes the distance from to , we have
Through similar reasoning, and are also rewritten as follows:
Notably, the vector and the scalar both appear as multiplicative factors in Eq. (17), so that if any of them goes to zero, then the magnitude of the metric error also goes to zero.
If we assume that , we can go one step further and obtain a formulation were and are always divided by , suggesting that what really matters is not the absolute position of or camera with respect to camera but rather how many times further and camera are from than from the plane. Such relation is made explicit below:
Let , being the angle between and , and let be the angle between and . Then can be rewritten as . Note that under the assumption that , always holds. Indeed for to hold, we need to require . Next, consider the scalar : it is easy to verify that if , then . Since both and are versors, the magnitude of their dot product is at most one. It follows that if and only if . Now we are left with a versor that multiplies the difference of two matrices. If we compute such product we obtain a new vector with magnitude less or equal to one, , and the versor . The difference of such vectors is at most 2. Summing up all the presented considerations, we have that the magnitude of the error is bounded as follows.
If and , then .
We now aim to derive a projection error bound from the above presented metric error bound. In order to do so, we need to introduce the focal length of the camera . For simplicity, we’ll assume that . First, we simplify our setting without loosing the upper bound constraint. To do so, we consider the worst case scenario, in which the mutual position of the plane and the camera maximizes the projected error:
the plane rotation is so that ;
the error segment is just in front of the camera;
the plane rotation along the axis is such that the parallel component of the error w.r.t. the axis is zero (this allows us to express the segment end points with simple coordinates without loosing generality);
the camera falls on the middle point of the error segment.
In the simplified scenario depicted in Fig. 25(c), the projection of the error is maximized. In this case, the two points we want to project are and (we consider the case in which , see Observation. 13.4) where is the distance of the camera from the plane. Considering the focal length of camera , and are projected as follows:
Thus, the magnitude of the projection of the metric error is bounded by .
Now, we notice that , so
Notably, the right term of the equation is maximized when (since when the point is behind the camera, which is impossible in our setting). Thus, we obtain that .
Fig. 24(b) shows a use case of the bound in Eq. 22. It shows values of up to , where the presented bound simplifies to (dashed black line). In practice, if we require i) and ii) that the camera-object distance is at least three times the plane-object distance , and if we let px, then the error is always lower than 200px, which translate to a precision up to 20% of an image at 1080p resolution.
In order to advocate for the peculiar training strategy illustrated in Sec. 4.1, involving two streams processing both resized clips and randomly cropped clips, we perform an additional experiment as follows. We first re-train our multi-branch architecture following the same procedures explained in the main paper, except for the cropping of input clips and groundtruth maps, which is always central rather than random. At test time we shift each input clip in the range pixels (negative and positive shifts indicate left and right translations respectively). After the translation, we apply mirroring to fill borders on the opposite side of the shift direction. We perform the same operation on groundtruth maps and report the mean of the multi-branch model when trained with random and central crops, as a function of the translation size, in Fig. 26. The figure highlights how random cropping consistently outperforms central cropping. Importantly, the gap in performance increases with amount of shift applied, from a relative difference when no translation is performed to and increases for px and px translations, suggesting the model trained with central crops is not robust to positional shifts.
Learning a globally localized solution seems theoretically impossible when using a fully convolutional architecture. Indeed, convolutional kernels are applied uniformly along all spatial dimension of the input feature map. Conversely, a globally localized solution requires knowing where kernels are applied during convolutions. We argue that a convolutional element can know its absolute position if there are latent statistics contained in the activations of the previous layer. In what follows, we show how the common habit of padding feature maps before feeding them to convolutional layers, in order to maintain borders, is an underestimated source of spatially localized statistics. Indeed, padded values are always constant, and unrelated to the input feature map. Thus, a convolutional operation, depending on its receptive field, can localize itself in the feature map by looking for statistics biased by the padding values. To validate this claim, we design a toy experiment in which a fully convolutional neural network is tasked to regress a white central square on a black background, when provided with a uniform or a noisy input map (in both cases, the target is independent from the input). We position the square (bias element) at the center of the output as it is the furthest position from borders, i.e. where the bias originates. We perform the experiment with several networks featuring the same number of parameters yet different receptive fields444a simple way to decrease the receptive field without changing the number of parameters is to replace two convolutional layers featuring output channels into a single one featuring output channels.. Moreover, to advocate for the random cropping strategy employed in the training phase of our network (recall that it was introduced to prevent a saliency-branch to regress a central bias), we repeat each experiment employing such strategy during training. All models were trained to minimize the mean squared error between target maps and predicted maps, by means of Adam optimizer555the code to reproduce this experiment is publicly released at https://github.com/DavideA/can_i_learn_central_bias/.
. The outcome of such experiments, in terms of regressed prediction maps and loss function value at convergence, are reported in Fig.27 and Fig. 28. As shown by the top-most region of both figures, despite the uncorrelation between input and target all models can learn a central biased map. Moreover, the receptive field plays a crucial role, as it controls the amount of pixels able to “localize themselves” within the predicted map. As the receptive field of the network increases, the responsive area shrinks to the groundtruth area, and loss value lowers reaching zero. For an intuition of why this is the case, we refer the reader to Fig. 29. Conversely, as clearly emerges from the bottom-most region of both figures, random cropping prevents the model to regress a biased map, regardless the receptive field of the network. The reason underlying this phenomenon is that padding is applied after the crop, so its relative position with respect to the target depends on the crop location, which is random.
|TRAINING WITHOUT RANDOM CROPPING|
|TRAINING WITH RANDOM CROPPING|
We represent in Fig. 30 the process employed for building the histogram in Fig.8 (in the paper). Given a segmentation map of a frame and the corresponding ground-truth fixation map, we collect pixel classes within the area of fixation thresholded at different levels: as the threshold increases, the area shrinks to the real fixation point. A better visualization of the process can be found at https://ndrplz.github.io/dreyeve/.
|different fixation map thresholds|
In Fig. 31 we report several clips in which our architecture fails to capture the groundtruth human attention.