Cameras are one of the most powerful sensors in the world of robotics as they capture detailed information about the environment, and thus can be used for object detection [wang2019salient, liu2010learning] and segmentation [wang2015saliency, wang2017video] - something that is much harder to achieve with a basic range sensor. However, an image/video may contain some irrelevant information. Therefore, there is a need to filter out these unimportant regions and instead, learn to focus our ”attention” on parts of the image which are necessary to solve the task at hand. This is crucial for autonomous driving scenarios, where a vehicle should pay more attention to other vehicles, pedestrians and cyclists present in its vicinity, while ignoring the inconsequential objects. Upon successfully identifying the objects of interest, the controller driving the vehicle only needs to attend to them in order to make optimal decisions.
We propose a novel framework for predicting driver’s focus of attention through a learnt saliency map by taking into consideration the semantic context in an image. Typical saliency prediction algorithms [palazzi2018predicting, palazzi2017learning, xia2018predicting, tawari2018learning] in driving scenarios rely only on human-gaze information, either through an in-car [Alletto_2016_CVPR_Workshops], or in-lab [xia2018predicting] setting. However, gaze by itself does not completely describe everything a driver should attend to, mainly due to three reasons:
(i) Single focus: Humans have a tendency to rely on peripheral vision, thus giving us the ability to fixate our eyes on one object while attending to another. This cannot be captured by an eye-tracking device. Thus, only in-car driver gaze [Alletto_2016_CVPR_Workshops] does not convey sufficient information. While the in-lab annotation does alleviate this problem to some extent [xia2018predicting] by aggregating the gazes of multiple independent observers, it does not completely remove it since that relies on real human gaze too. Furthermore, when we realize that the trajectory of an incoming car or pedestrian is not likely to collide with ours, we tend to shift our focus away from it as it approaches. This is a major cause of accidents. To address that, we propose a method of tracking the motion of every driving-relevant object by detecting it’s instances until it goes beyond the field of view of the camera. This is possible because the limitation of a human’s ability of single focus does not apply to an autonomous vehicle system.
(ii) Distracted gaze: A human driver while driving the car might often get distracted by some road-side object - say a brightly colored building, or some attractive billboard advertisement etc. We take care of this issue by only training to detect those objects which influence the task of driving. The in-lab gaze [xia2018predicting] also eliminates this noise by averaging the eye movements of independent observers. However, they assume that the people annotating are positioned in the co-pilot’s seat, and therefore cannot realistically emulate a driver’s gaze.
For majority of a driving task, human gaze remains on the road in front of the vehicle as this is where the vehicle is headed to. When deep learning models are trained on this gaze map, they invariably recognize this pattern and learn to keep the focus there. However, this is not enough since there might be important regions away from the center of the road which demand attention - such as when cars or pedestrians approach from the sides. Thus, relying only gaze data does not help capture these important cues.
Figure 1 shows an example of an accident-prone situation, where the predicted saliency maps from an algorithm trained using different target labels are shown. Gaze-only models were able to detect the car ahead, but completely missed the pedestrian jaywalking. In contrast, our approach successfully detects both objects since it has learnt to predict semantic context in an image.
It is important to note, however, that semantics alone does not completely provide insights into the action that a driver might take at run-time. This is because a saliency map obtained only from training on semantics will give an equal-weighted attention on all the objects present. Also, when there is no object of relevance (i.e. an empty road near the countryside), this saliency map will not provide any attention. In reality, here the focus should be towards road boundaries, lane dividers, curbs etc. These regions can be effectively learnt through gaze information which is an indicator of a driver’s intent. Thus, we design a Semantics Augmented GazE (SAGE) ground-truth, which successfully captures both gaze and semantic context. Figure 2 shows how our proposed ground-truth looks as compared to the existing gaze-only ground-truths.
There are three novel contributions made in this paper. Firstly, we propose SAGE - a combined attention mechanism, that can be used to train saliency models for accurately predicting an autonomous vehicle’s (hereafter termed as driver) focus of attention. Secondly, we provide a thorough saliency detection framework - SAGE-Net, by including important cues in driving such as distance to objects (depth), speed of ego-vehicle and pedestrian crossing intent to further enhance the initial raw prediction obtained from SAGE. Finally, we conduct a series of experiments using multiple saliency algorithms on different driving datasets to evaluate the flexibility, robustness, and adaptability of SAGE - both over the entire dataset, and also specific important driving scenarios such as intersections and busy traffic regions. The rest of the paper is organized as follows. Section 2 discusses the existing state-of-the-art research in driver saliency prediction. Section 3 then provides details of the proposed framework, followed by the extensive experiments conducted in Section 4. Finally, Section 5 concludes the discussion and mentions the real-world implication of the conducted research.
2 Related Work
Advances in Salient Object Detection: Detection [wang2019salient, liu2010learning] and segmentation [wang2015saliency, wang2017video]
of salient objects in the natural scene has been a very active area of research in the computer vision community for a long time. One of the earliest works in saliency prediction, by Ittiet al. [itti1998model], considered general computational frameworks and psychological theories of bottom-up attention, based on center-surround mechanisms [treisman1980feature, wolfe1989guided, koch1987shifts]. Subsequent behavioral [parkhurst2002modeling] and computational investigations [bruce2006saliency] used ”fixations” as a means to verify the saliency hypothesis and compare models. Our approach differs from them as we incorporate both a bottom-up strategy by scanning through the entire image and detecting object features that are relevant for driving, as well as a top-down strategy by incorporating human gaze which is purely task driven. Some later studies [liu2010learning, achanta2008salient]
defined saliency detection as a binary segmentation problem. We adopt a similar strategy, but instead of using handcrafted features that do not generalize well to real world scenes, we use deep learning techniques for robust feature extraction. Since the introduction of Convolutional Neural Networks (CNNs), a number of approaches have been developed for learning global and local features through varying receptive fields, both for 2D image datasets[wang2019salient, liu2019end, choe2019attention, fu2019dual], and video-based saliency predictions [wang2019learning, lu2019see, fan2019shifting]. However, these algorithms are either too heavily biased towards image datasets, or involve designs of complicated architectures which make them difficult to train. In contrast, our approach helps to improve existing architectures without any additional training parameters, thereby keeping the complexity unchanged. This is very important for an autonomous system since we want to make it as close to real-time as possible. For a detailed survey of salient object detection, we refer the reader to the work by Borji et al. [borji2014salient].
Saliency for driving scenario: Lately, there has been some focus on driver saliency prediction due to rise of the number of driving [kotseruba2016joint, yu2018bdd100k, Ramanishka_behavior_CVPR_2018, 360LiDARTracking_ICRA_2019, narayanan2019dynamic] and pedestrian tracking [Dollar2012PAMI, ess2007depth, kotseruba2016joint] datasets. Most saliency prediction models are trained using human gaze information, either through in-car eye trackers [Alletto_2016_CVPR_Workshops, palazzi2018predicting], or through in-lab simulations [xia2018predicting, tawari2018learning]
. However, as discussed above, these methods only give an estimate of the gaze, which is often prone to center bias, or distracted focus. In contrast, our approach involves combining scene semantics along with the existing gaze data. This ensures that the predicted saliency map can effectively mimic a real driver’s intent, with the added feature of also being able to successfully detect and track important objects in the vicinity of the ego-vehicle.
3 SAGE-Net: Semantic Augmented GazE detection Network
Figure 3 provides a simplified illustration of the entire SAGE-Net framework, which comprises of three components: a SAGE detection module, a distance-based attention update module, and finally a pedestrian intent-guided saliency module. We begin by firstly describing how the SAGE maps are obtained in §3.1. Next, in §3.2, we describe how relative distances of objects from ego-vehicle should impact saliency prediction. Lastly, in §3.3, we highlight the importance of pedestrian crossing intent detection and how it influences the focus of attention.
3.1 SAGE saliency map computation
We propose a new approach to predicting driving attention maps which not only uses raw human gaze information, but also learns to detect the scene semantics directly. This is done using the Mask R-CNN (M-RCNN) [he2017mask] object detection algorithm, which returns a segmented mask around an object of interest along with it’s identity and location.
We used the Matterport implementation of M-RCNN [matterport_maskrcnn_2017] which is based on Feature Pyramid Network (FPN) [lin2017feature] and uses ResNet-101 [he2016deep] as backbone. The model is trained on the MS-COCO dataset [lin2014microsoft]. However, out of the total 80 objects in [lin2014microsoft], we select 12 categories which are most relevant to driving scenarios - , , , , , , , , , , and . For each video frame, M-RCNN provides an instance segmentation of every detected object. However, as the relative importance of different instances of the same object is not a significant cue, we stick to a binary classification approach where we segment all objects vs the background. This object-level segmented map is then superimposed on top of the existing gaze map provided by a dataset, so as to preserve the gaze information. This gives us the final saliency map as seen in Fig 2. Upon inspection, it can be clearly seen that our ground-truth has managed to capture a lot more semantic context from the scene, which gaze-only maps have missed.
3.2 Does relative distance between objects and ego-vehicle impact focus of attention?
Depth estimation through supervised [eigen2014depth, liu2015learning, mayer2016large] and unsupervised [garg2016unsupervised, vijayanarasimhan2017sfm] learning methods as a measure of relative distance between objects and ego-vehicle has been a long studied problem in the autonomous driving community [mahjourian2018unsupervised, godard2017unsupervised, godard2018digging]. Human beings inherently react and give more attention to vehicles and pedestrians which are ”closer” to them as opposed to those at a distance, since chances of collision are much higher for the former case. Unfortunately, this crucial information is yet to be exploited for predicting driving saliency maps to the best of our knowledge. In this paper, we consider this through the recently proposed self-supervised monocular depth estimation approach - Monodepth2 [godard2018digging]. However, SAGE-Net is not restricted to just this algorithm, but can effectively inherit stereo or LiDAR-based depth estimators into its framework as well.
We considered two methods of incorporating depth maps into our framework. The first involves taking a parallel depth channel which does not undergo any training, but is simply used to amplify nearby regions of the predicted saliency map. The second method is to use it as a separate trainable input to the saliency prediction model along with the raw image, in a manner similar to how optical flow and semantic segmentation maps are trained in [palazzi2018predicting]
. We decided to go with the first strategy because in addition to being much simpler and faster to implement, it also removes the issue of training a network only on depth map which has a lot less variance in data, thus leading to overfitting towards the vanishing point in the image.
Given an input clip of 16 image frames, , we obtain the raw prediction . In addition, for each frame, we also compute the depth map . Finally, we combine the raw prediction with the depth map to obtain using the operator, which is defined as
3.3 Should we pay extra attention to pedestrians crossing at intersection scenarios?
Accurate pedestrian detection in crosswalks is a vital task for an autonomous vehicle. Thus, we include an additional module which focuses solely on the crossing intent of pedestrians at intersections, and correspondingly updates the saliency prediction. It should be noted that even though SAGE does capture pedestrians in its raw prediction in general driving scenarios, it does not distinguish between them and other objects in crowded traffic conditions such as intersections. This is critical since the chances of colliding with a pedestrian are much higher around intersection regions than at other roads. However, this is a slow process since it involves detecting pedestrians and predicting their pose at run-time. Fortunately, this situation only occurs when the speed of the ego-vehicle itself is less. Thus, we only include this in our framework when the speed of the ego-vehicle () is below a certain threshold velocity . It is not very difficult to obtain since most driving datasets provide this annotation [Alletto_2016_CVPR_Workshops, xia2018predicting]. Also, for an autonomous vehicle, the odometry reading contains this. is a tunable hyper-parameter which can vary as per the road and weather conditions. When , we look to see if there are pedestrians crossing the road. This is done using the recently proposed algorithm ResEnDec [gujjar:icra19] which predicts the intent of pedestrians as ”” or ” ” through an encoder-decoder framework using a spatio-temporal neural network and ConvLSTM. We trained this algorithm on the JAAD [kotseruba2016joint] dataset, considering 16 consecutive frames to be the temporal strip while making a prediction on the last frame . Our framework is designed such that if the prediction is ””, we use an object detector such as YOLOv3 [redmon2018yolov3] to get the bounding box of the pedestrians from that last frame. Consequently, we amplify the predicted attention for pixels inside the bounding boxes, while leaving the rest of the image intact. This is given by the operator, defined as follows
where is an amplification factor by which the predicted map is strengthened. If the predicted intent is ” ”, we simply stick with the original prediction . The summary of the entire SAGE-Net algorithm is depicted in 1.
4 Experiments and Results
Due to the simplicity of computation of our proposed ground-truth, several experiments can be run using it. These experiments can be split into a two-stage hierarchy - (i) conducted over the entire dataset comprising of multiple combinations in driving scenarios - day vs night, city vs countryside, intersection vs highway etc. and (ii) those over specific important driving conditions such as intersection regions and crowded streets. The reason for the latter set of experiments is that averaging out the predicted results over all scenarios is not always reflective of the most important situations requiring maximum human attention [xia2018predicting]
. For all the experiments, we describe the evaluation metrics used for comparison, and using those, compare the results of the gaze-only groundtruth and our proposed SAGE groundtruth for the different algorithms and datasets.
4.1 Some popular saliency prediction algorithms
We selected four popular saliency prediction algorithms from an exhaustive list for training with SAGE groundtruth and compared their performance against those trained with gaze-only maps. The first set of algorithms, DR(eye)VE [palazzi2017learning] and BDD-A [xia2018predicting], were created exclusively for saliency prediction in the driving context. For DR(eye)VE, we only consider the image-branch for our analysis instead of the multi-branch network [palazzi2018predicting] due to two main reasons which make real-time operation possible. Firstly, it has a fraction of the number of trainable parameters and hence is faster to train and evaluate. Secondly, the latter assumes that the optical flow and semantic segmented maps are pre-computed even at test time, which is difficult to achieve online. The BDD-A algorithm is more compact and it consists of a visual feature extraction module [krizhevsky2012imagenet], followed by a feature and temporal processing unit in the form of 2D convolutions and Convolutional LSTM (Conv2D-LSTM) [xingjian2015convolutional] network respectively. However, both these algorithms combine the features extracted from the final convolution layers to make the saliency maps. This mechanism ignores low-level intermediate representations such as edges and object boundaries, which are important detections for driving scenario. Thus, we also consider ML-Net [mlnet2016], which achieved best results on the largest publicly available image saliency datset SALICON [jiang2015salicon]. It extracts low, medium, and high-level image features and generates a fine-grained saliency map from them. Finally, PiCANet [liu2018picanet] extends this notion further by generating an attention map at each pixel over a context region and constructing an attended contextual feature to further enhance the feature representability of ConvNets. Figure 4 shows a comparison of the predicted saliency maps trained on gaze-only ground-truth, and those obtained from SAGE. For nearly every gaze-only model, the focus of attention is entirely towards the center of the image, thereby ignoring other cars. In contrast, SAGE-trained models have managed to successfully capture this vital information. We refer the reader to Appendix B of the supplementary material for implementation details of these four algorithms.
4.2 Evaluation metrics
We consider a set of metrics which are suitable for evaluating saliency prediction in the driving context, as opposed to general saliency prediction. More specifically, for driving purpose, we want to be more careful about identifying ”False Negatives (FN)” than ”False Positives (FP)”, since the former error holds a much higher cost. As illustrated in Section 3
, our proposed ground-truth has both a gaze component and a semantic component. Thus, we classify the set of metrics broadly into two categories - (i) fixation-centric and (ii) semantic-centric.
For the first category, we choose two distribution-based metrics - Kullback-Leibler Divergence (), and Pearson’s Cross Correlation (). is an asymmetric dissimilarity metric, that penalizes FN more than FP. , on the other hand is a symmetric similarity metric which equally affects both FN and FP, thus giving an overall information regarding the misclassifications that occurred. Another variant of fixation metrics are the location-based metrics, such as Area Under ROC Curve (), Normalized Scanpath Saliency () and Information Gain (), which operate on the ground-truth being represented as discrete fixation locations [bylinskii2018different]. But for the driving task, it is crucial to identify every point on a relevant object, especially their boundaries, in order to mitigate risks. Thus, continuous distribution metrics are more appropriate here as they can better capture object boundaries.
|Fixation-centric metrics||Semantic-centric metrics|
|Model||Gaze gt||SAGE gt||Gaze gt||SAGE gt||Gaze gt||SAGE gt||Gaze gt||SAGE gt|
In the second category, we again consider two metrics - namely -, which measures region similarity of detection, and Mean Absolute Error (), which gives pixel-wise accuracy. - is given by the formulae,
is a parameter that weighs the relative importance of precision and recall. In most literatures[wang2019learning, achanta2009frequency, li2014secrets], is taken to be 0.3, thus giving a higher weightage to precision. However, following the earlier discussion regarding varying costs associated with FN and FP for the driving purpose, we consider to be , thereby assigning equal weightage to each. For a formal proof of this, we refer the reader to Appendix A of the supplementary material.
4.3 Results and Discussion
In this section, we discuss the experiments and results of algorithms trained on our proposed SAGE ground-truth, along with how they compare to the performance of the same algorithms, when trained on existing gaze-only ground-truths [Alletto_2016_CVPR_Workshops, xia2018predicting]. We compare our results with that of BDD-A gaze in most of the experiments, since it is more reflective of scene semantics than the DR(eye)VE gaze. For fair comparison, we adopt different strategies for evaluating the fixation centric and semantic centric metrics. Since both the traditional gaze-only approach and SAGE contain gaze information, we use the respective ground-truths to evaluate the fixation metrics (i.e. gaze for the gaze-only trained model, and SAGE for our trained model). However, for the semantic metrics, we use the segmented maps generated by Mask RCNN as ground-truth to evaluate how well each of the methods can capture semantic context. The first set of comparisons, given by Table 1 and Figure 5, are calculated by taking the average over the entire test set, while the remaining comparisons are for a subset of the test set involving two important driving scenarios, namely - pedestrians crossing at an intersection in Table 2, and cars approaching towards the ego-vehicle in Table 3.
Overall comparison - In Table 1, we train the four algorithms described in §4.1 on the BDD-A dataset [xia2018predicting]. We show the results obtained when evaluating the algorithms trained on the gaze-only data, and then on SAGE data generated by combining semantics with the gaze of [xia2018predicting]. As observed from the table, the and values obtained on SAGE are optimal for almost all the algorithms, while for and , it either performs better or is marginally poorer in performance. Overall, this analysis shows that our proposed SAGE ground-truth performs good on a diverse set of algorithms, thus proving its flexibility and robustness.
We next consider Figure 5, where a cross-evaluation of our method with respect to different driving datasets is performed. For this set of experiments, we fix one algorithm, namely DR(eye)VE [palazzi2017learning], while we vary the data. We evaluate two variants of SAGE - first, by combining scene semantics with the gaze of [Alletto_2016_CVPR_Workshops], and second, with the gaze of [xia2018predicting]. For each of these, we compare with the respective gaze-only ground-truth of the respective datasets. Like before we evaluate the performance of predicted saliency maps using the same fixation-centric and semantic-centric metrics. The results show that the proposed SAGE models are not strongly tied to a dataset and can adapt to different driving conditions. It is important to note that even though the cross evaluation (SAGE-D tested on [xia2018predicting], and SAGE-B tested on [Alletto_2016_CVPR_Workshops]) is slightly unfair, the results for SAGE still significantly outperforms those of the respective gaze-only models.
|Fixation-centric metrics||Semantic-centric metrics|
|Model||Gaze gt||SAGE gt||Gaze gt||SAGE gt||Gaze gt||SAGE gt||Gaze gt||SAGE gt|
|Fixation-centric metrics||Semantic-centric metrics|
|Model||Gaze gt||SAGE gt||Gaze gt||SAGE gt||Gaze gt||SAGE gt||Gaze gt||SAGE gt|
Comparison at important driving scenarios - In Table 2, we consider the scenarios of pedestrians crossing at intersections. For this purpose, we used a subset of the JAAD dataset [kotseruba2016joint] containing more than five pedestrians (not necessarily as a group) crossing the road. The same four algorithms described in §4.1 have been reconsidered for this case. Using M-RCNN, the segmented masks of all the crossing pedestrians were computed and the predicted saliency maps from the models were evaluated against this baseline. Upon comparison, it can be seen that models trained on SAGE surpass those trained on the gaze-only ground-truth. It is to be noted that even though none of the models were trained on the JAAD dataset [kotseruba2016joint], the results are still pretty consistent across all the algorithms. This shows that learning from SAGE indeed yields a better saliency prediction model which can detect pedestrians crossing at an intersection more reliably.
Finally, in Table 3, we took into account another important driving scenario where we consider the detection of number of cars approaching the ego-vehicle as a metric. The evaluation set was constructed by us from different snippets of the DR(eye)VE [Alletto_2016_CVPR_Workshops] and the BDD-A [xia2018predicting] datasets, where a single or a group of cars is/are approaching the ego-vehicle from the opposite direction in an adjacent lane. Once again, we evaluated the four algorithms on this evaluation set. Like in Table 2, here too, we analyze the detections with respect to those made by M-RCNN. The results from Table 3 show that for almost each experiment the performance of algorithms trained on SAGE is consistent in detecting the vehicles more accurately compared to the models trained by gaze-only ground-truth.
To summarize, the experiments clearly show that the proposed SAGE ground-truth can be easily trained using different saliency algorithms and the obtained results can also operate well across a wide range of driving conditions. This makes it more reliable for the driving task as compared to existing approaches which only rely on raw human gaze. Overall, the performance of our method is better than gaze-only groundtruth on 49/56 (87.5%) cases, not only when averaged over the entire dataset, but more importantly, in specific driving situations demanding higher focus of attention.
5 Conclusion and Future Work
In this paper we introduced SAGE-Net, a novel deep learning framework for successfully predicting ”where the autonomous vehicle should look” while driving, through predicted saliency maps that learn to capture semantic context in the environment, while retaining the raw gaze information. With the proposed SAGE-groundtruth, saliency models have been shown to have attention on the important driving-relevant objects while discarding irrelevant or less important cues, without having any additional computational overhead to the training process. Extensive set of experiments demonstrate that our proposed method improves the performance of existing saliency algorithms across multiple datasets and various important driving scenarios, thus establishing the flexibility, robustness and adaptability of SAGE-Net. We hope that the research conducted in this paper will motivate the autonomous driving community into looking at strategies, that are simple but effective, for enhancing the performance of currently existing algorithms.
Our future work will involve incorporating depth in the SAGE-groundtruth and then having the entire framework to be trained end-to-end. Currently this could not be achieved due to low variance in the depth data, leading to overfitting. Another possible direction that is being considered is to explicitly add motion dynamics of segmented semantic objects in the surroundings in the form of SegFlow [cheng2017segflow]. Work in this area is under progress as we are building a campus-wide dataset with these kind of annotations through visual sensors and camera-LiDAR fusion techniques.
The authors would like to thank Army Research Laboratory (ARL) W911NF-10-2-0016 Distributed and Collaborative Intelligent Systems and Technology (DCIST) Collaborative Technology Alliance for supporting this research.